_This notebook is part of the material for the [ML Tutorials](https://github.com/NNPDF/como-2025) session._

# Real life PDF fit

In the previous tutorial (PDF Fitting) we have fitted a Neural Network to PDF data that we have obtained from LHAPDF. Ah! If life were so simple!

Reality is much more complicated:

1. We cannot measure the PDF: we have no PDF data!

2. The data has some uncertainties associated to it.

In this tutorial we are going to do a more realistic (albeit simplified) PDF fit.

## From the PDF to the experimental data

In an experiment we only have access to observables. These observables, while they can be computed theoretically, depend on the PDF in a non trivial manner, for hadronic collision (such as those at the LHC) we have:

\begin{equation}
    \mathcal{O} = \displaystyle\sum_{ij} \int dx_{1} dx_{2}  \  f_{i}  (x_1, \mu_F) \  f_{j}(x_2, \mu_F) \ \hat{\sigma}_{ij}(x_{1}, x_{2}, \mu_{R}, \mu_{F})
\end{equation}

For simplicity (and because that topic was already covered in the tutorials of the third day) we are going to drop the dependence on $\mu_F$ from the PDF. All scale-dependence is contained in the partonic cross section instead

Utilizing the model of the PDF that we built in the previous tutorial our PDF have instead the following form: $f_{i}(x) = (1 - x)^{1+\beta}NN_{i}(x)$ where the index $i$ refers to the parton.

\begin{equation}
    \mathcal{O} = \displaystyle\sum_{ij} \int dx_{1} dx_{2}  \ (1 - x_1)^{1+\beta}NN_{i}(x_1) \ (1 - x_2)^{1+\beta}NN_{j}(x_2) \ \hat{\sigma}_{ij}(x_{1}, x_{2}, \mu_{R}, \mu_{F})
\end{equation}

Note that in the previous equation both Neural Networks are the same, but are evaluated at different values of $x$ (and potentially contribute with different partons $i$ and $j$). This means the observable depends non-linearly on the Neural Network, which greatly complicates the training. For this tutorial we are going to limit ourselves to DIS observables so one of the two PDFs is set to 1, which will facilitate the construction of the network, but in a global PDF fit both DIS and double-hadronic observables need to be considered.

Up to this point we have used Mean Squared Errors as the loss function to be optimized. In this case we are comparing datapoints ($D$) with observables ($\mathcal{O}$). The loss function thus looks like:

\begin{equation}
    L = \frac{1}{N}\sum_{k} (\mathcal{O}_{k} - D_{k})^2 = \frac{1}{N}\sum_{k}\left(\displaystyle\sum_{ij} \int dx_{1} dx_{2}  \ (1 - x_1)^{1+\beta}NN_{i}(x_1) \ (1 - x_2)^{1+\beta}NN_{j}(x_2) \ \hat{\sigma}_{ij}^{(k)}(x_{1}, x_{2}, \mu_{R}, \mu_{F}) - D_{k}\right)^2
\end{equation}

with $k$ running over datapoints in the fit. This loss function corresponds and it is usually called $\chi^{2}$.


## Experimental uncertainties

In addition to the previous consideration, we need to include the information about the experimental uncertainties. Experimental uncertainties introduce correlations between datapoints which need to be taken into account:

\begin{equation}
    L = \frac{1}{N}\sum_{k,l} (\mathcal{O}_{k} - D_{k})s_{kl}^{-1}(\mathcal{O}_{l} - D_{l})
\end{equation}

With $s_{kl}^{-1}$ the inverse of the covariance matrix. In the limit of a diagional covariance matrix (no correlation between datapoints) one would recover the simpler form that we have used before.

## Tutorial outline

In this tutorial we are going to take the final multi-flavour PDF we constructed in the previous one as the starting point.

In order to have a realistic-looking PDF from only a few datasets, we are going to use the `.npz` files that you downloaded in the first day of the school. They contain:

- `D`: the experimental data
- `covmat`: the experimetal covariance matrix
- `fktable`: an interpolation table for the partonic cross section the fktable is a tensor `(ndata, luminosity channel, x)`
- `xgrid`: grid in x in which to evaluate the PDF
- `luminosity`: the relevant indices of the luminosity

The first thing we will do is to create an observable that we can compare to data. In a first step we will simply compare to data without taking into account the experimental uncertainties. 

Then we will create a custom loss function, introducing the covariance matrix into the problem.

We will finish the tutorials creating replicas of the data to generate a PDF ensemble.

In [None]:
from pathlib import Path

import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
from tensorflow import keras
from tensorflow.keras.models import Sequential

tf.keras.backend.clear_session()

data_folder = Path("data") / "pdf_fit"
if not data_folder.exists():
    print("Warning! The data folder does not exist!")

available_datasets = ["HERACOMBNCEP920", "HERACOMBCCEM", "SLACP", "HERACOMB_SIGMARED_B"]

## 1. Prepare the PDF model

1. Prepare a PDF model that takes as input `x` and outputs `9` different flavours. You should be able to use what you wrote in the previous tutorial

2. Create a layer that rotate a PDF to the evolution basis, in which the fk-tables are generated (as you learn two tutorials ago!)


code suggestions:
```python

# Model building
class Preprocessing(tf.keras.layers.Layer):
    """This layer generates a preprocessing (1-x)**(1+beta)"""

    def build(self, input_shape):
        """The build function will be called before a forward pass and the trainable weight
        will be generated. Beta is constrained to be a positive value to avoid 1/0"""
        self._beta = self.add_weight(
            shape=(1,),
            trainable=True,
            name="beta",
            constraint=tf.keras.constraints.non_neg(),
            initializer="ones",
        )

    def call(self, x):
        return (1.0 - x) ** (self._beta + 1.0)


class InputScaling(tf.keras.layers.Layer):
    """This layer applies a logarithmic scaling to the input and then concatenates it to the actual input
    This layer is dim=1 --> dim=2
    """

    def call(self, x):
        return tf.concat([x, tf.math.log(x)], axis=-1)


def generate_pdf_model(outputs=9, units=16, nlayers=4, activation="tanh"):
    """Generate a PDF model such that
    f(x) = (1-x)^beta * NN(x, log(x))
    """
    # Note that we have added a "None" size here, we will see in a moment why!
    input_layer = tf.keras.layers.Input(shape=(None, 1))
    scaled_input = InputScaling()
    preprocessing_factor = Preprocessing()
    mm_layer = tf.keras.layers.Multiply()

    # Prepare the sequential PDF model
    pdf_raw = Sequential(name="pdf")
    pdf_raw.add(scaled_input)
    for _ in range(nlayers):
        pdf_raw.add(keras.layers.Dense(units, activation=activation))
    pdf_raw.add(keras.layers.Dense(outputs, activation="linear"))

    final_result = mm_layer([pdf_raw(input_layer), preprocessing_factor(input_layer)])
    return tf.keras.models.Model(input_layer, final_result)


pdf_model = generate_pdf_model()
test = pdf_model(np.random.rand(2, 20, 1))
pdf_model.summary()

# Rotation to the evolution basis
class EvolutionRotation(tf.keras.layers.Layer):
    """
    While the PDFs that we are fitting are in the basis of flavours. Due to the peculiarities of the DGLAP evolution, which is contained in the fktable together with the partonic cross section, it is more convenient to perform the fit in what is known as the "evolution basis". And the fktables expect the PDFs to be, indeed in the evolution basis.
In this tutorial we are going to keep the fit in the flavour basis but this means we need a rotation from the NN output into the evolution basis.
Note that the PDFs we convolute are fitted at a scale of  𝑄=1.65  and contain no contribution of the bottom or top quark. In addition we are not consider a photon contribution in this tutorial.

    This layer takes a single PDF = NN(x)*(1-x)^beta in the flavour basis and rotates
    to the evolution basis

                                                       0   1   2   3   4  5  6  7  8
    The flavour basis of the NN model in our case is (g, -c, -s, -u, -d, d, u, s, c)
    """

    def call(self, pdf):
        # The input pdf has shape (batch, nx, flavours)
        pdfT = tf.transpose(pdf)

        singlet = tf.reduce_sum(pdfT, axis=0)
        g  = pdfT[0]
        v = (pdfT[6]-pdfT[3]) + (pdfT[5]-pdfT[4]) + (pdfT[7]-pdfT[2]) + (pdfT[8]-pdfT[1])
        v3  = (pdfT[6]-pdfT[3]) - (pdfT[5]-pdfT[4]) 
        v8  = (pdfT[6]-pdfT[3]) + (pdfT[5]-pdfT[4]) - 2*(pdfT[7]-pdfT[2])
        v15 = (pdfT[6]-pdfT[3]) + (pdfT[5]-pdfT[4]) + (pdfT[7]-pdfT[2]) - 3*(pdfT[8]-pdfT[1])
        t3  = (pdfT[6]+pdfT[3]) - (pdfT[5]+pdfT[4])
        t8  = (pdfT[6]+pdfT[3]) + (pdfT[5]+pdfT[4]) - 2*(pdfT[7]+pdfT[2])
        t15 = (pdfT[6]+pdfT[3]) + (pdfT[5]+pdfT[4]) + (pdfT[7]+pdfT[2]) - 3*(pdfT[8]+pdfT[1])

        # All other members of the evolution basis contain redundant information at the fitting scale
        photon = tf.zeros_like(g)
        v24 = v
        v35 = v
        t24 = singlet
        t35 = singlet

        pdf_evol_list = [photon, singlet, g, v, v3, v8, v15, v24, v35, t3, t8, t15, t24, t35]
        # The output will be (nx, 14)
        return tf.concat(pdf_evol_list, axis=1)



```

In [None]:
pdf_model = generate_pdf_model()

# Our PDF model should now be able to take any number of "datasets"
# for which it will need "nx" points to perform the PDF x Cross Section convolution
# and each of these "nx" point are dimension 1
fake_input = np.random.rand(2, 20, 1)

test = pdf_model(fake_input)
pdf_model.summary()

print(f"Does the PDF model produces the correct output? {np.allclose(test.shape, [2, 20, 9])}")

### 2. Create a trainable model for which the output is the data!

In order to use the built-in algorithms in tensorflow for training, we need to write the integral over the `x` in a way that tensorflow can understand.
Furthermore, computing the complete integral in a per-epoch or per-event basis would be too computationally expensive.
Thanks to the power of the FK-Tables we can make that integral into a convolution, also with TensorFlow!

Furthermore, the raining on datasets is a bit more subtle than the simple training with the PDF as the target.

When training against a PDF we had a situation in which every point in the input corresponds to a single point in the output, so the loss function (and the model) is a relatively simple one:

\begin{equation}
    l(x) = y(x) - t(x)
\end{equation}

However, now in order to train against data we need to perform the integral (which we approximate by a convolution). This means that many values of `x` correspond to a single value of the output. And many values of the output are generated by the same input grid.


1. Write a "Convolution" layer that is constructed taking as input the fktable and the basis of flavours that the fktables uses and computes the predictions to be compared with the experimental data.

2. Write a function to compare the results of your fit with the actual experimental data and look at the comparison.

3. Write a loss function so that you can train a model which includes a convolution

4. Fit against one of the dataset, you choose which.

5. Compare the results of your fit with your dataset.

6. Compare now the results of your fit for a different dataset! What has happened? 

code suggestion:
```python
# On Tuesday you use pineappl to perform the convolution
# in this tutorial you will need to write the convolution of the grid and the PDF by yourself!

# Step 1 is to select the flavours of the PDF that participate in the convolution
    masked_pdf = tf.boolean_mask(pdf, basis, axis=1)
# Then you can use einsum to perform the actual convolution
    tf.einsum("nfx, xf -> n", fktable, masked_pdf)
    
# Construct loss function that is able to digest both the output of the model and the experimental data 
def chi2_simplified(ytrue, ypred):
    """Loss function to pass to the model"""
    return tf.reduce_sum((ytrue - ypred) ** 2)
    
# In order to train against data, you can do the following:   
dsname = available_datasets[3]
data = np.load(data_folder / f"{dsname}.npz")
# We modify the input so that it takes 3 axis (even if two of them size=1)
x = data.get("xgrid").reshape(1, -1, 1)  # 1 batch, N datapoints, n dim
lbasis = data.get("luminosity")
fktable = data.get("fktable")

# We would need to also reshape the experimental data, so that we match through the batch dimension
experimental_data = data.get("D").reshape(1, -1)
```

In [None]:
# 1. Write the convolution layer

class Convolution(tf.keras.layers.Layer):
    """Convolutes as a partonic cross section with a PDF in the evolution basis

    This follows the same strategy as you learned in the previous tutorial! 
    We can approximate quite well the integral by a convolution wiht an interpolation grid
    """

    def __init__(self, fktable, basis, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._fktable = tf.constant(fktable, dtype=tf.float32)
        # Make the basis of flavours into a boolean mask
        basis_mask = np.zeros(14, dtype=bool)
        for i in basis:
            basis_mask[i] = True
        self._basis = tf.constant(basis_mask)

    def call(self, pdf):
        ...

In [None]:
# 2. Compare the results using your PDF model with the actual data

# Wrapper function to compare the predictions of a PDF with the results of a dataset
def theory_data_comparison(dsname, pdf):
    """Compare the predictions of a given pdf with the selected dataset
    
    Note that in this example we are using the custom layers that we have created as if they were
    numpy operations, without creating a model out of them!
    
    This function assumes that there are two layers, change it accordingly if it is not the case for you
        - EvolutionRotation
        - Convolution
    """
    # The data is in the `.npz` format which can be loaded by the numpy "load" function
    data = np.load(data_folder / f"{dsname}.npz")
    x = data.get("xgrid").reshape(1, -1, 1)
    lbasis = data.get("luminosity")
    fktable = data.get("fktable")
    experimental_data = data.get("D")

    # Rotate the PDF (evaluated in the grid in x)
    evolution_pdf = EvolutionRotation()(pdf(x))
    convolution_layer = Convolution(fktable, lbasis)
    
    theory_predictions = convolution_layer(evolution_pdf)
    idata = np.arange(len(experimental_data)) # we don't have information about the kinematic variable
    
    plt.title(f"Theory-data comparison for {dsname}")
        
    plt.errorbar(idata, experimental_data, yerr=0.0, fmt=".", label="Experimental data")
    plt.errorbar(idata, theory_predictions, yerr=0.0, fmt="x", color="red", label="Prediction")
    plt.legend()
    plt.xlabel("Data index")

In [None]:
# In order to train against data, you can do the following:   
dsname = available_datasets[3]
data = np.load(data_folder / f"{dsname}.npz")
# We modify the input so that it takes 3 axis (even if two of them size=1)
x = data.get("xgrid").reshape(1, -1, 1)  # 1 batch, N datapoints, n dim
lbasis = data.get("luminosity")
fktable = data.get("fktable")

theory_data_comparison(dsname, pdf_model)

In [None]:
# 3. Write a loss function so that you can train a model which includes a convolution
def loss_function(ytrue, ypred):
    return


In [None]:
# 4. Now instantiate your model, compile with the loss function above and train it!

# Let's construct a model
pdf_model = generate_pdf_model()

# Build the necessary layers
evolution_layer = EvolutionRotation()
convolution_layer = Convolution(fktable, lbasis)

# And finally prepare the observable model
# Note how we use the Sequential model again to build up a model that includes our original PDF
# and its output is then passed to the evolution and then convolution layers
obs_model = keras.models.Sequential([pdf_model, evolution_layer, convolution_layer])

# Now you can compile with our custom loss function...
obs_model.compile(keras.optimizers.Nadam(), loss=loss_function)
print("Training started...")

# And train!
history = obs_model.fit(x, experimental_data.reshape(1, -1), epochs=1000, verbose=0)


In [None]:
# 5. Perform the comparison with the same dataset you trained against

theory_data_comparison(dsname, pdf_model)

In [None]:
# 6. Now, select a different dataset and repeat the same comparison (without retraining!)

### 3. Train on multiple dataset and with experimental errors

For that we are going to generate several separated observable models that all use the same pdf models.
At the end we will concatenate all such models. At the end we will compare with the same data for HERACOMB_SIGMARED_B.

Note that the input grid in x is always the same. This allow us to greatly simplify the model. In general they can be different and this could be treated from the point of view of your model (where each observable will take a different input) or by a preprocessing of the data (adding extra 0s if necessary).

1. Train all datasets at the same time. There are two possibilities for this, either concatenating all outputs, or creating a loss function per dataset. Note that, since in either case the PDF model will be a single one, you will always be training the same PDF!
2. Check your results.
3. (optional) train leaving one of the datasets out, to see how the model generalizes

code suggestions:

```python

# Create an observable per dataset
observables = [ ]

## If you choose to concatenate
# Use the Keras `Concatenate` layer, to create a concatenation of observables
final_layer = keras.layers.Concatenate()(observables)
# Use said concatenation as the model output, now your output will be a concatenation of all data
final_model = keras.models.Model(model_input, final_layer)

## If you choose to use a different loss per output
chi2_list = []
for covmat in covmats:
    chi2_list.append(chi2_function)
final_model = keras.models.Model(model_input, observables)
final_model.compile(keras.optimizers.Nadam(), loss=chi2_list)
```

In [None]:
# 0. Use the code suggestions to generate a final model (starting form a pdf_model) that is able to fit several datasets at once

pdf_model = generate_pdf_model()

output_data = []
observables = []
covmats = []  # Let's save the covmats to use them later!

# Let's recover a reference to the initial layer, the information is always available
# this will allow us to instantiate our observable model with the same input shape
model_input = pdf_model.input

for dsname in available_datasets:
    data = np.load(data_folder / f"{dsname}.npz")
    x = data.get("xgrid").reshape(1, -1, 1)
    lbasis = data.get("luminosity")
    fktable = data.get("fktable")
    covmat = data.get("covmat")
    experimental_data = data.get("D").reshape(1, -1)

    convolution = Convolution(fktable, lbasis)

    obs = keras.models.Sequential([pdf_model, evolution_layer, convolution])

    observables.append(obs(model_input))
    covmats.append(covmat)
    output_data.append(experimental_data)

In [None]:
# 1. Perform the fit

In [None]:
# 2. Check your results against any of the dataset

theory_data_comparison(dsname, pdf_model)

In [None]:
# 3. (optional) perform again the fit leaving one dataset out and check the result against that one

### 4. Include experimental errors

Let us summarize what we have done up to now: we have create one single PDF model which takes as input a single grid in x and produces a pdf values for this grid in x. Then the results of this PDF are rotated into the evolution basis and convoluted with an interpolation table to produce a physical observable.

Note that while all our models utilize the same convolution and rotation, this is not a requirement. Indeed, we could in the same fit include FKTables for DIS and hadronic observables, include extra contribution or physical constraints.

For the output of the model, we have considered that all datapoints are created equal and have concatenated the outputs and compared against a concatenation of the experimental results. This is, as well, not a requirement.
In the next exercise we need to use a different loss per output (i.e., one per dataset) since each loss will be different due to the covariance matrix!

1. Write class such that you can generate a different loss functions per experiment with different data
2. Modify the script to compare model and data so that PDF errors are taken into account.
3. Fit and check!


code suggestion:
```python

class Chi2:
    def __init__(self, covmat):
        self._invcovmat = tf.constant(np.linalg.inv(covmat), dtype=tf.float32)
        
    def __call__(self, ytrue, ypred):
        tmp = (ytrue - ypred)
        return tf.einsum("bi,ij,bj->b", tmp, self._invcovmat, tmp)


# Add error bars to a plot as the diagonal of the covariance matrix
errors = np.sqrt(np.diag(data.get("covmat")))
plt.errorbar(idata, experimental_data, yerr=errors, fmt=".", label="Experimental data")
```

In [None]:
# 1. Use the suggestion above to wrap the loss function as a part of a class so it can hold data

chi2_list = []
for covmat in covmats:
    chi2_list.append(Chi2(covmat))

In [1]:
# 2. Modify the plot function above to take into account the experimental errors

In [2]:
# 3. Refit and recheck!

### 5. Generate PDF errors
Finally, we are going to use the Monte Carlo replica method to generate also an error bar for the predictions.

The basis of this method, once we have arrived to this point, is actually quite simple. We are going to generate variations of the output data according to the covariance matrix of each dataset. These variations will be random.

Then you will train separate PDF models so that each one of them is optimized with a different "replica". At the end you can use this set of separate PDF models to generate uncertainties.

1. Generate variations of the datasets. Try to do it on your own! Otherwise there's a suggestion below
2. Train a number of independent PDF models (i.e., 5)
3. Modify the comparison wrapper so that it can also plot uncertainties for the predictions of the model


code suggestion

```python
# let's save all the information that is shared by all the replicas
# (note that the PDF model is not among that!)

from dataclasses import dataclass


@dataclass
class Dataset:
    name: str
    expdata: np.ndarray
    covmat: np.ndarray
    convolution: Convolution
    chi2: Chi2

    @property
    def ndata(self):
        return self.expdata.shape[-1]

    def generate_replica(self):
        r = -0.5 + np.random.rand(self.ndata)
        return self.expdata + np.dot(self.covmat, r).reshape(1, -1)


datasets = []

for dsname in available_datasets:
    data = np.load(data_folder / f"{dsname}.npz")
    x = data.get("xgrid").reshape(1, -1, 1)
    lbasis = data.get("luminosity")
    fktable = data.get("fktable")
    covmat = data.get("covmat")
    edata = data.get("D").reshape(1, -1)

    chi2 = Chi2(covmat)
    cc = Convolution(fktable, lbasis)
    dd = Dataset(dsname, edata, covmat, cc, chi2)

    datasets.append(dd)
```

In [None]:
# 1. Generate variations of the dataset

In [None]:
# 2. Fit each of the variations separately as a separate PDF model

In [None]:
# 3. Use the mean value and the variance of the results of the separate PDF models as a measure of the PDF errors