# Preprocessing - AI safety for b-tagging@CMS edition
This introduction will explain the different steps inside the `prep_new.py` script that prepares the cleaned data for further usage by splitting the data set, scaling the features and storing sample weights.

<div align="center">
<img src="https://www.mkanalysis.com/static/img/Data%20Science%20Tom%20and%20Jerry.jpg" width="400"/><br>
    Fig. 1 <a href="https://www.mkanalysis.com/tutorial/?page=10">[from mkanalysis.com]</a>
</div>

The tutorial is split into four parts:
- [Prerequisites](#prerequisites)
- [Reweighting](#reweighting)
- [How the preprocessing code works](#code)
- [Perform preprocessing yourself](#perform)
    - [Calculating scalers](#scalers)
    - [Apply preprocessing to QCD and TT](#apply)
- [Some tasks](#tasks)

## Prerequisites<a name="prerequisites"></a>
The cleaning step as explained in the previous tutorial should be done in order to proceed. Access to the arrays that contain weights is necessary, these should either come with the cloned repository or can be downloaded.

More information on the preprocessing is given in section 2.2.5 of the thesis.

## Reweighting <a name="reweighting"></a>
For the reweighting, there are already arrays that store the weights for every flavour, in 50x50 bins in $\eta$ and $p_\text{T}$. They are part of the `may_21` directory and given either for the method that produces similar distributions, but non-flat, or flat distributions. In total, eight `.npy` files are necessary to proceed. The reweighting is explained in section 2.3 (how the weights are derived is part of the `reweighting_prototyping.ipynb` notebook in the `may_21`directory). Just make sure you have the weights stored under your HPC account.

## How the perprocessing code works<a name="code"></a>
Here is some theoretical explanation of the preprocessing code which helps you understand what happens later when you actually do the preprocessing yourself.

```python
import numpy as np

import torch

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from scipy.stats import binned_statistic_2d

import gc



import argparse
#import ast

parser = argparse.ArgumentParser(description="Perform data preprocessing")
parser.add_argument("default", type=float, help="Default value relative to the minimum of the distribution, with positive sign")
parser.add_argument("prepstep", help="calc_scalers, apply_to_TT or apply_to_QCD")
parser.add_argument('-s', "--startvar", type=int, help="Start calculating scaler for this variable", default=-1)
parser.add_argument('-e',"--endvar", type=int, help="End calculating scaler for this variable", default=-1)
args = parser.parse_args()

prepstep = args.prepstep
startvar = args.startvar
endvar = args.endvar
default = args.default
if int(default) == default:
    default = int(default)
minima = np.load('/home/um106329/aisafety/april_21/from_Nik/default_value_studies_minima.npy')
defaults = minima - default
```
The libraries are similar to what was already discussed for the cleaning. Additionally, scikit-learn and scipy are used because they offer some special functions for the splitting and scaling, or the 2d-binning, which is relevant to get the correct sample weights.

Also the parser looks similar to what we had before, but the two last entries come with a default to cope with the fact that the preprocessing script can be called in two variants, which have different purposes, more on that later (for one of the two variants, the two last arguments are not used).

There is a block that checks if the default is an integer - it was necessary in the past when we would have defaults at e.g. -999, but now it can just be ignored, it doesn't interfer with our standard default value of 0.001 and the code block proceeds with loading and defining defaults per variable, just like in the cleaning script. Make sure to update the path to your own one.

```python
if prepstep != 'calc_scalers':
    # if one wants to change pt-eta such that in each bin, there is an equal amount of entries from every flavour (target: average per bin over the four flavours), but keeping
    # the average distribution (more low-pt than high-pt, more small eta than large eta and such)

    b_weights = np.load('/home/um106329/aisafety/may_21/absweights_b.npy')
    bb_weights = np.load('/home/um106329/aisafety/may_21/absweights_bb.npy')
    c_weights = np.load('/home/um106329/aisafety/may_21/absweights_c.npy')
    l_weights = np.load('/home/um106329/aisafety/may_21/absweights_l.npy')

    flavour_lookuptables = np.array([b_weights,bb_weights,c_weights,l_weights])


    # if one wants to use flat distributions (target for weighting is average over the whole pt-eta-histogram per flavour), multiplied by class imbalance, leads to rectangular
    # shapes, naturally (almost) identical between the different flavours ('almost' because not every bin is filled for large eta / pt for every flavour)

    b_weights_flat = np.load('/home/um106329/aisafety/may_21/weights_flat_b.npy')
    bb_weights_flat = np.load('/home/um106329/aisafety/may_21/weights_flat_bb.npy')
    c_weights_flat = np.load('/home/um106329/aisafety/may_21/weights_flat_c.npy')
    l_weights_flat = np.load('/home/um106329/aisafety/may_21/weights_flat_l.npy')

    flavour_lookuptables_flat = np.array([b_weights_flat,bb_weights_flat,c_weights_flat,l_weights_flat])


# TT to Semileptonic
# this will have all starts from 0 to (including) 2400
ttstarts = np.arange(0,2450,50)
# this will have all ends from 49 to 2399 as well as 2446 (this was the number of original .root-files)
ttends = np.concatenate((np.arange(49,2449,50), np.arange(2446,2447)))             
#print(starts)
#print(ends)
TTNUM_DATASETS = len(ttstarts)
print(TTNUM_DATASETS)
qcdstarts = np.arange(0,11450,50)
TTdataset_paths = [f'/hpcwork/um106329/may_21/cleaned_TT/inputs_{ttstarts[k]}_to_{ttends[k]}_with_default_{default}.npy' for k in range(0, TTNUM_DATASETS)]
TTDeepCSV_paths = [f'/hpcwork/um106329/may_21/cleaned_TT/deepcsv_{ttstarts[k]}_to_{ttends[k]}_with_default_{default}.npy' for k in range(0, TTNUM_DATASETS)]

# QCD
# this will have all starts from 0 to (including) 11400
qcdstarts = np.arange(0,11450,50)
# this will have all ends from 49 to 11399 as well as 11407 (this was the number of original .root-files)
qcdends = np.concatenate((np.arange(49,11449,50), np.arange(11407,11408)))             
#print(starts)
#print(ends)
QCDNUM_DATASETS = len(qcdstarts)
print(QCDNUM_DATASETS)
QCDdataset_paths = [f'/hpcwork/um106329/may_21/cleaned_QCD/inputs_{qcdstarts[k]}_to_{qcdends[k]}_with_default_{default}.npy' for k in range(0, QCDNUM_DATASETS)]
QCDDeepCSV_paths = [f'/hpcwork/um106329/may_21/cleaned_QCD/deepcsv_{qcdstarts[k]}_to_{qcdends[k]}_with_default_{default}.npy' for k in range(0, QCDNUM_DATASETS)]



if prepstep == 'apply_to_TT':
   
    dataset_paths = TTdataset_paths
    DeepCSV_paths = TTDeepCSV_paths
    NUM_DATASETS = TTNUM_DATASETS
    sample = 'TT'

elif prepstep == 'apply_to_QCD':

    dataset_paths = QCDdataset_paths
    DeepCSV_paths = QCDDeepCSV_paths
    NUM_DATASETS = QCDNUM_DATASETS
    sample = 'QCD'


else:  # calc_scalers
    
    dataset_paths = TTdataset_paths + QCDdataset_paths
    # I checked: no need to use the DeepCSV files when producing the scalers after splitting into train/val/test. This works because train_test_split handles the splitting
    # independently of the number of arrays passed into the function call, and has reproducible results if random_state is given.



```
The first check that concerns how the preprocessing script is executed checks the stage or step of the preprocessing, the first one would be the calculation of the scalers, the following steps are used to apply the actual scaling etc. to the inputs. So in the case here, `!= 'calc_scalers` actually means the second case: applying the scaling. Only in that, it is necessary to load the weights from disk, there will be so clled lookup-tables that have all the arrays for the different flavours. If instead one does the calculation of the scalers, these are not necessary at that time.

From the cleaning we ended up with the input files (for our tagger and also the DeepCSV outputs). The lines 57 to 80 are used to get all the file paths under which the cleaned files have been stored, in line with the starting and endling indices of the original files, that's why there is the seemingly complicated array structure of starts and ends. At this point, not the files themselves are loaded, only the paths to the files are stored inside lists.

From line 84 on, there are again some checks that concern the preprocessing step. This is to make sure that the right paths are used depending on the process (QCD or TT or both for scaling).

Also here, you may have noticed that the paths are based on my account / file system - so you'd have to change them to your own ones again (look at the data cleaning if you do not remember where to find the paths).

```python
def calc_scalers_from_full_training_sample(starvar,endvar):

    def get_trainingsamples(dataset):
        # to calculate the scalers, one really only needs to use the training samples, which are created by using the train_test_split twice
        # the first function call splits test from train/val, and the second call further splits the train and val set
        train_and_val,_ = train_test_split(dataset, test_size=0.2, random_state=1)
        trainset, _ = train_test_split(train_and_val, test_size=0.1, random_state=1)
        return trainset
    
    # first idea was to do the scalers similar to before, all in one go,
    # but to run on interactive node: split up per variable
    # second idea: do it per variable
    # third idea: not the full set, but at least several variables together, that stil fit into memory and use rather fast array slicing
    scalers = []
    
    #for i in range(0,67):
    # get the training set for the current input, considering all available files for the training at once
    # to keep the same split as for the later creation of the train / val / test sets, it is necessary to do the splitting on all files separately,
    # and only merge the training samples afterwards
    all_train_inputs_variable_start_end = np.concatenate([get_trainingsamples(np.load(path)[:,startvar:endvar+1]) for path in dataset_paths])

    # do not compute scalers with default values, which were set to minima-default
    for i in range(endvar+1-startvar):
        scaler = StandardScaler().fit(all_train_inputs_variable_start_end[:,i][all_train_inputs_variable_start_end[:,i] != defaults[i]].reshape(-1,1))
        scalers.append(scaler)
    
    return scalers
    #return scaler



```
Now there is a function that is relevant only when using the preprocessing script for the first time, namely the calculation of the scalers. Given a starting and end variable between which the function is executed (the only reason for that is the interactive time limit - otherwise one could just do all inputs in one go), this function will output the scalers for the respective variables. Scalers by the way can be understood as a pair of $(\mu,\sigma)$ which will later on be used to scale the individual features (normalization via $\frac{x-\mu}{\sigma}$).

Because it's a bit easier to understand the function introduced above when looking at what happens actually during runtime, take a look at the end of the script, where the function will be called:
```python
...
if prepstep == 'calc_scalers':
    # get the 67 scalers, computed from the full training set (meaning: for each input, only one scaler for all files together;
    # but splitting up calculation per variable ensures running on interactive nodes)
    #for v in range(startvar,endvar+1):
    #    scaler = calc_scalers_from_full_training_sample(v)
    #    torch.save(scaler, f'/hpcwork/um106329/june_21/scaler_{v}_with_default_{default}.pt')
    # with this third version, the calculation itself does not happen in the loop, but for a set of variables simultaneously
    # only storing the scaler per variable separately needs the loop, so everything should be much faster (e.g. loading and splitting the data)
    currentscalers = calc_scalers_from_full_training_sample(startvar,endvar)
    for v in range(endvar+1-startvar):
        torch.save(currentscalers[v], f'/hpcwork/um106329/june_21/scaler_{startvar+v}_with_default_{default}.pt')
# a 'scaler' consists of mu and sigma, which is in the following applied to train, val, test)
else:
...
```
Ignore the lines that are commented out, I used several methods to make this script run quite efficient while staying in the timelimit, so only look at line 276f. for now. As I said, scalers are calculated for some of the variables, and these are in the end stored with a save command to a `.pt` file (that's a pytorch-variant of pickle). So with the first preprocessing step, one does not need to apply the scaling, it ends with the calculation of the scalers.

Back to the code inside the function that calculates scalers: there is another function inside which does the splitting of the samples, but we first look at the other parts. For example, from line 119 on, there is a placeholder list that will carry the scalers. Then, line 125 concatenates a bunch of columns of actually all available datasets, while only selecting the specified range between the starting and ending index for the different features. All datasets with all features would not fit into [memory](https://help.itc.rwth-aachen.de/service/rhr4fjjutttf/article/0a23d513f31b4cf1849986aaed475789/), but selecting only some features works just fine. From each of thos truncated data sets, the `get_trainingsamples` function now splits this whole thing into three parts (training, validation, test), but only the training is kept for this preprocessing step, because the calculation of scalers only uses the training samples. The application of the scalers will be done with all samples, but not be biased and have really independent sets, only the training data is taken into account when deriving the scalers. Inside `get_trainingsamples` you see the `train_test_split` function now for the first time - it's one handy function provided by sklearn which can be used iteratively to split a big sample into two (in the end: three) parts, by randomly selecting entries. (`random_state=1` - it needs a seed for the randomized selection, if no seed is specified, the results will be different among different executions of the same code, and then your derived scalers will not match with the latest training sets, also the splitting would be different for different features, in short: everything would be a mess and can not be controlled otherwise). The splits reflect the defined percentages as given in the thesis. Assume we now have all training samples collect for the features defined in the beginning. Then the loop that start in line 128 goes through all those features and uses the so called `StandardScaler` function (again, scikit-learn). Note that the calculation of scalers needs to fit the standard scaler, only later, the scalers are applied (here in this step, data sets are *not* transformed). Also, all default values present in the data are excluded from the calculation of the scalers (that's done with the check inside the `.fit()` function). Also, a reshaping is done to get the right dimensions that go into the fit function. The current scaler is added into the list that stores all scalers, which will be returned by the function in the end (then these are saved according to the code explained above.)

Here is now the lengthy preprocessing function:
```python
# preprocess the datasets, create train, val, test + DeepCSV
def preprocess(dataset, DeepCSV_dataset, s):    
    
    trainingset,testset,_,DeepCSV_testset = train_test_split(dataset, DeepCSV_dataset, test_size=0.2, random_state=1)
    #torch.save(DeepCSV_testset, f'/hpcwork/um106329/may_21/scaled_{sample}/DeepCSV_testset_%d_with_default_{default}.pt' % s)
    del DeepCSV_testset
    trainset, valset = train_test_split(trainingset,test_size=0.1, random_state=1)
    del trainingset
    gc.collect()
    # get the indices of the binned 2d histogram (eta, pt) for each jet
    # these arrays will have the shape (2,len(data)) where len(data) is the length of the testset, valset and trainset
    # to not waste too much memory & diskspace later, one only needs 8-bit unsigned integer (each going from 0 to 255, which is enough for 50 bins in each direction,
    # so only 50 possible values --> use np.ubyte directly, also one only needs to unpack the fourth return value from binned_statistic_2d, we don't need the histogram
    # or the bin edges, just the indices that will serve as a kind of look-up-table during the sampling for the training)
    # first sub-array are the indices for eta, second one for pt (notice: this is really a nested array because expand_binnumbers was set to true, otherwise it would have been flat)                                                
    test_targets = (torch.Tensor(testset[:,-1])).long()      
    '''
    _,_,_,test_pt_eta_bins = binned_statistic_2d(testset[:,0],testset[:,1],None,'count',bins=(50,50),range=((-2.5,2.5),(20,1000)),expand_binnumbers=True)
    test_eta_bins = test_pt_eta_bins[0]-1
    test_pt_bins = test_pt_eta_bins[1]-1
    test_all_weights = flavour_lookuptables[test_targets,test_eta_bins,test_pt_bins]
    test_weights = test_all_weights/sum(test_all_weights)
    test_all_weights_flat = flavour_lookuptables_flat[test_targets,test_eta_bins,test_pt_bins]
    test_weights_flat = test_all_weights_flat/sum(test_all_weights_flat)
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/test_pt_eta_bins_%d_with_default_{default}.npy' % s,test_pt_eta_bins.astype(np.ubyte))
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/test_sample_weights_%d_with_default_{default}.npy' % s,test_weights)
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/test_sample_weights_flat_%d_with_default_{default}.npy' % s,test_weights_flat)
    del test_pt_eta_bins
    del test_eta_bins
    del test_pt_bins
    del test_all_weights
    del test_weights
    gc.collect()
    '''
    val_targets = (torch.Tensor(valset[:,-1])).long()
    '''
    _,_,_,val_pt_eta_bins = binned_statistic_2d(valset[:,0],valset[:,1],None,'count',bins=(50,50),range=((-2.5,2.5),(20,1000)),expand_binnumbers=True)
    val_eta_bins = val_pt_eta_bins[0]-1
    val_pt_bins = val_pt_eta_bins[1]-1
    val_all_weights = flavour_lookuptables[val_targets,val_eta_bins,val_pt_bins]
    val_weights = val_all_weights/sum(val_all_weights)
    val_all_weights_flat = flavour_lookuptables_flat[val_targets,val_eta_bins,val_pt_bins]
    val_weights_flat = val_all_weights_flat/sum(val_all_weights_flat)
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/val_pt_eta_bins_%d_with_default_{default}.npy' % s,val_pt_eta_bins.astype(np.ubyte))
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/val_sample_weights_%d_with_default_{default}.npy' % s,val_weights)
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/val_sample_weights_flat_%d_with_default_{default}.npy' % s,val_weights_flat)
    del val_pt_eta_bins
    del val_eta_bins
    del val_pt_bins
    del val_all_weights
    del val_weights
    gc.collect()
    '''
    train_targets = (torch.Tensor(trainset[:,-1])).long()
    '''
    _,_,_,train_pt_eta_bins = binned_statistic_2d(trainset[:,0],trainset[:,1],None,'count',bins=(50,50),range=((-2.5,2.5),(20,1000)),expand_binnumbers=True)
    train_eta_bins = train_pt_eta_bins[0]-1
    train_pt_bins = train_pt_eta_bins[1]-1
    train_all_weights = flavour_lookuptables[train_targets,train_eta_bins,train_pt_bins]
    train_weights = train_all_weights/sum(train_all_weights)
    train_all_weights_flat = flavour_lookuptables_flat[train_targets,train_eta_bins,train_pt_bins]
    train_weights_flat = train_all_weights_flat/sum(train_all_weights_flat)
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/train_pt_eta_bins_%d_with_default_{default}.npy' % s,train_pt_eta_bins.astype(np.ubyte))
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/train_sample_weights_%d_with_default_{default}.npy' % s,train_weights)
    np.save(f'/hpcwork/um106329/may_21/scaled_{sample}/train_sample_weights_flat_%d_with_default_{default}.npy' % s,train_weights_flat)
    del train_pt_eta_bins
    del train_eta_bins
    del train_pt_bins
    del train_all_weights
    del train_weights
    gc.collect()
    '''
    # the indices have been retrieved before the scaling happened (because afterwards, the values will be different and not be placed in the bins defined during the calculation
    # of the weights)
    
    test_inputs = torch.Tensor(testset[:,0:67])                                                
    #test_targets = (torch.Tensor(testset[:,-1])).long()        
    val_inputs = torch.Tensor(valset[:,0:67])
    #val_targets = (torch.Tensor(valset[:,-1])).long()
    train_inputs = torch.Tensor(trainset[:,0:67])
    #train_targets = (torch.Tensor(trainset[:,-1])).long()
    
    norm_train_inputs,norm_val_inputs,norm_test_inputs = train_inputs.clone().detach(),val_inputs.clone().detach(),test_inputs.clone().detach()
    #scalers = []
    
    # scalers are computed without defaulted values, but applied to all values
    if default == 999:
        for i in range(0,67): # do not compute scalers with default values, which were set to -999
            #scaler = StandardScaler().fit(train_inputs[:,i][train_inputs[:,i]!=-999].reshape(-1,1))
            scaler = torch.load(f'/hpcwork/um106329/june_21/scaler_{i}_with_default_{default}.pt')
            norm_train_inputs[:,i][train_inputs[:,i]!=-999]   = torch.Tensor(scaler.transform(train_inputs[:,i][train_inputs[:,i]!=-999].reshape(-1,1)).reshape(1,-1))
            norm_val_inputs[:,i][val_inputs[:,i]!=-999]	  = torch.Tensor(scaler.transform(val_inputs[:,i][val_inputs[:,i]!=-999].reshape(-1,1)).reshape(1,-1))
            norm_test_inputs[:,i][test_inputs[:,i]!=-999]     = torch.Tensor(scaler.transform(test_inputs[:,i][test_inputs[:,i]!=-999].reshape(-1,1)).reshape(1,-1))
            #scalers.append(scaler)
    else:
        for i in range(0,67): # do not compute scalers with default values, which were set to minima-default
            #scaler = StandardScaler().fit(train_inputs[:,i][train_inputs[:,i]!=defaults[i]].reshape(-1,1))
            scaler = torch.load(f'/hpcwork/um106329/june_21/scaler_{i}_with_default_{default}.pt')
            norm_train_inputs[:,i]   = torch.Tensor(scaler.transform(train_inputs[:,i].reshape(-1,1)).reshape(1,-1))
            norm_val_inputs[:,i]       = torch.Tensor(scaler.transform(val_inputs[:,i].reshape(-1,1)).reshape(1,-1))
            norm_test_inputs[:,i]     = torch.Tensor(scaler.transform(test_inputs[:,i].reshape(-1,1)).reshape(1,-1))
            #scalers.append(scaler)
    
    
    train_inputs = norm_train_inputs.clone().detach().to(torch.float16)
    val_inputs = norm_val_inputs.clone().detach().to(torch.float16)
    test_inputs = norm_test_inputs.clone().detach().to(torch.float16)
    
    
    torch.save(train_inputs, f'/hpcwork/um106329/june_21/scaled_{sample}/train_inputs_%d_with_default_{default}.pt' % s)
    torch.save(val_inputs, f'/hpcwork/um106329/june_21/scaled_{sample}/val_inputs_%d_with_default_{default}.pt' % s)
    torch.save(test_inputs, f'/hpcwork/um106329/june_21/scaled_{sample}/test_inputs_%d_with_default_{default}.pt' % s)
    #torch.save(train_targets, f'/hpcwork/um106329/may_21/scaled_{sample}/train_targets_%d_with_default_{default}.pt' % s)
    #torch.save(val_targets, f'/hpcwork/um106329/may_21/scaled_{sample}/val_targets_%d_with_default_{default}.pt' % s)
    #torch.save(test_targets, f'/hpcwork/um106329/may_21/scaled_{sample}/test_targets_%d_with_default_{default}.pt' % s)
    #torch.save(scalers, f'/hpcwork/um106329/june_21/scaled_{sample}/scalers_%d_with_default_{default}.pt' % s)
    
    del train_inputs
    del val_inputs
    del test_inputs
    del train_targets
    del val_targets
    del test_targets
    del scaler
    #del scalers
    del trainset
    del testset
    del valset
    gc.collect() 
```
and the part of the code where this function is called:

```python
...
else:
    for s in range(NUM_DATASETS): #range(1,49):
        preprocess(np.load(dataset_paths[s]), np.load(DeepCSV_paths[s]), s)
```
Here the idea is that only TT or QCD, but not both processes are preprocessed per step. They should not be mixed, but the user does not have to modify the code, it will be given by the `prepstep` argument of the parser. More on executing the preprocessing will come later.

To understand what happens, again you can look at the code that actually calls the `preprocess` function, starting in line 280. There is a loop that iterates over all the files that have been identified with a given process (say, TT, where there would be `len(ttstarts)=49` paths to the cleaned files, and in the TT case, this is stored also in the `NUM_DATASETS` variable). With every cleaned file, the preprocessing is run with the inputs+targets stored in in the `dataset_paths` as well as DeepCSV. The counting variable `s` is used to have individual output paths of the preprocessing step that do not overwrite already existing files.

If we now go into the `preprocess` function itself, you will see it in a state in which I left it after several iterations, this is not the initial configuration of the uncommented code, which we will now restore together by walking through the lines. The reason is again that over time, more and more functionality was added, but some other already stored files do not have to be stored a second, third, ... time, that's why you see some code in block quotes.

Starting at line 141, the already introduced `train_test_split` function is used, now it is taking even two different arrays at inputs, but the good thing is that the splits will match exactly, meaning you have the same used indices for both the custom `dataset` as well as `DeepCSV_dataset`, so you can actually compare the correct jets with each other later when evaluating your tagger.

Line 142 is one example that needs to be uncommented, just delete the `#` in front of the line, it is used to store DeepCSV test data. But make sure to update the path to your own one. This is actually a good time to create the directories that will hold the preprocessed samples and auxiliary files, so to use the same naming schme as shown here, you'd have to create something like `mkdir /hpcwork/<your ID>/<month>_21/scaled_TT` and the respective QCD version: `mkdir /hpcwork/<your ID>/<month>_21/scaled_QCD`. To free memory, the next line cleares the variable that held the DeepCSV outputs. Actually, also the whole DeepCSV array could be 'deleted' here, as we are anyway not using that any further. You might want to add another line to do that, but it's also not strictly necessary, the code already ran fine on the HPC without exhausting the available memory. Something similar to the step described above happens when splitting a second time, but now among training and validation data. Also here, freeing memory is invoked, but this time with the `trainingset`, which consisted of both the actual training, and the validation data.

You might want to read the comments that explain a bit the choice of datatypes, it's not repeated here.

Then going to the different sets of data: in line 153, the last column of the `testset`is used to create a pytorch tensor (this will be the format that gets inserted into the pytorch model later), and as this contains classes, the datatype must be something that stores whole numbers. In the version of pytorch with which I coded everything, probably also in newer versions, the datatype is `long`. Then you see (in total) three blocks of code, separated by the extraction of validation and training targets, which you'd need to uncomment. They all are relevant when running the preprocessing yourself. And as you can see, you need to update many paths again to your own ones. I'll explain the idea for the first of the three blocks, the other two are similar, only for other parts of the data. E.g. look at line 155. With the `binned_statistic_2d` of scipy, we figure out into which bins the $\eta$ and $p_\text{T}$ of a given jet in the test set fall. The ranges are adapted to what we defined during cleaning. The number of bins matches the definitions inside the reweighting notebook mentioned earlier. `None` means we do not pass additional values, only the binning is of interest for us, and this is also reflected by the parameter `count`, but even the exact counts per bin are of relevance, only the fourth return of the function, which we named `test_pt_eta_bins`, and to get this in the exact same shape as our binned distributions, the `expand_binnumbers` has to be set to `True`. Whatever we do not need from the returns we just don't save, that's why you see the three leading underscores. The next two lines restore zero-indexing of the bins. Then, inside our lookup-tables created much earlier, we now actually look up the correct weight per sample in a vectorized form. There is three things that define which weight will be used: first, the true flavour, second, the `eta_bin` and third the `pt_bin`. Weights are normalized to take into account the splitting into the three sets. This is not strictly necessary, as during training, the normalization happens (for the training set), but it should also do no harm. The weights are read for the two cases, averaging among flavours and creating flat distributions, as explained in the thesis (section 2.3). Then, we can finally save something, namely the bins (if we need to look them up sometime later without wanting to do the preprocessing again), and the two sample weights arrays as explained above. Then we get rid of everything that does not need to persist in memory. Now everything is repeated also for validation and training set.

Once we have that, we care about the inputs themselves. First, they are all transformed into a format that can be read by the pytorch neural network, so those are now `Tensor` variables. You don't need to uncomment the lines in between, where the targets would be defined, this is also old and has been translated to the above code that concerns the weights, which already depend on the targets. More interesting is again the next line 220: there, we create actual copies of the inputs, because we need both the original data in memory, as well as something to store the normalized data.

Line 224 is a bit unnecessary (also the whole block ending at line 232) because we are not dealing with the 999 defaults anymore, it has just not been updated, because it does not interfer with what we want to do. So ignore this simply, only look at the loop starting at line 233. This goes through all features (there are 67), loads the corresponding scalers (assume we calculated them in an earlier step, which you will do soon enough), and transforms the data according to those scalers, to finally update the column with new, but scaled, Tensors. You can leave line 234 and 239 as they are, because we already have the scalers, no need to calculated them again. Same goes for line 221, the `#` sign can stay there, but creating an empty list is not as problematic as overwriting your *correct* scalers with *wrong* ones (wrong, because they would only be calculated based on a single file - I made that mistake earlier, but not again - I even have a sketch that explains why this is wrong, it's Fig. 2.5 in the thesis).

In a slightly inefficient way, there is now (starting at line 242) again a copy of the newly scaled arrays into the old placeholders, but there is a conversion to a smaller datatype than what is the standard: the files will have a smaller size, if one uses float16 instead of float32, however, there are some caveats related to this which would have to be checked when it comes to plotting the variables and looking at very fine binning (at some point, the high granularity reveals that we reduced the precision of our inputs, which is of course bad, but was necessary to fit everything into memory). Anyway, with this limitation, the next lines just store the scaled inputs and as you might notice, the lines 250 to 252 also need to be uncommented to also store the targets. Don't change line 253, we already have the scalers saved. But as always, for saving something, and also above when loading something, there needs to be a valid path to your own file system.

## Perform preprocessing yourself <a name="perform"></a>
Assuming you implemented all the changes mentioned above, you can finally proceed with the preprocessing.
### Calculating scalers <a name="scalers"></a>
The first time you run the preprocessing, you will use the `calc_scalers` option, and you want to define the variables to be used in one iteration. Sure, you could automate this a bit yourself by creating a batch job (or several jobs, actually), but with the same trick as for the cleaning, logging into several nodes at a time will do.

So assume you are at one of those nodes now, have activated your conda environment and want to start the calculation of the scalers for some features there. You'd type
```
python prep_new.py 0.001 calc_scalers -s 0 -e 5
```
and do something similar with all other features, packaged into something like 6 features at a time. Although there are 67 features in total, the last one gets index 66 (due to zero-indexing, which you should always remember).
### Apply preprocessing to QCD and TT <a name="apply"></a>
Next up, with all scalers available, you can apply them along with the other preprocessing steps in (almost) one go, it's just split between TT and QCD to avoid overwriting. So you need to do
```
python prep_new.py 0.001 apply_to_TT
```
as well as
```
python prep_new.py 0.001 apply_to_QCD
```
and that should be it for the preprocessing. These two things can of course be split also between nodes to not waste time with waiting and blocking resources for a long time on a single node.

## Some tasks <a name="tasks"></a>
Now you should have produced the cleaned, as well as scaled inputs. Here are some ideas how to investigate them a bit further:
- Plot some histograms of the scaled features
- Compare those with the original (cleaned) features, you can even try to use the .inverse_transform function of the scalers to go back from scaled to original units and then compare with the ingoing numpy arrays as a cross-check
- If you want to add a useful addition to the preprocessing script: it would be nice to have some sort of boolean information per jet per feature if the feature is at a default value or not - because you will see soon enough that accessing the information (default / not default) will be based on scaled quantities, and they come with floating point errors, which is very inconvenient when you want to exclude them from the attacks. You can also come back to this later if you first want to "*quickly*" reproduce my results. 

In [None]:
# your code