# Data Preprocessing


## 1. Downsize Samples
Constrained by computing power, we may need to reduce training samples to run. You can directly use Load_Downsize_SaveAsH5.py to create a sample that your computer can run.  For example,



In [1]:
filePath = ' '
features = ['j_index','j1_phirel','j1_etarel','j1_pt','j1_e','j1_ptrel'] 
labels = ['j_g','j_q','j_w','j_z','j_t']
# ratio=[.25,.25,.25,.25,1]
size = 15000                       # number of each kind of jets
seed = 42                          # random seed
N_par = 20                         # number of constituents in a jet

from Data_Preprocessing import Load_Downsize_SaveAsH5 as cvt
cvt.LoadTransSave(filePath,features, labels,size=size,seed=seed)

You can change the parameters. The ratio is used to control the task whether you want to do a Top Tagger or 5 Tagger.<br>

### Exercise
Load_Downsize_SaveAsH5.py will create a .h5 file. Read through the code, and figure out how it works, especially, you need to change the filePath before use!

## 2. Understand Data
Different from DGCNN, the input of ParticleNet is a dictionary with three keys {"mask","features","points"}.
- Mask:  An array with a shape of (N, 1, P) is used to indicate if a position is occupied by a real particle or by a zero-padded value.
- Features: An array with a shape of (N, C, P) holds the features we are going to do the classification.
- Points: An array with sa shape of (N, 2, P) represents the coordinates of the particles in the (eta, phi) space. 


## 3. Prepare data 

In [2]:
# These are the 4 features we are going to use on our model.
features_list = ['j1_phirel','j1_etarel','j1_ptLog','j1_eLog']
# Create a dict to store the (name, index) pair, so that we do not have to look up the list to find index.
cols = dict(zip(features_list+['constituents_index','j_index'], [i for i in range(len(features_list)+2)]))

In [5]:
def h5_to_data(h5path):
    Data = {'mask':[], 'points':[], 'features':[],'label':[]}
    f = h5py.File(h5path,'r')
    raw_data = np.array([f[col][()] for col in cols])
    label_arr = f['label'][()]
    raw_data = raw_data.transpose((1,0))
    mask, features, points = np.zeros((N_par,1)), np.zeros((N_par,len(features_list))), np.zeros((N_par,2)) # prepare constituents list
    for i in range(len(raw_data)):
        cIndex = int(raw_data[i][cols['constituents_index']])
        if cIndex >= N_par:                                                     # skip when excess N_par particles
            continue
        
        mask[cIndex] = np.array([raw_data[i][cols['j1_ptLog']]])      
        # mask[i] = np.array([1])                                    
        points[cIndex] = np.array([raw_data[i][cols['j1_etarel']],raw_data[i][cols['j1_phirel']]])
        features[cIndex] = np.array([raw_data[i][cols[feat]] for feat in features_list])
                
        if i < len(raw_data)-1:
            if raw_data[i][cols['j_index']] != raw_data[i+1][cols['j_index']] : # save the jet before switch to another
                Data['mask'].append(mask)
                Data['points'].append(points)
                Data['features'].append(features)
                Data['label'].append(label_arr[i])
                mask, features, points = np.zeros((N_par,1)), np.zeros((N_par,len(features_list))), np.zeros((N_par,2))  
    f.close()
    y = Data.pop('label')
    return Data, y

In [6]:
import numpy as np
import h5py
import pandas as pd
import tensorflow as tf
from keras.losses import categorical_crossentropy
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.metrics import roc_curve, auc

h5Path = "data_500jets_5labels.h5"
Data,y = h5_to_data(h5Path)


## 4. Train Test Split
As Usual, he default split ratio is Train:Validation:Test = 6:2:2, and the default random seed = 42

In [7]:
def splitData(Data, y, rateval=.2, ratetest=.2, seed=42):
    features_train, features_test, features_val={},{},{}
    from sklearn.model_selection import train_test_split
    mask = Data["mask"]
    features = Data["features"]
    points = Data["points"]
    X_ind = [i for i in range(len(y))]
    X_train, X_ind, y_train, y_ind = train_test_split(X_ind, y, test_size=rateval+ratetest, random_state=seed)
    X_val, X_test, y_val, y_test = train_test_split(X_ind, y_ind, test_size=(rateval/(rateval+ratetest)), random_state=seed)
    
    features_train['mask']=np.array([mask[i] for i in X_train])
    features_train['features']=np.array([features[i] for i in X_train])
    features_train['points']=np.array([points[i] for i in X_train])
    
    features_test['mask']=np.array([mask[i] for i in X_test])
    features_test['features']=np.array([features[i] for i in X_test])
    features_test['points']=np.array([points[i] for i in X_test])
    
    features_val['mask']=np.array([mask[i] for i in X_val])
    features_val['features']=np.array([features[i] for i in X_val])
    features_val['points']=np.array([points[i] for i in X_val])
    
    return features_train, features_val, features_test, np.array(y_train), np.array(y_val), np.array(y_test)

In [8]:
X_train, X_val, X_test, y_train, y_val, y_test = splitData(Data,y)

### Excercise
1. Try to prepare a dataset used to do Top Tagging.
2. You always want to maximize your sample number. Try to find a proper value for N_par without losing too much information while avoiding too much zero-padding.