# (2) Data Structure



An HDF5 data container is a standardized, highly-customizable data receptacle designed for portability.

Our physics data samples will be provided in h5 format. Get access to it [here](https://drive.google.com/drive/u/3/folders/1j5iMgARqRhtrMmVHDfM9o2XRTXXlZgcI).

## Accessing the Data: Part 1

Navigate to the directory where you've saved the processed-pythia.z data file. Mine is saved in 'Deep-Learning-Model-Evaluation'.

In [1]:
import os
import h5py

In [2]:
os.chdir('C:\\Users\\alexc\\OneDrive\\Projects\\Deep-Learning-Model-Evaluation\\1 layer NN')

In [3]:
f = h5py.File('../processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z', 'r')

In [4]:
f.keys()

<KeysViewHDF5 ['t_allpar_new']>

In [5]:
treeArray = f['t_allpar_new'][()] #Empty tuple indexing retrieves all values
print(treeArray.dtype.names)

('j_ptfrac', 'j_pt', 'j_eta', 'j_mass', 'j_tau1_b1', 'j_tau2_b1', 'j_tau3_b1', 'j_tau1_b2', 'j_tau2_b2', 'j_tau3_b2', 'j_tau32_b1', 'j_tau32_b2', 'j_zlogz', 'j_c1_b0', 'j_c1_b1', 'j_c1_b2', 'j_c2_b1', 'j_c2_b2', 'j_d2_b1', 'j_d2_b2', 'j_d2_a1_b1', 'j_d2_a1_b2', 'j_m2_b1', 'j_m2_b2', 'j_n2_b1', 'j_n2_b2', 'j_tau1_b1_mmdt', 'j_tau2_b1_mmdt', 'j_tau3_b1_mmdt', 'j_tau1_b2_mmdt', 'j_tau2_b2_mmdt', 'j_tau3_b2_mmdt', 'j_tau32_b1_mmdt', 'j_tau32_b2_mmdt', 'j_c1_b0_mmdt', 'j_c1_b1_mmdt', 'j_c1_b2_mmdt', 'j_c2_b1_mmdt', 'j_c2_b2_mmdt', 'j_d2_b1_mmdt', 'j_d2_b2_mmdt', 'j_d2_a1_b1_mmdt', 'j_d2_a1_b2_mmdt', 'j_m2_b1_mmdt', 'j_m2_b2_mmdt', 'j_n2_b1_mmdt', 'j_n2_b2_mmdt', 'j_mass_trim', 'j_mass_mmdt', 'j_mass_prun', 'j_mass_sdb2', 'j_mass_sdm1', 'j_multiplicity', 'j_g', 'j_q', 'j_w', 'j_z', 'j_t', 'j_undef')


## Understanding the Data
The structure of a jet is introduced in tutorial 12 from the computing tutorial. It is very important to understand the structure of the sample in order to see the applications of different network types. 

In our pre-processed sample, each row is a jet with the above features. The two and four layer networks use the features of the jet to classify whether the jet originated from the decay a top quark or not.

The features/inputs we will use are



|    Features    |    Labels     |
|  :-------:     |  :---------:  |
|  j_zlogz       |  j_t          |
|  j_c1_b0_mmdt        |               |
|  j_c1_b1_mmdt        |               |
|  j_c2_b1_mmdt        |               |
|  j_d2_b1_mmdt        |               |
|  j_d2_a1_b1_mmdt        |               |
|  j_m2_b1_mmdt        |               |
|  j_n2_b1_mmdt        |               |
|  j_mass_mmdt        |               |
|  j_multiplicity       |               |

And clearly our label will be top quark or not. A training sample contains both features and labels.

Not all networks use the already-clustered jets' features to classify the elementary particles from which they originated. The best performing ones are usually either _constituent_ or _image_ based classifiers. 

[ResNet-50](https://arxiv.org/pdf/1512.03385.pdf) is an example of an image-based classifier which achieves state of the art performance, with an input layer populated by the pixels in a _jet image_. The sensors on the inside surface of the cylindrical detectors can be represented in a two dimensional histogram of $\eta$ and $\phi$, and activation of these pixels is the energy or transverse momentum transferred to the sensors. This decision influences the architecture of the network as well as its performance; it requires immense amounts of data and computational power to train, as training time is proportional to the number of neurons a network needs to train (and with a 244x244 pixel image, thats an input layer of nearly 60,000 neurons). 

[Long Short-Term Memory](https://arxiv.org/pdf/1711.09059.pdf) (More [here](https://www.sciencedirect.com/science/article/pii/S0167278919305974)) is an example of a network which excells at consituent-based classification. The jets in our data sample come pre-clustered and pre-processed; with constituent based classifiers, the particles composing the jet itself are each individually analyzed. Constituent-based classification makes sense for a recurring neural network like LSTM, as the last constituent ought to influence the classification of the next constituent of the same jet. With a relatively small list of inputs and a simple network architecture, the LSTM network achieves very good performance with O(1000) neurons making it very quick to train and requires comparitively fewer data points to train.

## Accessing the Data: Part 2

In [6]:
features = ['j_zlogz', 'j_c1_b0_mmdt', 'j_c1_b1_mmdt', 'j_c2_b1_mmdt', 'j_d2_b1_mmdt', 'j_d2_a1_b1_mmdt',
            'j_m2_b1_mmdt', 'j_n2_b1_mmdt', 'j_mass_mmdt', 'j_multiplicity']
labels = ['j_t']

In [7]:
features+labels

['j_zlogz',
 'j_c1_b0_mmdt',
 'j_c1_b1_mmdt',
 'j_c2_b1_mmdt',
 'j_d2_b1_mmdt',
 'j_d2_a1_b1_mmdt',
 'j_m2_b1_mmdt',
 'j_n2_b1_mmdt',
 'j_mass_mmdt',
 'j_multiplicity',
 'j_t']

In [8]:
import pandas as pd

In [9]:
features_labels_df = pd.DataFrame(treeArray,columns=features+labels)
features_labels_df = features_labels_df.drop_duplicates()

In [10]:
features_labels_df

Unnamed: 0,j_zlogz,j_c1_b0_mmdt,j_c1_b1_mmdt,j_c2_b1_mmdt,j_d2_b1_mmdt,j_d2_a1_b1_mmdt,j_m2_b1_mmdt,j_n2_b1_mmdt,j_mass_mmdt,j_multiplicity,j_t
0,-2.901162,0.462566,0.039364,0.035541,0.902895,0.902895,0.069127,0.215783,79.503227,33.0,0
1,-3.112807,0.460751,0.049826,0.039287,0.788490,0.788490,0.066843,0.200235,81.145767,63.0,1
2,-3.363088,0.474168,0.060443,0.094000,1.555191,1.555191,0.126817,0.380907,97.876595,60.0,1
3,-2.287620,0.383430,0.008700,0.017118,1.967673,1.967673,0.109983,0.369561,17.177235,38.0,0
4,-2.878532,0.453209,0.047442,0.036979,0.779445,0.779445,0.066729,0.200641,92.293953,50.0,1
...,...,...,...,...,...,...,...,...,...,...,...
986801,-2.687196,0.443691,0.063865,0.078020,1.221655,1.221655,0.088727,0.257189,123.400360,55.0,0
986802,-2.218838,0.372641,0.003415,0.004913,1.438412,1.438412,0.101022,0.316370,5.798234,28.0,0
986803,-2.652741,0.446208,0.040134,0.025503,0.635463,0.635463,0.043024,0.146011,79.840851,36.0,1
986804,-2.544672,0.392504,0.012692,0.027972,2.203917,2.203917,0.106957,0.386887,23.755823,58.0,0


The above dataframe represents ~1 million jets each with the listed features and labelled as top (1) or not (0).

One of the most important steps to ensure the robust-ness of your machine learning solution is to retain a portion of data as a testing set. Understand the testing set's importance [here](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7). It is also imperative to shuffle the data before training a neural network to reach the global minimum of loss as opposed to getting stuck at a local minimum. When trying to create reproducible results, it is also useful to specify a seed for the random number generators.

scikit-learn can seperate training and testing sets as well as shuffle the data with the useful method train_test_split.

In [11]:
features_val = features_labels_df[features].values #Convert to numpy array
labels_val = features_labels_df[labels].values

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(features_val, labels_val, test_size=0.2, random_state=42)

Finally, we have shuffled training and testing data to use with our model. Now, follow the next tutorial to learn how to build a model.

##### Exercise

The four-layer model is more generally applicable due to its depth; instead of only tagging jets that originate from top quarks, it can tag jets originating from several different fundamental particles. Extract the training and testing data sets from the sample for the four-layer model training. The features and labels you are trying to extract are:


| Features | Labels |
|  :---:   |  :--:  |
j_zlogz  | j_g 
j_c1_b0_mmdt | j_q 
j_c1_b1_mmdt | j_w 
j_c1_b2_mmdt | j_z
j_c2_b1_mmdt | j_t
j_c2_b2_mmdt 
j_d2_b1_mmdt 
j_d2_b2_mmdt 
j_d2_a1_b1_mmdt 
j_d2_a1_b2_mmdt 
j_m2_b1_mmdt 
j_m2_b2_mmdt 
j_n2_b1_mmdt 
j_n2_b2_mmdt 
j_mass_mmdt 
j_multiplicity 