# Tutorial 1 - Dataset Preparation

Data preparation is needed in order to make the dataset more suitable for the machine learning algorithm.

Here will be shown some steps that could be useful before the training of a Neural Network.
## Table of contents
- Data loading and reading
- Data cleaning
- Label encoding
- Scaling
- Test/train splits

## Data loading and reading

We will use the dataset provided for the Kaggle Higgs challenge. 
The dataset can be easily downloaded by using the function get_data that saves a copy ('data.csv') in the current repository.

In [1]:
import requests
from program import get_data
data_file = get_data("https://www.dropbox.com/s/dr64r7hb0fmy76p/atlas-higgs-challenge-2014-v2.csv?dl=1")

Using TensorFlow backend.


Downloading Dataset
Writing dataset on disk


The file contains 'comma separated values' (CSV) and we will use pandas DataFrame to handle the data. For more informations about **`pandas`** you can find the documentation [here](https://pandas.pydata.org/pandas-docs/stable/)

In [2]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head(10)

Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label,KaggleSet,KaggleWeight
0,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,...,2.15,0.444,46.062,1.24,-2.475,113.497,0.000814,s,t,0.002653
1,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,...,0.725,1.158,-999.0,-999.0,-999.0,46.226,0.681042,b,t,2.233584
2,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,...,2.053,-2.028,-999.0,-999.0,-999.0,44.251,0.715742,b,t,2.347389
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,...,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,1.660654,b,t,5.446378
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,...,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,1.904263,b,t,6.245333
5,100005,89.744,13.55,59.149,116.344,2.636,284.584,-0.54,1.362,61.619,...,-2.412,-0.653,56.165,0.224,3.106,193.66,0.025434,b,t,0.083414
6,100006,148.754,28.862,107.782,106.13,0.733,158.359,0.113,2.941,2.545,...,0.864,1.45,56.867,0.131,-2.767,179.877,0.000814,s,t,0.002653
7,100007,154.916,10.418,94.714,29.169,-999.0,-999.0,-999.0,2.897,1.526,...,-0.715,-1.724,-999.0,-999.0,-999.0,30.638,0.005721,s,t,0.018636
8,100008,105.594,50.559,100.989,4.288,-999.0,-999.0,-999.0,2.904,4.288,...,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,1.614803,b,t,5.296003
9,100009,128.053,88.941,69.272,193.392,-999.0,-999.0,-999.0,1.609,28.859,...,-2.767,-2.514,-999.0,-999.0,-999.0,167.735,0.000461,s,t,0.001502


EventId column is useless because pandas.dataframe has a default index:

In [3]:
df.drop('EventId', axis=1, inplace=True)

## Data cleaning

Looking at the first 10 rows we can observe that the dataset is characterized by some missing values which are set to a default value equal to -999.000.
To avoid introducing an additional bias, we want to replace the missing values with the average of the defined values of the corresponding feature.

In [4]:
from program import Clean_Missing_Data
df, feature_list=  Clean_Missing_Data(df)

## Label encoding

The class labels of the events should be encoded to be numerical labels.

In [5]:
from program import Label_to_Binary
df['Label'] = Label_to_Binary(df['Label'])

## Scaling

It is very useful to scale the values of the same feature such that they have approximately similar ranges. Without scaling it's possible that the values of the variables span very different orders of magnitude and this will create problems in the algorithm convergence due to very wild fluctuations in the magnitude of the internal weights.
It's reccomended to test different scalers and use the one with the best performance. In this tutorial we will use StandardScaler.

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[feature_list] = scaler.fit_transform(df[feature_list])

## Test/train splits

Now we need to split the dataset in training and testing respectively 70% training data and 30% testing data. Let's first assign the design matrix to X and the target to Y:

In [7]:
y = df['Label']
df.drop('Weight', axis=1 ,inplace=True)
df.drop('Label', axis=1 ,inplace=True)
df.drop('KaggleSet', axis=1 ,inplace=True)
df.drop('KaggleWeight', axis=1 ,inplace=True)
X = df

In [8]:
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print('Class proportions:', np.bincount(y_train))

Class proportions: [377074 195692]


Now we save the four dataset in csv files, ready for Tutorial2

In [9]:
X_train.to_csv ('X_train.csv', index = False, header=True)
X_test.to_csv ('X_test.csv', index = False, header=True)
y_train.to_csv ('y_train.csv', index = False, header=True)
y_test.to_csv ('y_test.csv', index = False, header=True)