# Data Exploration
TODO : We are not allowed to use pandas --> need to replace everything with numpy

## Observations

* there are **30** features
* values -999 == Nan values

## Things we can do for cleaning

Feature engineering [link 1](https://www.analyticsvidhya.com/blog/2021/03/step-by-step-process-of-feature-engineering-for-machine-learning-algorithms-in-data-science/), [link 2](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114#3abe)
* Standardization/Normalization
* Feature Selection
* Handling Missing Values
* Handling Outliers
* Log Transform
* (Splitting features)
* (Binning)
* (One-Hot Encoding)
* (Grouping Operations)

Winner of the Higgs-Boson contest [winner](https://github.com/melisgl/higgsml/blob/master/doc/model.md), did:
* Data Normalization to have **mean = 0 and standard deviation = 1**
* dropping the *-phi features
* Log transform of features with long tails

In [4]:
# Useful starting lines
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
DATA_TRAIN_PATH = '../data/train.csv'

In [6]:
data = pd.read_csv(DATA_TRAIN_PATH)
data.head()

Unnamed: 0,Id,Prediction,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
0,100000,s,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,...,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497
1,100001,b,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,...,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226
2,100002,b,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,...,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251
3,100003,b,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,...,0.06,86.062,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
4,100004,b,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,...,-0.871,53.131,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0


In [7]:
data.Id.is_unique

True

In [8]:
tX = data.drop(['Id', 'Prediction'], axis=1)
print(tX.shape)
tX.head()

(250000, 30)


Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
0,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,...,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497
1,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,...,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226
2,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,...,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251
3,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,75.968,...,0.06,86.062,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
4,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,57.983,...,-0.871,53.131,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0


In [9]:
y = data[['Prediction']]
y.Prediction.describe()

count     250000
unique         2
top            b
freq      164333
Name: Prediction, dtype: object

## Dealing with Undefined values i.e. -999

In [61]:
from proj1_helpers import *
DATA_TRAIN_PATH = '../data/train.csv' # TODO: download train data and supply path here 
y, input_data, ids = load_csv_data(DATA_TRAIN_PATH)

In [13]:
y.shape

(250000,)

In [14]:
input_data.shape

(250000, 30)

A natural decision would be to replace the -999 in a given feature vector (column of ) by the average of the elements in it.

Example for Mike and Sophie

In [56]:
x = np.array([[2,2,-999],[-999,5,-999]])
x

array([[   2,    2, -999],
       [-999,    5, -999]])

In [57]:
#modify the -999 of the first columns
x[:,0][x[:,0] == -999] = 5
x

array([[   2,    2, -999],
       [   5,    5, -999]])

In [70]:
def replace_undefined_values_with_mean(tX):
    """Replace the -999 values of each feature column by the mean of its elements"""
    
    cols = tX.shape[1]
    
    for col in range(cols):
        vect = tX[:, col]
        #delete -999 values before calculating the mean
        vect = np.delete(vect, np.where(vect == -999))
        #replace -999 values by mean
        tX[:,col][tX[:,col] == -999] = np.mean(vect)
    
    return tX

Test

In [71]:
#there are -999 values
input_data[input_data == -999].size

1541938

In [72]:
clean_input_data = replace_undefined_values_with_mean(input_data)
clean_input_data[clean_input_data == -999].size

0

## Standardize

It is a beneficial preprocessing step to standardize the data, i.e. subtract the mean and divide by the standard deviation for each dimension

In [74]:
def standardize(x):
    """Standardize the data set."""
    
    centered_data = x - np.mean(x, axis=0)
    std_data = centered_data / np.std(centered_data, axis=0)
    
    return std_data

Test

In [83]:
clean_std_input_data = standardize(clean_input_data)

print(np.mean(clean_std_input_data, axis=0), "\n\n", np.std(clean_std_input_data, axis=0))

[-9.68885265e-13  4.50019089e-15 -3.48448848e-15  7.19675786e-15
 -7.25996786e-12 -6.28511637e-12  6.81155990e-13  2.16429719e-14
  6.39742126e-15  2.86409207e-15 -7.00447966e-15  4.45924897e-15
  5.42301347e-13 -5.96492045e-15  1.35646161e-16  7.13136217e-17
  2.58030370e-14 -1.06327391e-16 -1.87188487e-16  8.24369382e-15
  1.41040513e-16 -9.00283004e-15 -6.01698247e-16  2.88956741e-12
 -2.76637636e-15  2.53944285e-14 -8.41148019e-12  2.10209063e-14
 -5.90545076e-15 -8.76751116e-16] 

 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


## Bias term

We need to add a column of 1s to the data in order to account for the bias term

In [88]:
def add_bias_term(input_data):
    """Add column of 1s which corresponds to the bias term"""
    num_samples = input_data.shape[0]
    tX = np.c_[np.ones(num_samples), input_data]
    
    return tX

Test

In [91]:
tX = add_bias_term(clean_std_input_data)
tX[:,0]

array([1., 1., 1., ..., 1., 1., 1.])