## Exploratory analysis 

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from proj1_helpers import *
from implementations import *
import codecs, json 
%load_ext autoreload
%autoreload 2

### Load and merge training and test data

In [3]:
data_path = "../dataset/train.csv"
y_loaded, data_loaded, _ = load_csv_data(data_path)

In [4]:
data_path = "../dataset/test.csv"
_, data_test_loaded, _ = load_csv_data(data_path)

In [5]:
data_loaded.shape, data_test_loaded.shape

((250000, 30), (568238, 30))

In [7]:
data_merged = np.concatenate((data_loaded,data_test_loaded), axis=0)
data_merged.shape

(818238, 30)

### Plot ditributions 
The first step to understand the data is to plot it. Here we plot (1) every value of every feature, (2) the histogram and (3) the distribution for each feature. 

#### Plot all values
Plotting all the values can be usefull to determine if and how many outliers there are and also identify the categorical features. In these plots we consider both the training and the test set.

In [None]:
plot_features(data_merged, col_labels = column_labels(), title="values")

We observe that there are very few outliers and that, given the amount of data, they shouldn't affect much the computation of the mean and of the standard deviation. Moreover, the only categorical feature is PRI_jet_num (column 22), we will study further what this feature represents and if/how it affect other features.

#### Plot the histograms and the distributions
In the graps below we plotted the histogram (left column) and the distribution (rigth column) of each feature depending on the output. In this way, we can compare the distribution of each feature and check by eye if it changes depending on the output y. In these graphs we can consider only the training data since we need the value of the output y to split the input data.

In [None]:
plot_distributions(data_loaded, y_loaded, col_labels = column_labels(), title = "", normed=False)

We observe that there are the distribution of the four features PRI_tau_phi, PRI_lep_pt, PRI_lep_phi, PRI_met_phi (respectively columns 15, 16, 18, 20) seems to be independent from the output y. We will try to drop these columns and verify how it affects out model. Finally, we notice the precence of lot of -999 numbers.

### Percentile

In [None]:
# clean the merged dataset
x_all, _ = clean_input_data(data_merged.copy())

In [None]:
x_all[0].shape[0] + x_all[1].shape[0] +  x_all[2].shape[0] + x_all[3].shape[0]

In [None]:
data_merged = fill_with_nan_list(data_merged, nan_values=[-999])
data_merged.shape

In [None]:
list(range(0, 101, 5))

In [None]:
# [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
# gerarchical maps: jet -> column -> percentile
percentiles = {} 
for jet, d in enumerate(x_all):
    percentiles[jet] = {} 
    
    scan_perc = list(range(0, 101, 5))
    #col_perc = np.zeros((len(scan_perc), data_merged.shape[1]))
    for col in range(d.shape[1]): # scan columns
        percentiles[jet][col] = {}
        for p in scan_perc:
             percentiles[jet][col][p] = np.nanpercentile(d[:, col], p)

In [None]:
percentiles

In [None]:
# store the computed percentiles 
file_path = "percentiles.json"
json.dump(percentiles, codecs.open(file_path, 'w', encoding='utf-8'), separators=(',', ':'), sort_keys=True, indent=4)

### Our approach: split data depending on the feature PRI_jet_num

#### Why divide data depending on jet numbers and which columns can we drop?

After reading information on the dataset we learned that the jet number affects the presence of -999 (invalid) values in other features. Therefore, we splitted our input data in four sets depending on the jet number and verified it.

Divide the data depending on the jet number (column 22) that is a categorical number in {0, 1, 2, 3}

In [None]:
all_data = data_merged.copy()
jets_0 = all_data[all_data[:, 22]==0, :]
jets_1 = all_data[all_data[:, 22]==1, :]
jets_2 = all_data[all_data[:, 22]==2, :]
jets_3 = all_data[all_data[:, 22]==3, :]
jets_0.shape, jets_1.shape, jets_2.shape, jets_3.shape 

Where are the -999 values?
- jet = 0: columns [4, 5, 6, 12, 23, 24, 25, 26, 27, 28] contain only -999 values, 26.1% of entries in the first column have -999
- jet = 1: columns [4, 5, 6, 12, 26, 27, 28] contain only -999 values, 7.6% of entries in the first column have -999
- jet = 2: 3% of entries in the first column have -999
- jet = 3: 1.4% of entries in the first column have -999

We decided to drop in every set the columns that store only -999 values (since they do not keep any information). However, the first feature (DER_mass_MMC) is the only one that contains -999 values in all the 4 sets without filling completely the column. We tried to impute this invalid value with mean, std, median and also simply subtituted it with 0s (after standardization). After several trials we understood that this invalid value didn't "fit" in the column and therefore it could need a different weight from the other values of the same column. Therefore, we opted for the more versatile solution: add a boolean column to indicate the position of the -999 values and delete them from first column. By "delete" we mean: substitute them with 0s after standardization, where during standardization we ignored those invalid values so that they didn't affect mean and standard deviation.

In [None]:
for jet, cur_set in enumerate([jets_0, jets_1, jets_2, jets_3]):
    print("Features in the dataset with jet=", jet, "contains this many values != -999")
    for col in range(30):
        print(col, np.sum(cur_set[:, col] != -999))
    print()

Where are the 0 values?
- jet = 0: column 29 contains only 0s and, obviously, column 22 too since it stores the jet num.
- jet = 1: spread
- jet = 2: spread
- jet = 3: spread

In [None]:
for jet, cur_set in enumerate([jets_0, jets_1, jets_2, jets_3]):
    print("Features in the dataset with jet=", jet, "contains this many values != 0")
    for col in range(30):
        print(col, np.sum(cur_set[:, col] != 0))
    print()

After this first step we surely want to drop the following columns since they do not contain any useful information:
- jet = 0: [4, 5, 6, 12, 22, 23, 24, 25, 26, 27, 28, 29]
- jet = 1: [4, 5, 6, 12, 22, 26, 27, 28] 
- jet = 2: [22]
- jet = 3: [22]

The column 22 is dropped in every obtained dataset since it just stores a constant representing the jet number. 

### Divide data and compute the correlation matrix 

We now divide our data, drop the above columns and verify if there are some highly correlated features. If so, it is worth trying to drop all but 1 column in a set of correlated features.

In [8]:
# split data
datasets, _ = split_input_data(data_merged) # split and drop
datasets[0].shape, datasets[1].shape, datasets[2].shape, datasets[3].shape

Jet 0 columns dropped: [4, 5, 6, 12, 22, 23, 24, 25, 26, 27, 28, 29]
Jet 1 columns dropped: [4, 5, 6, 12, 22, 26, 27, 28]
Jet 2 columns dropped: [22]
Jet 3 columns dropped: [22]


((327371, 18), (252882, 22), (165027, 29), (72958, 29))

In [9]:
# compute correlation matrices
corr_matrices = [None]*4
for jet in range(4):
    # don't consider the first column since it contains nan values (we will simply keep that column)
    corr_matrices[jet] = np.corrcoef(datasets[jet][:, 1:].T) 
    
    # to keep the same indexing of the columns just add one row above and one column at the left
    corr_matrices[jet] = np.column_stack((np.zeros((corr_matrices[jet].shape[0], 1)), corr_matrices[jet]))   
    corr_matrices[jet] = np.row_stack((np.zeros((1, corr_matrices[jet].shape[1])), corr_matrices[jet]))

corr_matrices[0].shape, corr_matrices[1].shape, corr_matrices[2].shape, corr_matrices[3].shape

((18, 18), (22, 22), (29, 29), (29, 29))

In [35]:
# compute the mapping of |correlations| > min_corr 
min_correlations = [0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,  0.95]
corr_mappings = {} # mapping: jet -> minimum correlation -> features -> list of correlated features
for jet in range(4): # for each dataset build the correlation mapping
    corr_mappings[jet] = {}
    for min_corr in min_correlations: #for each min correlation considered
        corr_mappings[jet][min_corr] = {}
        corr_matrix_bool = np.abs(corr_matrices[jet]) > min_corr 
        nfeature = corr_matrix_bool.shape[0]
        # i is surely correlated to itself, drop that (useless) information
        for i in range(nfeature):
            corr_matrix_bool[i][i] = False

        # compute the mapping of correlations
        for i in range(nfeature):
            c = np.where(corr_matrix_bool[i])[0].tolist()
            if len(c) > 0: # if it is not correlated to any other column then ignore it
                corr_mappings[jet][min_corr][i] = c
    

In [36]:
corr_mappings

{0: {0.7: {1: [15], 3: [5], 5: [3], 6: [9, 12], 9: [6], 12: [6], 15: [1]},
  0.75: {3: [5], 5: [3], 6: [9, 12], 9: [6], 12: [6]},
  0.8: {3: [5], 5: [3]},
  0.85: {3: [5], 5: [3]},
  0.9: {3: [5], 5: [3]},
  0.95: {3: [5], 5: [3]}},
 1: {0.7: {3: [6, 17, 18, 21],
   6: [3, 17, 18, 21],
   7: [12],
   12: [7],
   17: [3, 6, 18, 21],
   18: [3, 6, 17, 21],
   21: [3, 6, 17, 18]},
  0.75: {3: [6, 18, 21],
   6: [3, 17, 18, 21],
   17: [6],
   18: [3, 6, 21],
   21: [3, 6, 18]},
  0.8: {3: [6, 18, 21], 6: [3, 18, 21], 18: [3, 6, 21], 21: [3, 6, 18]},
  0.85: {3: [6, 18, 21], 6: [3, 18, 21], 18: [3, 6, 21], 21: [3, 6, 18]},
  0.9: {3: [18, 21], 6: [18, 21], 18: [3, 6, 21], 21: [3, 6, 18]},
  0.95: {18: [21], 21: [18]}},
 2: {0.7: {3: [9, 19, 21, 22, 28],
   4: [5, 6],
   5: [4, 6],
   6: [4, 5],
   9: [3, 21, 22, 28],
   10: [16],
   16: [10],
   19: [3],
   21: [3, 9, 22, 28],
   22: [3, 9, 21, 28],
   25: [28],
   28: [3, 9, 21, 22, 25]},
  0.75: {3: [9, 22, 28],
   4: [5, 6],
   5: [4, 6

Now that we have the mappings of the correlated columns we compute the columns that could be dropped for each considered correlation.

In [37]:
def empty(map_):
    for k in map_:
        if len(map_[k]) > 0:
            return False
    return True

# jet -> minimum correlation -> features -> list of correlated features

tobe_deleted = {} # mapping: jet -> minimum correlation -> list of columns that can be dropped
for jet in range(4): # for each dataset build the correlation mapping
    tobe_deleted[jet] = {}
    for min_corr, corr in corr_mappings[jet].items():
        tobe_deleted[jet][min_corr] = []
        # fetch all the columns that can be deleted and put them in tobe_deleted
        while not empty(corr): 
            longer_key = -1 
            longer_length = 0

            # look for the longer list
            for key in corr:
                curr_length = len(corr[key])
                if curr_length > longer_length:
                    longer_length = curr_length
                    longer_key = key

            tobe_deleted[jet][min_corr].append(corr[longer_key])
            # delete all the columns that are correlated to column longer_key
            # i.e. all the column whose index is in  corr[longer_key]
            for corr_colum in corr[longer_key]:
                corr[corr_colum] = []

            # since those columns have been dropped they must be removed from all the other lists
            for key in corr: 
                if key != longer_key:
                    corr[key] = list(set(corr[key]) - set(corr[longer_key]))
            corr[longer_key] = []

        tobe_deleted[jet][min_corr] = [val for sublist in tobe_deleted[jet][min_corr] for val in sublist]
        tobe_deleted[jet][min_corr].sort()
tobe_deleted

{0: {0.7: [5, 9, 12, 15],
  0.75: [5, 9, 12],
  0.8: [5],
  0.85: [5],
  0.9: [5],
  0.95: [5]},
 1: {0.7: [6, 12, 17, 18, 21],
  0.75: [3, 17, 18, 21],
  0.8: [6, 18, 21],
  0.85: [6, 18, 21],
  0.9: [3, 6, 21],
  0.95: [21]},
 2: {0.7: [5, 6, 9, 16, 19, 21, 22, 28],
  0.75: [3, 5, 6, 21, 22, 28],
  0.8: [5, 6, 21, 22, 28],
  0.85: [6, 21, 22, 28],
  0.9: [22, 28],
  0.95: [28]},
 3: {0.7: [5, 6, 16, 19, 21, 22, 25, 28],
  0.75: [5, 6, 16, 21, 22, 25, 28],
  0.8: [9, 21, 22, 25],
  0.85: [21, 22, 28],
  0.9: [21, 28],
  0.95: [28]}}

The found maps show, for evey dataset and for every chosen minimum correlation a list of features that can be dropped, e.g. tobe_deleted[2][0.85] contains a list of features in the jet=2 dataset which have a |correlation| > 0.85 with at least one of the feature that is kept in the dataset (and therefore can be dropped).

### Compute mean and std

We must use the same standardisation process both for the training and for predicting. Since the -999 values will be dropped from the first column I remove them to compute the mean and the std.