# Data Preprocessing:

### Problem:
Until know, we have considered data, which is nice and clean. No NaN-values have been present, as have there been no serious outliers. In this exercise we will consider especially the first problem (non-values!), and how to deal with it.

We will use the Aleph b-quark jet data (50000 events), which exists in two version:
1. Normal (which means "perfect")
2. Flawed (where NaN-values have been introduced)

We want you to preprocess the flawed data to a point where you can train a Neural Net on it (further specified below).


### Data:
The input variables (X) are (where Aleph uses only **the first six**):
* **prob_b**: Probability of being a b-jet from the pointing of the tracks to the vertex.
* **spheri**: Sphericity of the event, i.e. how spherical it is.
* **pt2rel**: The transverse momentum squared of the tracks relative to the jet axis, i.e. width of the jet.
* **multip**: Multiplicity of the jet (in a relative measure).
* **bqvjet**: b-quark vertex of the jet, i.e. the probability of a detached vertex.
* **ptlrel**: Transverse momentum (in GeV) of possible lepton with respect to jet axis (about 0 if no leptons).
* energy: Measured energy of the jet in GeV. Should be 45 GeV, but fluctuates.
* cTheta: cos(theta), i.e. the polar angle of the jet with respect to the beam axis. Note, that the detector works best in the central region (|cTheta| small) and less well in the forward regions.
* phi:    The azimuth angle of the jet. As the detector is uniform in phi, this should not matter (much).

The target variable (Y) is:
* isb:    1 if it is from a b-quark and 0, if it is not.

Finally, those before you (the Aleph collaboration in the mid 90'ies) produced a Neural Net (6 input variables, two hidden layers with 10 neurons in each, and 1 output varible) based classification variable, which you can compare to (and compete with?):
* nnbjet: Value of original Aleph b-jet tagging algorithm, using only the last six variables (for reference).

---

* Author: Troels C. Petersen (NBI)
* Email:  petersen@nbi.dk
* Date:   21st of May 2024

In [1]:
from __future__ import print_function, division   # Ensures Python3 printing & division standard
from matplotlib import pyplot as plt
from matplotlib import colors
from matplotlib.colors import LogNorm
import numpy as np
import csv
import pandas as pd 
from pandas import Series, DataFrame 

Possible other packages to consider:
cornerplot, seaplot, sklearn.decomposition(PCA)

In [2]:
r = np.random
r.seed(42)

SavePlots = False
plt.close('all')

In [3]:
data = pd.read_csv('AlephBtag_MC_train_Nev50000_flawed.csv')  #Read the data
variables = data.columns   #Get columns names

# Sometimes when the csv file had been created, the columns get saved as "object" type, we need to convert into float
# This is the case for the flawed datset
data['prob_b'] = data['prob_b'].astype(float)
data['spheri'] = data['spheri'].astype(float)
data['pt2rel'] = data['pt2rel'].astype(float)
data['bqvjet'] = data['bqvjet'].astype(float)
data['ptlrel'] = data['ptlrel'].astype(float)
data['energy'] = data['energy'].astype(float)
data['cTheta'] = data['cTheta'].astype(float)
data['phi']    = data['phi'].astype(float)
data['multip'] = data['multip'].astype(float)

## Inspect the data

In [4]:
# Total number of NaNs:
nan_count = data.isna().sum().sum()
print('Number of NaNs:', nan_count)

Number of NaNs: 23461


In [5]:
# Count number of missing values in each column
total_rows = data.shape[0]
perc_dict = {}
for i in range(data.shape[1]):
    n_miss = data.iloc[:, i].isnull().sum()
    perc = (n_miss / total_rows) * 100
    perc_dict[i] = perc
    print('> %d, Missing: %d (%.2f%%)' % (i, n_miss, perc))


> 0, Missing: 395 (0.79%)
> 1, Missing: 416 (0.83%)
> 2, Missing: 422 (0.84%)
> 3, Missing: 407 (0.81%)
> 4, Missing: 385 (0.77%)
> 5, Missing: 413 (0.83%)
> 6, Missing: 424 (0.85%)
> 7, Missing: 411 (0.82%)
> 8, Missing: 20188 (40.38%)
> 9, Missing: 0 (0.00%)
> 10, Missing: 0 (0.00%)


In [6]:
# Evaluate:
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.preprocessing import MinMaxScaler

input_variables = variables[(variables != 'nnbjet') & (variables != 'isb') &
                            (variables != 'energy') & (variables != 'phi') & (variables != 'cTheta')]
input_data = data[input_variables]
truth_data = data['isb']
benchmark_data = data['nnbjet']

# Normalize the input data
scaler = MinMaxScaler()
normalized_input_data = pd.DataFrame(scaler.fit_transform(input_data), columns=input_variables)

# Initialize dictionaries to store fpr and tpr values
fpr_dict = {}
tpr_dict = {}
auc_dict = {}

# Loop through each feature column (excluding the last column 'isb')
for i in range(data.shape[1] - 2):
    column_name = normalized_input_data.columns[i]
    filtered_data = pd.concat([normalized_input_data[column_name], data['isb']], axis=1).dropna()
    # Compute ROC curve
    fpr, tpr, _ = roc_curve(filtered_data['isb'],filtered_data[column_name])
    auc_score = auc(fpr,tpr)  
    # Store the results in the dictionaries
    fpr_dict[f'fpr_{i}'] = fpr
    tpr_dict[f'tpr_{i}'] = tpr
    auc_dict[f'auc_{i}'] = auc_score

# Let's plot the ROC curves for these results:
fig = plt.figure(figsize = [10,10])
plt.title('ROC from non-NaNs entries', size = 16)
for i in range(data.shape[1] - 2):
    column_name = normalized_input_data.columns[i]
    plt.plot(fpr_dict[f'fpr_{i}'], tpr_dict[f'tpr_{i}'], label=f'x{i}: {column_name} {perc_dict[i]}% (AUC = {auc_dict[f'auc_{i}']:5.3f})')

plt.legend(fontsize=16)
plt.xlabel('False Postive Rate', size=16)
plt.ylabel('True Positive Rate', size=16)
plt.show()

SyntaxError: invalid syntax (<ipython-input-6-f7d148d1db99>, line 38)

In [None]:
# Get portion of NaNs per event
total_columns = data.shape[1]

# Create a DataFrame to store the results
missing_data_per_event = (data.apply(lambda row: row.isnull().sum(), axis=1))/ total_columns

# Combine results into a new DataFrame
result_df = pd.DataFrame({
    'portion_NaNs': missing_data_per_event,
})

# Print or inspect the result
print(result_df)

In [None]:
fig, axs = plt.subplots()
bins = np.linspace(0,2,102)
axs.hist(result_df, histtype='step')
axs.set_xlabel('$\\rho_{NaN}$')#
axs.set_ylabel('Events')

# axs.grid(color="grey")
axs.legend( loc="upper right",fontsize=14)
axs.set_xlim(0,1)
axs.set_yscale('log')
plt.show()

# Suggested problems:

1. We have kindly provided you with a plot that shows you how strong the different input variables are and <b>what fraction of these are NaN-values</b>. Consider this plot, and discuss in your group how to interpret it, and how to use it! Based on this, would you consider excluding any of the input variables?

2. We also kindly provide you with code for a plot, that shows <b>the fraction of NaNs in the entries</b>. Once again, consider this plot and discuss how to use it.

3. Apply a BDT to both the "normal" and the (original) "flawed" datasets, and see to what extend the NaNs ruins the training by considering the performance in a "normal" test set. Any degredation?

4. Apply an NN to both the "normal" and the preprocessed (repaired?) "flawed" datasets, and see to what extend the NaNs ruins the training by considering the performance in a "normal" test set. Any degredation?

# Learning points:

From this exercise you should learn to inspect and "repair" your data. The exercise focuses on NaN-values. The typical pitfalls you want to get rid of are:
1. NaN-values and "Non-values" (i.e. -9999)
2. Wild outliers (i.e. values much outside the typical range)
3. Shifts in distributions (i.e. part of data having a different mean/width/definition/etc)

You should have learned how to find, evaluate, and eliminate NaN-values first column-wise (input variables) and then row-wise (entries). And you should have understood the concept of "imputation" (replacing missing data with actual values) and methods to do so (mean, median, and more advanced methods).

Finally, you should have gained some experience with the impact of NaN-values on the performance of a subsequent ML analysis.