## Classification Experiment 2:  
**Engineered features with uniform sampled data**  
**Followed by the Classification Experiment 1, fit on different momentum range**

Based on the previous researches, the ring radius between pion and muon is too close in higher momentum range to make the classification more difficult. Therefore, we are exploring the classifier's performance on different momentum range in this notebook.

In [1]:
import h5py 
import numpy as np 
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import glob
import warnings

### 1.0 Read hit data and separate by momentum range

In [2]:
# read hit data from sampled dataset
grouped_hit_data = pd.read_csv('data_0/grouped_hit_data.csv')

In [3]:
# convert label to int datatype
grouped_hit_data['label'] = grouped_hit_data['label'].astype('int')
grouped_hit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900602 entries, 0 to 900601
Data columns (total 29 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Unnamed: 0          900602 non-null  int64  
 1   event               900602 non-null  int64  
 2   x_realigned_min     900602 non-null  float64
 3   x_realigned_max     900602 non-null  float64
 4   x_realigned_median  900602 non-null  float64
 5   y_realigned_min     900602 non-null  float64
 6   y_realigned_max     900602 non-null  float64
 7   y_realigned_median  900602 non-null  float64
 8   min_hit_radius      900602 non-null  float64
 9   max_hit_radius      900602 non-null  float64
 10  mean_hit_radius     900602 non-null  float64
 11  median_hit_radius   900602 non-null  float64
 12  rms_hit_radius      900602 non-null  float64
 13  momentum            900602 non-null  float64
 14  label               900602 non-null  int64  
 15  ring_radius_cal     900602 non-nul

In [4]:
# seperate data by momentum with range size = 5 Gev
grouped_hit_data_20_25 = grouped_hit_data.query('momentum >= 20 & momentum <= 25')
grouped_hit_data_25_30 = grouped_hit_data.query('momentum > 25 & momentum <= 30')
grouped_hit_data_30_35 = grouped_hit_data.query('momentum > 30 & momentum <= 35')
grouped_hit_data_35_40 = grouped_hit_data.query('momentum > 35 & momentum <= 40')
grouped_hit_data_40_45 = grouped_hit_data.query('momentum > 40 & momentum <= 45')

In [5]:
# check number of muon and pion in each momentum range (o for muon; 1 for pion)
grouped_hit_data_20_25['label'].value_counts()

0    156254
1     23860
Name: label, dtype: int64

In [6]:
grouped_hit_data_25_30['label'].value_counts()

0    156471
1     23645
Name: label, dtype: int64

In [7]:
grouped_hit_data_30_35['label'].value_counts()

0    158556
1     21564
Name: label, dtype: int64

In [8]:
grouped_hit_data_35_40['label'].value_counts()

0    160841
1     19285
Name: label, dtype: int64

In [9]:
grouped_hit_data_40_45['label'].value_counts()

0    163570
1     16556
Name: label, dtype: int64

### 2.0 Perform model training & cross valid (catboost/lightgbm/xgboost)

In [10]:
# import ML models and tools
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier
from lightgbm.sklearn import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from helpers import mean_std_cross_val_scores

In [11]:
classification_metrics = ["accuracy", "precision", "recall", "f1"]

In [12]:
# define pipelines

preprocessor = StandardScaler()

pipe_catboost = make_pipeline(
    preprocessor, CatBoostClassifier(verbose=0, random_state=123))

pipe_lgbm = make_pipeline(
    preprocessor, LGBMClassifier(random_state=123))

pipe_xgb = make_pipeline(
    preprocessor, XGBClassifier(random_state=123, verbosity=0))

classifiers = {
    "CatBoost": pipe_catboost,
    "LightGBM": pipe_lgbm,    
    "XGBoost": pipe_xgb
}

In [14]:
def train_test_split_range(df):
    """Create X_train, y_train, X_test, y_test for different momentum range data"""
    train_df, test_df = train_test_split(df,
                                         test_size=0.25,
                                         random_state=42,
                                         shuffle=True)
    X_train, y_train = train_df.drop(columns=["Unnamed: 0",
                                              "event",
                                              "label",
                                              "ring_radius_cal"]), train_df["label"]
    X_test, y_test = test_df.drop(columns=["Unnamed: 0",
                                           "event", 
                                           "label",
                                           "ring_radius_cal"]), test_df["label"]
    return X_train, y_train, X_test, y_test

#### For momentum range 20 - 25

In [15]:
X_train, y_train, X_test, y_test = train_test_split_range(grouped_hit_data_20_25)

In [18]:
%%time
results_20_25 = {}

for (name, model) in classifiers.items():
    results_20_25[name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True, scoring=classification_metrics
    )

CPU times: user 4min 16s, sys: 29.9 s, total: 4min 46s
Wall time: 56.3 s


In [19]:
pd.DataFrame(results_20_25).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
CatBoost,6.654 (+/- 0.099),0.026 (+/- 0.002),0.999 (+/- 0.000),1.000 (+/- 0.000),0.996 (+/- 0.001),1.000 (+/- 0.000),0.994 (+/- 0.001),0.999 (+/- 0.000),0.995 (+/- 0.001),0.999 (+/- 0.000)
LightGBM,0.649 (+/- 0.010),0.044 (+/- 0.003),0.998 (+/- 0.000),1.000 (+/- 0.000),0.995 (+/- 0.000),1.000 (+/- 0.000),0.992 (+/- 0.001),1.000 (+/- 0.000),0.994 (+/- 0.001),1.000 (+/- 0.000)
XGBoost,3.482 (+/- 0.060),0.027 (+/- 0.001),0.999 (+/- 0.000),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000),0.996 (+/- 0.001),1.000 (+/- 0.000),0.997 (+/- 0.000),1.000 (+/- 0.000)


#### For momentum range 25 - 30

In [22]:
X_train, y_train, X_test, y_test = train_test_split_range(grouped_hit_data_25_30)

In [23]:
%%time
results_25_30 = {}

for (name, model) in classifiers.items():
    results_25_30[name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True, scoring=classification_metrics
    )

CPU times: user 4min 21s, sys: 29.8 s, total: 4min 51s
Wall time: 57.3 s


In [24]:
pd.DataFrame(results_25_30).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
CatBoost,6.623 (+/- 0.158),0.027 (+/- 0.001),0.999 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.997 (+/- 0.001),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000)
LightGBM,0.668 (+/- 0.023),0.044 (+/- 0.002),0.998 (+/- 0.000),1.000 (+/- 0.000),0.995 (+/- 0.001),1.000 (+/- 0.000),0.989 (+/- 0.003),0.998 (+/- 0.000),0.992 (+/- 0.002),0.999 (+/- 0.000)
XGBoost,3.664 (+/- 0.035),0.030 (+/- 0.003),1.000 (+/- 0.000),1.000 (+/- 0.000),1.000 (+/- 0.001),1.000 (+/- 0.000),0.997 (+/- 0.001),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000)


#### For momentum range 30 - 35

In [26]:
X_train, y_train, X_test, y_test = train_test_split_range(grouped_hit_data_30_35)

In [27]:
%%time
results_30_35 = {}

for (name, model) in classifiers.items():
    results_30_35[name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True, scoring=classification_metrics
    )

CPU times: user 4min 24s, sys: 29.5 s, total: 4min 53s
Wall time: 57.5 s


In [28]:
pd.DataFrame(results_30_35).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
CatBoost,6.579 (+/- 0.119),0.027 (+/- 0.001),1.000 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.998 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.000),1.000 (+/- 0.000)
LightGBM,0.653 (+/- 0.003),0.044 (+/- 0.001),0.998 (+/- 0.000),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000),0.989 (+/- 0.001),0.997 (+/- 0.000),0.993 (+/- 0.001),0.998 (+/- 0.000)
XGBoost,3.768 (+/- 0.048),0.028 (+/- 0.002),1.000 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.997 (+/- 0.001),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000)


#### For momentum range 35 - 40

In [30]:
X_train, y_train, X_test, y_test = train_test_split_range(grouped_hit_data_35_40)

In [31]:
%%time
results_35_40 = {}

for (name, model) in classifiers.items():
    results_35_40[name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True, scoring=classification_metrics
    )

CPU times: user 4min 25s, sys: 29.9 s, total: 4min 55s
Wall time: 57.8 s


In [32]:
pd.DataFrame(results_35_40).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
CatBoost,6.707 (+/- 0.091),0.027 (+/- 0.002),1.000 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000)
LightGBM,0.651 (+/- 0.044),0.046 (+/- 0.001),0.999 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.990 (+/- 0.001),0.998 (+/- 0.000),0.994 (+/- 0.001),0.999 (+/- 0.000)
XGBoost,3.710 (+/- 0.049),0.028 (+/- 0.001),1.000 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.997 (+/- 0.001),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000)


#### For momentum range 40 - 45

In [34]:
X_train, y_train, X_test, y_test = train_test_split_range(grouped_hit_data_40_45)

In [35]:
%%time
results_40_45 = {}

for (name, model) in classifiers.items():
    results_40_45[name] = mean_std_cross_val_scores(
        model, X_train, y_train, return_train_score=True, scoring=classification_metrics
    )

CPU times: user 4min 36s, sys: 30.2 s, total: 5min 6s
Wall time: 58.9 s


In [36]:
pd.DataFrame(results_40_45).T

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall,test_f1,train_f1
CatBoost,6.813 (+/- 0.083),0.027 (+/- 0.001),1.000 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.000),1.000 (+/- 0.000),0.997 (+/- 0.001),1.000 (+/- 0.000),0.998 (+/- 0.001),1.000 (+/- 0.000)
LightGBM,0.664 (+/- 0.004),0.044 (+/- 0.000),0.999 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.992 (+/- 0.001),0.998 (+/- 0.000),0.995 (+/- 0.000),0.999 (+/- 0.000)
XGBoost,3.816 (+/- 0.036),0.028 (+/- 0.001),1.000 (+/- 0.000),1.000 (+/- 0.000),0.999 (+/- 0.001),1.000 (+/- 0.000),0.996 (+/- 0.001),1.000 (+/- 0.000),0.997 (+/- 0.001),1.000 (+/- 0.000)


### 3.0 Performance of best model (CatBoostClassifier) on test data

In [37]:
pipe_catboost.fit(X_train, y_train)

In [38]:
# prediction accuracy on test data
pipe_catboost.score(X_test, y_test)

0.9998223485521407

In [39]:
# prediction recall on test data
recall_score(y_test, pipe_catboost.predict(X_test))

0.9982939312698026

In [40]:
# output predicted label vs given label (assuming as ground truth)
y_pred = pipe_catboost.predict(X_test)

pred_df = pd.DataFrame(y_pred, y_test).reset_index()
pred_df.columns=['predited_label','given_label']

In [41]:
pred_df

Unnamed: 0,predited_label,given_label
0,0,0
1,0,0
2,0,0
3,1,1
4,0,0
...,...,...
45027,0,0
45028,0,0
45029,0,0
45030,0,0


**Manually calculate pion efficiency:**

In [42]:
# total number of true pion
num_pi_true = len(pred_df.query('given_label == 1'))
# Ture Positive (correctly predicted as pion)
num_pi_TP = len(pred_df.query('given_label == 1 & predited_label == 1'))

# recall(True Positive Rate)/pion efficiency
pi_efficiency = num_pi_TP/num_pi_true
pi_efficiency

0.9997559189650964

**Manually calculate muon efficiency:**

In [43]:
# total number of true muon
num_mu_true = len(pred_df.query('given_label == 0'))
# False Positive (incorrectly predicted as pion)
num_mu_FP = len(pred_df.query('given_label == 0 & predited_label == 1'))

# False Positive Rate/muon efficiency
mu_efficiency = num_mu_FP/num_mu_true
mu_efficiency

0.0001710028093318676