This is a simple notebook to train a Support Vector Machine to discriminate between two types of collisional events.

It accompanies Chapter 4 of the book.

Data for this exercise were kindly provided by [Sascha Caron](https://www.nikhef.nl/~scaron/).

Author: Viviana Acquaviva, with contributions by Jake Postiglione and Olga Privman.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict, cross_validate
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 100)
rc('text', usetex=False)

Read in features and labels.

In [None]:
features = pd.read_csv('../data/ParticleID_features.csv', index_col='ID')

In [None]:
features.head(10)

In [None]:
features.shape

In [None]:
y = np.genfromtxt('../data/ParticleID_labels.txt', dtype = str)

In [None]:
y

#### We need to turn categorical (string-type) labels into an array, e.g. 0/1.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() #turns categorical into 1 ... N

In [None]:
y

In [None]:
y = le.fit_transform(y)

In [None]:
y #This uses 1 for the first instance, I actually wanted  4top to be my positive label.

In [None]:
target = np.abs(y - 1)

In [None]:
target # Happier now.

#### Let's take a look at these features, using the "describe" property.

In [None]:
features.describe() #Note that this automatically excludes non-numerical type columns

### Important:

Looking at the "count" row, we can see that the whole data set has 5,000 rows, but some columns are present only for a fraction of them. This is because of the variable number of products in each collision.

#### Option 1: Only consider first 16 columns (first four products) so we have limited imputing/manipulation problems.

In [None]:
features_lim = features[['MET', 'METphi', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11',
       'P12',  'P13', 'P14', 'P15', 'P16']]

In [None]:
features_lim.head(20)

In [None]:
features_lim.describe() #This automatically excludes non-numerical type columns, and missing values/NaNs are not counted.

There are still some feature columns with different length! This means there might be NaN values. Let's replace them with 0 for the moment. 

In [None]:
# Take a look for column P10

np.where(np.isnan(features_lim.P10))

Fill with 0 everywhere there is a NaN

In [None]:
features_lim = features_lim.fillna(0) #Fill with 0 everywhere there is a NaN

#### Let's see what "describe" says now.

In [None]:
features_lim.describe()

Yay - we now have consistent sizes, so we can use these as feature arrays, BUT be mindful of possible negative impacts of our imputing strategies.

### Learning Check-in
    
Q: What does the "describe" method of pandas do, and what does it tell us about a data frame?  

<details>
    <summary style="display: list-item;">Click here for the answer!</summary>
    <p>
         The describe method gives us useful statistics about the data in our dataframe. It will output a total count of objects in the dataframe, as well as other information such as mean, frequency, min, max, etc. Note that it will only show if it is applicable to the type of data stored; it will include only numerical columns.
    </p>
</details>

### Let's move onto a quick exploration of labels.

In [None]:
np.sum(target)/len(target) #distribution (helps with benchmarking!)

84\% in the negative label, 16\% in the positive label. A bit unbalanced; a classifier that puts everything in the negative class will have 84\% accuracy.

#### How about a random classifier that just assigns a random class according to class distribution?

In [None]:
#Numerical solution

acc=0
for i in range(1000):
    x = np.random.choice(target,5000)
    acc += metrics.accuracy_score(target,x)
print(acc/1000)

#Analytic solution

print(0.8378*(0.8378) + 0.1622*0.1622)

### Let's start with a linear model; model = SVC()

Establish benchmark: linear model, no regularization (C parameter very high)

In [None]:
bmodel = LinearSVC(dual = False, C = 1000) #Prefer dual = False when n_samples > n_features. If not, will not converge!!

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=101) 

In [None]:
l_benchmark_lim = cross_validate(bmodel, features_lim, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
l_benchmark_lim

In [None]:
np.round(l_benchmark_lim['test_score'].mean(),3), np.round(l_benchmark_lim['test_score'].std(), 3)

We can also check the predicted labels. Cross\_val\_predict will compile labels predicted when each object was in the test fold.

In [None]:
ypred_bench_lim = cross_val_predict(bmodel, features_lim, target, cv = cv)

Slightly better than a random classifier, but worse than a super silly classifier that says no to everything.

### How about with scaling?

In [None]:
from sklearn.pipeline import make_pipeline #This allows one to build different steps together

In [None]:
piped_model = make_pipeline(StandardScaler(), LinearSVC(dual = False, C = 1000)) #changed to linear SVC

benchmark_lim_piped = cross_validate(piped_model, features_lim, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
benchmark_lim_piped

In [None]:
np.round(benchmark_lim_piped['test_score'].mean(),3), np.round(benchmark_lim_piped['test_score'].std(), 3)

This is a significant improvement, and the comparison between test and train scores tells us already something about the problem that we have. We can formalize this by looking at the learning curves, which tell us both about gap between train/test scores, AND whether we need more data.

### Learning curves

In [None]:
from sklearn.model_selection import learning_curve

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=5,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5), scoring = 'accuracy', scale = False):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("# of training examples",fontsize = 14)

    plt.ylabel("Accuracy score",fontsize = 14)

    if (scale == True):
        scaler = sklearn.preprocessing.StandardScaler()
        X = scaler.fit_transform(X)

    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring = scoring)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
#    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="b")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="b",
             label="Training score from CV")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Test score from CV")

    plt.legend(loc="best",fontsize = 12)
    return plt

In [None]:
plot_learning_curve(piped_model, 'Generalized Learning Curves, linear SVC model, no reg', features_lim, target, train_sizes = np.array([0.05,0.1,0.2,0.5,1.0]), cv = KFold(n_splits=5, shuffle=True))

### Learning Check-in

Q: Based on the learning curves, what do you think the issue is, and what can we change to improve our model?

<details>
    <summary style="display: list-item;">Click here for the answer!</summary>
    <p>
        The model suffers from high bias. We can see this by looking at the learning curves, which show only a small (and not statistically significant) gap between the train and test scores, for our current sample size, n = 4000. This excludes the problem of high variance. So, we can look for fixes that target high bias!
    </p>
</details>

<br/>

Q: Would having more data help? Why?  

<details>
    <summary style="display: list-item;">Click here for the answer!</summary>
    <p>
        It wouldn't. More data would correspond to imagining that graph continuing to the right (n > 4000 samples). But the learning curves have plateaud already (i.e., they look flat). This means that having more data would not help improve the scores and fix our issue of high bias.
    </p>
</details>

### Parameter optimization 

(note: this is NOT nested cross validation).

In [None]:
piped_model = make_pipeline(StandardScaler(), SVC()) #non linear so I can change the kernel

piped_model.get_params() #this shows how we can access parameters both for the scaler and the classifier

### We can define a grid of parameter values to run the optimization. 

(should do nested CV to estimate generalization error!)

Note that this might take a while (~5 mins on my laptop, but it was 15' on my previous laptop); the early estimates are misleading because more complex models (in particular high gamma) take longer.


In [44]:
#optimizing SVC: THIS IS NOT YET NESTED CV

parameters = {'svc__kernel':['poly', 'rbf'], \
              'svc__gamma':[0.00001,'scale', 0.01, 0.1], 'svc__C':[0.1, 1.0, 10.0, 100.0, 1000], \
              'svc__degree': [2, 4, 8]}

model = GridSearchCV(piped_model, parameters, cv = StratifiedKFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True)

model.fit(features_lim,target)

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

Best params, best score: 0.8956 {'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}


#### Visualize the scores in a data frame, and rank them according to test scores.

I like to look at mean, std of test scores, mean of train scores (so I can evaluate if they differ and the significance of the result), and also fitting time (would pick a model that takes less time if scores are comparable).

In [45]:
scores_lim = pd.DataFrame(model.cv_results_)

scores_lim[['params','mean_test_score','std_test_score','mean_train_score', \
            'mean_fit_time']].sort_values(by = 'mean_test_score', ascending = False)

Unnamed: 0,params,mean_test_score,std_test_score,mean_train_score,mean_fit_time
27,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.8956,0.008958,0.9218,0.279999
43,"{'svc__C': 1.0, 'svc__degree': 8, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.8956,0.008958,0.9218,0.2768
35,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.8956,0.008958,0.9218,0.267401
29,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.8946,0.009604,0.9007,0.2261
37,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.8946,0.009604,0.9007,0.2148
45,"{'svc__C': 1.0, 'svc__degree': 8, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.8946,0.009604,0.9007,0.2135
47,"{'svc__C': 1.0, 'svc__degree': 8, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}",0.894,0.009295,0.93875,0.3163
31,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}",0.894,0.009295,0.93875,0.362499
39,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}",0.894,0.009295,0.93875,0.3285
53,"{'svc__C': 10.0, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.8932,0.008183,0.9127,0.248099


#### We can also isolate one type of kernel to look at it more closely.

In [46]:
scores_lim[scores_lim['param_svc__kernel'] == 'poly'][['params','mean_test_score','std_test_score',\
                        'mean_train_score','mean_fit_time']].sort_values(by = 'mean_test_score', ascending = False)

Unnamed: 0,params,mean_test_score,std_test_score,mean_train_score,mean_fit_time
30,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 0.1, 'svc__kernel': 'poly'}",0.8772,0.006969,0.8842,0.437499
76,"{'svc__C': 100.0, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'poly'}",0.8772,0.006969,0.88425,0.530502
50,"{'svc__C': 10.0, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'poly'}",0.8764,0.005314,0.8861,0.8353
54,"{'svc__C': 10.0, 'svc__degree': 2, 'svc__gamma': 0.1, 'svc__kernel': 'poly'}",0.8764,0.006406,0.88695,2.295999
102,"{'svc__C': 1000, 'svc__degree': 2, 'svc__gamma': 0.1, 'svc__kernel': 'poly'}",0.8764,0.004224,0.8875,184.779399
98,"{'svc__C': 1000, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'poly'}",0.8762,0.0044,0.88765,62.524099
74,"{'svc__C': 100.0, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'poly'}",0.876,0.004472,0.8875,6.6827
78,"{'svc__C': 100.0, 'svc__degree': 2, 'svc__gamma': 0.1, 'svc__kernel': 'poly'}",0.876,0.004561,0.88745,19.912899
100,"{'svc__C': 1000, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'poly'}",0.8758,0.006013,0.88705,2.3793
26,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'poly'}",0.8754,0.007392,0.8819,0.4065


In [47]:
scores_lim.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_svc__C', 'param_svc__degree', 'param_svc__gamma',
       'param_svc__kernel', 'params', 'split0_test_score', 'split1_test_score',
       'split2_test_score', 'split3_test_score', 'split4_test_score',
       'mean_test_score', 'std_test_score', 'rank_test_score',
       'split0_train_score', 'split1_train_score', 'split2_train_score',
       'split3_train_score', 'split4_train_score', 'mean_train_score',
       'std_train_score'],
      dtype='object')

### Final diagnosis 

The problem here is high bias, which is not that surprising given that we are using only a subset of features.

We can try two things: making up new features which might help, based on what we know about the problem, and using an imputing strategy to include information about the discarded features.

### Next step: define some new variables. 

In [48]:
features = features.fillna(0) #takes care of nan

In [49]:
features = features.replace('', 0) #takes care of empty string values

In [50]:
features.head()

Unnamed: 0_level_0,MET,METphi,Type_1,P1,P2,P3,P4,Type_2,P5,P6,P7,P8,Type_3,P9,P10,P11,P12,Type_4,P13,P14,P15,P16,Type_5,P17,P18,P19,P20,Type_6,P21,P22,P23,P24,Type_7,P25,P26,P27,P28,Type_8,P29,P30,P31,P32,Type_9,P33,P34,P35,P36,Type_10,P37,P38,P39,P40,Type_11,P41,P42,P43,P44,Type_12,P45,P46,P47,P48,Type_13,P49,P50,P51,P52
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1
0,62803.5,-1.81001,j,137571.0,128444.0,-0.345744,-0.307112,j,174209.0,127932.0,0.826569,2.332,b,86788.9,84554.9,-0.180795,2.18797,j,140289.0,76955.8,-1.19933,-1.3028,m+,85230.6,70102.4,-0.645689,-1.65954,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0
1,57594.2,-0.509253,j,161529.0,80458.3,-1.31801,1.40205,j,291490.0,68462.9,-2.12674,-2.58231,e-,44270.1,35139.6,-0.70612,-0.371392,e+,72883.9,26902.2,-1.65386,-3.12963,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0
2,82313.3,1.68684,b,167130.0,113078.0,0.937258,-2.06868,j,102423.0,54922.3,1.22685,0.646589,j,60768.9,36244.3,1.10289,-1.43448,j,77714.0,27801.5,1.68461,1.38969,j,26840.0,24469.3,-0.388937,-1.64726,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0
3,30610.8,2.61712,j,112267.0,61383.9,-1.21105,-1.4578,b,40647.8,39472.0,-0.024646,-2.2228,j,201589.0,32978.6,-2.49604,1.13781,j,90096.7,26964.5,1.87132,0.817631,j,28235.4,25887.9,-0.411528,2.02429,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0
4,45153.1,-2.24135,j,178174.0,100164.0,1.16688,-0.018721,j,92351.3,69762.1,0.774114,2.56874,j,61625.2,50086.7,0.652572,-3.0128,j,104193.0,31151.0,1.87641,0.865381,j,746585.0,26219.3,4.04182,-0.874169,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0


In [51]:
np.unique(features.Type_1.values)

array(['b', 'j'], dtype=object)

Let's start by looking at what kind of particles we have as a product of the collision.

In [52]:
np.unique(np.array([features['Type_'+str(i)].values for i in range(1,14)]).astype('str'))

array(['0', 'b', 'e+', 'e-', 'g', 'j', 'm+', 'm-'], dtype='<U2')

Here are the proposed new features (justification can be found in Chapter 4 of the textbook!)
    
    1. The total number of particles produced
    2. The total number of b jets
    3. The total number of jets
    4. The total number of leptons (electrons, positron, mu+, mu-)

In [53]:
#count number of non-zero types 

ntot = np.array([-(np.sum(np.array([features['Type_'+str(i)].values[j] == 0 for i in range(1,14)])) - 13) for j in range(features.shape[0])])

In [54]:
#define new column in my data frame

features['Total_products'] = ntot

In [55]:
#count number of b jets 

nbtot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'b' for i in range(1,14)])) for j in range(features.shape[0])])

In [56]:
#define new column in my data frame

features['Total_b'] = nbtot

In [57]:
#Actually, let's count all types (jets, photons g, e-, e+, mu-, mu+)

njtot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'j' for i in range(1,14)])) for j in range(features.shape[0])])

In [58]:
ngtot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'g' for i in range(1,14)])) for j in range(features.shape[0])])

In [59]:
n_el_tot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'e-' for i in range(1,14)])) for j in range(features.shape[0])])

In [60]:
n_pos_tot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'e+' for i in range(1,14)])) for j in range(features.shape[0])])

In [61]:
n_muneg_tot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'm-' for i in range(1,14)])) for j in range(features.shape[0])])

In [62]:
n_mupos_tot = np.array([np.sum(np.array([features['Type_'+str(i)].values[j] == 'm+' for i in range(1,14)])) for j in range(features.shape[0])])

In [63]:
n_lepton_tot = n_el_tot + n_pos_tot + n_muneg_tot + n_mupos_tot

And here we define the other new features:

In [64]:
features['Total_j'] = njtot
features['Total_g'] = ngtot
features['Total_leptons'] = n_lepton_tot

### Learning Check-in

What method can we use to peek at the first few lines of our features table? <i>(Test your code in the cell below!)</i>


In [65]:
# Enter code here!


<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
```python
features.head()
```
    
</p>
</details>

### Feature engineering 1: impact of ad-hoc variables

In [66]:
features_lim_2 = features[['MET', 'METphi', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11',
       'P12',  'P13', 'P14', 'P15', 'P16','Total_products', 'Total_b' ,'Total_j','Total_g', 
              'Total_leptons']]

In [67]:
bmodel #remember our benchmark model?

LinearSVC(C=1000, dual=False)

In [68]:
piped_model = make_pipeline(StandardScaler(), LinearSVC(dual = False, C = 1000))

In [69]:
benchmark_lim2_piped = cross_validate(piped_model, features_lim_2, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [70]:
benchmark_lim2_piped

{'fit_time': array([0.01100373, 0.00850034, 0.00849962, 0.00850034, 0.00903392]),
 'score_time': array([0.00249696, 0.00200105, 0.00147676, 0.00249982, 0.00146747]),
 'test_score': array([0.952, 0.939, 0.961, 0.948, 0.94 ]),
 'train_score': array([0.95   , 0.9555 , 0.94825, 0.95   , 0.95   ])}

In [71]:
np.round(benchmark_lim2_piped['test_score'].mean(),3), np.round(benchmark_lim2_piped['test_score'].std(), 3)

(0.948, 0.008)

In [72]:
piped_model = make_pipeline(StandardScaler(), SVC())

We can optimize this model as well; it will take a while, just like the previous time.

In [73]:
#optimizing SVC: Takes a while!

parameters = {'svc__kernel':['poly', 'rbf'], \
              'svc__gamma':[0.00001,'scale', 0.01, 0.1], 'svc__C':[0.1, 1.0, 10.0, 100.0], 'svc__degree': [2, 4, 8]}

nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(piped_model, parameters, cv = StratifiedKFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(features_lim_2,target)

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best params, best score: 0.9448 {'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}


In [74]:
scores_lim_2 = pd.DataFrame(model.cv_results_)
scores_lim_2[['params','mean_test_score','mean_train_score','mean_fit_time']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)

Unnamed: 0,params,mean_test_score,mean_train_score,mean_fit_time
29,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9448,0.95335,0.138
37,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9448,0.95335,0.125798
45,"{'svc__C': 1.0, 'svc__degree': 8, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9448,0.95335,0.125604
69,"{'svc__C': 10.0, 'svc__degree': 8, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9438,0.9619,0.1417
53,"{'svc__C': 10.0, 'svc__degree': 2, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9438,0.9619,0.168
61,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9438,0.9619,0.1382
35,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.9432,0.96495,0.1805
27,"{'svc__C': 1.0, 'svc__degree': 2, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.9432,0.96495,0.172898
43,"{'svc__C': 1.0, 'svc__degree': 8, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.9432,0.96495,0.171201
73,"{'svc__C': 100.0, 'svc__degree': 2, 'svc__gamma': 1e-05, 'svc__kernel': 'rbf'}",0.943,0.94565,0.164399


### Another feature engineering attempt we could potentially do is use the type of product in the i-th location as a feature.

We could do it with label encoding, but this introduces a notion of distance metric (labels that are mapped to 0 and 1 are interpreted to be closer to each other than labels that are mapped into 0 and 7).

We introduce as many new columns as categorical labels, and we just use a 0/1 to indicate that the particle is of that type.

In [75]:
features_add = pd.get_dummies(data=features, columns=['Type_'+str(i) for i in range(1,14)])

In [76]:
features_add.columns[58:80]

Index(['Total_leptons', 'Type_1_b', 'Type_1_j', 'Type_2_0', 'Type_2_b',
       'Type_2_e+', 'Type_2_e-', 'Type_2_g', 'Type_2_j', 'Type_2_m+',
       'Type_2_m-', 'Type_3_0', 'Type_3_b', 'Type_3_e+', 'Type_3_e-',
       'Type_3_g', 'Type_3_j', 'Type_3_m+', 'Type_3_m-', 'Type_4_0',
       'Type_4_b', 'Type_4_e+'],
      dtype='object')

In [77]:
features_add.shape

(5000, 156)

### Feature engineering 2: add other variables (type of product)

In [78]:
features_lim_3 = features_add[['MET', 'METphi', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11',
       'P12',  'P13', 'P14', 'P15', 'P16','Total_products', 'Total_b' ,'Total_j','Total_g', 
              'Total_leptons','Type_1_b',
       'Type_1_j', 'Type_2_0', 'Type_2_b', 'Type_2_e+', 'Type_2_e-',
       'Type_2_g', 'Type_2_j', 'Type_2_m+', 'Type_2_m-', 'Type_3_0',
       'Type_3_b', 'Type_3_e+', 'Type_3_e-', 'Type_3_g', 'Type_3_j',
       'Type_3_m+', 'Type_3_m-', 'Type_4_0', 'Type_4_b', 'Type_4_e+',
       'Type_4_e-', 'Type_4_g', 'Type_4_j', 'Type_4_m+', 'Type_4_m-']]

In [79]:
features_lim_3.head()

Unnamed: 0_level_0,MET,METphi,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,Total_products,Total_b,Total_j,Total_g,Total_leptons,Type_1_b,Type_1_j,Type_2_0,Type_2_b,Type_2_e+,Type_2_e-,Type_2_g,Type_2_j,Type_2_m+,Type_2_m-,Type_3_0,Type_3_b,Type_3_e+,Type_3_e-,Type_3_g,Type_3_j,Type_3_m+,Type_3_m-,Type_4_0,Type_4_b,Type_4_e+,Type_4_e-,Type_4_g,Type_4_j,Type_4_m+,Type_4_m-
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
0,62803.5,-1.81001,137571.0,128444.0,-0.345744,-0.307112,174209.0,127932.0,0.826569,2.332,86788.9,84554.9,-0.180795,2.18797,140289.0,76955.8,-1.19933,-1.3028,5,1,3,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,57594.2,-0.509253,161529.0,80458.3,-1.31801,1.40205,291490.0,68462.9,-2.12674,-2.58231,44270.1,35139.6,-0.70612,-0.371392,72883.9,26902.2,-1.65386,-3.12963,4,0,2,0,2,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
2,82313.3,1.68684,167130.0,113078.0,0.937258,-2.06868,102423.0,54922.3,1.22685,0.646589,60768.9,36244.3,1.10289,-1.43448,77714.0,27801.5,1.68461,1.38969,5,1,4,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
3,30610.8,2.61712,112267.0,61383.9,-1.21105,-1.4578,40647.8,39472.0,-0.024646,-2.2228,201589.0,32978.6,-2.49604,1.13781,90096.7,26964.5,1.87132,0.817631,5,1,4,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
4,45153.1,-2.24135,178174.0,100164.0,1.16688,-0.018721,92351.3,69762.1,0.774114,2.56874,61625.2,50086.7,0.652572,-3.0128,104193.0,31151.0,1.87641,0.865381,5,0,5,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0


In [80]:
piped_model = make_pipeline(StandardScaler(), LinearSVC(dual = False, C = 10**3))

In [81]:
benchmark = cross_validate(piped_model, features_lim_3, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [82]:
benchmark

{'fit_time': array([0.02250028, 0.01900148, 0.02149987, 0.02349949, 0.02499962]),
 'score_time': array([0.00199986, 0.00200272, 0.00149941, 0.00200295, 0.0025003 ]),
 'test_score': array([0.953, 0.937, 0.959, 0.949, 0.942]),
 'train_score': array([0.95075, 0.955  , 0.947  , 0.951  , 0.952  ])}

In [83]:
np.round(benchmark['test_score'].mean(),3), np.round(benchmark['test_score'].std(), 3)

(0.948, 0.008)

In [84]:
np.round(benchmark['train_score'].mean(),3), np.round(benchmark['train_score'].std(), 3)

(0.951, 0.003)

#### No further improvement is observed, although we should optimize the model.

In [85]:
piped_model = make_pipeline(StandardScaler(), SVC())

In [86]:
#optimizing SVC: 

parameters = {'svc__kernel':['poly', 'rbf'], \
              'svc__gamma':[0.00001,'scale', 0.01, 0.1], 'svc__C':[0.1, 1.0, 10.0, 100.0, 1000.0], 'svc__degree': [4]} #poly never helps
nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(piped_model, parameters, cv = StratifiedKFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(features_lim_3,target)

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

scores_lim_3 = pd.DataFrame(model.cv_results_)
scores_lim_3[['params','mean_test_score','mean_train_score','mean_fit_time']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
Best params, best score: 0.9478 {'svc__C': 1000.0, 'svc__degree': 4, 'svc__gamma': 1e-05, 'svc__kernel': 'rbf'}


Unnamed: 0,params,mean_test_score,mean_train_score,mean_fit_time
33,"{'svc__C': 1000.0, 'svc__degree': 4, 'svc__gamma': 1e-05, 'svc__kernel': 'rbf'}",0.9478,0.95075,0.1592
13,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9464,0.955,0.162197
11,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.9432,0.96145,0.2023
25,"{'svc__C': 100.0, 'svc__degree': 4, 'svc__gamma': 1e-05, 'svc__kernel': 'rbf'}",0.9428,0.9447,0.193201
21,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9426,0.9693,0.195494
19,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.9408,0.9846,0.2434
5,"{'svc__C': 0.1, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.938,0.9418,0.2634
3,"{'svc__C': 0.1, 'svc__degree': 4, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}",0.9374,0.94095,0.2563
18,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 'scale', 'svc__kernel': 'poly'}",0.9316,0.9756,0.310104
29,"{'svc__C': 100.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9316,0.9898,0.359501


### Finally, we can try with all the features.

In [87]:
features_add.shape

(5000, 156)

In [88]:
piped_model = make_pipeline(StandardScaler(), LinearSVC(dual = False, C = 1000))

In [89]:
cv

StratifiedKFold(n_splits=5, random_state=101, shuffle=True)

In [90]:
benchmark = cross_validate(piped_model, features_add, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [91]:
benchmark

{'fit_time': array([0.70750189, 1.81350398, 1.32249832, 0.37649345, 0.10750222]),
 'score_time': array([0.00700259, 0.00349569, 0.00303292, 0.00350261, 0.00349498]),
 'test_score': array([0.94 , 0.93 , 0.957, 0.929, 0.929]),
 'train_score': array([0.95275, 0.957  , 0.949  , 0.95675, 0.956  ])}

In [92]:
np.round(benchmark['test_score'].mean(),3), np.round(benchmark['test_score'].std(), 3)

(0.937, 0.011)

In [93]:
np.round(benchmark['train_score'].mean(),3), np.round(benchmark['train_score'].std(), 3)

(0.954, 0.003)

### Learning Check-in

With all of these changes, and new benchmarks, what can we observe about our model? Does it still have high bias?

<details>
<summary style="display: list-item;">Click here for the answer!</summary>
<p>
    
We can observe that the model no longe exhibits traits of high bias, but insted high variance! Which isn't that suprising, and rather expected since our data has a lot of noise when all features included.

Its possible that we can re-run optimization, but its likely that it won't help too much considering what we've observed this far.
    
</p>
</details>

In [94]:
piped_model = make_pipeline(StandardScaler(), SVC())

In [95]:
#optimizing SVC: THIS IS NOT YET NESTED CV

parameters = {'svc__kernel':['poly', 'rbf'], \
              'svc__gamma':[0.00001, 0.001, 0.01, 0.1], 'svc__C':[0.1, 1.0, 10.0, 1000.0], 'svc__degree': [4]} #poly never helps
nmodels = np.product([len(el) for el in parameters.values()])
model = GridSearchCV(piped_model, parameters, cv = StratifiedKFold(n_splits=5, shuffle=True), \
                     verbose = 2, n_jobs = 4, return_train_score=True)
model.fit(features_add,target)

print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
Best params, best score: 0.9438 {'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}


In [96]:
scores_all = pd.DataFrame(model.cv_results_)
scores_all[['params','mean_test_score','mean_train_score','mean_fit_time']].sort_values(by = 'mean_test_score', \
                                                    ascending = False)

Unnamed: 0,params,mean_test_score,mean_train_score,mean_fit_time
11,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}",0.9438,0.9511,0.5412
19,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}",0.9432,0.96735,0.540098
25,"{'svc__C': 1000.0, 'svc__degree': 4, 'svc__gamma': 1e-05, 'svc__kernel': 'rbf'}",0.9402,0.9548,0.944007
13,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9366,0.97645,0.851505
21,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.93,0.99745,1.115001
29,"{'svc__C': 1000.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9276,1.0,1.767103
20,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'poly'}",0.9252,0.9935,1.167206
17,"{'svc__C': 10.0, 'svc__degree': 4, 'svc__gamma': 1e-05, 'svc__kernel': 'rbf'}",0.9242,0.92645,0.827703
12,"{'svc__C': 1.0, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'poly'}",0.9234,0.97935,0.899207
5,"{'svc__C': 0.1, 'svc__degree': 4, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}",0.9202,0.9441,1.112998


### Morale of the story: feature engineering often works best if we use subject matter knowledge, and more features is not necessarily better.