# Week 11- Machine Learning with Scikit-Learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import HTML, display
import tabulate

In [None]:
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/00279/SUSY.csv.gz > SUSY.csv.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  879M    0  879M    0     0  15.6M      0 --:--:--  0:00:56 --:--:-- 10.4M


In [None]:
!gunzip SUSY.csv.gz

In [None]:
VarNames=["signal", "l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi", "MET", "MET_phi", "MET_rel", "axial_MET", "M_R", "M_TR_2", "R", "MT2", "S_R", "M_Delta_R", "dPhi_r_b", "cos_theta_r1"]
RawNames=["l_1_pT", "l_1_eta","l_1_phi", "l_2_pT", "l_2_eta", "l_2_phi", "MET", "MET_phi"]
FeatureNames=list(set(VarNames[1:]).difference(RawNames))

In [None]:
filename="SUSY.csv"
df = pd.read_csv(filename, dtype='float64', names=VarNames)
df_sig=df[df.signal==1]
df_bkg=df[df.signal==0]

## ML with Scikit-Learn

Scikit-Learn provides a large library of ML algorithms with a common interface which makes it easy to try and compare algorithms.

Last week we created a Fisher discriminant by computing the weights with numpy using the analytical solution. Lets use Scikit-Learn to do the same thing.

First we instanciate the algorithm:

In [None]:
import sklearn.discriminant_analysis as DA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score


Establishing Train, Test, Validate Sample:

In [None]:
N_train=4000000
Train_sample=df[:N_train]
Test_sample=df[N_train:]

X_Train=Train_sample[VarNames[1:]]
y_Train=Train_sample["signal"]

X_Test=Test_sample[VarNames[1:]]
y_Test=Test_sample["signal"]

Test_sig=Test_sample[Test_sample.signal==1]
Test_bkg=Test_sample[Test_sample.signal==0]

Import the ML model:

In [None]:
Fisher=DA.LinearDiscriminantAnalysis() # Fisher is our LDA model.
Fisher.fit(X_Train,y_Train)

In [None]:
#split into features and target variables
X = df.drop(columns=["signal"]).values
y = df["signal"].values

#Standardize data set

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0, stratify=y)
lda = LinearDiscriminantAnalysis(n_components=1)
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda  = lda.transform(X_test)

y_pred = lda.predict(X_test)

#results
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=4))

Accuracy: 0.757341
              precision    recall  f1-score   support

         0.0     0.7180    0.9100    0.8027    542435
         1.0     0.8438    0.5763    0.6849    457565

    accuracy                         0.7573   1000000
   macro avg     0.7809    0.7432    0.7438   1000000
weighted avg     0.7756    0.7573    0.7488   1000000



## Deep Learning with Keras

Now lets define training and test samples. Note that DNNs take very long to train, so for testing purposes we will use only about 10% of the 5 million events in the training/validation sample. Once you get everything working, make the final version of your plots with the full sample.

Also note that Keras had trouble with the Pandas tensors, so after doing all of the nice manipulation that Pandas enables, we convert the Tensor to a regular numpy tensor.

In [None]:
N_max=550000
N_Train=500000

Train_sample=df[:N_Train]
Test_sample=df[N_Train:N_max]

X_Train=np.array(Train_sample[VarNames[1:]])
y_Train=np.array(Train_sample["signal"])

X_Test=np.array(Test_sample[VarNames[1:]])
y_Test=np.array(Test_sample["signal"])

In [None]:
X_Train.shape

(500000, 18)

In [None]:
X_Test.shape

(50000, 18)

Now we will build a simple model, as described in class. Note that this is very small model, so things run fast. You should attempt more ambitious models.

In most Deep Learning frameworks, models are build layer by layer. In Keras, you can make simple models using  `Sequential`. This method is now mostly discouraged, but it still works and is illustrative:

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

Method 1:

In [None]:
model=Sequential()
model.add(Dense(12,input_dim=X_Train.shape[1],activation="relu"))
model.add(Dense(8,activation="relu"))
model.add(Dense(8,activation="relu"))
model.add(Dense(1,activation="sigmoid")) #output only 1 or 0. Binary Classification

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
model.summary()

Method 2:  The preferred method of creating a model is to use functional API, where the Deep Neural Network is viewed as a composition of functions. With this API, you can create very sophisticated models with multiple inputs and outputs.

In [None]:
from keras.models import Model
from keras.layers import Dense, Input

in_x=Input(shape=X_Train.shape[1:])
x=Dense(12,input_dim=X_Train.shape[1],activation="relu")(in_x)
x=Dense(8,activation="relu")(x)
x=Dense(8,activation="relu")(x)
out_y=Dense(1,activation="sigmoid")(x)



  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
model=Model(in_x,out_y)
model.summary()

In [None]:
model.compile(loss="binary_crossentropy",optimizer="adam",metrics=['accuracy'])

In [None]:
history=model.fit(X_Train, y_Train, validation_data=(X_Test,y_Test),epochs=10, batch_size=2048)

Epoch 1/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 10ms/step - accuracy: 0.7311 - loss: 0.5858 - val_accuracy: 0.7874 - val_loss: 0.4609
Epoch 2/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7913 - loss: 0.4539 - val_accuracy: 0.7910 - val_loss: 0.4519
Epoch 3/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7945 - loss: 0.4465 - val_accuracy: 0.7928 - val_loss: 0.4473
Epoch 4/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7967 - loss: 0.4428 - val_accuracy: 0.7938 - val_loss: 0.4449
Epoch 5/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7974 - loss: 0.4411 - val_accuracy: 0.7950 - val_loss: 0.4428
Epoch 6/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7981 - loss: 0.4391 - val_accuracy: 0.7950 - val_loss: 0.4416
Epoch 7/10
[1m245/245[0m 

Note that `fit` takes care of

* Running multiple epochs
* Dividing the data into batches
* Computing gradient on each batch
* Using optimizer to take step
* Evaluating performance on training and testing data
* Keep track of everything

But in some instances you may wish to consume data from a difference source (rather than a tensor in memory) or do something different (e.g. training an Adversarial network), which will require you to perform some of these steps yourself. Keras provides easy methods to enable such functionality.

The model history keeps track of the loss and accuracy for each epoch. Note that the training above was setup to run on the validation sample at the end of each epoch:

In [None]:
print (history.history)