# Workshop SL01: Classification

## Agenda
- Introduction to training and testing data distribution
- Common classification models

## Previously on the last 2 workshops
From the last 2 workshops we have covered the pre-processing of data before model training: 
- Read data into dataframes
- Join multiple dataframes
- Encode string data into float/int
- Feature selection/engineering 

## Exercise
- Think about how to tune hyperparameters for better performance (hint: Sklearn Documents)

### Prepping the data
These are from last 2 workshops straight, to get the dataframe to work with.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [4]:
# read csv file into a dataframe
df_id_train = pd.read_csv("train_identity.csv")
df_tran_train = pd.read_csv("train_transaction.csv")
df_id_test = pd.read_csv("test_identity.csv")
df_tran_test = pd.read_csv("test_transaction.csv")

# joining table
df_train = pd.merge(df_tran_train,df_id_train, on='TransactionID' ,how='left')

# target dataframe 
Y_train = df_train['isFraud']
Y_train = pd.DataFrame(Y_train)

# dropping the irrelevant data for training
list = ['isFraud','TransactionID','DeviceInfo']
X_train = df_train.drop(list, axis=1)

# encoding strings
obj_df = X_train.select_dtypes(include=['object']).copy()
int_df = X_train.select_dtypes(include=['int64']).copy()
float_df = X_train.select_dtypes(include=['float64']).copy()

for column in obj_df.head(0):
    obj_df[column] = obj_df[column].astype('category')
    obj_df[column] = obj_df[column].cat.codes

X_train = pd.concat([obj_df,int_df,float_df],axis=1, sort=False) 

# filling na
X_train.fillna(value=-1,inplace=True)


Or we can just download the dataframe as csv for future use. We only need to download it once, so in the future we just need to read these csv into dataframes. 

In [None]:
# downloadig dataframe as csv
X_train.to_csv (r'X_train.csv', index = None, header=True)
Y_train.to_csv (r'Y_train.csv', index = None, header=True)

In [3]:
X_train = pd.read_csv('X_train.csv')
Y_train = pd.read_csv('Y_train.csv')

### Testing/Training Set Distribution

For **model selection** purpose we need to distribute the data into training and testing set, and compute model error on both sets, i.e. train error and test error. If we select model based on train error solely, we will have over-fitting problem because the model will just perform really well on training data but not on testing data. Test error is a better tool to judge whether the model will perform well on new data. Perhaps this graph will explain better.

<img src="train-test-error.png">

Source: [In-depth introduction to machine learning in 15 hours of expert videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)


In [5]:
# spliting test/train data into 80:20
from sklearn.model_selection import train_test_split
train_size = int(0.8*X_train.shape[0])
test_size = X_train.shape[0]-train_size
X_train, X_test, Y_train, Y_test = train_test_split(
    X_train, Y_train, train_size=train_size, test_size=test_size, random_state=4)

In [6]:
# this cell is optional: run this if you don't want to read the future warning messages
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
# this is to stop the format warning
Y_train = np.array(Y_train).ravel()
Y_test = np.array(Y_test).ravel()

### Model Training
There are many models up our sleeves. We provide a list of models here for you to explore here:
- [Generalized Linear Models](https://scikit-learn.org/stable/modules/linear_model.html) (Logistic regression, [SGD](https://scikit-learn.org/stable/modules/sgd.html), Perceptron) 
- [Linear and Quadratic Discriminant Analysis](https://scikit-learn.org/stable/modules/lda_qda.html#dimensionality-reduction-using-linear-discriminant-analysis)
- [Support Vector Machines](https://scikit-learn.org/stable/modules/svm.html) (SVC)
- [Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)
- [Gaussian Processes](https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-gpc)
- [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [Trees](https://scikit-learn.org/stable/modules/tree.html)
- [Ensemble Methods](https://scikit-learn.org/stable/modules/ensemble.html) 
- Neural Network (more about this in later workshops) 

If anyone in interested in the math, you can click on the links and read more.

Later when we talk about regression we will notice the list for regression problems is pretty similar to this one. Actually a classification problem can be viewed as a regression problem, with the regression output being the probability of being classified into a specific category. 


#### 1) GLM
**Logistic regression** is a linear model for classification, with the probability of being a specific category being modeled using a logistic function.

In [7]:
from sklearn.linear_model import LogisticRegression
# training lr model
lr = LogisticRegression()
lr.fit(X_train, Y_train)
# predict on test data
Y_lr = lr.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_lr,digits = 6))
print (confusion_matrix(Y_test, Y_lr))
print (accuracy_score(Y_test, Y_lr))

              precision    recall  f1-score   support

           0   0.964863  0.999798  0.982020    113954
           1   0.178571  0.001204  0.002391      4154

   micro avg   0.964676  0.964676  0.964676    118108
   macro avg   0.571717  0.500501  0.492206    118108
weighted avg   0.937208  0.964676  0.947565    118108

[[113931     23]
 [  4149      5]]
0.9646763978731331


Now this is very bad because the model is very bad at detecting fraud transactions (only a few out of over 4000 fraud transactions). Even with cross validation it is still pretty bad. (When we say this you can believe us because we tried and there are only slightly more successful fraud detections). 

Perhaps there is a way to fix this by putting more penalty on false negative? This is an exercie for you to find out how! ([hint](https://stackoverflow.com/questions/49151325/how-to-penalize-false-negatives-more-than-false-positives))

If you have forgotten how to read confusion matrix, here is a [link](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62) on how. <img src="cm.png">

The **perceptron** is another simple classification algorithm that works well for large scale learning. The **passive-aggressive algorithms** are a family of algorithms for large-scale learning. They are similar to the Perceptron but do not require a learning rate and include a regularization parameter.

In [17]:
from sklearn.linear_model import Perceptron
# training model
pct = Perceptron()
pct.fit(X_train, Y_train)
# predict on test data
Y_pct = pct.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_pct,digits = 6))
print (confusion_matrix(Y_test, Y_pct))
print (accuracy_score(Y_test, Y_pct))

              precision    recall  f1-score   support

           0   0.964931  0.997955  0.981165    113954
           1   0.082677  0.005055  0.009528      4154

   micro avg   0.963034  0.963034  0.963034    118108
   macro avg   0.523804  0.501505  0.495347    118108
weighted avg   0.933901  0.963034  0.946992    118108

[[113721    233]
 [  4133     21]]
0.9630338334405798


In [18]:
from sklearn.linear_model import PassiveAggressiveClassifier
# training model
pac = PassiveAggressiveClassifier()
pac.fit(X_train, Y_train)
# predict on test data
Y_pac = pac.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_pac,digits = 6))
print (confusion_matrix(Y_test, Y_pac))
print (accuracy_score(Y_test, Y_pac))

              precision    recall  f1-score   support

           0   0.965120  0.993603  0.979154    113954
           1   0.078382  0.014925  0.025076      4154

   micro avg   0.959181  0.959181  0.959181    118108
   macro avg   0.521751  0.504264  0.502115    118108
weighted avg   0.933932  0.959181  0.945598    118108

[[113225    729]
 [  4092     62]]
0.9591814271683544


**SGD** or stochastic gradient descent is a simple yet very efficient approach to fit linear models (e.g. linear SVM, Logistic Regression etc.) by updating the model along with a decreasing strength schedule (aka learning rate). So SGD itself is not a model but an algorithm that minimises the loss function. In later workshop we will revisit SGD for fitting neural networks. Another great thing is that SGD allows minibatch (online/out-of-core) learning, see the `partial_fit` method. For best results using the default learning rate schedule, the data should be standardised (zero mean and unit variance). Hence, it is particularly useful when the number of samples (and the number of features) is very large. Learn more [here](https://scikit-learn.org/stable/modules/sgd.html).

In [40]:
from sklearn.linear_model import SGDClassifier
# training SGD model
SGD = SGDClassifier()
SGD.fit(X_train, Y_train)
# predict on test data
Y_SGD = SGD.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_SGD,digits = 6))
print (confusion_matrix(Y_test, Y_SGD))
print (accuracy_score(Y_test, Y_SGD))

              precision    recall  f1-score   support

           0   0.964935  0.997832  0.981108    113954
           1   0.081784  0.005296  0.009948      4154

   micro avg   0.962924  0.962924  0.962924    118108
   macro avg   0.523360  0.501564  0.495528    118108
weighted avg   0.933874  0.962924  0.946951    118108

[[113707    247]
 [  4132     22]]
0.9629237646899448


Even though it's trained really fast, can you see the problem here?

#### 2) Linear and Quadratic Discriminant Analysis

Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are two classic classifiers, with, as their names suggest, a linear and a quadratic decision surface, respectively.

In [20]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# training model
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, Y_train)
# predict on test data
Y_lda = lda.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_lda,digits = 6))
print (confusion_matrix(Y_test, Y_lda))
print (accuracy_score(Y_test, Y_lda))

              precision    recall  f1-score   support

           0   0.976051  0.989970  0.982961    113954
           1   0.548043  0.333654  0.414784      4154

   micro avg   0.966886  0.966886  0.966886    118108
   macro avg   0.762047  0.661812  0.698872    118108
weighted avg   0.960997  0.966886  0.962978    118108

[[112811   1143]
 [  2768   1386]]
0.9668862397128052


In [21]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# training model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, Y_train)
# predict on test data
Y_qda = qda.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_qda,digits = 6))
print (confusion_matrix(Y_test, Y_qda))
print (accuracy_score(Y_test, Y_qda))



              precision    recall  f1-score   support

           0   0.982177  0.911587  0.945567    113954
           1   0.183814  0.546221  0.275064      4154

   micro avg   0.898737  0.898737  0.898737    118108
   macro avg   0.582996  0.728904  0.610315    118108
weighted avg   0.954098  0.898737  0.921984    118108

[[103879  10075]
 [  1885   2269]]
0.898736749415789


#### 3) Support Vector Machines

In [22]:
from sklearn import svm
# training model
# svc = svm.SVC(kernel='linear')
# SVC using the libsvm is gonna take forever to run because the core of an SVM is a quadratic programming problem (QP) 
# (about hours, we haven't run it ourselves but you can try)
# SGDclassifier has the same cost function as linear SVC by adjusting penalty and loss parameters
# in fact, the default SGDClasifier is a linear SVM
svc = SGDClassifier(loss='hinge', penalty='l2')
svc.fit(X_train, Y_train)
# predict on test data
Y_svc = svc.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_svc,digits = 6))
print (confusion_matrix(Y_test, Y_svc))
print (accuracy_score(Y_test, Y_svc))

              precision    recall  f1-score   support

           0   0.964888  0.998613  0.981461    113954
           1   0.076023  0.003130  0.006012      4154

   micro avg   0.963601  0.963601  0.963601    118108
   macro avg   0.520456  0.500871  0.493736    118108
weighted avg   0.933626  0.963601  0.947153    118108

[[113796    158]
 [  4141     13]]
0.9636011108476987


#### 4) Nearest Neighbour

In [23]:
from sklearn import neighbors
# training knn model
knn = neighbors.KNeighborsClassifier()
knn.fit(X_train, Y_train)
# Predict on test data
Y_knn = knn.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_knn,digits = 6))
print (confusion_matrix(Y_test, Y_knn))
print (accuracy_score(Y_test, Y_knn))

              precision    recall  f1-score   support

           0   0.966557  0.996736  0.981414    113954
           1   0.375839  0.053924  0.094316      4154

   micro avg   0.963576  0.963576  0.963576    118108
   macro avg   0.671198  0.525330  0.537865    118108
weighted avg   0.945780  0.963576  0.950214    118108

[[113582    372]
 [  3930    224]]
0.9635757103667829


#### 5) Gaussian Processes
A generic supervised learning method designed to solve regression and probabilistic classification problems but loses efficiency in high dimensional spaces (when the number of features exceeds a few dozens). So this is not particularly useful for our dataset.

In [26]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
# training model
mini_size = 2000
gpc = GaussianProcessClassifier(1.0 * RBF(1.0),n_jobs=-1)
gpc.fit(X_train.iloc[0:mini_size], Y_train[0:mini_size])
# Predict on test data
Y_gcp = gpc.predict(X_test.iloc[0:mini_size])
# accuracy 
print (classification_report(Y_test[0:mini_size], Y_gcp,digits = 6))
print (confusion_matrix(Y_test[0:mini_size], Y_gcp))
print (accuracy_score(Y_test[0:mini_size], Y_gcp))
# we have reduced the dataset to mini size otherwise the kernel will die
# the warning message appears because the model never predicts a fraud setection  

              precision    recall  f1-score   support

           0   0.959500  1.000000  0.979331      1919
           1   0.000000  0.000000  0.000000        81

   micro avg   0.959500  0.959500  0.959500      2000
   macro avg   0.479750  0.500000  0.489666      2000
weighted avg   0.920640  0.959500  0.939669      2000

[[1919    0]
 [  81    0]]
0.9595


  'precision', 'predicted', average, warn_for)


#### 6) Naive Bayes
[Scikit learn document](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes): "Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering."  

In [27]:
from sklearn.naive_bayes import GaussianNB
# training model
gnb = GaussianNB()
gnb.fit(X_train, Y_train)
# Predict on test data
Y_gnb = gnb.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_gnb,digits = 6))
print (confusion_matrix(Y_test, Y_gnb))
print (accuracy_score(Y_test, Y_gnb))

              precision    recall  f1-score   support

           0   0.952437  0.028643  0.055614    113954
           1   0.034801  0.960761  0.067169      4154

   micro avg   0.061427  0.061427  0.061427    118108
   macro avg   0.493619  0.494702  0.061391    118108
weighted avg   0.920162  0.061427  0.056020    118108

[[  3264 110690]
 [   163   3991]]
0.06142682968130863


#### 7) Trees
Tree models do not require any parameters but predict the target by applying a set of if-then-else decision rules to the features. The deeper the tree, the more complex the decision rules,

In [32]:
from sklearn import tree
# training model
dt = tree.DecisionTreeClassifier(max_depth=50)
dt.fit(X_train, Y_train)
# Predict on test data
Y_dt = dt.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_dt,digits = 6))
print (confusion_matrix(Y_test, Y_dt))
print (accuracy_score(Y_test, Y_dt))

              precision    recall  f1-score   support

           0   0.984969  0.986231  0.985600    113954
           1   0.608533  0.587145  0.597648      4154

   micro avg   0.972195  0.972195  0.972195    118108
   macro avg   0.796751  0.786688  0.791624    118108
weighted avg   0.971730  0.972195  0.971955    118108

[[112385   1569]
 [  1715   2439]]
0.9721949402242016


#### 8) Ensemble Methods
Lastly, ensemble method is where we combine the predictions of estimators built with a given learning algorithm together. The goal of ensemble methods is to improve generalizability and robustness over a single estimator.

There are two families of ensemble methods:

- **Averaging methods**: build several estimators independently and then average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

- Examples: Bagging methods, Forests of randomized trees (Random Forests, Extremely Randomized Trees, Totally Random Trees Embedding), …

- **Boosting methods**: base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

- Examples: AdaBoost, Gradient Tree Boosting, Voting Classifier …

Bagging methods work best with strong and complex models (e.g. fully developed decision trees), whereas boosting methods usually work best with weak models (e.g. shallow decision trees).

In [43]:
from sklearn.ensemble import BaggingClassifier
# training model
bag = BaggingClassifier(SGDClassifier())
bag.fit(X_train, Y_train)
# Predict on test data
Y_bag = bag.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_bag,digits = 6))
print (confusion_matrix(Y_test, Y_bag))
print (accuracy_score(Y_test, Y_bag))

              precision    recall  f1-score   support

           0   0.964972  0.996490  0.980477    113954
           1   0.074074  0.007703  0.013956      4154

   micro avg   0.961713  0.961713  0.961713    118108
   macro avg   0.519523  0.502097  0.497217    118108
weighted avg   0.933638  0.961713  0.946484    118108

[[113554    400]
 [  4122     32]]
0.9617130084329597


In [41]:
from sklearn.ensemble import RandomForestClassifier
# training model
rf = RandomForestClassifier(max_depth=50,n_estimators=20)
rf.fit(X_train, Y_train)
# Predict on test data
Y_rf = rf.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_rf,digits = 6))
print (confusion_matrix(Y_test, Y_rf))
print (accuracy_score(Y_test, Y_rf))

              precision    recall  f1-score   support

           0   0.979656  0.998956  0.989212    113954
           1   0.937664  0.430910  0.590467      4154

   micro avg   0.978977  0.978977  0.978977    118108
   macro avg   0.958660  0.714933  0.789839    118108
weighted avg   0.978179  0.978977  0.975187    118108

[[113835    119]
 [  2364   1790]]
0.9789768686287127


In [38]:
from sklearn.ensemble import AdaBoostClassifier
# training model
ab = AdaBoostClassifier(n_estimators=20) # default is DecisionTreeClassifier(max_depth=1)
ab.fit(X_train, Y_train)
# Predict on test data
Y_ab = ab.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_ab,digits = 6))
print (confusion_matrix(Y_test, Y_ab))
print (accuracy_score(Y_test, Y_ab))

              precision    recall  f1-score   support

           0   0.973777  0.997499  0.985495    113954
           1   0.793179  0.263120  0.395155      4154

   micro avg   0.971670  0.971670  0.971670    118108
   macro avg   0.883478  0.630309  0.690325    118108
weighted avg   0.967425  0.971670  0.964732    118108

[[113669    285]
 [  3061   1093]]
0.9716699969519423


In [37]:
from sklearn.ensemble import GradientBoostingClassifier
# training model
gb = GradientBoostingClassifier(n_estimators=20)
gb.fit(X_train, Y_train)
# Predict on test data
Y_gb = gb.predict(X_test)
# accuracy 
print (classification_report(Y_test, Y_gb,digits = 6))
print (confusion_matrix(Y_test, Y_gb))
print (accuracy_score(Y_test, Y_gb))

              precision    recall  f1-score   support

           0   0.972438  0.998833  0.985459    113954
           1   0.874647  0.223399  0.355896      4154

   micro avg   0.971560  0.971560  0.971560    118108
   macro avg   0.923542  0.611116  0.670678    118108
weighted avg   0.968999  0.971560  0.963316    118108

[[113821    133]
 [  3226    928]]
0.9715599282013073


Hope you enjoyed this session so far. As you can see, different models have very different performance, and perhaps it's a good idea to take some time to research on what hyperparameters to tune and how to better manipulate data (feature selection, feature engineering, stacking etc.) to feed the model. [Here](https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284#latest-655464) is the first place solution for IEEE fraud detection competition. Maybe this will give you some ideas on how to better manipulation data and train models.    