# Workshop SL01: Classification

## Agenda
- Introduction to training and testing data distribution
- Common classification models

## Previously on the last 2 workshops
From the last 2 workshops we have covered the pre-processing of data before model training: 
- Read data into dataframes
- Join multiple dataframes
- Encode string data into float/int
- Feature selection/engineering 

## Exercise
- find the best model and tune it for better performance

### Prepping the data
These are from last 2 workshops straight, to get the dataframe to work with.

In [3]:
import pandas as pd
import numpy as np
# read csv file into a dataframe
df_id_train = pd.read_csv("train_identity.csv")
df_tran_train = pd.read_csv("train_transaction.csv")
df_id_test = pd.read_csv("test_identity.csv")
df_tran_test = pd.read_csv("test_transaction.csv")

# joining table
df_train = pd.merge(df_tran_train,df_id_train, on='TransactionID' ,how='left')

# target dataframe 
Y_train = df_train['isFraud']
Y_train = pd.DataFrame(Y_train)

# dropping the irrelevant data for training
list = ['isFraud','TransactionID','DeviceInfo']
X_train = df_train.drop(list, axis=1)

# encoding strings
obj_df = X_train.select_dtypes(include=['object']).copy()
int_df = X_train.select_dtypes(include=['int64']).copy()
float_df = X_train.select_dtypes(include=['float64']).copy()

for column in obj_df.head(0):
    obj_df[column] = obj_df[column].astype('category')
    obj_df[column] = obj_df[column].cat.codes

X_train = pd.concat([obj_df,int_df,float_df],axis=1, sort=False) 

# filling na
X_train.fillna(value=-1,inplace=True)


Or we can just download the dataframe as csv for future use. We only need to download it once, so in the future we just need to read these csv into dataframes. 

In [None]:
# downloadig dataframe as csv
X_train.to_csv (r'X_train.csv', index = None, header=True)
Y_train.to_csv (r'Y_train.csv', index = None, header=True)

In [18]:
X_train = pd.read_csv('X_train.csv')
Y_train = pd.read_csv('Y_train.csv')

### Testing/Training Set Distribution

For model selection purpose we need to distribute the data into training and testing set, and compute model error on both sets, i.e. train error and test error. If we select model based on train error solely, we will have over-fitting problem because the model will just perform really well on training data but not on testing data. Test error is a better tool to judge whether the model will perform on new data. Perhaps this graph will explain better.

<img src="train-test-error.png">

Source: [In-depth introduction to machine learning in 15 hours of expert videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)


In [27]:
# spliting test/train data into 80:20
from sklearn.model_selection import train_test_split
train_size = int(0.8*X_train.shape[0])
test_size = X_train.shape[0]-train_size
X_train, X_test, Y_train, Y_test = train_test_split(
    X_train, Y_train, train_size=train_size, test_size=test_size, random_state=4)


### Model Training

#### 1) KNN model

In [26]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import neighbors
# training knn model
knn = neighbors.KNeighborsClassifier()
knn.fit(X_train, Y_train)
# Predict on new data
Y_knn = knn.predict(X_test)
# accuracy 
Y_test = np.array(Y_test)

print (classification_report(Y_test, Y_knn,digits = 6))
print (confusion_matrix(Y_test, Y_knn))
print (accuracy_score(Y_test, Y_knn))


  """


              precision    recall  f1-score   support

           0   0.966557  0.996736  0.981414    113954
           1   0.375839  0.053924  0.094316      4154

   micro avg   0.963576  0.963576  0.963576    118108
   macro avg   0.671198  0.525330  0.537865    118108
weighted avg   0.945780  0.963576  0.950214    118108

[[113582    372]
 [  3930    224]]
0.9635757103667829


Now this is very bad because the model is very bad at detecting fraud transactions. Perhaps there is a way to put more penalty on false positve? This is an exercie to find out how!

#### 2) Gaussian Processes Classification