# Project 6 - Create a counterfeit banknote detection algorithm based on Logistic Regression
# Part 3 -  Classification

We have selected our 3 principal components, we can now move on to the next and final part, which consists of creating a predictive model to detect the fake banknotes. The problem we want to solve is a binary classification and the dataset is labelled so this is a supervised problem. 
The size of the dataset is relatively small and the main focus is to be accurate (Remember: we want to detect fake banknotes).
Therefore, Logistic Regression is especially indicated for this case-study. 

This part include all the following steps:

    ✅ Create a predictive model based on Logistic Regression 

## Get started

In [79]:
import pandas as pd
import numpy as np
from pathlib import Path

import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats  
import seaborn as sns
%matplotlib inline

from sklearn import decomposition
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [98]:
# Read the data
X_PCA1 = pd.read_csv(Path.cwd()/'dataset_PCA.csv',index_col=0)
X_PCA2 = pd.read_csv(Path.cwd()/'dataset_cleaned_PCA.csv',index_col=0)

## Logistic Regression

Last step of this project: creating a predictive model based on Logistic Regression to classify the banknotes and detect the fake ones.

### Preprocessing

In [99]:
# Split between features and labels
# With outliers
features1 = X_PCA1.iloc[:,1:4] #keeping only the 3 first PCs
labels1 =  X_PCA1['is_genuine']

features1.head()

Unnamed: 0,PC1,PC2,PC3
0,2.143117,2.982124,-1.947397
1,-2.051636,0.411908,0.249463
2,-1.953085,0.808068,0.247236
3,-2.03515,-0.359593,-0.537573
4,-2.432789,2.792122,1.962433


In [82]:
# Without outliers
features2 = X_PCA2.iloc[:,1:4]
labels2 =  X_PCA2['is_genuine']

features2.head()

Unnamed: 0,PC1,PC2,PC3
0,-2.098539,-0.486727,0.22854
1,-1.991465,-0.87688,0.154188
2,-2.055383,0.404777,-0.479884
3,1.071453,-1.914158,-1.656851
4,-2.230656,-0.256015,0.323828


In [83]:
# Split between train and test set
    # We apply a ratio 80:20
    # We stratify based on column is_genuine to keep the proportion of fake and genuine banknotes in both sets
    
X1_train, X1_test, y1_train, y1_test = train_test_split(features1,labels1,stratify= labels1,test_size=0.2) 
X2_train, X2_test, y2_train, y2_test = train_test_split(features2,labels2,stratify= labels2,test_size=0.2) 

In [84]:
# Quick check on the structure
print(X1_train.shape, X1_test.shape, y1_train.shape, y1_test.shape)
print(X2_train.shape, X2_test.shape, y2_train.shape, y2_test.shape)

(136, 3) (34, 3) (136,) (34,)
(131, 3) (33, 3) (131,) (33,)


### Apply

In [85]:
# Train the model
model = LogisticRegression(random_state=42)
lgr1 = model.fit(X1_train, y1_train)
lgr2 = model.fit(X2_train, y2_train)

In [86]:
# Make predictions
y1_pred = lgr1.predict(X1_test)
y2_pred = lgr2.predict(X2_test)

In [76]:
# Collect the predictions and probabilities associated into a dataframe
    # Example for dataset with outliers
results = X1_test.copy()
results['proba_Fake'] = lgr1.predict_proba(X1_test)[:,0]
results['proba_Genuine'] = lgr1.predict_proba(X1_test)[:,1]
results['predict_is_genuine'] = lgr1.predict(X1_test)
results.head()

Unnamed: 0,PC1,PC2,PC3,proba_Fake,proba_Genuine,predict_is_genuine
51,0.310311,1.56064,-0.461629,0.562571,0.437429,False
7,-2.538766,-1.190436,-0.229546,0.000247,0.999753,True
68,-1.807811,0.369765,0.050761,0.006432,0.993568,True
128,1.198144,-0.378908,-0.25741,0.85356,0.14644,False
120,2.333952,-1.210722,-0.535306,0.976824,0.023176,False


### Performances

While there are many ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use the accuracy and the confusion matrix (comparable to what we did before with Kmean).

In [87]:
# Use score method to get accuracy of model
score1 = lgr1.score(X1_test, y1_test)
score2 = lgr2.score(X2_test, y2_test)
print(score1, score2)

0.9411764705882353 0.9696969696969697


- The accuracy is 94% on the dataset with outliers and about 97% when we clean the data. Removing the outliers had a positive impact on the accuracy of the model. 
- The performance of the Logistic Regression model is higher than the performance observed with Kmean when the dataset is clean (about 96%). 

In [88]:
# Confusion matrix
cm = metrics.confusion_matrix(y1_test, y1_pred)
print(cm)

[[14  0]
 [ 2 18]]


In [90]:
# Zoom-in the 2 False Negative
# Extract the test and predictions values
test = pd.Series(y1_test.reset_index().iloc[:,1],name='test')
pred = pd.Series(y1_pred,name='predictions')

# Create a dataframe with the informations
tmp = pd.concat([test, pred], axis=1)

# Select the rows when the 2 columns are not matched
tmp[tmp['test'] != tmp['predictions']]

Unnamed: 0,test,predictions
4,True,False
12,True,False


- On the dataset with outliers, we can see that 2 genuine banknotes have been mislabelled as fake (False Negative).

In [91]:
# Confusion matrix
cm = metrics.confusion_matrix(y2_test, y2_pred)
print(cm)

[[13  1]
 [ 0 19]]


- On the dataset without outliers, 1 fake banknotes have been placed into the genuine set (False positive).

The question here is: Which model to choose? 
- The first model trained on the dataset with outliers, is somehow too selective. It mislabelled 2 genuine banknotes as fake. The global accuracy is still good with 94%.
- The second model trained on the dataset without outliers, has a better accuracy (97%) but it mislabelled a fake banknote as genuine.

In spite of having a lower accuracy, we choose to keep the first model because mislabelling a fake banknote is more "costly" than mislabelling a genuine banknote for a counterfake algorithm.

### Wrap Model

Here we wrap all the steps together:
- Reading the data
- Standardization
- PCA
- Predictions by the trained model
- Extracting the results and probabilities associated

In [100]:
def fake_detector(dataset,idxcol):
    
    #Read the data
    df = pd.read_csv(Path.cwd()/dataset)
    df = df.set_index(idxcol) # set id as index
    df.drop('diagonal', inplace=True, axis=1) # Drop the variable diagonal 

    #Standardization
    mask = df.columns
    df[mask] = StandardScaler().fit_transform(df[mask])
    
    # Extract features
    features = df.columns
    
    # Reduce dimension with PCA
    pca = PCA()
    name = ['PC1','PC2','PC3','PC4','PC5']
    df_PCA = df.reset_index()[['id']]
    df_PCA[name] = pd.DataFrame(pca.fit_transform(df.values),columns=[name])
    
    # Predict
    results = df_PCA.iloc[:,0:4].copy()
    results['proba_Fake'] = lgr2.predict_proba(df_PCA.iloc[:,1:4])[:,0]
    results['proba_Genuine'] = lgr2.predict_proba(df_PCA.iloc[:,1:4])[:,1]
    results['predict_is_genuine'] = lgr2.predict(df_PCA.iloc[:,1:4])
    
    return results[['id','proba_Fake','proba_Genuine','predict_is_genuine']]

In [101]:
# We test the function on a new set
fake_detector('example.csv','id')

Unnamed: 0,id,proba_Fake,proba_Genuine,predict_is_genuine
0,A_1,0.022523,0.977477,True
1,A_2,0.042593,0.957407,True
2,A_3,0.001527,0.998473,True
3,A_4,0.983772,0.016228,False
4,A_5,0.994046,0.005954,False
