# Introduction

In this project, we will evaluate the performance and predictive power of a model that has been trained and tested on data collected during a research collaboration of Worldline and the Machine Learning Group. A model trained on this data that is seen as a good fit could then be used to make certain recognize fraudulent credit card transactions.

## Credit card fraud detection:

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

## Content:

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

## Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

In [None]:
# Import libraries necessary for this project
import numpy as np
import matplotlib.pyplot as plt
import zipfile
import os
import pandas as pd
import tensorflow as tf
import seaborn as sns
from tqdm import tqdm

import warnings
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix, auc, f1_score
import seaborn as sns
import matplotlib.gridspec as gridspec
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import recall_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


warnings.filterwarnings("ignore")

In [None]:
# global variable 
SEED = 7

In [None]:
# Unzipping
new_folder = "Data"
zip_file = os.path.join(new_folder, "creditcardfraud.zip")
new_files_location = os.path.join(os.getcwd(), new_folder)

try:
    with zipfile.ZipFile(zip_file, "r") as zip_ref:
        zip_ref.extractall(new_files_location)
except:
    print("Error during unzipping")

del zip_file, new_files_location

raw_data_file = os.path.join(new_folder, "creditcard.csv")

In [None]:
# read raw data
raw_data = pd.read_csv(raw_data_file)


# Overall analysis

In [None]:
# Show shape
raw_data.shape

In [None]:
# Show data
raw_data.head(4)

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'

# Data cleaning

In [None]:
#count the missing values for each column
raw_data.isnull().sum()

There is no missing value

# Look inside data

In [None]:
#Check distribution of class column
count_classes = pd.value_counts(raw_data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

In [None]:
#Now let us check in the number of Percentage
dataframe_class = raw_data['Class'].value_counts().to_frame().reset_index()
dataframe_class['percent'] = dataframe_class["Class"].apply(lambda x : round(100*float(x) / len(raw_data), 2))
dataframe_class = dataframe_class.rename(columns = {"index" : "Target", "Class" : "Count"})
dataframe_class

Data is imbalanced, there is only 0.17 % are the fraud transcation while 99.83 are valid transcation.
Collect more data not applicable in this case. So now we have to do resampling or oversampling (include generate synthetic samples) of this data. In this case it is not possible collect more data.


Moreover we have to change metrics, accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. Metrics that can provide better insight include: Confusion Matrix, Precision, Recall or F1. I goint to use confusion matrix and Recall.

In [None]:
#Check distribution of transactions in time.
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 100

ax1.hist(raw_data.Time[raw_data.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(raw_data.Time[raw_data.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Number of Transactions')
plt.show()

In [None]:
def PlotHistogram(df,norm):
    bins = np.arange(df['hour'].min(),df['hour'].max()+2)
    plt.figure(figsize=(15,4))
    sns.distplot(df[df['Class']==0.0]['hour'],
                 norm_hist=norm,
                 bins=bins,
                 kde=False,
                 color='b',
                 hist_kws={'alpha':.5},
                 label='Legit')
    sns.distplot(df[df['Class']==1.0]['hour'],
                 norm_hist=norm,
                 bins=bins,
                 kde=False,
                 color='r',
                 label='Fraud',
                 hist_kws={'alpha':.5})
    plt.xticks(range(0,24))
    plt.legend()
    plt.show()

In [None]:
raw_data['hour'] = raw_data['Time'].apply(lambda x: np.ceil(float(x)/3600) % 24)

In [None]:
print('Normalized histogram of Legit/Fraud over hour of the day')
PlotHistogram(raw_data,True)
print('Counts histogram of Legit/Fraud over hour of the day')
PlotHistogram(raw_data,False)

You can barely see the Fraud cases since there are so little of them. Hour of the day seems to have some impact on the number of Fraud cases. I'll be sure to to add the 'hour' dimension to visualizations later to further investigate its impact.

Before we train our classifers, we need to normalize the Amount since it's on a totally different scale. The distributions are also highly skewed with a lot of statistical outliers. All Fraud cases are in the low dollar values i.e. Amount.


In [None]:

#Check distribution of amount in time.

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,6))

ax1.scatter(raw_data.Time[raw_data.Class == 1], raw_data.Amount[raw_data.Class == 1])
ax1.set_title('Fraud')

ax2.scatter(raw_data.Time[raw_data.Class == 0], raw_data.Amount[raw_data.Class == 0])
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
plt.show()




The 'Time' feature looks pretty similar across both types of transactions. You could argue that fraudulent transactions are more uniformly distributed, while normal transactions have a cyclical distribution. This could make it easier to detect a fraudulent transaction during at an 'off-peak' time.



In [None]:
#Transaction amount differs between the two types.
f, (ax1, ax2) = plt.subplots(1, 2, sharex=True, figsize=(12,8))

bins = 8

ax1.hist(raw_data.Amount[raw_data.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(raw_data.Amount[raw_data.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Amount (in dolars)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()

Most transactions are small amounts, less than 100 dollars.
Fraudulent transactions have a maximum value far less than normal transactions, 
2,125.87 dollars vs 25,691.16 dollars.

In [None]:
print ("Fraud")
print (raw_data.Amount[raw_data.Class == 1].describe())
print ()
print ("Normal")
print (raw_data.Amount[raw_data.Class == 0].describe())

Mean amount fraud transactions is twice less than normal. 122 dollars vs 250 dollars.

In [None]:
#Select only the anonymized features.
v_features = raw_data.ix[:,1:29].columns

In [None]:
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(raw_data[v_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(raw_data[cn][raw_data.Class == 1], bins=50)
    sns.distplot(raw_data[cn][raw_data.Class == 0], bins=50)
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(cn))
plt.show()

In [None]:
#Drop all of the features that have very similar distributions between the two types of transactions.
df_drop = raw_data.drop(['V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8'], axis =1)

In [None]:
#Normalize amount of transaction
raw_data['norm_Amount'] = StandardScaler().fit_transform(pd.DataFrame(raw_data['Amount']))
raw_data = raw_data.drop(['Amount'], axis=1)

df_drop['norm_Amount'] = StandardScaler().fit_transform(pd.DataFrame(df_drop['Amount']))
df_drop = df_drop.drop(['Amount'], axis=1)

raw_data.head()

# Oversampling

In [None]:
#Split data to X and y
X = raw_data.drop('Class', 1)
y = raw_data.Class

print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))

In [None]:
#Split data to train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
#Oversampling by SMOTE methods
smote = SMOTE(random_state=SEED)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_smote.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_smote.shape))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_smote==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_smote==0)))

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)


    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
def optimum_C(model_X, X_train_smote, y_train_smote):
    #Cross validate with C parameter 
    parameters = {
    'C': np.linspace(1, 10, 2)
             }
    clf = GridSearchCV(model_X, parameters, cv=5, verbose=5, n_jobs=3, scoring = 'recall')
    clf.fit(X_train_smote, y_train_smote.ravel())
    return (clf.best_params_['C'])

In [None]:
def plot_confusion_m(y_test, y_pre):
    #Plot non-normalized confusion matrix
    cnf_matrix = confusion_matrix(y_test, y_pre)
    class_names = [0,1]
    plt.figure(figsize=(6,3))
    plot_confusion_matrix(cnf_matrix , classes=class_names, title='Confusion matrix')
    plt.show()
    return cnf_matrix

In [None]:
def plot_roc_c(y_test, y_pred_sample_score):
    #Plot ROC curve
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_sample_score)
    roc_auc = auc(fpr,tpr)
    
    plt.title('Roc curve')
    plt.plot(fpr, tpr, 'b',label='AUC = %0.3f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.1])
    plt.ylim([-0.1,1.01])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [None]:
def write_results(cnf_matrix,y_test, y_pre):
    #Write recall metric, confusion matrix and bigger classification report
    print("Recall metric in the testing dataset: {}%".format(round(100*cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])),4))
    print(confusion_matrix(y_test, y_pre))
    print(classification_report(y_test, y_pre))

In [None]:
def show_results(y_test, y_pre, y_pred_sample_score):
    #Show all results
    cnf_matrix = plot_confusion_m(y_test, y_pre)
    plot_roc_c(y_test, y_pred_sample_score)
    write_results(cnf_matrix,y_test, y_pre)



In [None]:
def show_results_without_auc(y_test, y_pre):
    #Show all results without roc curve
    cnf_matrix = plot_confusion_m(y_test, y_pre)
    write_resoults(cnf_matrix,y_test, y_pre)


In [None]:
#Choose model Logistic Regression
model1 = LogisticRegression(penalty='l1', verbose=5, solver='liblinear')
#Find optimum C
opt_C = optimum_C(model1, X_train_smote, y_train_smote)
print('optimum C is ', opt_C)
#Prepare model with the best parameter C
lr1 = LogisticRegression(C=opt_C,penalty='l1', verbose=5, solver='liblinear')
#Learn model
lr1.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = lr1.predict(X_test)
tmp = lr1.fit(X_train_smote, y_train_smote.ravel())
y_pred_sample_score = tmp.decision_function(X_test)
#Print results
show_results(y_test, y_pre, y_pred_sample_score)

In [None]:
#Choose model GaussianNB
lr1 = GaussianNB()
#Learn model
lr1.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = lr1.predict(X_test)
tmp = lr1.fit(X_train_smote, y_train_smote.ravel())
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Find the best parameter n_neighbors for K-NN

train_accuracy = []
test_accuracy = []
acc = 0

#Train some models with using recall score
for i in range(1,15):
    model_lr = KNeighborsClassifier(n_neighbors=i,metric = 'minkowski', p = 2)
    model_lr.fit(X_train_smote, y_train_smote.ravel())
    pred_al = model_lr.predict(X_train_smote)
    pred_lr = model_lr.predict(X_test)
    train_accuracy.append(recall_score(y_test, pred_lr))
    test_accuracy.append(recall_score(y_train_smote.ravel(), pred_al))
    acc=0  
print(' Sample number: ',np.argmax(train_accuracy)+1, ' Train error equals:', train_accuracy[np.argmax(train_accuracy)] )
print(' Sample number: ',np.argmax(test_accuracy)+1, ' Validation error equals:', test_accuracy[np.argmax(test_accuracy)] )

In [None]:
#Plot recall for K parameter for K Neighboors Classifier 
fig, ax1 = plt.subplots(figsize=(12,8))
plt.title('K for KNeighborsClassifier')
plt.plot(train_accuracy, color='r')
#Print recall for train data set
ax1.set_ylabel('train_recall',color='r')
plt.legend(['train_recall'],loc=(0.01,0.95))
ax2 = ax1.twinx()
#Print recall for test data set
plt.plot(test_accuracy,color='b')
ax2.set_ylabel('test_recall',color='b')
plt.legend(['test_recall'],loc=(0.01,0.90))
plt.grid(True)

In [None]:
#Choose KNN model
model3 = KNeighborsClassifier(n_neighbors = 2, metric = 'minkowski', p = 2)
#Learn model
model3.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = model3.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Choose DecisionTreeClassifier model
model4 = DecisionTreeClassifier(criterion = 'entropy', random_state = 1)
#Learn model
model4.fit(X_train, y_train)
#Predict label in test set
y_pre = model4.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Choose RandomForestClassifier model
model5 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 1)
#Learn model
model5.fit(X_train, y_train)
#Predict label in test set
y_pre = model5.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#$how features importances in Random Forest Classifier
importances = pd.Series(model5.feature_importances_, index=X.columns)
importances.plot(kind='barh', figsize=(12,8),title='Features importances in Random Forest Classifier ')

In [None]:
#print 10 the most important features in Random Forest Classifier
sorted_importances_model5 = pd.Series(model5.feature_importances_, index=X.columns).sort_values(ascending=False)
print (sorted_importances_model5[:10])

In [None]:
sorted_importances_model5.index.values[:10]

# 1. Feature engineering

Use 10 best features from Random Forest Classifier on raw_data

In [None]:
best_features = sorted_importances_model5.index.values[:10]
best_features = list(best_features)
X.drop(X.columns.difference(best_features), 1, inplace=True)
X = raw_data.loc[:, raw_data.columns.intersection(a)]
y = raw_data.Class

In [None]:
#Split data to train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
#Oversampling by SMOTE methods
smote = SMOTE(random_state=SEED)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_smote.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_smote.shape))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_smote==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_smote==0)))

In [None]:
#Choose model Logistic Regression
model6 = LogisticRegression(penalty='l1', verbose=5, solver='liblinear')
#Find optimum C
opt_C = optimum_C(model6, X_train_smote, y_train_smote)
print('optimum C is ', opt_C)
#Prepare model with the best parameter C
lr6 = LogisticRegression(C=opt_C,penalty='l1', verbose=5, solver='liblinear')
#Learn model
lr6.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = lr6.predict(X_test)
tmp = lr6.fit(X_train_smote, y_train_smote.ravel())
y_pred_sample_score = tmp.decision_function(X_test)
#Print results
show_results(y_test, y_pre, y_pred_sample_score)

In [None]:
#Choose model GaussianNB
lr7 = GaussianNB()
#Learn model
lr7.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = lr7.predict(X_test)
tmp = lr7.fit(X_train_smote, y_train_smote.ravel())
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Find the best parameter n_neighbors for K-NN

train_accuracy = []
test_accuracy = []
acc = 0

#Train some models with using recall score
for i in range(1,15):
    model_lr = KNeighborsClassifier(n_neighbors=i,metric = 'minkowski', p = 2)
    model_lr.fit(X_train_smote, y_train_smote.ravel())
    pred_al = model_lr.predict(X_train_smote)
    pred_lr = model_lr.predict(X_test)
    train_accuracy.append(recall_score(y_test, pred_lr))
    test_accuracy.append(recall_score(y_train_smote.ravel(), pred_al))
    acc=0  
print(' Sample number: ',np.argmax(train_accuracy)+1, ' Train error equals:', train_accuracy[np.argmax(train_accuracy)] )
print(' Sample number: ',np.argmax(test_accuracy)+1, ' Validation error equals:', test_accuracy[np.argmax(test_accuracy)] )

In [None]:
#Plot recall for K parameter for K Neighboors Classifier 
fig, ax1 = plt.subplots(figsize=(12,8))
plt.title('K for KNeighborsClassifier')
plt.plot(train_accuracy, color='r')
#Print recall for train data set
ax1.set_ylabel('train_recall',color='r')
plt.legend(['train_recall'],loc=(0.01,0.95))
ax2 = ax1.twinx()
#Print recall for test data set
plt.plot(test_accuracy,color='b')
ax2.set_ylabel('test_recall',color='b')
plt.legend(['test_recall'],loc=(0.01,0.90))
plt.grid(True)

In [None]:
#Choose KNN model
model7 = KNeighborsClassifier(n_neighbors = 2, metric = 'minkowski', p = 2)
#Learn model
model7.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = model7.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Choose DecisionTreeClassifier model
model8 = DecisionTreeClassifier(criterion = 'entropy', random_state = 1)
#Learn model
model8.fit(X_train, y_train)
#Predict label in test set
y_pre = model8.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Choose RandomForestClassifier model
model9 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 1)
#Learn model
model9.fit(X_train, y_train)
#Predict label in test set
y_pre = model9.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

# 2. Feature engineering

Use data with droped all of the features that have very similar distributions between the two types of transactions.

In [None]:
#Split data to X and y
X = df_drop.drop('Class', 1)
y = df_drop.Class

print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))

In [None]:
#Split data to train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
#Oversampling by SMOTE methods
smote = SMOTE(random_state=SEED)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_smote.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_smote.shape))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_smote==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_smote==0)))

In [None]:
#Choose model Logistic Regression
model10 = LogisticRegression(penalty='l1', verbose=5, solver='liblinear')
#Find optimum C
opt_C = optimum_C(model10, X_train_smote, y_train_smote)
print('optimum C is ', opt_C)
#Prepare model with the best parameter C
lr10 = LogisticRegression(C=opt_C,penalty='l1', verbose=5, solver='liblinear')
#Learn model
lr10.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = lr10.predict(X_test)
tmp = lr10.fit(X_train_smote, y_train_smote.ravel())
y_pred_sample_score = tmp.decision_function(X_test)
#Print results
show_results(y_test, y_pre, y_pred_sample_score)

In [None]:
#Choose model GaussianNB
lr11 = GaussianNB()
#Learn model
lr11.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = lr11.predict(X_test)
tmp = lr11.fit(X_train_smote, y_train_smote.ravel())
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Find the best parameter n_neighbors for K-NN

train_accuracy = []
test_accuracy = []
acc = 0

#Train some models with using recall score
for i in range(1,15):
    model_lr = KNeighborsClassifier(n_neighbors=i,metric = 'minkowski', p = 2)
    model_lr.fit(X_train_smote, y_train_smote.ravel())
    pred_al = model_lr.predict(X_train_smote)
    pred_lr = model_lr.predict(X_test)
    train_accuracy.append(recall_score(y_test, pred_lr))
    test_accuracy.append(recall_score(y_train_smote.ravel(), pred_al))
    acc=0  
print(' Sample number: ',np.argmax(train_accuracy)+1, ' Train error equals:', train_accuracy[np.argmax(train_accuracy)] )
print(' Sample number: ',np.argmax(test_accuracy)+1, ' Validation error equals:', test_accuracy[np.argmax(test_accuracy)] )

In [None]:
#Plot recall for K parameter for K Neighboors Classifier 
fig, ax1 = plt.subplots(figsize=(12,8))
plt.title('K for KNeighborsClassifier')
plt.plot(train_accuracy, color='r')
#Print recall for train data set
ax1.set_ylabel('train_recall',color='r')
plt.legend(['train_recall'],loc=(0.01,0.95))
ax2 = ax1.twinx()
#Print recall for test data set
plt.plot(test_accuracy,color='b')
ax2.set_ylabel('test_recall',color='b')
plt.legend(['test_recall'],loc=(0.01,0.90))
plt.grid(True)

In [None]:
#Choose KNN model
model12 = KNeighborsClassifier(n_neighbors = 2, metric = 'minkowski', p = 2)
#Learn model
model12.fit(X_train_smote, y_train_smote.ravel())
#Predict label in test set
y_pre = model12.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Choose DecisionTreeClassifier model
model13 = DecisionTreeClassifier(criterion = 'entropy', random_state = 1)
#Learn model
model13.fit(X_train, y_train)
#Predict label in test set
y_pre = model13.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

In [None]:
#Choose RandomForestClassifier model
model14 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 1)
#Learn model
model14.fit(X_train, y_train)
#Predict label in test set
y_pre = model14.predict(X_test)
#Print results
show_results_without_auc(y_test, y_pre)

# Neural Net