## Week 11 Assignment

Description : 

Logistic Regression and SVM for Credit Card Fraud Detection In this assignment, you will use logistic regression and SVM to predict whether a credit card transaction is fraudulent or not using the Credit Card Fraud Detection dataset

Instructions:
    
    1. Use the dataset as “creditcardfraud.csv”.

    2. Preprocess the data by:

        ● Dropping the Time column since it is not useful for classification.

        ● Scaling the Amount column using a standard scaler.

    3. Split the data into training and test sets using a 80:20 split ratio. Use random_state=42 for reproducibility.

    4. Train a logistic regression model on the training set using the default hyperparameters.

    5. Evaluate the model's performance on the test set using:

        ● Confusion matrix

        ● Classification report

    6. Train an SVM model on the training set using the default hyperparameters. Evaluate the model's performance on the test set using the same evaluation metrics as in step 4.

    7. Tune the hyperparameters of the logistic regression model and the SVM model using grid search cross-validation. Use a range of values for the hyperparameters of your choice. Choose the evaluation metric of your choice (e.g.Accuracy Score) to optimize the hyperparameters.

    8. Train the logistic regression model and the SVM model with the optimal hyperparameters on the training set. Evaluate their performance on the test set using the same evaluation metrics as in step 4.

    9. Compare the performance of the logistic regression model and the SVM model using the evaluation metrics from steps 4 and 7. Interpret the results and provide insights on which model performed better and why.

    10. Summarize your findings and conclusions in a brief report.

Note: Use python programming language and any library of your choice.

Data set - https://drive.google.com/file/d/1cGPZnVkF4MIZDyaPZjhI6ZxwV8Ax_GjG/view?usp=sharing

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
dataset = pd.read_csv("creditcardfraud.csv")
dataset

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,82450,1.314539,0.590643,-0.666593,0.716564,0.301978,-1.125467,0.388881,-0.288390,-0.132137,...,-0.170307,-0.429655,-0.141341,-0.200195,0.639491,0.399476,-0.034321,0.031692,0.76,0
1,50554,-0.798672,1.185093,0.904547,0.694584,0.219041,-0.319295,0.495236,0.139269,-0.760214,...,0.202287,0.578699,-0.092245,0.013723,-0.246466,-0.380057,-0.396030,-0.112901,4.18,0
2,55125,-0.391128,-0.245540,1.122074,-1.308725,-0.639891,0.008678,-0.701304,-0.027315,-2.628854,...,-0.133485,0.117403,-0.191748,-0.488642,-0.309774,0.008100,0.163716,0.239582,15.00,0
3,116572,-0.060302,1.065093,-0.987421,-0.029567,0.176376,-1.348539,0.775644,0.134843,-0.149734,...,0.355576,0.907570,-0.018454,-0.126269,-0.339923,-0.150285,-0.023634,0.042330,57.00,0
4,90434,1.848433,0.373364,0.269272,3.866438,0.088062,0.970447,-0.721945,0.235983,0.683491,...,0.103563,0.620954,0.197077,0.692392,-0.206530,-0.021328,-0.019823,-0.042682,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,160243,-2.783865,1.596824,-2.084844,2.512986,-1.446749,-0.828496,-0.732262,-0.203329,-0.347046,...,0.203563,0.293268,0.199568,0.146868,0.163602,-0.624085,-1.333100,0.428634,156.00,1
596,110547,-1.532810,2.232752,-5.923100,3.386708,-0.153443,-1.419748,-3.878576,1.444656,-1.465542,...,0.632505,-0.070838,-0.490291,-0.359983,0.050678,1.095671,0.471741,-0.106667,0.76,1
597,70071,-0.440095,1.137239,-3.227080,3.242293,-2.033998,-1.618415,-3.028013,0.764555,-1.801937,...,0.764187,-0.275578,-0.343572,0.233085,0.606434,-0.315433,0.768291,0.459623,227.30,1
598,93879,-13.086519,7.352148,-18.256576,10.648505,-11.731476,-3.659167,-14.873658,8.810473,-5.418204,...,2.761157,-0.266162,-0.412861,0.519952,-0.743909,-0.167808,-2.498300,-0.711066,30.31,1


In [3]:
dataset.drop("Time",inplace=True,axis=1)
dataset

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,1.314539,0.590643,-0.666593,0.716564,0.301978,-1.125467,0.388881,-0.288390,-0.132137,-0.597739,...,-0.170307,-0.429655,-0.141341,-0.200195,0.639491,0.399476,-0.034321,0.031692,0.76,0
1,-0.798672,1.185093,0.904547,0.694584,0.219041,-0.319295,0.495236,0.139269,-0.760214,0.170547,...,0.202287,0.578699,-0.092245,0.013723,-0.246466,-0.380057,-0.396030,-0.112901,4.18,0
2,-0.391128,-0.245540,1.122074,-1.308725,-0.639891,0.008678,-0.701304,-0.027315,-2.628854,2.051312,...,-0.133485,0.117403,-0.191748,-0.488642,-0.309774,0.008100,0.163716,0.239582,15.00,0
3,-0.060302,1.065093,-0.987421,-0.029567,0.176376,-1.348539,0.775644,0.134843,-0.149734,-1.238598,...,0.355576,0.907570,-0.018454,-0.126269,-0.339923,-0.150285,-0.023634,0.042330,57.00,0
4,1.848433,0.373364,0.269272,3.866438,0.088062,0.970447,-0.721945,0.235983,0.683491,1.166335,...,0.103563,0.620954,0.197077,0.692392,-0.206530,-0.021328,-0.019823,-0.042682,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,-2.783865,1.596824,-2.084844,2.512986,-1.446749,-0.828496,-0.732262,-0.203329,-0.347046,-2.162061,...,0.203563,0.293268,0.199568,0.146868,0.163602,-0.624085,-1.333100,0.428634,156.00,1
596,-1.532810,2.232752,-5.923100,3.386708,-0.153443,-1.419748,-3.878576,1.444656,-1.465542,-5.208335,...,0.632505,-0.070838,-0.490291,-0.359983,0.050678,1.095671,0.471741,-0.106667,0.76,1
597,-0.440095,1.137239,-3.227080,3.242293,-2.033998,-1.618415,-3.028013,0.764555,-1.801937,-4.711769,...,0.764187,-0.275578,-0.343572,0.233085,0.606434,-0.315433,0.768291,0.459623,227.30,1
598,-13.086519,7.352148,-18.256576,10.648505,-11.731476,-3.659167,-14.873658,8.810473,-5.418204,-13.202577,...,2.761157,-0.266162,-0.412861,0.519952,-0.743909,-0.167808,-2.498300,-0.711066,30.31,1


In [4]:
scaler = StandardScaler()

In [5]:
scaler.fit_transform(dataset[["Amount"]])

array([[-4.43050529e-01],
       [-4.27820564e-01],
       [-3.79636872e-01],
       [-1.92602209e-01],
       [-4.46434966e-01],
       [-2.50983743e-01],
       [-2.24531698e-01],
       [-2.90572747e-01],
       [ 5.28817203e-01],
       [-3.93664472e-01],
       [-2.86564861e-01],
       [-4.37617617e-01],
       [ 3.54910147e-02],
       [-1.96209307e-01],
       [-4.01947435e-01],
       [-3.84134610e-01],
       [ 1.49382700e+00],
       [-1.31815944e-01],
       [-3.39157227e-01],
       [-3.67078830e-01],
       [ 1.87809417e+00],
       [-3.57370841e-01],
       [-3.98607531e-01],
       [-4.41981760e-01],
       [ 5.18084976e-01],
       [ 3.51072864e+00],
       [-4.37617617e-01],
       [ 1.59201084e-01],
       [-4.42471612e-01],
       [-3.95802011e-01],
       [-1.93092062e-01],
       [-3.11102028e-01],
       [-3.25218692e-01],
       [-3.59597444e-01],
       [-3.73847704e-01],
       [-3.48508960e-01],
       [-3.79636872e-01],
       [-4.43005997e-01],
       [-4.3

In [6]:
dataset["Amount"] = scaler.fit_transform(dataset[["Amount"]])
dataset

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,1.314539,0.590643,-0.666593,0.716564,0.301978,-1.125467,0.388881,-0.288390,-0.132137,-0.597739,...,-0.170307,-0.429655,-0.141341,-0.200195,0.639491,0.399476,-0.034321,0.031692,-0.443051,0
1,-0.798672,1.185093,0.904547,0.694584,0.219041,-0.319295,0.495236,0.139269,-0.760214,0.170547,...,0.202287,0.578699,-0.092245,0.013723,-0.246466,-0.380057,-0.396030,-0.112901,-0.427821,0
2,-0.391128,-0.245540,1.122074,-1.308725,-0.639891,0.008678,-0.701304,-0.027315,-2.628854,2.051312,...,-0.133485,0.117403,-0.191748,-0.488642,-0.309774,0.008100,0.163716,0.239582,-0.379637,0
3,-0.060302,1.065093,-0.987421,-0.029567,0.176376,-1.348539,0.775644,0.134843,-0.149734,-1.238598,...,0.355576,0.907570,-0.018454,-0.126269,-0.339923,-0.150285,-0.023634,0.042330,-0.192602,0
4,1.848433,0.373364,0.269272,3.866438,0.088062,0.970447,-0.721945,0.235983,0.683491,1.166335,...,0.103563,0.620954,0.197077,0.692392,-0.206530,-0.021328,-0.019823,-0.042682,-0.446435,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,-2.783865,1.596824,-2.084844,2.512986,-1.446749,-0.828496,-0.732262,-0.203329,-0.347046,-2.162061,...,0.203563,0.293268,0.199568,0.146868,0.163602,-0.624085,-1.333100,0.428634,0.248265,1
596,-1.532810,2.232752,-5.923100,3.386708,-0.153443,-1.419748,-3.878576,1.444656,-1.465542,-5.208335,...,0.632505,-0.070838,-0.490291,-0.359983,0.050678,1.095671,0.471741,-0.106667,-0.443051,1
597,-0.440095,1.137239,-3.227080,3.242293,-2.033998,-1.618415,-3.028013,0.764555,-1.801937,-4.711769,...,0.764187,-0.275578,-0.343572,0.233085,0.606434,-0.315433,0.768291,0.459623,0.565779,1
598,-13.086519,7.352148,-18.256576,10.648505,-11.731476,-3.659167,-14.873658,8.810473,-5.418204,-13.202577,...,2.761157,-0.266162,-0.412861,0.519952,-0.743909,-0.167808,-2.498300,-0.711066,-0.311458,1


In [7]:
X=dataset.drop("Class",axis=1)
X

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,1.314539,0.590643,-0.666593,0.716564,0.301978,-1.125467,0.388881,-0.288390,-0.132137,-0.597739,...,-0.058040,-0.170307,-0.429655,-0.141341,-0.200195,0.639491,0.399476,-0.034321,0.031692,-0.443051
1,-0.798672,1.185093,0.904547,0.694584,0.219041,-0.319295,0.495236,0.139269,-0.760214,0.170547,...,-0.081298,0.202287,0.578699,-0.092245,0.013723,-0.246466,-0.380057,-0.396030,-0.112901,-0.427821
2,-0.391128,-0.245540,1.122074,-1.308725,-0.639891,0.008678,-0.701304,-0.027315,-2.628854,2.051312,...,0.065716,-0.133485,0.117403,-0.191748,-0.488642,-0.309774,0.008100,0.163716,0.239582,-0.379637
3,-0.060302,1.065093,-0.987421,-0.029567,0.176376,-1.348539,0.775644,0.134843,-0.149734,-1.238598,...,-0.169706,0.355576,0.907570,-0.018454,-0.126269,-0.339923,-0.150285,-0.023634,0.042330,-0.192602
4,1.848433,0.373364,0.269272,3.866438,0.088062,0.970447,-0.721945,0.235983,0.683491,1.166335,...,-0.282777,0.103563,0.620954,0.197077,0.692392,-0.206530,-0.021328,-0.019823,-0.042682,-0.446435
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,-2.783865,1.596824,-2.084844,2.512986,-1.446749,-0.828496,-0.732262,-0.203329,-0.347046,-2.162061,...,-0.515001,0.203563,0.293268,0.199568,0.146868,0.163602,-0.624085,-1.333100,0.428634,0.248265
596,-1.532810,2.232752,-5.923100,3.386708,-0.153443,-1.419748,-3.878576,1.444656,-1.465542,-5.208335,...,0.520840,0.632505,-0.070838,-0.490291,-0.359983,0.050678,1.095671,0.471741,-0.106667,-0.443051
597,-0.440095,1.137239,-3.227080,3.242293,-2.033998,-1.618415,-3.028013,0.764555,-1.801937,-4.711769,...,0.895841,0.764187,-0.275578,-0.343572,0.233085,0.606434,-0.315433,0.768291,0.459623,0.565779
598,-13.086519,7.352148,-18.256576,10.648505,-11.731476,-3.659167,-14.873658,8.810473,-5.418204,-13.202577,...,-1.376298,2.761157,-0.266162,-0.412861,0.519952,-0.743909,-0.167808,-2.498300,-0.711066,-0.311458


In [8]:
y = dataset["Class"]
y

0      0
1      0
2      0
3      0
4      0
      ..
595    1
596    1
597    1
598    1
599    1
Name: Class, Length: 600, dtype: int64

In [11]:
xtrain, xtest, ytrain, ytest =  train_test_split(X, y , test_size=0.2, random_state = 42)

In [12]:
(xtrain.shape,xtest.shape)

((480, 29), (120, 29))

In [13]:
(ytrain.shape,ytest.shape)

((480,), (120,))

In [14]:
log_r = LogisticRegression()

In [15]:
log_r.fit(xtrain,ytrain)

In [16]:
ypred = log_r.predict(xtest)

In [17]:
confusion_matrix(y_true=ytest,y_pred=ypred)

array([[59,  3],
       [ 4, 54]], dtype=int64)

In [18]:
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_true=ytest,y_pred=ypred),
                                   columns=pd.MultiIndex.from_tuples([("Predicted","False"),("Predicted","True")]),
                                   index=pd.MultiIndex.from_tuples([("Actual","False"),("Actual","True")]))
confusion_matrix_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted,Predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,False,True
Actual,False,59,3
Actual,True,4,54


In [19]:
classification_report(y_true=ytest,y_pred=ypred,output_dict=True)

{'0': {'precision': 0.9365079365079365,
  'recall': 0.9516129032258065,
  'f1-score': 0.944,
  'support': 62},
 '1': {'precision': 0.9473684210526315,
  'recall': 0.9310344827586207,
  'f1-score': 0.9391304347826087,
  'support': 58},
 'accuracy': 0.9416666666666667,
 'macro avg': {'precision': 0.9419381787802841,
  'recall': 0.9413236929922135,
  'f1-score': 0.9415652173913043,
  'support': 120},
 'weighted avg': {'precision': 0.9417571707045391,
  'recall': 0.9416666666666667,
  'f1-score': 0.9416463768115941,
  'support': 120}}

In [20]:
classification_report_df = pd.DataFrame(classification_report(y_true=ytest,y_pred=ypred,output_dict=True))
classification_report_df

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.936508,0.947368,0.941667,0.941938,0.941757
recall,0.951613,0.931034,0.941667,0.941324,0.941667
f1-score,0.944,0.93913,0.941667,0.941565,0.941646
support,62.0,58.0,0.941667,120.0,120.0


In [21]:
svc = SVC()

In [22]:
svc.fit(xtrain, ytrain)

In [23]:
ypred = svc.predict(xtest)

In [24]:
confusion_matrix(y_true=ytest,y_pred=ypred)

array([[62,  0],
       [ 6, 52]], dtype=int64)

In [25]:
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_true=ytest,y_pred=ypred),
                                   columns=pd.MultiIndex.from_tuples([("Predicted","False"),("Predicted","True")]),
                                   index=pd.MultiIndex.from_tuples([("Actual","False"),("Actual","True")]))
confusion_matrix_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted,Predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,False,True
Actual,False,62,0
Actual,True,6,52


In [26]:
classification_report(y_true=ytest,y_pred=ypred,output_dict=True)

{'0': {'precision': 0.9117647058823529,
  'recall': 1.0,
  'f1-score': 0.9538461538461539,
  'support': 62},
 '1': {'precision': 1.0,
  'recall': 0.896551724137931,
  'f1-score': 0.9454545454545454,
  'support': 58},
 'accuracy': 0.95,
 'macro avg': {'precision': 0.9558823529411764,
  'recall': 0.9482758620689655,
  'f1-score': 0.9496503496503497,
  'support': 120},
 'weighted avg': {'precision': 0.9544117647058824,
  'recall': 0.95,
  'f1-score': 0.9497902097902099,
  'support': 120}}

In [27]:
classification_report_df = pd.DataFrame(classification_report(y_true=ytest,y_pred=ypred,output_dict=True))
classification_report_df

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.911765,1.0,0.95,0.955882,0.954412
recall,1.0,0.896552,0.95,0.948276,0.95
f1-score,0.953846,0.945455,0.95,0.94965,0.94979
support,62.0,58.0,0.95,120.0,120.0


In [28]:
grid = { 
    "solver" : ["lbfgs", "liblinear", "newton-cg", "newton-cholesky", "sag", "saga"],
    'C': [1, 10,100],
    'random_state' : [42,None],
    "warm_start" : [True,False],
    "max_iter" : [50000]
}
log_r_GSCV = GridSearchCV(LogisticRegression(),param_grid=grid,cv=5,scoring="accuracy")

In [29]:
log_r_GSCV.fit(xtrain, ytrain)

In [30]:
log_r_GSCV.best_params_

{'C': 1,
 'max_iter': 50000,
 'random_state': 42,
 'solver': 'liblinear',
 'warm_start': True}

In [31]:
#Training the logistic regression model with the most optimal hyperparameters

In [32]:
log_r = LogisticRegression(C=1,max_iter=50000,random_state=42,solver="liblinear",warm_start=True)

In [33]:
log_r.fit(xtrain,ytrain)

In [34]:
ypred = svc.predict(xtest)

In [35]:
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_true=ytest,y_pred=ypred),
                                   columns=pd.MultiIndex.from_tuples([("Predicted","False"),("Predicted","True")]),
                                   index=pd.MultiIndex.from_tuples([("Actual","False"),("Actual","True")]))
confusion_matrix_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted,Predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,False,True
Actual,False,62,0
Actual,True,6,52


In [36]:
classification_report_df = pd.DataFrame(classification_report(y_true=ytest,y_pred=ypred,output_dict=True))
classification_report_df

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.911765,1.0,0.95,0.955882,0.954412
recall,1.0,0.896552,0.95,0.948276,0.95
f1-score,0.953846,0.945455,0.95,0.94965,0.94979
support,62.0,58.0,0.95,120.0,120.0


In [37]:
grid = { 
    "kernel" : ["linear", "poly", "rbf", "sigmoid"],
    'random_state' : [42,None],
    'C': [1, 10,100],
    'gamma': [0.001, 0.0001],
    "max_iter" : [430000,440000,450000]
}
svc_GSCV = GridSearchCV(SVC(),param_grid=grid,cv=5,scoring="accuracy")

In [38]:
svc_GSCV.fit(xtrain, ytrain)

In [39]:
svc_GSCV.best_params_

{'C': 10,
 'gamma': 0.001,
 'kernel': 'rbf',
 'max_iter': 430000,
 'random_state': 42}

In [40]:
#Training the Support Vector Machine Classification model with the most optimal hyperparameters

In [41]:
svc = SVC(C = 10,gamma=0.001,kernel="rbf",max_iter=430000,random_state=42)

In [42]:
svc.fit(xtrain,ytrain)

In [43]:
ypred = svc.predict(xtest)

In [44]:
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_true=ytest,y_pred=ypred),
                                   columns=pd.MultiIndex.from_tuples([("Predicted","False"),("Predicted","True")]),
                                   index=pd.MultiIndex.from_tuples([("Actual","False"),("Actual","True")]))
confusion_matrix_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted,Predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,False,True
Actual,False,62,0
Actual,True,5,53


In [45]:
classification_report_df = pd.DataFrame(classification_report(y_true=ytest,y_pred=ypred,output_dict=True))
classification_report_df

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.925373,1.0,0.958333,0.962687,0.961443
recall,1.0,0.913793,0.958333,0.956897,0.958333
f1-score,0.96124,0.954955,0.958333,0.958098,0.958202
support,62.0,58.0,0.958333,120.0,120.0


1) For the Logistic Regression model, training it with default hyperparameter values led to an accuracy score of 0.941667 .
   
2) Training the Logistic Regression model with optimal hyperparamters gave an accuracy score of 0.95 .
   
3) There was nearabout 0.01 increase in accuracy score for the Logistic Regression Model.
   
4) For the Support Vector Classifier model, training it with default hyperparameter values led to an accuracy score of 0.95 .
   
5) Training the Support Vector Classifier model with optimal hyperparameters gave an accuracy score of 0.958333 .
    
6) There was nearabout 0.008 increase in accuracy score for the Support Vector Classifier Model.
    
7) The SVM is performing slightly better than the Logistic Regression, possibly because Logistic Regression uses Linear Regression underneath, which cannot handle non-linear relationships while SVM can do so.