---
 **NB:** It is **MANDATORY** to comment on every line of code you write should the need arise.
<br>

**Interpret results obtained for questions 2-4**

---

# Assignment 5 - Modelling

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [2]:
df = pd.read_csv("fraud_data.csv")

In [3]:
def solution_1():
    percent_list = []
    for i in df.Class.value_counts():
        percent = round((i/sum(df.Class.value_counts())),2)
        percent_list.append(percent)
        classes = list(set([j for j in df.Class]))

    return pd.concat([pd.DataFrame(classes),pd.DataFrame(percent_list)], axis=1,  keys=['Class','Percent'])
    
    

solution_1()

Unnamed: 0_level_0,Class,Percent
Unnamed: 0_level_1,0,0
0,0,0.98
1,1,0.02


In [4]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [6]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score, accuracy_score

def Dummy_Classifier(X_TRAIN,X_TEST,Y_TRAIN,Y_TEST):
    
    dummy_class = DummyClassifier(random_state=0)
    dummy_class.fit(X_TRAIN,Y_TRAIN)
    Y_PRED = dummy_class.predict(X_TEST)
    
    TPR = recall_score(Y_TEST,Y_PRED)
    ACC = round(accuracy_score(Y_TEST,Y_PRED),4)
    
    return "Accuracy: ",ACC,"Recall: ",TPR

Dummy_Classifier(X_train,X_test,y_train,y_test)

('Accuracy: ', 0.9709, 'Recall: ', 0.0125)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [7]:
def SVC_Classifier(X_TRAIN,X_TEST,Y_TRAIN,Y_TEST):
    from sklearn.svm import SVC
    from sklearn.metrics import recall_score, accuracy_score, precision_score

    
    SV_class = SVC(random_state=0)
    SV_class.fit(X_TRAIN,Y_TRAIN)
    Y_PRED = SV_class.predict(X_TEST)
    
    TPR = recall_score(Y_TEST,Y_PRED)
    ACC = round(accuracy_score(Y_TEST,Y_PRED),4)
    PREC = round(precision_score(Y_TEST,Y_PRED),4)
    
    return "Accuracy: ",ACC,"Recall: ",TPR, "Precision: ", PREC

SVC_Classifier(X_train,X_test,y_train,y_test)

('Accuracy: ', 0.99, 'Recall: ', 0.35, 'Precision: ', 0.9333)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [8]:
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

def Conf_matrix(X_TRAIN,X_TEST,Y_TRAIN,Y_TEST):
    SV_class = SVC(C=1e9, gamma=1e-07, random_state=0)
    SV_class.fit(X_TRAIN,Y_TRAIN)
#     Y_PRED = SV_class.predict(X_TEST)
    Dec_func = SV_class.decision_function(X_TEST)
    thresh = -220
    Y_PRED = []
    for dec_vals in Dec_func:
        if dec_vals < thresh:
            Y_PRED.append(0)
        else:
            Y_PRED.append(1)
        
    
    
    matrix = pd.DataFrame(confusion_matrix(Y_TEST,Y_PRED), columns=['Predicted Not Fraud','Predicted Fraud'],
                 index=['Not Fraud','Fraud'])
    
    return matrix

Conf_matrix(X_train,X_test,y_train,y_test)

Unnamed: 0,Predicted Not Fraud,Predicted Fraud
Not Fraud,5320,24
Fraud,14,66
