---
 **NB:** It is **MANDATORY** to comment on every line of code you write should the need arise.
<br>

**Interpret results obtained for questions 2-4**

---

# Assignment 5 - Modelling

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [2]:
df  = pd.read_csv('fraud_data.csv')

In [3]:
df.head(3)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,1.176563,0.323798,0.536927,1.047002,-0.368652,-0.728586,0.084678,-0.069246,-0.266389,0.155315,...,-0.109627,-0.341365,0.057845,0.49918,0.415211,-0.581949,0.015472,0.018065,4.67,0
1,0.681109,-3.934776,-3.801827,-1.147468,-0.73554,-0.501097,1.038865,-0.626979,-2.274423,1.527782,...,0.652202,0.272684,-0.982151,0.1659,0.360251,0.195321,-0.256273,0.056501,912.0,0
2,1.140729,0.453484,0.24701,2.383132,0.343287,0.432804,0.09338,0.17331,-0.808999,0.775436,...,-0.003802,0.058556,-0.121177,-0.304215,0.645893,0.1226,-0.012115,-0.005945,1.0,0


In [4]:
#df.info()

In [5]:
df.iloc[:,2:30].values

array([[ 5.36927047e-01,  1.04700168e+00, -3.68652161e-01, ...,
         1.80650395e-02,  4.67000000e+00,  0.00000000e+00],
       [-3.80182745e+00, -1.14746778e+00, -7.35540319e-01, ...,
         5.65009045e-02,  9.12000000e+02,  0.00000000e+00],
       [ 2.47010003e-01,  2.38313244e+00,  3.43287073e-01, ...,
        -5.94502519e-03,  1.00000000e+00,  0.00000000e+00],
       ...,
       [ 1.88313689e+00, -2.67439566e-01,  1.05697245e+00, ...,
        -1.50290873e-01,  1.31100000e+02,  0.00000000e+00],
       [ 8.32325026e-02, -7.97911888e-01,  5.64317706e-01, ...,
         7.94200575e-02,  4.49000000e+00,  0.00000000e+00],
       [ 1.43060545e+00,  6.27951207e-01,  3.17724805e-01, ...,
        -2.35628647e-02,  1.49000000e+01,  0.00000000e+00]])

In [6]:
df.Class.values

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

<h3> Checking the Count of Fraudulent and None Fraudulent 

In [7]:
#pd.DataFrame(df['Class'].value_counts())

<h3> Checking the Count of Fraudulent and None Fraudulent and % of the Categories

In [8]:
def solution_1():
    
    occ = df['Class'].value_counts() #
    ratio_cases = occ/len(df.index)
    return print(f'Fraudulent Cases: {ratio_cases[1]}\nNon-fraudulent cases: {ratio_cases[0]}')
    
solution_1()

Fraudulent Cases: 0.016410823768035772
Non-fraudulent cases: 0.9835891762319642


In [9]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,1.176563,0.323798,0.536927,1.047002,-0.368652,-0.728586,0.084678,-0.069246,-0.266389,0.155315,...,-0.137258,-0.109627,-0.341365,0.057845,0.49918,0.415211,-0.581949,0.015472,0.018065,4.67
1,0.681109,-3.934776,-3.801827,-1.147468,-0.73554,-0.501097,1.038865,-0.626979,-2.274423,1.527782,...,1.341809,0.652202,0.272684,-0.982151,0.1659,0.360251,0.195321,-0.256273,0.056501,912.0
2,1.140729,0.453484,0.24701,2.383132,0.343287,0.432804,0.09338,0.17331,-0.808999,0.775436,...,-0.232185,-0.003802,0.058556,-0.121177,-0.304215,0.645893,0.1226,-0.012115,-0.005945,1.0


### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [19]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score, accuracy_score

def solution_2():
    d = DummyClassifier(strategy='most_frequent')
    
    d.fit(X_train, y_train)
    
    y_p = d.predict(X_test)
    
    
    return (accuracy_score(y_p, y_test), recall_score(y_p, y_test))

solution_2()

  _warn_prf(average, modifier, msg_start, len(result))


(0.9852507374631269, 0.0)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [11]:
from sklearn.metrics import recall_score, precision_score
from sklearn.svm import SVC

def solution_3():
    scv = SVC()
    scv.fit(X_train, y_train)
    predict = scv.predict(X_test)
    return (scv.score(X_test, y_test), recall_score(y_test, predict), precision_score(y_test, predict))

pd.DataFrame(solution_3())

Unnamed: 0,0
0,0.990044
1,0.35
2,0.933333


### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [12]:
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

def solution_4():
    scv = SVC(C=1e9, gamma=1e-07)
    
    scv.fit(X_train, y_train)
    
    scores = scv.decision_function(X_test)
    
    y_pred_with_threshold = scores > -220
    confusion = confusion_matrix(y_test, y_pred_with_threshold)
    
    return confusion



In [13]:
pd.DataFrame(solution_4())

Unnamed: 0,0,1
0,5320,24
1,14,66
