# Evaluate Fraud Data Detection

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

### What percentage of the observations in the dataset are instances of fraud?

In [2]:
def answer_one():
    fraud=pd.read_csv('../input/fraud-data/fraud_data.csv')
    fraud_instances=fraud[fraud['Class']==1]
    return len(fraud_instances)/len(fraud)
print('Approximately ' + str (100*round(answer_one(),2))+' % of the observations in dataset are fraud.')

Approximately 2.0 % of the observations in dataset are fraud.


### Spit data into train and test dataset

In [3]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('../input/fraud-data/fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Training a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?


In [4]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score,accuracy_score,recall_score
    dummy_clf=DummyClassifier(strategy='most_frequent').fit(X_train,y_train)
    y_predicted=dummy_clf.predict(X_test)
    
    return print('Accuracy Score: ',accuracy_score(y_test,y_predicted), '\nRecall Score: ',recall_score(y_test,y_predicted))
answer_two()

Accuracy Score:  0.9852507374631269 
Recall Score:  0.0


### Training a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

In [5]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score
    from sklearn.svm import SVC
    svc_clf=SVC().fit(X_train,y_train)
    y_pred_svc=svc_clf.predict(X_test)
    
    return print('Accuracy Score: ',svc_clf.score(X_test,y_test),'\nRecall Score: ',recall_score(y_test,y_pred_svc),'\nPrecision Score: ',precision_score(y_test,y_pred_svc))
answer_three()

Accuracy Score:  0.9900442477876106 
Recall Score:  0.35 
Precision Score:  0.9333333333333333


### Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

In [6]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC
    svc_clf=SVC(C= 1e9, gamma= 1e-07).fit(X_train,y_train)
    decision=svc_clf.decision_function(X_test)
    y_pred_thrh=np.where(decision>-220,1,0)
    confusion=confusion_matrix(y_test,y_pred_thrh)
    return confusion
answer_four()

array([[5320,   24],
       [  14,   66]])

### Train a logisitic regression classifier.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

In [7]:
def answer_five():
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve,roc_curve
    from sklearn.svm import SVC
    import matplotlib.pyplot  as plt
    log_reg=LogisticRegression().fit(X_train,y_train)
    y_prob=log_reg.predict_proba(X_test)
    precision,recall,thresholds=precision_recall_curve(y_test,y_prob[:,1])
    plt.figure()
    plt.plot(precision,recall,label='PRC')
    plt.xlabel('precision')
    plt.ylabel('recall')
    fpr, tpr,mm=roc_curve(y_test,y_prob[:,1])
    plt.figure()
    plt.plot(fpr,tpr,label='ROC curve')
    plt.xlabel('false positive rate')
    plt.ylabel('true positiv rate')
    plt.show()
    
    return  (0.8,0.9)
#answer_five()

### Performing a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

In [8]:
def answer_six():    
    import warnings
    #from astropy.io import fits
    warnings.filterwarnings('ignore', category=UserWarning, append=False)
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression
    log_reg=LogisticRegression()
    param_grid={'penalty': ['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100]}
    log_grid=GridSearchCV(log_reg,param_grid=param_grid,cv=3, scoring='recall')
    log_reg_fit=log_grid.fit(X_train,y_train)
    
    return log_grid.cv_results_['mean_test_score'].reshape(5,2)
#answer_six()

### Visualization of grid search on heatmap

In [9]:

def GridSearch_Heatmap(scores):
    %matplotlib notebook
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);

#GridSearch_Heatmap(answer_six())