# Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [3]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [None]:
def answer_one():
    
    fr_data=pd.read_csv('fraud_data.csv')

    return fr_data['Class'].loc[fr_data['Class']==1].shape[0]/fr_data['Class'].loc[fr_data['Class']==0].shape[0]


In [4]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')
df.head()
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [11]:
X_test.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
16629,0.866892,-0.899856,0.97169,0.131152,-0.931563,0.906297,-0.806258,0.485835,1.017235,-0.405591,...,0.055604,-0.066068,-0.177038,0.038159,-0.211134,-0.10259,0.974237,-0.029268,0.015414,119.21
19225,-2.443959,-3.320793,1.038459,-0.593688,-3.925352,2.008374,2.053072,-0.378888,-2.196404,0.585541,...,0.20656,-0.28598,-0.172471,0.600713,0.056783,0.298285,-0.050735,0.385018,0.031201,992.42
8754,2.066008,0.212734,-1.676676,0.407158,0.510979,-0.790015,0.228,-0.25996,0.363376,-0.411215,...,-0.109995,-0.347788,-0.847449,0.339762,0.531585,-0.247659,0.172542,-0.053897,-0.029045,1.29
1524,2.071216,-0.956856,-0.889735,-0.524264,-0.776255,-0.324803,-0.731801,-0.041067,-0.342191,1.021587,...,-0.540049,-0.330418,-0.440544,0.205605,-0.502165,-0.365702,0.566753,-0.044129,-0.061905,39.8
4330,-1.440668,0.04171,0.610007,-2.820097,-1.664394,-0.877558,-0.884012,0.786296,-1.961644,0.528843,...,-0.308033,-0.319893,-0.610358,0.215085,-0.015094,-0.377717,-0.573045,0.309037,0.067352,42.46


### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [40]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score

    dummy_majority=DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
    y_pred=dummy_majority.predict(X_test)
    recall=recall_score(y_test, y_pred)
    accur=dummy_majority.score(X_test, y_test)
    
    return (accur, recall)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [42]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score
    from sklearn.svm import SVC

    svc=SVC().fit(X_train, y_train)
    y_pred=svc.predict(X_test)
    recall=recall_score(y_test, y_pred)
    accur=svc.score(X_test, y_test)
    prec=precision_score(y_test, y_pred)
   
    return (accur, recall, prec)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [64]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    svc=SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)

    y_pred=svc.predict(X_test)
    y_df=svc.decision_function(X_test)
    ydf=pd.Series(y_df)
    y2=pd.Series([0]*ydf.shape[0])
    y2.loc[ydf>-220]=1

    return confusion_matrix(y_test, y2)

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [93]:
def answer_five():
    from sklearn.metrics import roc_curve
    from sklearn.metrics import precision_recall_curve
    from sklearn.linear_model import LogisticRegression
    lr=LogisticRegression().fit(X_train, y_train)
    y=lr.decision_function(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y)
    p075=np.argmin(abs(precision-0.75))
    fpr, tpr, thresholds =roc_curve(y_test, y)
    fpr016=np.argmin(abs(fpr-0.16))
    fpr016=np.argmin(abs(fpr-0.16))
    return (recall[p075], tpr[fpr016])

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array.*

In [8]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression
    param={'C':[0.01,0.1,1,10,100], 'penalty': ['l1', 'l2']}
    lr=LogisticRegression()
    grid_clf = GridSearchCV(lr, param_grid = param, scoring='recall')
    grid_clf=grid_clf.fit(X_train, y_train)
      
    return grid_clf.cv_results_['mean_test_score'].reshape(5,2)

In [9]:
answer_six()

array([[ 0.66666667,  0.76086957],
       [ 0.80072464,  0.80434783],
       [ 0.8115942 ,  0.8115942 ],
       [ 0.80797101,  0.8115942 ],
       [ 0.80797101,  0.80797101]])

In [10]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    %matplotlib notebook
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);

#GridSearch_Heatmap(answer_six())