---

_You are currently looking at **version 1.2** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

# Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [7]:
def answer_one():
    a1 = pd.read_csv('fraud_data.csv')
    classes = a1['Class']
    a = classes.groupby(classes).count().values
    return a[1] / (a[1] + a[0])# Return your answer
answer_one()

0.016410823768035772

In [8]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [14]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score,accuracy_score
    
    # Your code here
    dc=DummyClassifier(strategy='most_frequent').fit(X_train,y_train)
    y_dummy_pred=dc.predict(X_test)
    acc=accuracy_score(y_test,y_dummy_pred)
    rec=recall_score(y_test,y_dummy_pred)
    return acc,rec # Return your answer
answer_two()

(0.98525073746312686, 0.0)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [20]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score,accuracy_score
    from sklearn.svm import SVC

    # Your code here
    svc1=SVC().fit(X_train,y_train)
    y_pred=svc1.predict(X_test)
    acc_svc=accuracy_score(y_test,y_pred)
    rec_svc=recall_score(y_test,y_pred)
    prec_svc=precision_score(y_test,y_pred)
    return acc_svc,rec_svc,prec_svc

answer_three()# Return your answer

(0.99078171091445433, 0.375, 1.0)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [22]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    # Your code here
    SVM=SVC(C= 1e9, gamma= 1e-07).fit(X_train,y_train)
    SVM_y_predict = SVM.decision_function(X_test) > -220  
    
    Decisions = SVM.decision_function(X_test)
    confusion = confusion_matrix(y_test, SVM_y_predict)

    return confusion # Return your answer
answer_four()

array([[5320,   24],
       [  14,   66]])

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
def answer_five():
        
    lr = LogisticRegression().fit(X_train, y_train)# traomomg classifier
     
    y_scores_lr_predict_proba = lr.predict_proba(X_test)#1st column - prob of being in 1st class, 2nd column - prob of being in 2nd class
    precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr_predict_proba[:,1])
    Zipped = list(zip(precision,recall))
    Recall_answer = [rec for pre,rec in Zipped if pre == 0.75]
    
     
    
    y_scores_lr_decision_func = lr.decision_function(X_test)
    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr_decision_func)
    #print(_)#was is das?
    Zipped = list(zip(fpr_lr,tpr_lr))
    #print(Zipped)
    TPR_answer = [tpr for fpr,tpr in Zipped if round(fpr, 2) == 0.16]
    
    return Recall_answer[0], TPR_answer[1]

answer_five()

(0.82499999999999996, 0.94999999999999996)

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.*

In [30]:
import numpy
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression
    answer=[]
    # Your code here
    grid_values = [{'penalty': ['l2'], 'C': [0.01, 0.1, 1, 10, 100]}, 
                   {'penalty': ['l1'], 'C': [0.01, 0.1, 1, 10, 100]}]
    
    for elements in grid_values:
        clf = LogisticRegression()
        grid_clf = GridSearchCV(clf, param_grid = elements, scoring = 'recall')
        grid_clf = grid_clf.fit(X_train, y_train)
        
        mean_test_score = grid_clf.cv_results_['mean_test_score']
        answer.append(mean_test_score)
        
    answer = numpy.asarray(answer)

    return numpy.rot90(answer,3)
answer_six()

array([[ 0.66666667,  0.76086957],
       [ 0.80072464,  0.80434783],
       [ 0.8115942 ,  0.8115942 ],
       [ 0.80797101,  0.8115942 ],
       [ 0.80797101,  0.80797101]])