---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

# Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [4]:
import numpy as np
import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt


### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [None]:
def answer_one():
    
    # Your code here
    fraud_df=pd.read_csv('fraud_data.csv')
    fraud_class_percentage=fraud_df['Class'].sum()/len(fraud_df['Class'])
    
    return fraud_class_percentage# Return your answer

In [5]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [None]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import accuracy_score,recall_score
    
    
    # Your code here
    # dummy classifier that classifies everything as the majority class of the training data
    # use most frequent strategy always predicts '0'
      
    dclf = DummyClassifier(strategy = 'most_frequent', random_state = 0)
    dclf.fit(X_train, y_train)
    
    #making predictions 
    
    y_pred = dclf.predict(X_test)
    
    return (accuracy_score(y_test, y_pred)),(recall_score(y_test, y_pred))# Return your answer

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [None]:
def answer_three():
    from sklearn.metrics import accuracy_score, recall_score, precision_score
    from sklearn import svm
    from sklearn.svm import SVC

    # Your code here
    # Default Parameters
    c_value = 1.0
    krenel_ = 'rbf'
    
    # Define our classifier and fit our data
    SVC_ = svm.SVC(kernel=krenel_, C = c_value, random_state = 0).fit(X_train, y_train)
    
    #get predictions
    y_pred = SVC_.predict(X_test)
    

    return (accuracy_score(y_test, y_pred)),(recall_score(y_test, y_pred)),(precision_score(y_test, y_pred))# Return your answer

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [None]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn import svm
    from sklearn.svm import SVC

    # Your code here
    # Default Parameters
    c_value = 1e9
    gamma_ = 1e-07
    
    # Define our classifier and fit our data
    SVC_ = svm.SVC(C = c_value, gamma=gamma_).fit(X_train, y_train)

    # Using Decision Function Method Present in svc class
    Decision_Function = SVC_.decision_function(X_test)
    
    #predictions with decision function and threshold
    y_pred_threshold = Decision_Function > (-220)

    # confusion matrix
    confusion_matrix = confusion_matrix(y_test, y_pred_threshold)
    
    return confusion_matrix# Return your answer

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [6]:
def answer_five():
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve
    from sklearn.metrics import roc_auc_score, roc_curve
    from sklearn.metrics import accuracy_score, recall_score, precision_score

    
    # Define our classifier and fit our data
    logreg = LogisticRegression(max_iter=100000)
    logreg_ = logreg.fit(X_train, y_train)
    
    #get predictions
    y_pred = logreg_.predict(X_test)
    
    # calculate model precision recall scores
    precision_scores, recall_scores, _ = precision_recall_curve(y_test, y_pred)
    
    #get tpr and fpr
    fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
    
    recall = np.interp(0.75, precision_scores, recall_scores).round(2)
    true_positive_rate = np.interp(0.16, fpr, tpr).round(2)
    
    return recall, true_positive_rate# Return your answer

In [None]:
def answer_five_plot():
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve
    from sklearn.metrics import roc_auc_score, roc_curve
    from sklearn.metrics import accuracy_score, recall_score, precision_score
    
    #%matplotlib notebook

    # Define our classifier and fit our data
    logreg = LogisticRegression()
    logreg_ = logreg.fit(X_train, y_train)
    
    #get predictions
    y_pred = logreg_.predict(X_test)
    
    # calculate model precision recall scores
    precision_scores, recall_scores, _ = precision_recall_curve(y_test, y_pred)
    
    #get tpr and fpr
    fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
    
    # Draw precision recall curve and roc curve
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(7,4))

    #Precision recall curve plot
    logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
    ax1.plot(fpr, tpr, label='ROC (area = %0.2f)' % logit_roc_auc, color='violet', lw=1.5)
    ax1.plot([0, 1], [0, 1], color='blue', lw=1.5, linestyle='--')
    ax1.set_xlim([-0.01, 1])
    ax1.set_ylim([-0.01, 1])
    ax1.set_xlabel('False Positive Rate', size=10)
    ax1.set_ylabel('True Positive Rate', size=10)
    ax1.set_title('Receiver operating characteristic', size=10)
    ax1.legend(loc="lower right")
    
    # Plot precision-recall curve

    ax2.plot(recall_scores, precision_scores, color='limegreen', lw=1.5)
    ax2.set_xlabel('Recall', size=10)
    ax2.set_ylabel('Precision', size=10)
    ax2.set_title('precision-recall curv', size=10);

    recall = np.interp(0.75, precision_scores, recall_scores).round(2)
    true_positive_rate = np.interp(0.16, fpr, tpr).round(2)
    
    #mark the points on graphs
    ax1.plot(0.16, true_positive_rate,  ls="", marker="*", ms=7,  color="crimson")
    ax2.plot(0.75, recall, ls="", marker="*", ms=7,  color="crimson")

    fig.tight_layout()

    return fig.tight_layout()# Return your answer

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array.*

In [None]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression

    # Your code here
    # Define our classifier and fit our data
    logreg = LogisticRegression(C=1, penalty='l1', solver='liblinear')
    
    grid_params = {'penalty': ['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100]}
    
    gs_logreg = GridSearchCV(logreg, param_grid=grid_params, scoring='recall')
    gs_logreg.fit(X_train, y_train)
    
    cv_result = pd.DataFrame(gs_logreg.cv_results_)
    
    mean_test_score = np.array(cv_result['mean_test_score']).reshape(5,2)
    
    return mean_test_score# Return your answer

In [None]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    #%matplotlib notebook
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);
    return plt.show()
#GridSearch_Heatmap(answer_six())