---

_You are currently looking at **version 0.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource._

---

In [1]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `assets/fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [2]:
def answer_one():
    # Import the data
    data = pd.read_csv('assets/fraud_data.csv')
    
    # Calculate the percentage of fraud observations
    fraud_percentage = data['Class'].mean()
    
    return fraud_percentage

# Test the function
answer_one()


0.016410823768035772

In [3]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('assets/fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [4]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import accuracy_score, recall_score

    # Train a dummy classifier
    dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
    
    # Calculate accuracy and recall scores
    accuracy = accuracy_score(y_test, dummy_majority.predict(X_test))
    recall = recall_score(y_test, dummy_majority.predict(X_test))
    
    return accuracy, recall

# Test the function
answer_two()


(0.9852507374631269, 0.0)

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [5]:
def answer_three():
    from sklearn.metrics import accuracy_score, recall_score, precision_score
    from sklearn.svm import SVC
    
    # Train an SVC classifier with default parameters
    svc = SVC().fit(X_train, y_train)
    
    # Calculate accuracy, recall, and precision scores
    accuracy = accuracy_score(y_test, svc.predict(X_test))
    recall = recall_score(y_test, svc.predict(X_test))
    precision = precision_score(y_test, svc.predict(X_test))
    
    return accuracy, recall, precision

# Test the function
answer_three()


(0.9900442477876106, 0.35, 0.9333333333333333)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [6]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC
    
    # Train an SVC classifier with specified parameters
    svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
    
    # Calculate decision function and predict labels based on the threshold
    y_decision = svc.decision_function(X_test) > -220
    y_pred = y_decision.astype(int)
    
    # Calculate the confusion matrix
    confusion = confusion_matrix(y_test, y_pred)
    
    return confusion

# Test the function
answer_four()


array([[5320,   24],
       [  14,   66]], dtype=int64)

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_curve

def get_recall_and_true_positive_rate(X_train, y_train, X_test, y_test):
    # Train a logistic regression classifier with default parameters
    log_reg = LogisticRegression()
    log_reg.fit(X_train, y_train)

    # Get probability estimates for X_test
    y_pred_proba = log_reg.predict_proba(X_test)[:, 1]

    # Create precision recall curve
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

    # Find recall when precision is 0.75
    recall_at_precision_0_75 = recall[np.argmax(precision >= 0.75)]

    # Create ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

    # Find true positive rate when false positive rate is 0.16
    true_positive_rate_at_fpr_0_16 = tpr[np.argmin(np.abs(fpr - 0.16))]

    # Return tuple with recall and true positive rate
    return (recall_at_precision_0_75, true_positive_rate_at_fpr_0_16)

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation. (Suggest to use `solver='liblinear'`, more explanation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html))

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|

<br>

*This function should return a 4 by 2 numpy array with 8 floats.* 

*Note: do not return a DataFrame, just the values denoted by `?` in a numpy array.*

In [14]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

def answer_six():
    param_grid = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10]}
    log_reg = LogisticRegression(solver='liblinear')
    grid_search = GridSearchCV(log_reg, param_grid, cv=3, scoring='recall')
    grid_search.fit(X_train, y_train)
    
    # Get the mean test scores
    mean_test_scores = np.array([grid_search.cv_results_['mean_test_score'][grid_search.cv_results_['param_penalty'] == penalty][grid_search.cv_results_['param_C'] == C] for penalty in param_grid['penalty'] for C in param_grid['C']]).reshape(4, 2)
    
    return mean_test_scores

In [15]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    %matplotlib notebook
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10])
    plt.yticks(rotation=0);

#GridSearch_Heatmap(answer_six())