# Logistic Regression for Credit Card Fraud Detection (10 pts)

Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

### Loading the data (1 pts)
Load the data from `fraud_data.csv`.

In [4]:
import numpy as np
import pandas as pd

In [5]:
import numpy as np
import pandas as pd

data = pd.read_csv("fraud_data.csv")

## Print the percentage of fraud observations

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  # Your code here
print("Percentage of fraud observation ", (list(y).count(1)/len(y))*100)

Percentage of fraud observation  1.6410823768035772


**Question:** What percentage of the observations in the dataset are instances of fraud? 1.64

### Predictions using the majority class label (4pts)

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? (Here accuracy is the ratio of the number of correctly classified transactions to the total number of transactions)

In [6]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
    
## Instantiate and fit a dummy classifier that always predict class label by the majority class of the training data
## Use DummyClassifier in sklearn with strategy 'most_frequent
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

dummy_test_pred = dummy.predict(X_test)

## Measure test accuracy of your dummy classifier
dummy_test_acc = accuracy_score(y_test, dummy_test_pred)


print('Dummy classifier accuray:', dummy_test_acc)

Dummy classifier accuray: 0.9852507374631269


**Question:** *How does the accuracy of the dummy classifier look (very low, low, high, very high)? 

It is very high. This is due the fact that we have unbalanced data (frauds are less than 2%), therefore when we predict that every instance will be non fraudulent (98.5 %). So we are only mistaking in the 1.64%  that is fraudulent

**Question:** *How many fraudulent transactions are correctly classified? Zero

In [7]:
from sklearn.metrics import recall_score

## Measure test recall score of your dummy classifier
dummy_test_recall = recall_score(y_test, dummy_test_pred)

print('Dummy classifier recall:', dummy_test_recall)

Dummy classifier recall: 0.0


**Question:** *How does the recall of the dummy classifier look (very low, low, high, very high)? 
It is very low, because we are not classifing any of the flaudulent transactions correctly 

### Training a logistic regression model (3pts)

Train a logisitic regression classifier with default parameters using X_train and y_train.

In [79]:
from sklearn.linear_model import LogisticRegression
    
## Instantiate a logistic regression model and fit to the training data
logR = LogisticRegression()
logR.fit(X_train, y_train)
logR_test_pred = logR.predict(X_test)


## Measure test accuracy 
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)

print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9964970501474927
Logistic classifier recall: 0.7875


**Question:** *Compare the results of logistic regression with those of the above dummy classifier*

The results of the logistic regresion are much better than the ones from the dummy clasifier in both metrics but specially in the recall measure because now we are able to identify fraudulent transactions

### Grid search for selecting hyperparameters for Logistic Regression (2pts)

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

In [81]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.model_selection import GridSearchCV

## Define the grid of logistic regression parameters
parameters = {"penalty":["l1","l2"],"C": [0.01, 0.1, 1, 10, 100]}
model = LogisticRegression()
    
## Perform grid search CV to find best model parameter setting
cmodel = GridSearchCV(model, param_grid=parameters, cv=3)
cmodel.fit(X_train, y_train)

best_C =cmodel.best_estimator_.C
best_penalty = cmodel.best_estimator_.penalty

## Fit logistic regression with best parameters to the entire training data
model = LogisticRegression(C=best_C, penalty=best_penalty)
model.fit(X_train, y_train)
    
logR_test_pred = model.predict(X_test)

## Measure test accuracy
logR_test_acc = accuracy_score(y_test, logR_test_pred)

print('Logistic classifier accuray:', logR_test_acc)

## Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred)
print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuray: 0.9963126843657817
Logistic classifier recall: 0.775


**Question:** *Compare the results with that of logistic regression with default parameters*

Accuracy is slightly worst when we do grid search. I belive this happens because the range of parameter is not sufficiently large to include the optimal solution. 