# Assignment A7 1

**Description**  
This assignment is to let you summarise the machine learning algorithms you have learned so far and to implement your knowledge in solving real-life security problem.  
You can use the provided links or other sources to get access to large data sets related to data security issues.

Choose a data set and the appropriate algorithms for solving a task in the context of the data.

Design and develop iPython solution, and either upload the notebook or a link to it.



## Dataset Information

We chose the Creditcard Fraud dataset from Kaggle, which is linked below.  
[Dataset reference](https://www.kaggle.com/mlg-ulb/creditcardfraud)

This Dataset is a collection of transactions including fraud transactions.  
Our assignment is to create a model that can distinguish between fraud and not fraud.
Because we the dataset contains labels we can use Supervised Learning Models.

## Why Classification?
**Differents between Discrete and Continuous**  
Classification predicts and class, like gender, color and object.  
It is based on discrete values, meaning the values are finite and countable.  

Regression predicts a value, like size, income and time.  
It is based on continuous values, meaning the values can be infite and not countable.  
The digits in π are continuous, this is because π contains are infite amount of digits.  

Therefore we have chosen classification to predict one of the two types of transactions in the dataset. 

## Chosen Classifiers
* Decision Tree
* Random Forest
* Naive Bayes

## Imports

In [None]:
import numpy as np
import pandas as pd

# classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# utilities
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# from sklearn.metrics import confusion_matrix´

# plotting
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

from zipfile import ZipFile

## Unzipping Zip File

In [None]:
zip_path = 'creditcardfraud.zip'
with ZipFile(zip_path, 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

## Data Preparation

### Loading the cvs file into DataFrame

In [None]:
file_path = 'creditcard.csv'
df = pd.read_csv(file_path)

### Display Dataset Section

In [None]:
df.head()

### Check Dataset for Null Values

In [None]:
df.isnull().sum()

### Display Dataset Informations

In [None]:
df.info()

### Calculate Descriptive Statistics for the Dataset

In [None]:
df.describe()

### Display Confusion Matrix

In [None]:
corr_matrix = df.corr()
corr_matrix

# plot the matrix as a heat map
plt.subplots(figsize = (20, 20))
sns.heatmap(corr_matrix, annot=True)

### Splitting Test and Train Data

In [None]:
train_data, test_data = train_test_split(df, train_size =0.8, random_state = 3)

targets = list(df)

X_train = train_data[targets[:-1]]
y_train = train_data[targets[-1]]

X_test = test_data[targets[:-1]]
y_test = test_data[targets[-1]]

### Separating 'Fraud' and 'Not Fraud' into DataFrames

In [None]:
not_frauds = df.loc[df['Class'] == 0]
frauds = df.loc[df['Class'] == 1]

not_frauds = not_frauds.drop(['Class'], axis=1)
frauds = frauds.drop(['Class'], axis=1)

## Decision Tree Classifier

### Initialize Classification Model

In [None]:
decisionTreeClassifier = DecisionTreeClassifier(max_depth=5)

### Train Model for Classification

In [None]:
decisionTreeClassifier.fit(X_train, y_train)

### Validate Model Accuracy

In [None]:
dtc_score = decisionTreeClassifier.score(X_test, y_test)
dtc_score

### Show Model Rapport

In [None]:
classification_report(y_train, decisionTreeClassifier.predict(X_train), target_names=['not fraud', 'fraud'])

### Perform Predictions

In [None]:
dtc_predict_fraud = decisionTreeClassifier.predict([frauds.iloc[0]])
dtc_predict_not_fraud = decisionTreeClassifier.predict([not_frauds.iloc[0]])

print('predict fraud, should be [1] =>', dtc_predict_fraud)
print('predict not fraud, should be [0] =>', dtc_predict_not_fraud)

## Random Forest Classifier

### Initialize Classification Model

In [None]:
randomForestClassifier = RandomForestClassifier(n_estimators = 20, max_depth = 5)

### Train Model for Classification

In [None]:
randomForestClassifier.fit(X_train, y_train)

### Validate Model Accuracy

In [None]:
rfc_score = randomForestClassifier.score(X_test, y_test)
rfc_score

### Show Model Rapport

In [None]:
predict_fraud = randomForestClassifier.predict([frauds.iloc[5]])
predict_not_fraud = randomForestClassifier.predict([not_frauds.iloc[0]])

print('predict fraud, should be [1] =>', predict_fraud)
print('predict not fraud, should be [0] =>', predict_not_fraud)

### Perform Predictions

## Naïve Bayes Classifier

### Initialize Classification Model

In [None]:
naiveBayesClassifier = GaussianNB()

### Train Model for Classification

In [None]:
naiveBayesClassifier.fit(X_train, y_train)

In [None]:
nbc_score = naiveBayesClassifier.score(X_test, y_test)
nbc_score

### Validate Model Accuracy

### Show Model Rapport

In [None]:
predict_fraud = naiveBayesClassifier.predict([frauds.iloc[3]])
predict_not_fraud = naiveBayesClassifier.predict([not_frauds.iloc[0]])

print('predict fraud, should be [1] =>', predict_fraud)
print('predict not fraud, should be [0] =>', predict_not_fraud)

In [None]:
classification_report(y_train, naiveBayesClassifier.predict(X_train), target_names=['not fraud', 'fraud'])

### Perform Predictions

## Compare Model Scores

In [None]:
scores = {dtc_score: 'Decision Tree Classifier', rfc_score:'Random Forest Classifier',  nbc_score:'Naive Bayes Classifier'}

print("scores for decision tree classifier", dtc_score)
print("scores for random forest classifier", rfc_score)
print("scores for naive bayes classifier", nbc_score)

print("best score is", scores[max(scores)])

### Based on the results, we can conclude the Decision Tree Classifier is the most accurate classifier for this dataset.