<a href="https://colab.research.google.com/github/Gulayrose/Fraud_Detection_Project/blob/main/Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# WELCOME!

Welcome to "***Fraud Detection Project***". This is the last project of the Capstone Series.

One of the challenges in this project is the absence of domain knowledge. So without knowing what the column names are, you will only be interested in their values. The other one is the class frequencies of the target variable are quite imbalanced.

You will implement ***Logistic Regression, Random Forest, Neural Network*** algorithms and ***SMOTE*** technique. Also visualize performances of the models using ***Seaborn, Matplotlib*** and ***Yellowbrick*** in a variety of ways.

At the end of the project, you will have the opportunity to deploy your model by ***Streamlit API***.

Before diving into the project, please take a look at the Determines and Tasks.

- ***NOTE:*** *This tutorial assumes that you already know the basics of coding in Python and are familiar with model deployement (streamlit api) as well as the theory behind Logistic Regression, Random Forest, Neural Network.*



---
---


# #Determines
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where it has **492 frauds** out of **284,807** transactions. The dataset is **highly unbalanced**, the positive class (frauds) account for 0.172% of all transactions.

**Feature Information:**

**Time**: This feature is contains the seconds elapsed between each transaction and the first transaction in the dataset. 

**Amount**:  This feature is the transaction Amount, can be used for example-dependant cost-senstive learning. 

**Class**: This feature is the target variable and it takes value 1 in case of fraud and 0 otherwise.

---

The aim of this project is to predict whether a credit card transaction is fraudulent. Of course, this is not easy to do.
First of all, you need to analyze and recognize your data well in order to draw your roadmap and choose the correct arguments you will use. Accordingly, you can examine the frequency distributions of variables. You can observe variable correlations and want to explore multicollinearity. You can show the distribution of the target variable's classes over other variables. 
Also, it is useful to take missing values and outliers.

After these procedures, you can move on to the model building stage by doing the basic data pre-processing you are familiar with. 

Start with Logistic Regression and evaluate model performance. You will apply the SMOTE technique used to increase the sample for unbalanced data. Next, rebuild your Logistic Regression model with SMOTE applied data to observe its effect.

Then, you will use three different algorithms in the model building phase. You have applied Logistic Regression and Random Forest in your previous projects. However, the Deep Learning Neural Network algorithm will appear for the first time.

In the final step, you will deploy your model using ***Streamlit API***. 

**Optional**: You can Dockerize your project and deploy on cloud.

---
---


# #Tasks

#### 1. Exploratory Data Analysis & Data Cleaning

- Import Modules, Load Data & Data Review
- Exploratory Data Analysis
- Data Cleaning



    
#### 2. Data Preprocessing

- Scaling
- Train - Test Split


#### 3. Model Building

- Logistic Regression without SMOTE
- Apply SMOTE
- Logistic Regression with SMOTE
- Random Forest Classifier with SMOTE
- Neural Network

#### 4. Model Deployement

- Save and Export the Model as .pkl
- Save and Export Variables as .pkl 




---
---


## 1. Exploratory Data Analysis & Data Cleaning

### Import Modules, Load Data & Data Review

In [147]:
import numpy as np 
import pandas as pd 
from pandas.plotting import register_matplotlib_converters
%matplotlib inline

import seaborn as sns
sns.set_style("darkgrid")

import matplotlib.pyplot as plt
from collections import Counter

from pylab import rcParams
plt.rcParams['figure.figsize'] = (6,6)
plt.rcParams['figure.dpi'] = 100

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

pd.set_option('display.max_columns', None)



In [None]:
!pip install unrar
!unrar x /content/drive/MyDrive/creditcard.part1.rar

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from /content/drive/MyDrive/creditcard.part1.rar


Would you like to replace the existing file creditcard.csv
150828752 bytes, modified on 2021-02-15 06:26
with a new one
150828752 bytes, modified on 2021-02-15 06:26

[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit 

In [None]:
df = pd.read_csv("/content/drive/MyDrive/creditcard.csv")

Veri kümeleri, Avrupalı ​​kart sahipleri tarafından Eylül 2013'te kredi kartlarıyla yapılan işlemleri içerir. Bu veri kümesi, 284,807 işlemden 492'sinin dolandırıcılık olduğu iki gün içinde gerçekleşen işlemleri sunar. Veri kümesi oldukça dengesizdir, pozitif sınıf (dolandırıcılık) tüm işlemlerin %0,172'sini oluşturur.

In [None]:
df.head()

### Exploratory Data Analysis

In [None]:
df.shape

In [None]:
df.describe().T

In [None]:
df.info()

In [None]:
df.Time.value_counts() #burdaki degerler sn cinsinden 

In [None]:
df.Amount.value_counts()

In [None]:
df.Class.value_counts() ##dolandırıcılık durumunda 1, aksi durumda 0 değerini alır.

In [None]:
fig, ax = plt.subplots(figsize=(6,4))
df.Class.value_counts(ascending=False).plot.bar();
for p in ax.patches:
    ax.annotate((p.get_height()), (p.get_x()+0.2, p.get_height()+20),rotation=95);

In [None]:
df_class1=df[df.Class == 1]

In [None]:
df_class0=df[df.Class== 0 ]

In [None]:
df_class1.head()

In [None]:
df[df.Class == 1].describe().T.style.background_gradient(cmap='Spectral_r')

In [None]:
plt.figure(figsize = (30, 15))
sns.heatmap(round(df.corr(), 3), annot = True, cmap = 'RdYlGn', linewidth = 0.2, annot_kws = {'size' : 16});

Class ile corr iliskileri yuksek olan feature' lar ile de model kurduk fakat azinlikta olan 1 class' inin corr iliskisine bakarak model kurmak daha saglikli olacagi icin feature selection islemini asagida shap yontemi ile yapmayi tercih ettik.


In [None]:
plt.figure(figsize=(15,10))
df.corr()["Class"].drop("Class").sort_values().plot.barh();

In [None]:
df_class0.head()

Fraud islemleri 48 saat icine rastgele dagilim gostermis :


In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(18,8))

bins = 50

ax1.hist(df.Time[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Time[df.Class == 0], bins = bins)
ax2.set_title('NoFraud')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Number of Transactions')
plt.show()

In [None]:
ax = df.loc[df['Class'] == 1].plot.scatter(x='Amount', y='Class', color='Orange', label='Fraud')
df.loc[df['Class'] == 0].plot.scatter(x='Amount', y='Class', color='Blue', label='NoFraud', ax=ax)
plt.show()

Ustteki ve alttaki gosterim, yapilan islemlerin fiyatini temsil ediyor. Fraud islemlerinde en fazla yapilan harcama 2500 euro civarlarinda iken Fraud olmayan islemler 25000 euro' ya kadar cikmis.

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(15,7))

bins = 30

ax1.hist(df.Amount[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Amount[df.Class == 0], bins = bins)
ax2.set_title('NoFraud')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()

1 ve 0 class' larinin PCA yontemi ile olusturulan componentlerin icindeki dagilimlarini gormek icin asagidaki grafikleri cizdirdik. 9, 10, 11, 12, 14, 16 ve ozellikle de 17. ve 18. componentler icinde class' larin birbirinden daha iyi ayristigini soyleyebiliriz :

In [None]:
import matplotlib.gridspec as gridspec


In [None]:
plt.figure(figsize=(18,28*4))
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(df.drop(['Time', 'Class', 'Amount'], axis=1)):
    ax = plt.subplot(gs[i])
    plt.hist(df[cn][df.Class == 1], bins=50, alpha = 0.7)
    plt.hist(df[cn][df.Class == 0], bins=50, alpha = 0.3)
    plt.yscale('log')
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(cn))
plt.show()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.Class.value_counts()

In [None]:

import plotly.express as px

In [None]:
fig = px.pie(df, values = df['Class'].value_counts(), 
             names = (df['Class'].value_counts()).index, 
             title = '"Class" Column Distribution')
fig.show()

### Data Cleaning
Check Missing Values and Outliers

In [None]:
df.isna().sum().sum()

In [None]:
index = 0
plt.figure(figsize=(20,20))
for feature in df.columns :
    if feature != 'Class' :
        index += 1
        plt.subplot(8,4,index)
        sns.boxplot(x = 'Class', y = feature, data = df)
plt.tight_layout()
plt.show();

Feature' lar icinde az da olsa outlier verilere rastlandi. Fakat feature' larimiz PCA yontemi ile elde edilen componentler oldugu icin outlier verilerin ne olduklari hakkinda bilgi sahibi degiliz. 1 class' ina ait verimiz az oldugu icin de veri kaybetmemek adina outlier verileri silmeden devam etme karari aldik :

---
---


## 2. Data Preprocessing

#### Train - Test Split

As in this case, for extremely imbalanced datasets you may want to make sure that classes are balanced across train and test data.

In [None]:
X = df.drop(['Class', 'Time'], axis = 1)
y = df.Class ##burda time da dusurudk cunku bi etlkisi olmadigi icin. datayi rahatlatmak icin . 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, stratify=y, random_state=42)

#### Scaling

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [None]:
operations = [("scaler", StandardScaler()), ('log', LogisticRegression())]

---
---


## 3. Model Building
It was previously stated that you need to make class prediction with three different algorithms. As in this case, different approaches are required to obtain better performance on unbalanced data.

This dataset is severely **unbalanced** (most of the transactions are non-fraud). So the algorithms are much more likely to classify new observations to the majority class and high accuracy won't tell us anything. To address the problem of imbalanced dataset we can use undersampling and oversampling data approach techniques. Oversampling increases the number of minority class members in the training set. The advantage of oversampling is that no information from the original training set is lost unlike in undersampling, as all observations from the minority and majority classes are kept. On the other hand, it is prone to overfitting. 

There is a type of oversampling called **[SMOTE](https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/)** (Synthetic Minority Oversampling Technique), which we are going to use to make our dataset balanced. It creates synthetic points from the minority class.

- It is important that you can evaluate the effectiveness of SMOTE. For this reason, implement the Logistic Regression algorithm in two different ways, with SMOTE applied and without.

***Note***: 

- *Do not forget to import the necessary libraries and modules before starting the model building!*

- *If you are going to use the cross validation method to be more sure of the performance of your model for unbalanced data, you should make sure that the class distributions in the iterations are equal. For this case, you should use **[StratifiedKFold](https://www.analyseup.com/python-machine-learning/stratified-kfold.html)** instead of regular cross validation method.*

### Logistic Regression without SMOTE

- The steps you are going to cover for this algorithm are as follows: 

   i. Import Libraries
   
   *ii. Model Training*
   
   *iii. Prediction and Model Evaluating*
   
   *iv. Plot Precision and Recall Curve*
   
   *v. Apply and Plot StratifiedKFold*

***i. Import Libraries***

In [None]:
from sklearn.linear_model import LogisticRegression

***ii. Model Training***

In [None]:
def eval_metric(model, X_train, y_train, X_test, y_test):
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    print("Test_Set")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print()
    print("Train_Set")
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, accuracy_score, f1_score, average_precision_score

In [None]:
pipe_model = Pipeline(steps=operations)

In [None]:
pipe_model.fit(X_train, y_train)


In [None]:
eval_metric(pipe_model, X_train, y_train, X_test, y_test)

CROSS VALIDATE

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
operations = [('scaler',StandardScaler()),('log',LogisticRegression())]
model = Pipeline(operations)

scores = cross_validate(model, X_train, y_train, scoring = ['precision','recall','f1','accuracy'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

In [None]:
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve, roc_auc_score, auc, roc_curve, average_precision_score

In [None]:
y_pred = pipe_model.predict(X_test)
log_f1 = f1_score(y_test, y_pred)
log_recall = recall_score(y_test, y_pred)
log_auc = roc_auc_score(y_test, y_pred)

with class_weight

In [None]:
class_weights = {0:1, 1:15}

In [None]:
operations = [("scaler", StandardScaler()), ('log', LogisticRegression(class_weight=class_weights))]

In [None]:
pipe_model_weight = Pipeline(steps=operations)

In [None]:
pipe_model_weight.fit(X_train, y_train)

In [None]:
eval_metric(pipe_model_weight, X_train, y_train, X_test, y_test) ##class_weight isleminden sonra presicion skorlari duserken recall skorlarimiz yukseldi :

Cross Validate sonucu elde ettigimiz precision skorlari tek seferlik skorlardan biraz dusuk cikti :

In [None]:
operations = [('scaler',StandardScaler()),('log',LogisticRegression(class_weight=class_weights))]
model = Pipeline(operations)

scores = cross_validate(model, X_train, y_train, scoring = ['precision','recall','f1','accuracy'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

***iii. Prediction and Model Evaluating***

In [None]:
y_pred = pipe_model_weight.predict(X_test)

In [None]:
y_pred_proba = pipe_model_weight.predict_proba(X_test)

In [None]:
test_data = pd.concat([X_test, y_test], axis=1)
test_data["pred"] = y_pred
test_data["pred_proba"] = y_pred_proba[:,1]
test_data.sample(10)

In [None]:
log_weighted_f1 = f1_score(y_test, y_pred)
log_weighted_recall = recall_score(y_test, y_pred)
log_weighted_auc = roc_auc_score(y_test, y_pred)

Datadaki bir ornekten prediction :

In [None]:
df[df.Class == 1].head()

In [None]:
pipe_model_weight.predict(X.loc[[541]])    # True prediction

In [None]:
pipe_model_weight.predict(X.loc[[623]])      # Wrong prediction

matthews_corrcoef --> Alinan gercek degerler ile tahmin degerleri arasindaki corr

matthews_corrcoef ve cohen_kappa_score dengesiz datasetlerinde genel performans icin bakilan skorlardir.

In [None]:
from sklearn.metrics import matthews_corrcoef

y_pred = pipe_model_weight.predict(X_test)

matthews_corrcoef(y_test, y_pred)

In [None]:
from sklearn.metrics import cohen_kappa_score

cohen_kappa_score(y_test, y_pred)


You're evaluating "accuracy score"? Is your performance metric reflect real success? You may need to use different metrics to evaluate performance on unbalanced data. You should use **[precision and recall metrics](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#:~:text=The%20precision%2Drecall%20curve%20shows,a%20low%20false%20negative%20rate.)**.

***iv. Plot Precision and Recall Curve***


In [None]:
from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve, plot_roc_curve, roc_auc_score, roc_curve, average_precision_score

In [None]:
plot_precision_recall_curve(pipe_model_weight, X_test, y_test);

In [None]:
y_pred_proba = pipe_model.predict_proba(X_train)
average_precision_score(y_train, y_pred_proba[:,1])

***v. Apply StratifiedKFold***

In [None]:
precisions, recalls, thresholds = precision_recall_curve(y_train, y_pred_proba[:,1])


In [None]:
optimal_idx = np.argmax((2 * precisions * recalls) / (precisions + recalls))
optimal_threshold = thresholds[optimal_idx]
optimal_threshold

In [None]:
from sklearn.model_selection import StratifiedKFold    # Modeli kaç parçaya ayırmak istiyorsak ona göre index numaraları belirler.

def CV(n, est, X, y, optimal_threshold):
    skf = StratifiedKFold(n_splits = n, shuffle = True, random_state = 42)
    acc_scores = []
    pre_scores = []
    rec_scores = []
    f1_scores  = []
    
    X = X.reset_index(drop=True)       # Index no'ları her işlemden sonra sıfırlaması için.
    y = y.reset_index(drop=True)
    
    for train_index, test_index in skf.split(X, y):
        
        X_train = X.loc[train_index]
        y_train = y.loc[train_index]
        X_test = X.loc[test_index]
        y_test = y.loc[test_index]
        
        
        est = est
        est.fit(X_train, y_train)
        y_pred = est.predict(X_test)
        y_pred_proba = est.predict_proba(X_test)
             
        y_pred2 = pd.Series(y_pred_proba[:,1]).apply(lambda x : 1 if x >= optimal_threshold else 0)
        
        acc_scores.append(accuracy_score(y_test, y_pred2))
        pre_scores.append(precision_score(y_test, y_pred2, pos_label=1))
        rec_scores.append(recall_score(y_test, y_pred2, pos_label=1))
        f1_scores.append(f1_score(y_test, y_pred2, pos_label=1))
    
    print(f'Accuracy {np.mean(acc_scores)*100:>10,.2f}%  std {np.std(acc_scores)*100:.2f}%')
    print(f'Precision-1 {np.mean(pre_scores)*100:>7,.2f}%  std {np.std(pre_scores)*100:.2f}%')
    print(f'Recall-1 {np.mean(rec_scores)*100:>10,.2f}%  std {np.std(rec_scores)*100:.2f}%')
    print(f'F1_score-1 {np.mean(f1_scores)*100:>8,.2f}%  std {np.std(f1_scores)*100:.2f}%')

In [None]:
CV(10, pipe_model, pd.DataFrame(X_train), y_train, 0.5)

In [None]:
CV(10, pipe_model, pd.DataFrame(X_train), y_train, optimal_threshold)

- Didn't the performance of the model you implemented above satisfy you? If your model is biased towards the majority class and minority class recall is not sufficient, apply **SMOTE**.

### Apply SMOTE

In [None]:
#conda install imblearn


In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as imbpipeline


In [None]:
y_train.value_counts()

In [None]:
#over = SMOTE(sampling_strategy=0.1)
#under = RandomUnderSampler(sampling_strategy=0.5)

SMOTE islemi ile az olan 1 class' inin veri sayisini sentetik olarak artirarak yaklasik 3 katina cikardik. 0 class' ina ait veri sayisini da yaklasik 2.5 kat azalttik :

In [None]:
over = SMOTE(sampling_strategy={1: 1000})
under = RandomUnderSampler(sampling_strategy={0: 100000})

In [None]:
X_resampled, y_resampled = over.fit_resample(X_train, y_train)

In [None]:
X_resampled, y_resampled = under.fit_resample(X_resampled, y_resampled)

In [None]:
y_resampled.value_counts()

### Logistic Regression with SMOTE

- The steps you are going to cover for this algorithm are as follows:
   
   *i. Train-Test Split (Again)*
   
   *ii. Model Training*
   
   *iii. Prediction and Model Evaluating*
   
   *iv. Plot Precision and Recall Curve*
   
   *v. Apply and Plot StratifiedKFold*

***i. Train-Test Split (Again)***

Use SMOTE applied data.

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)

In [None]:
class_weight_smote = {0:1, 1:1}

In [None]:
operations = [('o', over), ('u', under), ("scaler", StandardScaler()), 
              ('log', LogisticRegression(class_weight=class_weight_smote, random_state = 42))]

In [None]:
smote_pipeline_log = imbpipeline(steps=operations)

***ii. Model Training***

In [None]:
smote_pipeline_log.fit(X_train, y_train)

***iii. Prediction and Model Evaluating***

Smote isleminden sonra kurulan Logistic Regression modelde precision ve recall skorlari birbirine biraz daha yaklasti :

In [None]:
eval_metric(smote_pipeline_log, X_train, y_train, X_test, y_test)

Cross Validate

In [None]:
model = smote_pipeline_log = imbpipeline(steps=operations)

scores = cross_validate(model, X_train, y_train, scoring = ['accuracy', 'recall', 'f1'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]


iv. Plot Precision and Recall Curve

In [None]:
plot_precision_recall_curve(smote_pipeline_log, X_test, y_test);

***v. Apply StratifiedKFold***

In [None]:
 CV(10, smote_pipeline_log, pd.DataFrame(X_train), y_train, optimal_threshold)

In [None]:
 CV(10, smote_pipeline_log, pd.DataFrame(X_train), y_train, 0.5)

In [None]:
y_pred = smote_pipeline_log.predict(X_test)
smote_pipeline_f1 = f1_score(y_test, y_pred)
smote_pipeline_recall = recall_score(y_test, y_pred)
smote_pipeline_auc = roc_auc_score(y_test, y_pred)

### Random Forest Classifier with SMOTE

- The steps you are going to cover for this algorithm are as follows:

   *i. Model Training*
   
   *ii. Prediction and Model Evaluating*
   
   *iii. Plot Precision and Recall Curve*
   
   *iv. Apply and Plot StratifiedKFold*
   

***i. Model Training***

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
class_weights = {0 : 1, 1 : 1}

In [None]:
over = SMOTE(sampling_strategy={1: 1000})
under = RandomUnderSampler(sampling_strategy={0: 100000})

In [None]:
rf_operations = [('o', over), ('u', under), ('rf', RandomForestClassifier(class_weight=class_weights, max_depth=7, random_state=42))]
smote_rf_model = imbpipeline(steps=rf_operations)

In [None]:
smote_rf_model.fit(X_train, y_train)

RF default parametreler ile kurulan modelde overfit durumu gozlendi. Parametreler ile oynanarak en iyi skor max_depth=7 parametresi ile alindi :

In [None]:
eval_metric(smote_rf_model, X_train, y_train, X_test, y_test)

In [None]:
model = RandomForestClassifier(class_weight = class_weights, max_depth=7, random_state=42)

scores = cross_validate(model, X_train, y_train, scoring = ['accuracy', 'precision', 'recall', 'f1'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))

df_scores.mean()[2:]

***ii. Prediction and Model Evaluating***

In [None]:
y_pred = smote_rf_model.predict(X_test)
smote_rf_f1 = f1_score(y_test, y_pred)
smote_rf_recall = recall_score(y_test, y_pred)
smote_rf_auc = roc_auc_score(y_test, y_pred)

In [None]:
smote_rf_model.predict(X.loc[[541]])      # True prediction


In [None]:
smote_rf_model.predict(X.loc[[623]])        # Wrong prediction

***iii. Plot Precision and Recall Curve***


In [None]:
plot_precision_recall_curve(smote_rf_model, X_test, y_test);

***iv. Apply StratifiedKFold***

In [None]:
# without class_weight for StratifiedKFold :

rf_operations = [('o', over), ('u', under), ('rf', RandomForestClassifier(max_depth=7, random_state=42))]
rf_stratified = imbpipeline(steps=rf_operations)

rf_stratified.fit(X_train, y_train)

eval_metric(rf_stratified, X_train, y_train, X_test, y_test)

In [None]:
model = RandomForestClassifier(max_depth=7, random_state=42)

scores = cross_validate(model, X_train, y_train, scoring = ['accuracy', 'precision', 'recall', 'f1'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))

df_scores.mean()[2:]

In [None]:
y_pred_proba = rf_stratified.predict_proba(X_train)
average_precision_score(y_train, y_pred_proba[:,1])

In [None]:
precisions, recalls, thresholds = precision_recall_curve(y_train, y_pred_proba[:,1])

In [None]:
optimal_idx = np.argmax((2 * precisions * recalls) / (precisions + recalls))
optimal_threshold = thresholds[optimal_idx]
optimal_threshold

In [None]:
CV(10, smote_rf_model, pd.DataFrame(X_train), y_train, optimal_threshold)

In [None]:
CV(10, smote_rf_model, pd.DataFrame(X_train), y_train, 0.5)


In [None]:
CV(10, rf_stratified, pd.DataFrame(X_train), y_train, optimal_threshold)

In [None]:
CV(10, rf_stratified, pd.DataFrame(X_train), y_train, 0.5)

### Neural Network

In the final step, you will make classification with Neural Network which is a Deep Learning algorithm. 

Neural networks are a series of algorithms that mimic the operations of a human brain to recognize relationships between vast amounts of data. They are used in a variety of applications in financial services, from forecasting and marketing research to fraud detection and risk assessment.

A neural network contains layers of interconnected nodes. Each node is a perceptron and is similar to a multiple linear regression. The perceptron feeds the signal produced by a multiple linear regression into an activation function that may be nonlinear.

In a multi-layered perceptron (MLP), perceptrons are arranged in interconnected layers. The input layer collects input patterns. The output layer has classifications or output signals to which input patterns may map. 

Hidden layers fine-tune the input weightings until the neural network’s margin of error is minimal. It is hypothesized that hidden layers extrapolate salient features in the input data that have predictive power regarding the outputs.

You will discover **[how to create](https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5)** your deep learning neural network model in Python using **[Keras](https://keras.io/about/)**. Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models.

- The steps you are going to cover for this algorithm are as follows:

   *i. Import Libraries*
   
   *ii. Define Model*
    
   *iii. Compile Model*
   
   *iv. Fit Model*
   
   *v. Prediction and Model Evaluating*
   
   *vi. Plot Precision and Recall Curve*

***i. Import Libraries***

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix

***ii. Define Model***

In [None]:
X2 = df[feature]             
y = df.Class.values
seed = 42
X_train, X_test, y_train, y_test = train_test_split(X2,
                                                    y,
                                                    stratify=y,
                                                    test_size=0.1,
                                                    random_state=seed)

In [None]:
scaler = StandardScaler()

In [None]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
tf.random.set_seed(seed)

model_ann = Sequential()

model_ann.add(Dense(30, activation = "relu", input_dim = X_train.shape[1]))
model_ann.add(Dense(15, activation = "relu"))
model_ann.add(Dense(1, activation = "sigmoid"))

***iii. Compile Model***

***iv. Fit Model***

***v. Prediction and Model Evaluating***

***vi. Plot Precision and Recall Curve***

## 4. Model Deployement
You cooked the food in the kitchen and moved on to the serving stage. The question is how do you showcase your work to others? Model Deployement helps you showcase your work to the world and make better decisions with it. But, deploying a model can get a little tricky at times. Before deploying the model, many things such as data storage, preprocessing, model building and monitoring need to be studied.

Deployment of machine learning models, means making your models available to your other business systems. By deploying models, other systems can send data to them and get their predictions, which are in turn populated back into the company systems. Through machine learning model deployment, can begin to take full advantage of the model you built.

Data science is concerned with how to build machine learning models, which algorithm is more predictive, how to design features, and what variables to use to make the models more accurate. However, how these models are actually used is often neglected. And yet this is the most important step in the machine learning pipline. Only when a model is fully integrated with the business systems, real values ​​can be extract from its predictions.

After doing the following operations in this notebook, jump to a proper IDE and create your web app with Streamlit API.

### Save and Export the Model as .pkl


### Save and Export Variables as .pkl

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___