## Credit Card Fraud - Oversampling and undersampling

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#eda">Exploratory Data Analysis -EDA </a></li>
<li><a href="#under">RandomUnderSampler & SMOTE combination</a></li>
<li><a href="#over">SMOTE</a></li>    
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, So we will use two methods to support this imbalanced data using SMOTE this model is used to maximize the minority data to be balanced with the majority data using the k-neighbor model so which uses the nearest points to make a new point among them
and randomundersampler This way of the model is used to minimize the number of samples in the majority class to be matched with the minority.

This is an into to a Precision and recall :
* Precision = True Positives / (True Positives + False Positives)<br>
 A high precision score means that the classifier is making very few false positive predictions, which is good if we want to minimize false alarms.<br>
* Recall = True Positives / (True Positives + False Negatives)<br>
A high recall score means that the classifier is correctly identifying a large fraction of the positive instances in the dataset, which is good if we want to minimize false negatives.

<a id='eda'></a>
## EDA

In [None]:
# first the EDA we will import the needed libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [None]:
df=pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
df.head()

In [None]:
# Normlize the Amount column
min_max_scaler = preprocessing.MinMaxScaler()
df["Amount"] = min_max_scaler.fit_transform(np.array(df["Amount"]).reshape(-1, 1))
df.head()

In [None]:
df["Class"].value_counts()

In [None]:
# draw the distrubtion of y by using any of our features
from collections import Counter

counter = Counter(df["Class"])

for label, _ in counter.items():
    row_ix = np.where(df["Class"] == label)[0]
    plt.scatter(df.iloc[row_ix,1], df.iloc[row_ix,2], label=str(label))
print(Counter(df["Class"]))
plt.legend()
plt.show()

The above plot is clarify the distribution of y in column V1 and V2.

### Imbalance Dataset using by RandomUnderSampler & SMOTE

In [None]:
# import some important laibraies
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,precision_score,recall_score

In [None]:
# here is the x and y that we will use to predict our models after apply overfitting & underfitting models
X=df.drop("Class",axis=1).values
y=df["Class"].values
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=5)

In [None]:
# this LogisticRegression without using SMOTE
linear=LogisticRegression(max_iter=200)
linear.fit(x_train,y_train)

y_train_pred=linear.predict(x_train)
print("linear Report for training : \n",classification_report(y_train,y_train_pred))

print("*"*50)

y_test_pred_LR=linear.predict(x_test)
print("linear Report for testing : \n",classification_report(y_test,y_test_pred_LR))

<a id='under'></a>
## First using RandomUnderSampler & SMOTE combinations

In [None]:
# define the x(Features) the y(Label) from the data 
X1=df.drop("Class",axis=1).values
y1=df["Class"].values

In [None]:
# model over sample SMOTE & under sample RandomUnderSampler 
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
print("over : ",over)

steps = [('o', over), ('u', under)]
print("steps : ",steps)

# use pipeline to apply the previous steps
pipeline = Pipeline(steps=steps)
print("pipeline : ",pipeline)

# transform the dataset
X1, y1 = pipeline.fit_resample(X1, y1)

# summarize the new class distribution
counter = Counter(y1)
print(counter)

# scatter plot of examples by class label
for label, _ in counter.items():
    row_ix = np.where(y1 == label)[0]
    plt.scatter(X1[row_ix, 1], X1[row_ix, 2], label=str(label))
    
plt.legend()
plt.show()

In the above model we using the combinatoin between SMOTE and RandomUnderSampler to make the balance beteween the data and we use sampling_strategy to be make ratio 2:1 maximize the minority data to be 50% of the majority data and minimize the majority), The final output would be class 0=56,862 & class 1=28,431

## LogisticRegression by SMOTE & RandomUnderSampler

In [None]:
x1_train , x1_test , y1_train , y1_test = train_test_split(X1,y1,test_size=.2,random_state=1)

In [None]:
linear=LogisticRegression(C=.01,max_iter=200)
linear.fit(x1_train,y1_train)

y1_train_pred=linear.predict(x_train)
print("linear Report for training : \n",classification_report(y_train,y1_train_pred))

print("*"*50)

y1_test_pred_LR=linear.predict(x_test)
print("linear Report for testing : \n",classification_report(y_test,y1_test_pred_LR))

This model focused on the recall score = 88% in the test score otherwise the recall is equal 59% in the previous model without applying SMOTE and RandomUnderSampler.

### DecisionTreeClassifier by SMOTE & RandomUnderSampler

In [None]:
DT=DecisionTreeClassifier()
DT.fit(x1_train,y1_train)

y1_train_pred=DT.predict(x_train)
print("DecisionTreeClassifier Report for training : \n",classification_report(y_train,y1_train_pred))

print("*"*50)

y1_test_pred_DT=DT.predict(x_test)
print("DecisionTreeClassifier Report for testing : \n",classification_report(y_test,y1_test_pred_DT))

## RandomForestClassifier -  SMOTE & RandomUnderSampler

In [None]:
RF=RandomForestClassifier()
RF.fit(x1_train,y1_train)

y1_train_pred=RF.predict(x_train)
print("DecisionTreeClassifier Report for training : \n",classification_report(y_train,y1_train_pred))

print("*"*50)

y1_test_pred_RF=RF.predict(x_test)
print("DecisionTreeClassifier Report for testing : \n",classification_report(y_test,y1_test_pred_RF))

<a id='over'></a>
## SMOTE

In [None]:
# prepare new x and y
X_somte=df.drop("Class",axis=1).values
y_somte=df["Class"].values

In [None]:
# transform the dataset
oversample = SMOTE()
X_somte, y_somte = oversample.fit_resample(X_somte, y_somte)

In [None]:
# summarize the new class distribution
counter = Counter(y_somte)
print(counter)

In [None]:
# scatter plot after SMOTE
for label, _ in counter.items():
    row_ix = np.where(y_somte == label)[0]
    plt.scatter(X_somte[row_ix, 1], X_somte[row_ix, 2], label=str(label))
plt.legend()
plt.show()

The above plot is clarify the distribution of y in column V1 and V2 after SMOTE

In [None]:
x_smote_train ,x_smote_test , y_smote_train , y_smote_test = train_test_split(X_somte, y_somte,test_size=.3,random_state=2)

## LogisticRegression SOMOTE

In [None]:
linear=LogisticRegression(max_iter=200)
linear.fit(x_smote_train,y_smote_train)

y_smote_train_pred=linear.predict(x_train)
print("linear Report for training : \n",classification_report(y_train,y_smote_train_pred))

print("*"*50)

y_smote_test_pred_LR=linear.predict(x_test)
print("linear Report for testing : \n",classification_report(y_test,y_smote_test_pred_LR))

## DecisionTreeClassifier SMOTE

In [None]:
DT=DecisionTreeClassifier()
DT.fit(x_smote_train,y_smote_train)

y_smote_train_pred=DT.predict(x_train)
print("DecisionTreeClassifier Report for training : \n",classification_report(y_train,y_smote_train_pred))

print("*"*50)

y_smote_test_pred_DT=DT.predict(x_test)

print("DecisionTreeClassifier Report for testing : \n",classification_report(y_test,y_smote_test_pred_DT))

<a id='conclusions'></a>
## Conclusions

In [None]:
pd.DataFrame({"Model Name":["LR","DT","RF"],
              "racall RUS&Smote":[recall_score(y_test,y1_test_pred_LR),recall_score(y_test,y1_test_pred_DT),recall_score(y_test,y1_test_pred_RF)],
              "precision RUS&Smote":[precision_score(y_test,y1_test_pred_LR),precision_score(y_test,y1_test_pred_DT),precision_score(y_test,y1_test_pred_RF)],
              "racall SMOTE":[recall_score(y_test,y_smote_test_pred_LR),recall_score(y_test,y_smote_test_pred_DT),"-"],
              "precision SMOTE":[precision_score(y_test,y_smote_test_pred_LR),precision_score(y_test,y_smote_test_pred_DT),"-"]
             }
            )
