#Overview
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Inspiration Identify fraudulent credit card transactions.

Given the class imbalance ratio, we are measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# Data can be downloaded from https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcardfraud.zip/3
df= pd.read_csv('/content/Credit card fraud detection.zip')

In [None]:
df.head()

In [None]:
#Null Value Check
df.isnull().values.any()

In [None]:
#Data Class Balance Check
print('Fraud Percentage: {}'.format(round((df['Class'].value_counts()[1]/len(df))*100,2)))
print('Non Fraud Percentage: {}'.format(round((df['Class'].value_counts()[0]/len(df))*100,2)))

In [None]:
count= df['Class'].value_counts()
count.plot(kind='bar')
plt.xticks(range(2),['Non Fraud','Fraud'])
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()

How imbalanced is our original dataset! Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud. But we don't want our model to assume, we want our model to detect patterns that give signs of fraud!

#
#Distributions:

By seeing the distributions we can have an idea how skewed are these features, we can also see further distributions of the other features.

In [None]:
fig, ax= plt.subplots(2,1, figsize=(20,10))

amount= df['Amount'].values
time= df['Time'].values

sns.distplot(amount,ax=ax[0], color='r')
sns.distplot(time,ax=ax[1],color='b')

By distribution we can see transaction amounts are very small, where as time is distributed.

#Scaling
As data is given after PCA to hide original data so scalling was done on the variables except time and amount which we will scale

In [None]:
from sklearn.preprocessing import RobustScaler # it is prone to outliers
ss1= RobustScaler()
df['Amount']= ss1.fit_transform(df['Amount'].values.reshape(-1, 1))

In [None]:
ss2= RobustScaler()
df['Time']= ss2.fit_transform(df['Time'].values.reshape(-1, 1))

In [None]:
df.head()

#Splitting the Data (Original DataFrame)
Before proceeding with the Random UnderSampling technique we have to separate the orginal dataframe. Why? for testing purposes, remember although we are splitting the data when implementing Random UnderSampling or OverSampling techniques, we want to test our models on the original testing set not on the testing set created by either of these techniques. The main goal is to fit the model either with the dataframes that were undersample and oversample (in order for our models to detect the patterns), and test it on the original testing set.

In [None]:
xorg=df.drop('Class',axis=1)
yorg= df.loc[:,'Class']

In [None]:
from sklearn.model_selection import train_test_split
xorgtrain,xorgtest,yorgtrain,yorgtest= train_test_split(xorg,yorg,test_size=0.2,random_state=9)

In [None]:
print(xorgtrain.shape,xorgtest.shape,yorgtrain.shape,yorgtest.shape)

#Random Sampling
we will implement "Random Under Sampling" which basically consists of removing data in order to have a more balanced dataset and thus avoiding our models to overfitting.

Note: The main issue with "Random Under-Sampling" is that we run the risk that our classification models will not perform as accurate as we would like to since there is a great deal of information loss (bringing 492 non-fraud transaction from 284,315 non-fraud transaction)

In [None]:
#Using imblearn library
# from imblearn.under_sampling import NearMiss

# nm=NearMiss(random_state=9)
# xr,yr= nm.fit_sample(xorg,yorg)

# from collections import Counter
# print('Original Count: {}'.format(Counter(yorg)))
# print('Sampled Count: {}'.format(Counter(yr)))

# # Now we have equal fraud and non fraud data.

# new_df= pd.concat([pd.DataFrame(xr,columns=xorg.columns),pd.DataFrame(yr)],axis=1)

# new_df= new_df.rename({0:'Class'},axis=1)

# new_df.head()

Using shuffling and selecting first 492 non fraud

In [None]:
# Since our classes are highly skewed we should make them equivalent in order to have a normal distribution of the classes.

# Lets shuffle the data before creating the subsamples

df = df.sample(frac=1)

# amount of fraud classes 492 rows.
fraud_df = df.loc[df['Class'] == 1]
non_fraud_df = df.loc[df['Class'] == 0][:492] #Taking top 492 row for 0

normal_distributed_df = pd.concat([fraud_df, non_fraud_df])

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42)

new_df.head()

#Oversampling using RandomOverSampler

In [None]:
# from imblearn.over_sampling import RandomOverSampler
# nm= RandomOverSampler(ratio=1,random_state=42)
# xr,yr= nm.fit_sample(xorg,yorg)

# from collections import Counter
# print('Original Count: {}'.format(Counter(yorg)))
# print('Sampled Count: {}'.format(Counter(yr)))

# # Now we have equal fraud and non fraud data.

# new_df= pd.concat([pd.DataFrame(xr,columns=xorg.columns),pd.DataFrame(yr)],axis=1)

# new_df= new_df.rename({0:'Class'},axis=1)

# new_df.head()

In [None]:
new_df.shape

In [None]:
df.columns

#Correlation Matrices
Correlation matrices are the essence of understanding our data. We want to know if there are features that influence heavily in whether a specific transaction is a fraud. However, it is important that we use the correct dataframe (subsample) in order for us to see which features have a high positive or negative correlation with regards to fraud transactions.

In [None]:
#for Original Data frame
plt.figure(figsize=(20,20))

sns.heatmap(df.corr(),annot=True,cmap='coolwarm_r')

In [None]:
#For new sampled df
#for Original Data frame
plt.figure(figsize=(20,20))

sns.heatmap(new_df.corr(),annot=True,cmap='coolwarm_r')

Negative Correlations: V17,V16, V14, V12 and V10 are negatively correlated. Notice how the lower these values are, the more likely the end result will be a fraud transaction.
Positive Correlations: V2, V4, V11, and V19 are positively correlated. Notice how the higher these values are, the more likely the end result will be a fraud transaction.
BoxPlots: We will use boxplots to have a better understanding of the distribution of these features in fradulent and non fradulent transactions.

#Negative Correlation

In [None]:
f, axes = plt.subplots(ncols=5, figsize=(20,4))

# Negative Correlations with our Class (The lower our feature value the more likely it will be a fraud transaction)
sns.boxplot(x="Class", y="V17", data=new_df, ax=axes[0])
axes[0].set_title('V17 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V16", data=new_df, ax=axes[1])
axes[1].set_title('V16 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V14", data=new_df, ax=axes[2])
axes[2].set_title('V14 vs Class Negative Correlation')


sns.boxplot(x="Class", y="V12", data=new_df, ax=axes[3])
axes[3].set_title('V12 vs Class Negative Correlation')


sns.boxplot(x="Class", y="V10", data=new_df, ax=axes[4])
axes[4].set_title('V10 vs Class Negative Correlation')

plt.show()

#Positive Correlation

In [None]:
f, axes = plt.subplots(ncols=4, figsize=(20,4))

# Postive Correlations with our Class (The higher our feature value the more likely it will be a fraud transaction)
sns.boxplot(x="Class", y="V2", data=new_df, ax=axes[0])
axes[0].set_title('V2 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V4", data=new_df, ax=axes[1])
axes[1].set_title('V4 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V11", data=new_df, ax=axes[2])
axes[2].set_title('V11 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V19", data=new_df, ax=axes[3])
axes[3].set_title('V2 vs Class Positive Correlation')

plt.show()

In [None]:
neg= ['V17','V16','V14','V12','V10']

f, axes = plt.subplots(ncols=len(neg), figsize=(20,4))
for i,j in enumerate(neg):
# Negative Correlations with our Class (The lower our feature value the more likely it will be a fraud transaction)
    sns.boxplot(x="Class", y=j, data=new_df, ax=axes[i])
    axes[i].set_title(j + ' vs Class Negative Correlation')

In [None]:
pos= ['V2','V4','V11','V19']

f, axes = plt.subplots(ncols=len(pos), figsize=(20,4))
for i,j in enumerate(pos):
# Postive Correlations with our Class (The higher our feature value the more likely it will be a fraud transaction)
    sns.boxplot(x="Class", y=j, data=new_df, ax=axes[i])
    axes[i].set_title(j+'vs Class Positive Correlation')

Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

median (Q2/50th Percentile): the middle value of the dataset.

first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.

third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.

interquartile range (IQR): 25th to the 75th percentile.

whiskers (shown in blue)

outliers (shown as green circles)

“maximum”: Q3 + 1.5*IQR

“minimum”: Q1 -1.5*IQR

'outliers= 3* IQR or more than that

#Anomly Detection
visualize Distributions: We first start by visualizing the distribution of the feature we are going to use to eliminate some of the outliers. V14 is the only feature that has a Gaussian distribution compared to features V12 and V10.
Determining the threshold: After we decide which number we will use to multiply with the iqr (the lower more outliers removed), we will proceed in determining the upper and lower thresholds by substrating q25 - threshold (lower extreme threshold) and adding q75 + threshold (upper extreme threshold).
Conditional Dropping: Lastly, we create a conditional dropping stating that if the "threshold" is exceeded in both extremes, the instances will be removed.
Boxplot Representation: Visualize through the boxplot that the number of "extreme outliers" have been reduced to a considerable amount.

In [None]:
from scipy.stats import norm

f, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(20, 6))

v14_fraud_dist = new_df['V14'].loc[new_df['Class'] == 1].values
sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861')
ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14)

v12_fraud_dist = new_df['V12'].loc[new_df['Class'] == 1].values
sns.distplot(v12_fraud_dist,ax=ax2, fit=norm, color='#56F9BB')
ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14)


v10_fraud_dist = new_df['V10'].loc[new_df['Class'] == 1].values
sns.distplot(v10_fraud_dist,ax=ax3, fit=norm, color='#C5B3F9')
ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14)

plt.show()

#Outliers Treatment
as we say in box plots all the variables with coorelation have outliers so now we will treat them in with iqr , lb and ub.

In [None]:
df2= new_df # I am creating the copy of new_df to preserve the original data
treat= ['V14','V12','V10']
for j in treat:
    q25,q75= new_df[j].quantile(q=0.25),new_df[j].quantile(q=0.75)
    iqr= q75-q25
    cut_off= iqr*1.5
    lb,ub= q25-cut_off,q75+cut_off
    outliers= [x for x in new_df[j] if x<=lb or x>=ub]
    print(j,'Q25: {} , Q75: {}, IQR: {}, Cutoff: {}, LB: {}, UB: {},'.format(q25,q75,iqr,cut_off,lb,ub))
    print(len(outliers), outliers)
    df2= df2.drop(df2[(df2['V14'] > ub) | (df2['V14']< lb)].index, axis=0)
    print(df2.shape)
    print('----' * 44)


# Tried imputing outliers.
#     lb,ub= new_df[j].quantile(q=0.05),new_df[j].quantile(q=0.95)
#     outliers= [x for x in new_df[j] if x>ub or x<lb]
#     print(len(outliers),outliers)
#     func= (lambda x: x if x>=ub or x<=lb else new_df[j].mean())
#     df2[j]= df2[j].apply(func)

In [None]:
df2.shape

In [None]:
from collections import Counter
Counter(df2['Class'])

In [None]:
f, axes = plt.subplots(ncols=len(treat), figsize=(20,4))
for i,j in enumerate(treat):
# Postive Correlations with our Class (The higher our feature value the more likely it will be a fraud transaction)
    sns.boxplot(x="Class", y=j, data=df2, ax=axes[i])
    axes[i].set_title(j)

In [None]:
from sklearn.model_selection import train_test_split,cross_val_score,cross_val_predict
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

x=df2.drop('Class',axis=1).values
y= df2.loc[:,'Class'].values

# SPlitting the test and train after removing outliers

xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=42)

#Fitting the models and calculating test and training score

In [None]:
classifier= {
    'Logistic Regression':LogisticRegression(),
    'KNN':KNeighborsClassifier(),
    'SVC':SVC(),
    'DecisionTree':DecisionTreeClassifier()
}

import warnings
warnings.filterwarnings('ignore')

In [None]:
for key,values in classifier.items():
    values.fit(xtrain,ytrain)
    training_score= cross_val_score(values,xtrain,ytrain,cv=5)
    print('Training accuracy score of {} is {}'.format(key,round(training_score.mean()*100,2)))
    train_pred = cross_val_predict(values, xtrain, ytrain, cv=5)
    print('Roc_Auc training score for {} is {}: '.format(key, round(roc_auc_score(ytrain,train_pred)*100,2)))
    test_score= cross_val_score(values,xtest,ytest,cv=5)
    print('Test accuracy score of {} is {}'.format(key,round(test_score.mean()*100,2)))
    test_pred = cross_val_predict(values, xtest, ytest, cv=5)
    print('Roc_Auc test score for {} is {}: '.format(key, round(roc_auc_score(ytest,test_pred)*100,2)))
    print('---'*30)

#Calculating for original data

In [None]:
for key,values in classifier.items():
    values.fit(xtrain,ytrain)
    test_score= cross_val_score(values,xorgtest,yorgtest,cv=5)
    print('Test accuracy score of {} is {}'.format(key,round(test_score.mean()*100,2)))
    test_pred = cross_val_predict(values, xorgtest,yorgtest, cv=5)
    print('Roc_Auc test score for {} is {}: '.format(key, round(roc_auc_score(yorgtest,test_pred)*100,2)))
    print('---'*30)

#Hyper Parameter Tuning using GridSearchCv

In [None]:
# Use GridSearchCV to find the best parameters.
from sklearn.model_selection import GridSearchCV


# Logistic Regression
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
knears_params = {"n_neighbors": list(range(2,5,1)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
svc_params = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)),
              "min_samples_leaf": list(range(5,7,1))}


classifier= {
    'Logistic Regression':LogisticRegression(),
    'KNN':KNeighborsClassifier(),
    'SVC':SVC(),
    'DecisionTree':DecisionTreeClassifier()
}

def grid_search(classifier,Param):
    grid_log_reg = GridSearchCV(classifier,param_grid=Param)
    grid_log_reg.fit(xtrain, ytrain)
    best_param = grid_log_reg.best_estimator_
    print('{} algorithm best parameter are : {}'.format(classifier.__class__.__name__,best_param))



grid_search(LogisticRegression(),log_reg_params)

In [None]:
log_reg = LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                             intercept_scaling=1, l1_ratio=None, max_iter=100,
                             multi_class='ovr', n_jobs=None, penalty='l1',
                             random_state=None, solver='liblinear', tol=0.0001,
                             verbose=0, warm_start=False)
log_reg.fit(xtrain,ytrain)

In [None]:
log_reg_pred = cross_val_predict(log_reg, xtest, ytest, cv=5,method="decision_function")

In [None]:
from sklearn.metrics import roc_curve, auc
fpr,tpr,threshold= roc_curve(ytest,log_reg_pred)

In [None]:
plt.figure(figsize=(15,10))
plt.plot([0,1],[0,1],'r--')
plt.plot(fpr,tpr,'g')
plt.title('Auc Score is :'+str(auc(fpr,tpr)))

#Deeper look into the logistic regression classifier.

True Positives: Correctly Classified Fraud Transactions
False Positives: Incorrectly Classified Fraud Transactions
Negative: Correctly Classified Non-Fraud Transactions
False Negative: Incorrectly Classified Non-Fraud Transactions
Precision (1-Specificty): True Positives/(True Positives + False Positives) i.e ratio of correct predicted positive to the total predicted positive.
Recall or Sensitivity: True Positives/(True Positives + False Negatives) i.e what percentage of fraud is correctly identified
F1 Score = 2*(Recall * Precision) / (Recall + Precision) is the weighted average of Precision and Recall.
Accuracy = TP+TN/TP+FP+FN+TN
= TP/TP+FN
Specificity= TN/TN+FP i.e what percent is of non fraud is correctly identified
Precision as the name says, says how precise (how sure) is our model in detecting fraud transactions while recall is the amount of fraud cases our model is able to detect.

Precision/Recall Tradeoff: The more precise (selective) our model is, the less cases it will detect. Example: Assuming that our model has a precision of 95%, Let's say there are only 5 fraud cases in which the model is 95% precise or more that these are fraud cases. Then let's say there are 5 more cases that our model considers 90% to be a fraud case, if we lower the precision there are more cases that our model will be able to detect.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

# Define and configure the Logistic Regression model
log_reg = LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                             intercept_scaling=1, l1_ratio=None, max_iter=100,
                             multi_class='ovr', n_jobs=None, penalty='l1',
                             random_state=None, solver='liblinear', tol=0.0001,
                             verbose=0, warm_start=False)

# Fit the model with the training data
log_reg.fit(xtrain, ytrain)

# Predict on the test set
y_pred = log_reg.predict(xtest)

# Calculate and print evaluation metrics
print('Recall Score: {:.2f}'.format(recall_score(ytest, y_pred)))
print('Precision Score: {:.2f}'.format(precision_score(ytest, y_pred)))
print('F1 Score: {:.2f}'.format(f1_score(ytest, y_pred)))
print('Accuracy Score: {:.2f}'.format(accuracy_score(ytest, y_pred)))


#Testing on orginal dataset

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
y_pred = log_reg.predict(xorgtest)

print('Recall Score: {:.2f}'.format(recall_score(yorgtest, y_pred)))
print('Precision Score: {:.2f}'.format(precision_score(yorgtest, y_pred)))
print('F1 Score: {:.2f}'.format(f1_score(yorgtest, y_pred)))
print('Accuracy Score: {:.2f}'.format(accuracy_score(yorgtest, y_pred)))

so while predicting on original test values it has an unaccepted accuracy which shows our model is over fitted. This over fitting occured because we did the sampling before cross validating.

So we will do undersampling during cross validation, which will be correct value for down sampling

#Overfitting during Cross Validation:¶
In our undersample analysis I want to show you a common mistake I made that I want to share with all of you. It is simple, if you want to undersample or oversample your data you should not do it before cross validating. Why because you will be directly influencing the validation set before implementing cross-validation causing a "data leakage" problem.you will see amazing precision and recall scores but in reality our data is overfitting!

Below I am doing undersampling during cross validation but still accuracy wont be good, as in undersampling we loose the information.

#if we get the minority class ("Fraud) in our case, and create the synthetic points before cross validating we have a certain influence on the "validation set" of the cross validation process. Remember how cross validation works, let's assume we are splitting the data into 5 batches, 4/5 of the dataset will be the training set while 1/5 will be the validation set. The test set should not be touched! For that reason, we have to do the creation of synthetic datapoints "during" cross-validation and not before, just like below:

In [None]:
import numpy as np
from collections import Counter
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.under_sampling import NearMiss
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score, roc_auc_score

# Assuming df is your DataFrame containing the data
undersample_X = df.drop('Class', axis=1)
undersample_y = df['Class']

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

for train_index, test_index in sss.split(undersample_X, undersample_y):
    print("Train:", train_index, "Test:", test_index)
    undersample_Xtrain, undersample_Xtest = undersample_X.iloc[train_index], undersample_X.iloc[test_index]
    undersample_ytrain, undersample_ytest = undersample_y.iloc[train_index], undersample_y.iloc[test_index]

undersample_Xtrain = undersample_Xtrain.values
undersample_Xtest = undersample_Xtest.values
undersample_ytrain = undersample_ytrain.values
undersample_ytest = undersample_ytest.values

undersample_accuracy = []
undersample_precision = []
undersample_recall = []
undersample_f1 = []
undersample_auc = []

# Implementing NearMiss Technique
# Distribution of NearMiss (Just to see how it distributes the labels we won't use these variables)
X_nearmiss, y_nearmiss = NearMiss().fit_resample(undersample_X.values, undersample_y.values)
print('NearMiss Label Distribution: {}'.format(Counter(y_nearmiss)))

# Define parameter grid for Logistic Regression
log_reg_params = {
    "penalty": ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'solver': ['liblinear', 'saga'],  # Only solvers that support 'l1' penalty
    'multi_class': ['ovr', 'auto']  # Valid values for multi_class
}

rand_log_reg1 = RandomizedSearchCV(LogisticRegression(), log_reg_params, n_iter=4)

# Cross Validating the right way
for train, test in sss.split(undersample_Xtrain, undersample_ytrain):
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), rand_log_reg1)
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])

    undersample_accuracy.append(undersample_pipeline.score(undersample_Xtrain[test], undersample_ytrain[test]))
    undersample_precision.append(precision_score(undersample_ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(undersample_ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(undersample_ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(undersample_ytrain[test], undersample_prediction))

print('--' * 45)
print("accuracy: {}".format(np.mean(undersample_accuracy)))
print("precision: {}".format(np.mean(undersample_precision)))
print("recall: {}".format(np.mean(undersample_recall)))
print("f1: {}".format(np.mean(undersample_f1)))
print("AUC: {}".format(np.mean(undersample_auc)))
print('--' * 45)


In [None]:
from sklearn.metrics import classification_report
labels = ['No Fraud', 'Fraud']
nm_prediction = rand_log_reg1.best_estimator_.predict(undersample_Xtest)
print(classification_report(undersample_ytest, nm_prediction, target_names=labels))

In [None]:
best= rand_log_reg1.best_estimator_
y_score = best.decision_function(xorgtest)

from sklearn.metrics import average_precision_score
average_precision = average_precision_score(yorgtest, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

fig = plt.figure(figsize=(12,6))

from sklearn.metrics import precision_recall_curve

precision, recall, threshold = precision_recall_curve(yorgtest, y_score)

plt.step(recall, precision, color='r', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Under Sampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          average_precision), fontsize=16)

#Now we will increase our data points so that we dont losse any informations

SMOTE stands for Synthetic Minority Over-sampling Technique. Unlike Random UnderSampling, SMOTE creates new synthetic points in order to have an equal balance of the classes. This is another alternative for solving the "class imbalance problems".

Understanding SMOTE:

Solving the Class Imbalance: SMOTE creates synthetic points from the minority class in order to reach an equal balance between the minority and majority class.
Location of the synthetic points: SMOTE picks the distance between the closest neighbors of the minority class, in between these distances it creates synthetic points.
Final Effect: More information is retained since we didn't have to delete any rows unlike in random undersampling.
Accuracy || Time Tradeoff: Although it is likely that SMOTE will be more accurate than random under-sampling, it will take more time to train since no rows are eliminated as previously stated.
SMOTE will be done "during" cross validation and not "prior" to the cross validation process. Synthetic data are created only for the training set without affecting the validation set.

In [None]:

from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE


print('Length of X (train): {} | Length of y (train): {}'.format(len(xorgtrain), len(yorgtrain)))
print('Length of X (test): {} | Length of y (test): {}'.format(len(xorgtest), len(yorgtest)))

# List to append the score and then find the average
accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

# Classifier with optimal parameters
# log_reg_sm = LogisticRegression()
rand_log_reg = RandomizedSearchCV(LogisticRegression(), log_reg_params, n_iter=4)

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
# Implementing SMOTE Technique
# Cross Validating the right way
# Parameters
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
for train, test in sss.split(xorgtrain, yorgtrain):
    pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg)
    # SMOTE happens during Cross Validation not before..
    model = pipeline.fit(xorgtrain.iloc[train], yorgtrain.iloc[train])
    best_est = rand_log_reg.best_estimator_
    prediction = best_est.predict(xorgtrain.iloc[test])

    accuracy_lst.append(pipeline.score(xorgtrain.iloc[test], yorgtrain.iloc[test]))
    precision_lst.append(precision_score(yorgtrain.iloc[test], prediction))
    recall_lst.append(recall_score(yorgtrain.iloc[test], prediction))
    f1_lst.append(f1_score(yorgtrain.iloc[test], prediction))
    auc_lst.append(roc_auc_score(yorgtrain.iloc[test], prediction))

print('---' * 45)
print('')
print("accuracy: {}".format(np.mean(accuracy_lst)))
print("precision: {}".format(np.mean(precision_lst)))
print("recall: {}".format(np.mean(recall_lst)))
print("f1: {}".format(np.mean(f1_lst)))
print('---' * 45)

In [None]:
from sklearn.metrics import classification_report
labels = ['No Fraud', 'Fraud']
smote_prediction = best_est.predict(xorgtest)
print(classification_report(yorgtest, smote_prediction, target_names=labels))

In [None]:

y_score = best_est.decision_function(xorgtest)

from sklearn.metrics import average_precision_score
average_precision = average_precision_score(yorgtest, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

fig = plt.figure(figsize=(12,6))

from sklearn.metrics import precision_recall_curve

precision, recall, threshold = precision_recall_curve(yorgtest, y_score)

plt.step(recall, precision, color='r', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('OverSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          average_precision), fontsize=16)

#Neural network testing for Under and Over Sample data.

I am making simple two hidden layer network and will use it

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

import warnings
warnings.filterwarnings('ignore')

In [None]:
classifier= Sequential()
classifier.add(Dense(15, activation='relu',kernel_initializer='uniform',input_shape=(30,)))
classifier.add(Dense(15, activation='relu',kernel_initializer='uniform'))
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='uniform' ))

In [None]:
classifier.summary()

In [None]:
classifier.compile(optimizer='adam',loss= ['binary_crossentropy'],metrics=['accuracy'])

In [None]:
from imblearn.under_sampling import NearMiss

# Assuming df is your DataFrame containing the data
X = df.drop('Class', axis=1)
y = df['Class']

# Applying NearMiss technique to undersample the data
X_nearmiss, y_nearmiss = NearMiss().fit_resample(X, y)

# Assuming classifier is a neural network model (e.g., from Keras)
classifier.fit(X_nearmiss, y_nearmiss, batch_size=10, epochs=100)

from keras.models import Sequential
from keras.layers import Dense

# Define a simple neural network model
classifier = Sequential()
classifier.add(Dense(units=16, activation='relu', input_dim=X_nearmiss.shape[1]))
classifier.add(Dense(units=8, activation='relu'))
classifier.add(Dense(units=1, activation='sigmoid'))

# Compile the model
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Fit the model with the undersampled data
classifier.fit(X_nearmiss, y_nearmiss, batch_size=10, epochs=100)


In [None]:
undersample_pred_prob = classifier.predict(xorgtest, batch_size=200, verbose=0)

In [None]:
from sklearn.metrics import confusion_matrix

# Predict on the test set and convert probabilities to class labels
undersample_pred = (classifier.predict(xorgtest, batch_size=200, verbose=0) > 0.5).astype("int32")

# Calculate and print the confusion matrix
conf_matrix = confusion_matrix(yorgtest, undersample_pred)
print(conf_matrix)

In [None]:
undersample_pred = (classifier.predict(xorgtest, batch_size=200, verbose=0) > 0.5).astype("int32")
conf_matrix = confusion_matrix(yorgtest, undersample_pred)


In [None]:
print(classification_report(yorgtest, undersample_pred, target_names=labels))

In [None]:
precision, recall, threshold = precision_recall_curve(yorgtest, undersample_pred)

plt.step(recall, precision, color='r', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('OverSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          average_precision_score(yorgtest, undersample_pred)), fontsize=16)

In [None]:
sm = SMOTE(sampling_strategy='minority', random_state=42)
Xsm_train, ysm_train = sm.fit_resample(X, y)


# SMOTE Technique (OverSampling) After splitting and Cross Validating
from imblearn.over_sampling import SMOTE

sm = SMOTE(sampling_strategy='minority', random_state=42)
Xsm_train, ysm_train = sm.fit_resample(X, y)

# Assuming classifier is a neural network model (e.g., from Keras)
classifier.fit(Xsm_train, ysm_train, batch_size=200, epochs=100)

In [None]:
oversample_pred = (classifier.predict(xorgtest, batch_size=200, verbose=0) > 0.5).astype("int32")
conf_matrix = confusion_matrix(yorgtest, oversample_pred)
print(conf_matrix)


In [None]:
precision, recall, threshold = precision_recall_curve(yorgtest, oversample_pred)

plt.step(recall, precision, color='r', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('OverSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          average_precision_score(yorgtest, oversample_pred)), fontsize=16)

In [None]:
print(classification_report(yorgtest, oversample_pred, target_names=labels))