<a href="https://colab.research.google.com/github/Asmaaad37/Machine-Learning/blob/main/Credit_Card_Fraud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
organizations_mlg_ulb_creditcardfraud_path = kagglehub.dataset_download('organizations/mlg-ulb/creditcardfraud')

print('Data source import complete.')


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Inroduction to our Dataset**

In this notebook, i'll be using various models to check how accurate they are in detecting whether the transaction is a normal payment or a fraud.

Let's Start Exploring!!

**Tasks:**

* Divide the Data-frame into 50/50 sub-dataframe ratio each telling "Fraud" and "Non-Fraud" transactions. (We'll use NearMiss Alog).
* Experiment with different classifiers and determine their accuracy.
* Also determine the accuracy with a neural net.


# **Outline**

***I. Understanding our data***

a) Gather Sense of our data

***II. Preprocessing***

a) Scaling and Distributing

b) Splitting the Data


***III. Random UnderSampling and Oversampling***

a) Distributing and Correlating

b) Anomaly Detection

c) Dimensionality Reduction and Clustering (t-SNE)

d) Classifiers

***What more can be done to this Notebook!***

e) A Deeper Look into Logistic Regression

f) Oversampling with SMOTE

***IV. Testing***

a) Testing with Logistic Regression

b) Neural Networks Testing (Undersampling vs Oversampling)

**What we already know!!**

* The transaction amounts are relatively small, with the average transaction amount being approximately USD 88.
* There are no missing values in the dataset; therefore, no imputation is required.
* The majority of transactions (99.83%) are non-fraudulent, while fraudulent transactions account for only 0.17% of the total dataset.

# References:
* https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets/notebook - I learned some really helpful insights by this notebook by: **janiobachmann**

 # Coding

In [None]:
# Importing Libraries

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
# dimensionality reduction and visualization
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
import time

# Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections


# Other Libraries
# Data Splitting & Pipelines
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
# Handling Imbalanced Data
from imblearn.over_sampling import SMOTE # Over-sampling
from imblearn.under_sampling import NearMiss # Under-sampling
# Performance Metrics
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
# Class Distribution
from collections import Counter
# Cross-Validation
from sklearn.model_selection import KFold, StratifiedKFold

import warnings
warnings.filterwarnings("ignore")

# Getting some sense of our data

In [None]:
df = pd.read_csv('../input/creditcardfraud/creditcard.csv')

In [None]:
df.head()

In [None]:
# Let's check some statistics about our data
df.describe()

In [None]:
# Shape and info
print("Shape of out Dataset is: ",df.shape)
print()
df.info()

In [None]:
# We can see above there are not a single null value in our dataframe.
# But if there would be, we could have double checked.
df.isnull().sum()

In [None]:
# Columns in the dataset
df.columns

In [None]:
class_counts = df['Class'].value_counts(normalize=True) * 100
print(f"No Frauds: {class_counts[0]:.2f}% of the dataset")
print(f"Frauds: {class_counts[1]:.2f}% of the dataset")


This confirms that the dataset is heavily imbalanced,and we can't use this dataset to train our models, if we use this dataframe, we might get a lot of errors and most probably our model will overfit.

In [None]:
# Let's visualize our fraud and Non-Fraud classes.
class_palette = {0: "#0101DF", 1: "#DF0101"}

sns.countplot(x="Class", data=df, palette=class_palette)

plt.title("Class Distribution\n(0: No Fraud | 1: Fraud)", fontsize=14)

plt.xlabel("Transcation Classes")
plt.ylabel("Count")

plt.show();

By seeing the distributions we can have an idea how skewed these features are.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 4))

columns = ["Amount", "Time"]
colors = ["red", "blue"]

# Creating distribution plot using loop
for i, col in enumerate(columns):
    sns.distplot(df[col], ax=axes[i], kde=True, color=colors[i])
    axes[i].set_title(f"Distribution of transaction {col}", fontsize=14)
    axes[i].set_xlabel(f"{col} Value")
    axes[i].set_ylabel("Density")
    axes[i].set_xlim(df[col].min(), df[col].max())


plt.show()

# SubSampling and Scaling
* Scaled amount and scaled time are the columns with scaled values.
* There are 492 cases of fraud in our dataset so we can randomly get 492 cases of non-fraud to create our new sub dataframe.
* We concat the 492 cases of fraud and non fraud, creating a new sub-sample.

# Why Creating Sub-Samples ?

1️⃣ **Overfitting to the Majority Class:**

* The model may learn to classify most transactions as non-fraudulent, failing to detect actual fraud cases.
* Our goal is to improve fraud detection accuracy rather than just maximizing overall classification accuracy.

2️⃣ **Misleading Feature Importance & Correlations:**

* Since we do not know the exact meaning of the "V" features, we rely on statistical relationships to understand their impact.
* If the dataset is imbalanced, these relationships may be skewed, preventing the model from learning meaningful correlations.

In [None]:
# SubSampling and Scaling

from sklearn.preprocessing import RobustScaler

columns_to_scale = ["Amount", "Time"]

# Robust scaler uses median and IQR, less sensitive to outliers.
scaler = RobustScaler()

df[["Scaled_Amount", "Scaled_Time"]] = scaler.fit_transform(df[columns_to_scale])

df.drop(columns=columns_to_scale, inplace=True)

In [None]:
# Reordering the columns
columns_order = ["Scaled_Amount", "Scaled_Time"] + [col for col in df.columns if col not in ["Scaled_Amount", "Scaled_Time"]]
df = df[columns_order]

df.head()
# Amount & Time - Scaled now

**Splitting the Data (Original DataFrame)**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

# Display Class Distributions
class_counts = df["Class"].value_counts(normalize=True) * 100
print(f"No Frauds: {class_counts[0]:.2f}% of the dataset")
print(f"Frauds: {class_counts[1]:.2f}% of the dataset")

# Define features and target
X, y = df.drop(columns=["Class"]), df["Class"]

# Stratified k-Fold for balanced train-test split
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_idx, test_idx = next(skf.split(X, y)) # Getting the first split

# Create train-test sets
original_Xtrain, original_Xtest = X.iloc[train_idx].values, X.iloc[test_idx].values
original_ytrain, original_ytest = y.iloc[train_idx].values, y.iloc[test_idx].values

# Check label distribution
for label, counts in zip(["Train", "Test"], [original_ytrain, original_ytest]):
    unique, count = np.unique(counts, return_counts=True)
    print(f'{label} Label Distribution: {count / len(counts)}')



**Random Under-Sampling:**

In [None]:
df["Class"].value_counts()

In [None]:
from sklearn.utils import shuffle
# Shuffle data
df = shuffle(df, random_state=42)

# Separate Fraud and Non-Fraud Classes
fraud_df = df[df["Class"] == 1]
non_fraud_df = df[df["Class"] == 0].sample(n=len(fraud_df), random_state=42) # Match Fraud Count

# Combine and Shuffle
new_df = shuffle(pd.concat([fraud_df, non_fraud_df]), random_state=42)

In [None]:
new_df.head()

**Data Preprocessing**

In [None]:
# Class Distributions
class_distributions = new_df["Class"].value_counts(normalize=True)
print(f"Distribution of the Classes in the subsample dataset:\n{class_distributions}")

In [None]:
# Visualizing Class Distributions
plt.figure(figsize=(6, 4))
sns.countplot(x='Class', data=new_df, palette=['#1f77b4', '#ff7f0e'])
plt.title("Equally Distributed Classes", fontsize=14)
plt.xlabel("Class (0: No Fraud) | 1: Fraud")
plt.ylabel("Count")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# **Correlation Matrices**

In [None]:
# Set up figure with two subplots
fig, axes = plt.subplots(nrows=2, figsize=(18, 16))

# Correlation matix for imbalanced dataset
sns.heatmap(df.corr(), cmap='coolwarm', annot=False, ax=axes[0])
axes[0].set_title("Imbalanced Correlation Matrix (Not Reliable)", fontsize=16)

# Correlation Matrix for balanced dataset
sns.heatmap(new_df.corr(), cmap='coolwarm', annot=False, ax=axes[1])
axes[1].set_title("Subsample Correlation Matrix (More Reliable)", fontsize=16)

plt.tight_layout()
plt.show()

# **BoxPlots for Negative Correlations**

In [None]:
# Define negatively correlated features
neg_corr_features = ["V17", "V14", "V12", "V10"]

# Create figure with subplots
fig, axes = plt.subplots(ncols=len(neg_corr_features), figsize=(18, 5))

# Boxplots
for i, feature in enumerate(neg_corr_features):
    sns.boxplot(x="Class", y=feature, data=new_df, palette=["#0101DF", "#DF0101"], ax=axes[i])
    axes[i].set_title(f"{feature} vs Class Negative Correlation")


plt.tight_layout()
plt.show()

In [None]:
# Define positively correlated features
pos_corr_features = ["V2", "V4", "V11", "V19"]

# Create figure with subplots
fig, axes = plt.subplots(ncols=len(pos_corr_features), figsize=(18, 5))

# Boxplots
for i, feature in enumerate(pos_corr_features):
    sns.boxplot(x="Class", y=feature, data=new_df, palette=["#0101DF", "#DF0101"], ax=axes[i])
    axes[i].set_title(f"{feature} vs Class Positive Correlation")


plt.tight_layout()
plt.show()

# Outliers Detection: (Only Extreme Ones)

In [None]:
from scipy.stats import norm

# Define features to visualize
features = ["V14", "V12", "V10"]
colors = ["#FB8861", "#56F9BB", "#C5B3F9"]

# Fig and Subplots
fig, axes = plt.subplots(1, len(features), figsize=(18, 6))

for i, feature in enumerate(features):
    fraud_dist = new_df[feature].loc[new_df["Class"] == 1].values

    # Histogram
    sns.histplot(fraud_dist, ax=axes[i], kde=True, stat="density", color=colors[i], bins=30)

    # Ovelaying a normal distribution curve
    xmin, xmax = axes[i].get_xlim()
    x = np.linspace(xmin, xmax, 100)
    p = norm.pdf(x, np.mean(fraud_dist), np.std(fraud_dist))
    axes[i].plot(x, p, 'k', linewidth=2)   # Normal Curve

    axes[i].set_title(f"{feature} Distribution \n (Fraud Transactions)", fontsize=14)

plt.tight_layout()
plt.show()

# Dimensionality Reduction and Clustering:

* The t-SNE algorithm accurately clusters fraud and non-fraud cases, even with a small subsample.
* Shuffling the dataset before applying t-SNE ensures reliable clustering across different scenarios.
* The results suggest that predictive models will effectively distinguish fraudulent transactions.

**Note**: If you want a simple instructive video look at [StatQuest: t-SNE, Clearly Explained](http://) by Joshua Starmer

In [None]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE
from matplotlib.ticker import FuncFormatter

# Define dimensionality reduction techniques
methods = {
    "t-SNE": TSNE(n_components=2, random_state=42),
    "PCA": PCA(n_components=2, random_state=42),
    "Truncated SVD": TruncatedSVD(n_components=2, algorithm="randomized", random_state=42),
}

X = new_df.drop('Class', axis=1).values
y = new_df['Class'].values

# Plot setup
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
colors = {0: 'blue', 1: 'red'}  # Non-Fraud = Blue, Fraud = Red

for i, (name, model) in enumerate(methods.items()):
    t0 = time.time()
    X_reduced = model.fit_transform(X)
    t1 = time.time()

    print(f"{name} took {t1 - t0:.2f} seconds")

    # Scatter plot of reduced dimensions
    scatter = axes[i].scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="coolwarm", alpha=0.6)
    axes[i].set_title(f"{name} Projection", fontsize=14)
    axes[i].set_xlabel("Component 1")
    axes[i].set_ylabel("Component 2")

# Add legend
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', label='Non-Fraud', markersize=8, markerfacecolor='blue'),
                    plt.Line2D([0], [0], marker='o', color='w', label='Fraud', markersize=8, markerfacecolor='red')],
           loc="upper right")

plt.tight_layout()
plt.show()


In [None]:
# X = new_df.drop('Class', axis=1)
# y = new_df['Class']


# # T-SNE Implementation
# t0 = time.time()
# X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)
# t1 = time.time()
# print("T-SNE took {:.2} s".format(t1 - t0))

# # PCA Implementation
# t0 = time.time()
# X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values)
# t1 = time.time()
# print("PCA took {:.2} s".format(t1 - t0))

# # TruncatedSVD
# t0 = time.time()
# X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values)
# t1 = time.time()
# print("Truncated SVD took {:.2} s".format(t1 - t0))

# Training Classifiers

**Summary of Classifiers (UnderSampling):**

* **Four classifiers** are trained to identify the most effective model for fraud detection.
* **Logistic Regression** outperforms the other classifiers in most cases and is selected for further analysis.

In [None]:
X = new_df.drop('Class', axis=1)
y = new_df['Class']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values


classifiers = {
    "LogisiticRegression": LogisticRegression(),
    "KNearest": KNeighborsClassifier(),
    "Support Vector Classifier": SVC(),
    "DecisionTreeClassifier": DecisionTreeClassifier()
}

from sklearn.model_selection import cross_val_score
from tqdm import tqdm  # For prgress bar

# Store results in a dictionary
results = []

# Iterate over classifiers
for key, classifier in tqdm(classifiers.items(), desc="Training Models"):
    classifier.fit(X_train, y_train)  # Train model
    scores = cross_val_score(classifier, X_train, y_train, cv=5)  # 5-fold CV

    # Store results
    results.append({
        "Model": classifier.__class__.__name__,
        "Mean Accuracy": round(scores.mean() * 100, 2),
        "Std Deviation": round(scores.std() * 100, 2)  # Model stability
    })

# Convert results into DataFrame and sort by accuracy
results_df = pd.DataFrame(results).sort_values(by="Mean Accuracy", ascending=False)
print(results_df)

# Finding Optimal Hyperparameters.

1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Classifier (SVC)
4. Decision Tree Classifier

**GridSearchCV** systematically tries different parameter values and selects the combination that gives the best model performance based on cross-validation.

**GridSearchCV** exhaustively searches through all combinations, which can be computationally expensive.

One major disadvantage of using **GridSearchCV** is that it is very computational expensive.

So, another alternative of GridSearchCV is **RandomizedSearchCV** which is faster, efficient and also provides near optimal solutions.

**RandomizedSearchCV**, which randomly samples parameter combinations and finds near-optimal results in less time.

But if computational resources are not a concern, **GridSearchCV** remains the best choice for maximum accuracy.

So, for these classifiers i'll be using *RandomizedSearchCV*, just because it would save me time and resources.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Logistic Regression
log_reg_params = {"penalty": ['l1', 'l2'], 'C': np.logspace(-3, 3, 7)}
random_log_reg = RandomizedSearchCV(LogisticRegression(), log_reg_params, n_iter=10, random_state=42)
random_log_reg.fit(X_train, y_train)
log_reg = random_log_reg.best_estimator_

# K-Nearest Neighbors
knears_params = {"n_neighbors": list(range(2, 11)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
random_knears = RandomizedSearchCV(KNeighborsClassifier(), knears_params, n_iter=10, random_state=42)
random_knears.fit(X_train, y_train)
knears_neighbors = random_knears.best_estimator_

# Support Vector Classifier
svc_params = {'C': np.linspace(0.1, 2, 10), 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
random_svc = RandomizedSearchCV(SVC(), svc_params, n_iter=10, random_state=42)
random_svc.fit(X_train, y_train)
svc = random_svc.best_estimator_

# Decision Tree Classifier
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2, 10)),
              "min_samples_leaf": list(range(1, 10))}
random_tree = RandomizedSearchCV(DecisionTreeClassifier(), tree_params, n_iter=10, random_state=42)
random_tree.fit(X_train, y_train)
tree_clf = random_tree.best_estimator_

In [None]:
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

# Stratified K-Fold for better class balance in each fold
strat_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# hyperparameter search space for each model
param_distributions = {
    'log_reg': {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100]},
    'knears': {'n_neighbors': range(2, 10), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']},
    'svc': {'C': [0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']},
    'tree': {'criterion': ['gini', 'entropy'], 'max_depth': range(2, 10), 'min_samples_leaf': range(1, 5)}
}

# Dictionary of models
models = {
    'log_reg': LogisticRegression(),
    'knears': KNeighborsClassifier(),
    'svc': SVC(),
    'tree': DecisionTreeClassifier()
}

# Perform RandomizedSearchCV for each model
best_estimators = {}
for name, model in models.items():
    search = RandomizedSearchCV(model, param_distributions[name], cv=strat_kfold, n_iter=10, scoring='accuracy', random_state=42, n_jobs=-1)
    search.fit(X_train, y_train)
    best_estimators[name] = search.best_estimator_

# Evaluate the best models
for name, model in best_estimators.items():
    score = cross_val_score(model, X_train, y_train, cv=strat_kfold).mean()
    print(f'{name.upper()} Best Model Cross Validation Score: {round(score * 100, 2)}%')


If we had used *GridSearchCV* the results could have been different!

In [None]:
# Overfitting Case

log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Logistic Regression Cross Validation Score: ', round(log_reg_score.mean() * 100, 2).astype(str) + '%')

knears_score = cross_val_score(knears_neighbors, X_train, y_train, cv=5)
print('Knears Neighbors Cross Validation Score', round(knears_score.mean() * 100, 2).astype(str) + '%')

svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('Support Vector Classifier Cross Validation Score', round(svc_score.mean() * 100, 2).astype(str) + '%')

tree_score = cross_val_score(tree_clf, X_train, y_train, cv=5)
print('DecisionTree Classifier Cross Validation Score', round(tree_score.mean() * 100, 2).astype(str) + '%')

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit as sss

In [None]:
# We will undersample during cross validating
undersample_X = df.drop('Class', axis=1)
undersample_y = df['Class']

# Define StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(undersample_X, undersample_y):
    print("Train:", train_index, "Test:", test_index)
    undersample_Xtrain, undersample_Xtest = undersample_X.iloc[train_index], undersample_X.iloc[test_index]
    undersample_ytrain, undersample_ytest = undersample_y.iloc[train_index], undersample_y.iloc[test_index]

# Convert DataFrames to NumPy arrays
undersample_Xtrain = undersample_Xtrain.values
undersample_Xtest = undersample_Xtest.values
undersample_ytrain = undersample_ytrain.values
undersample_ytest = undersample_ytest.values

# Lists to store metrics
undersample_accuracy = []
undersample_precision = []
undersample_recall = []
undersample_f1 = []
undersample_auc = []

# # Applying NearMiss
# X_nearmiss, y_nearmiss = NearMiss().fit_sample(undersample_X.values, undersample_y.values)
# print('NearMiss Label Distribution: {}'.format(Counter(y_nearmiss)))

# Cross Validating the right way
for train, test in sss.split(undersample_Xtrain, undersample_ytrain):
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), log_reg) # SMOTE happens during Cross Validation not before..
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])

    undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
    undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))

In [None]:
# Print final results
print("Mean Accuracy:", np.mean(undersample_accuracy))
print("Mean Precision:", np.mean(undersample_precision))
print("Mean Recall:", np.mean(undersample_recall))
print("Mean F1 Score:", np.mean(undersample_f1))
print("Mean ROC AUC Score:", np.mean(undersample_auc))

In [None]:
# Let's Plot LogisticRegression Learning Curve
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve


def plot_learning_curve(estimator1, estimator2, estimator3, estimator4, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
     f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(20,14), sharey=True)
     if ylim is not None:
        plt.ylim(*ylim)
     # First Estimator
     train_sizes, train_scores, test_scores = learning_curve(
        estimator1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
     train_scores_mean = np.mean(train_scores, axis=1)
     train_scores_std = np.std(train_scores, axis=1)
     test_scores_mean = np.mean(test_scores, axis=1)
     test_scores_std = np.std(test_scores, axis=1)
     ax1.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
     ax1.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
     ax1.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
     ax1.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
     ax1.set_title("Logistic Regression Learning Curve", fontsize=14)
     ax1.set_xlabel('Training size (m)')
     ax1.set_ylabel('Score')
     ax1.grid(True)
     ax1.legend(loc="best")


     train_sizes, train_scores, test_scores = learning_curve(
        estimator2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
     train_scores_mean = np.mean(train_scores, axis=1)
     train_scores_std = np.std(train_scores, axis=1)
     test_scores_mean = np.mean(test_scores, axis=1)
     test_scores_std = np.std(test_scores, axis=1)
     ax2.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
     ax2.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
     ax2.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
     ax2.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
     ax2.set_title("Knears Neighbors Learning Curve", fontsize=14)
     ax2.set_xlabel('Training size (m)')
     ax2.set_ylabel('Score')
     ax2.grid(True)
     ax2.legend(loc="best")


     # Third Estimator
     train_sizes, train_scores, test_scores = learning_curve(
        estimator3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
     train_scores_mean = np.mean(train_scores, axis=1)
     train_scores_std = np.std(train_scores, axis=1)
     test_scores_mean = np.mean(test_scores, axis=1)
     test_scores_std = np.std(test_scores, axis=1)
     ax3.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
     ax3.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
     ax3.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
     ax3.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
     ax3.set_title("Support Vector Classifier \n Learning Curve", fontsize=14)
     ax3.set_xlabel('Training size (m)')
     ax3.set_ylabel('Score')
     ax3.grid(True)
     ax3.legend(loc="best")

      # Fourth Estimator
     train_sizes, train_scores, test_scores = learning_curve(
        estimator4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
     train_scores_mean = np.mean(train_scores, axis=1)
     train_scores_std = np.std(train_scores, axis=1)
     test_scores_mean = np.mean(test_scores, axis=1)
     test_scores_std = np.std(test_scores, axis=1)
     ax4.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#ff9124")
     ax4.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#2492ff")
     ax4.plot(train_sizes, train_scores_mean, 'o-', color="#ff9124",
             label="Training score")
     ax4.plot(train_sizes, test_scores_mean, 'o-', color="#2492ff",
             label="Cross-validation score")
     ax4.set_title("Decision Tree Classifier \n Learning Curve", fontsize=14)
     ax4.set_xlabel('Training size (m)')
     ax4.set_ylabel('Score')
     ax4.grid(True)
     ax4.legend(loc="best")
     return plt

In [None]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
plot_learning_curve(log_reg, knears_neighbors, svc, tree_clf, X_train, y_train, (0.87, 1.01), cv=cv, n_jobs=4);

In [None]:
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

# Create a DataFrame with all the scores and the classifiers names.

log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5,
                             method="decision_function")

knears_pred = cross_val_predict(knears_neighbors, X_train, y_train, cv=5)

svc_pred = cross_val_predict(svc, X_train, y_train, cv=5,
                             method="decision_function")

tree_pred = cross_val_predict(tree_clf, X_train, y_train, cv=5)

**ROC_AUC Score**

In [None]:
from sklearn.metrics import roc_auc_score

print('Logistic Regression: ', roc_auc_score(y_train, log_reg_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knears_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, tree_pred))