<a href="https://colab.research.google.com/github/Palak730/creditcarddefaultprediction/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Credit Card Default Prediction




##### **Project Type**    -Classification
##### **Contribution**    - Individual
##### **Team Member 1 -**  Palak Srivastava


# **Project Summary -**

The aim of this study is to exploit some supervised machine learning algorithms to identify the key drivers that determine the likelihood of credit card default, underlining the mathematical aspects behind the methods used. Credit card default happens when you have become severely delinquent on your credit card payments. In order to increase market share, card-issuing banks in Taiwan over-issued cash and credit cards to unqualified applicants. At the same time, most cardholders, irrespective of their repayment ability, the overused credit card for consumption and accumulated heavy credit and debts

The goal is to build an automated model for both identifying the key factors, and predicting a credit card default based on the information about the client and historical transactions. The general concepts of the supervised machine learning paradigm are later reported, together with a detailed explanation of all techniques and algorithms used to build the models. In particular, Logistic Regression, Random Forest and Support Vector Machines algorithms have been applied.

# **GitHub Link -**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy.stats import randint
import pandas as pd # data processing, CSV file I/O, data manipulation
import matplotlib.pyplot as plt # this is used for the plot the graph
import seaborn as sns # used for plot interactive graph.
from pandas import set_option
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split # to split the data into two parts
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.model_selection import KFold # for cross validation
from sklearn.model_selection import GridSearchCV # for tuning parameter
from sklearn.model_selection import RandomizedSearchCV  # Randomized search on hyper parameters.
from sklearn.preprocessing import StandardScaler # for normalization
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn import metrics # for the check the error and accuracy of the model
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.preprocessing import StandardScaler
import os

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
file_path = '/content/drive/MyDrive/'
data=pd.read_csv(file_path + '/creditcardclients.csv')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

In [None]:
# Convert integer columns to float
float_columns = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5',
                 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4',
                 'PAY_AMT5', 'PAY_AMT6', 'LIMIT_BAL']
data[float_columns] = data[float_columns].astype(float)


###**Dataset Information**

What we know about dataset :

We have records of 30000 customers. Below are the description of all features we have.

ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)

SEX: Gender (1 = male, 2 = female)

EDUCATION: (1 = graduate school, 2 = university, 3 = high school, 0,4,5,6 = others)

MARRIAGE: Marital status (0 = others, 1 = married, 2 = single, 3 = others)

AGE: Age in years
Scale for PAY_0 to PAY_6 : (-2 = No consumption, -1 = paid in full, 0 = use of revolving credit (paid minimum only), 1 = payment delay for one month, 2 = payment delay for two months, ... 8 = payment delay for eight months, 9 = payment delay for nine months and above)

PAY_0: Repayment status in September, 2005 (scale same as above)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)
In our dataset we got customer credit card transaction history for past 6 month , on basis of which we have to predict if customer will default or not.

So let's begin.

In [None]:
data.info()

In [None]:
data.describe(include='all')

First we will check if we have any null values

In [None]:
data.isnull().sum()

 ## **Data Preprocessing**

**Changing name of some columns for simplicity and better understanding**

In [None]:
#renaming columns
data.rename(columns = {'default payment next month' : 'Defaulter'  },inplace=True )
data.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
data.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
data.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)

In [None]:
data.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'},
               'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'},
               'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)

In [None]:
data.head()

# ***Exploratory Data Analysis***

## **Calculate the frequency of defaults and non-defaults credit cards**

In [None]:
# Calculate the frequency of defaults and non-defaults
yes = data['Defaulter'].sum()
no = len(data) - yes

# Calculate the percentage of defaults and non-defaults
total = len(data)
yes_perc = round(yes / total * 100, 1)
no_perc = round(no / total * 100, 1)

# Set the figure size and context
plt.figure(figsize=(7, 4))
sns.set_context('notebook', font_scale=1.2)

# Create the count plot
sns.countplot(x='Defaulter', data=data, palette="Blues")

# Annotate the counts and percentages
plt.annotate('Non-default: {}'.format(no), xy=(-0.3, 15000), xytext=(-0.3, 3000), size=12)
plt.annotate('Default: {}'.format(yes), xy=(0.7, 15000), xytext=(0.7, 3000), size=12)
plt.annotate(str(no_perc) + "%", xy=(-0.3, 15000), xytext=(-0.1, 8000), size=12)
plt.annotate(str(yes_perc) + "%", xy=(0.7, 15000), xytext=(0.9, 8000), size=12)

# Set the title and axis labels
plt.title('COUNT OF CREDIT CARDS', size=14)
plt.xlabel('Default')
plt.ylabel('Count')

# Remove the top and right spines
sns.despine()

# Show the plot
plt.show()


From this sample of 30,000 credit card holders, there were 6,636 default credit cards; that is, the proportion of default in the data is 22,1%.         

##**SEX**


In [None]:
data['SEX'].value_counts()

## **Education**


In [None]:
# Filter the values 5, 6, and 0 in the 'EDUCATION' column and replace them with 'others'
condition = (data['EDUCATION'] == 5) | (data['EDUCATION'] == 6) | (data['EDUCATION'] == 0) | (data['EDUCATION'] == 4)
data.loc[condition, 'EDUCATION'] = 'others'

# Print the updated value counts for each category in the 'EDUCATION' column
print(data['EDUCATION'].value_counts())

For values without descriptions (5, 6, 0), we can group them as "Others" and represent them as 4 in the dataset.

## **Marriage**



In [None]:
# Filter the value 0 in the 'MARRIAGE' column and replace them with others using .loc
condition = (data['MARRIAGE'] == 0) | (data['MARRIAGE'] == 3)
data.loc[condition, 'MARRIAGE'] = 'others'
data['MARRIAGE'].value_counts()

We have few values for 0, which are not determined . So I am adding them in Others category.

# **Plotting our categorical features**

In [None]:
categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']

In [None]:
# Create a new DataFrame 'df_cat' containing only the categorical features
df_cat = data[categorical_features].copy()

# Use .loc to set the 'Defaulter' column in 'df_cat' based on the 'Defaulter' column in the original 'data' DataFrame
df_cat.loc[:, 'Defaulter'] = data['Defaulter']

In [None]:
for col in categorical_features:
  plt.figure(figsize=(8,5))
  fig, axes = plt.subplots(ncols=2,figsize=(13,8))
  data[col].value_counts().plot(kind="pie",ax = axes[0],subplots=True)
  sns.countplot(x = col, hue = 'Defaulter', data = df_cat)

## Observations for categorical features:

Gender: The majority of credit card holders who defaulted are females, indicating a higher proportion of female defaulters.

Education: Defaulters are more likely to be educated, with a higher proportion having completed graduate school or university education.

Marital Status: Defaulters are more likely to be singles, suggesting a higher proportion of single individuals among the group of credit card holders who defaulted.

### Limit Balance

In [None]:
data['LIMIT_BAL'].max()


In [None]:
data['LIMIT_BAL'].min()

In [None]:
data['LIMIT_BAL'].describe()

In [None]:
sns.barplot(x='Defaulter', y='LIMIT_BAL', data=data)

In [None]:
# Explore the numerical features using histograms and box plots
numerical_features = ['LIMIT_BAL', 'AGE']

for feature in numerical_features:
    plt.figure(figsize=(8, 4))
    plt.subplot(1, 2, 1)
    plt.hist(data[feature], bins=20, color='skyblue', edgecolor='black')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.title(f'{feature} Distribution')

    plt.subplot(1, 2, 2)
    sns.boxplot(x='Defaulter', y=feature, data=data, palette='pastel')
    plt.xlabel('Default')
    plt.ylabel(feature)
    plt.title(f'{feature} by Defaulters')
    plt.xticks(ticks=[0, 1], labels=['No', 'Yes'])
    plt.tight_layout()
    plt.show()


1. The 'LIMIT_BAL' histogram shows that a significant number of credit card holders have lower credit limits, while the 'AGE' histogram suggests a relatively uniform distribution across various age groups.

2. The box plots indicate that the median credit limit for customers who defaulted is slightly lower than those who did not default, but there are no significant differences in the median ages between defaulters and non-defaulters.

## **AGE**



In [None]:
data['AGE'].value_counts()
data['AGE']=data['AGE'].astype('int')


In [None]:
# Create a bar plot for the age distribution
plt.figure(figsize=(12, 6))
age_counts_df = data['AGE'].value_counts().reset_index()
sns.barplot(x='index', y='AGE', data=age_counts_df, palette='pastel')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution - Bar Plot')
plt.xticks(rotation=90)
plt.show()

#values count for Age with respect to IsDefaulter
plt.figure(figsize=(20,8))
sns.countplot(x = 'AGE', hue = 'Defaulter', data = data)

## **BILL AMOUNT**

Creating a pair plot to visualize the pairwise relationships and distributions among the bill amount columns from the original dataset.

In [None]:
# Selecting the bill amount columns
bill_amnt_df = data[['BILL_AMT_SEPT', 'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY', 'BILL_AMT_APR']]

# Create a pair plot for the bill amount columns
sns.pairplot(data=bill_amnt_df)
plt.show()

**History payment status**

The count plots visualize the distribution of payment status for defaulters and non-defaulters in different months, providing insights into how payment behavior relates to the likelihood of defaulting.

In [None]:
# Selecting the payment status columns
pay_col = ['PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR']

# Create count plots for the payment status columns
for col in pay_col:
    plt.figure(figsize=(10, 5))
    sns.countplot(x=col, hue='Defaulter', data=data, palette='dark')
    plt.xlabel(f'{col} (Repayment Status)')
    plt.ylabel('Count')
    plt.title(f'Payment Status Distribution - {col}')
    plt.legend(title='Defaulter', labels=['No', 'Yes'])
    plt.show()


### Correlation Heatmap

In [None]:
# Calculate and visualize correlations between features
correlation_matrix = data.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, cmap='coolwarm', annot=True, fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()


## **Label Incoding**

In [None]:

data.replace({'SEX':{'FEMALE': 0, 'MALE' : 1},
               'EDUCATION' : { 'graduate school' : 1 ,  'university' :2 , 'high school' : 3,  'others' : 4},
               'MARRIAGE' : { 'married' : 1,  'single':2,  'others' : 3}}, inplace = True)

In [None]:
data.info()

## **One Hot Encoding**

In [None]:
# Perform one-hot encoding for the 'EDUCATION' and 'MARRIAGE' columns
data = pd.get_dummies(data, columns=["EDUCATION", "MARRIAGE"])

In [None]:
data.head()

In [None]:
data.drop(['EDUCATION_4', 'MARRIAGE_3'], axis=1, inplace=True)
data.head()

In [None]:
#creating dummy variables by droping firs variable
data = pd.get_dummies(data, columns = ['PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR'], drop_first = True )

In [None]:
data.head()

## ***6. Feature Engineering & Data Pre-processing***

#### Using SMOTE (Synthetic Minority Oversampling Technique) to remediate class imbalance is a common approach in machine learning when dealing with imbalanced datasets. SMOTE is a technique that generates synthetic samples for the minority class by interpolating between existing samples.

In [None]:
# Assuming you have your DataFrame named 'data' with features and the target column 'Defaulter'

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(data[(i for i in list(data.describe(include='all').columns) if i != 'Defaulter')], data['Defaulter'])

print('Original unbalanced dataset shape', len(data))
print('Resampled balanced dataset shape', len(y_smote))


In [None]:
# Create the balanced DataFrame 'balanced_data' from the resampled data
balanced_data = pd.DataFrame(x_smote, columns=[col for col in data.columns if col != 'Defaulter'])
# Add the 'Defaulter' column to the balanced DataFrame
balanced_data['Defaulter'] = y_smote

# Print the shape of the balanced DataFrame
print(balanced_data.shape)

In [None]:
# Removing feature ID from dataset
balanced_data.drop('ID',axis = 1, inplace = True)


In [None]:
balanced_data.head()

In [None]:
#seperating dependant and independant variabales
X = balanced_data[(list(i for i in list(balanced_data.describe(include='all').columns) if i != 'Defaulter'))]
y = balanced_data['Defaulter']

In [None]:
X.shape

In [None]:
y.shape

## Data Transformation

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

## Train Test Splitting

In [None]:
# Split the data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape


In [None]:
X_test.shape

# **Models used for predictions:**

Random Forest

Decision Tree

Logistic Regression

Gradient Boosting

XGBoost

SVM






### **Random Forest Classifier**

In [None]:
# Create and train the Random Forest Classifier model
rf_classifier = RandomForestClassifier(n_estimators=50, random_state=42)
rf_classifier.fit(X_train, y_train)

# Class prediction of y on test data
y_pred_rf = rf_classifier.predict(X_test)
y_train_pred_rf = rf_classifier.predict(X_train)


# Getting all scores for Random Forest Classifier on test data
train_accuracy_rf = accuracy_score( y_train_pred_rf , y_train)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_score_rf = precision_score(y_test, y_pred_rf)
recall_score_rf = recall_score(y_test, y_pred_rf)
f1_score_rf = f1_score(y_test, y_pred_rf)
roc_score_rf = roc_auc_score(y_test, y_pred_rf)

print("The accuracy on train data is:", round(train_accuracy_rf, 3))
print("The accuracy on test data is:", round(accuracy_rf, 3))
print("The precision on test data is:", round(precision_score_rf, 3))
print("The recall on test data is:", round(recall_score_rf, 3))
print("The f1 on test data is:", round(f1_score_rf, 3))
print("The roc_score on test data is:", round(roc_score_rf, 3))


In [None]:
# Random Forest
cm_rf_test = confusion_matrix(y_test, y_pred_rf)
# Example to print confusion matrix for Logistic Regression
print("\nConfusion Matrix for Random Forest - Test:")
print(cm_rf_test)

### **Decision Tree Classifier**

In [None]:
# Create and train the Decision Tree Classifier model
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)

# Class prediction of y on test data
y_pred_dt = dt_classifier.predict(X_test)
y_train_pred_dt = dt_classifier.predict(X_train)

# Getting all scores for Decision Tree Classifier on test data
train_accuracy_dt = accuracy_score(y_train_pred_dt, y_train)
accuracy_dt = accuracy_score(y_pred_dt, y_test)
precision_score_dt = precision_score(y_pred_dt, y_test)
recall_score_dt = recall_score(y_pred_dt, y_test)
f1_score_dt = f1_score(y_pred_dt, y_test)
roc_score_dt = roc_auc_score(y_pred_dt, y_test)

print("The accuracy on train data is:", round(train_accuracy_dt, 3))
print("The accuracy on test data is:", round(accuracy_dt, 3))
print("The precision on test data is:", round(precision_score_dt, 3))
print("The recall on test data is:", round(recall_score_dt, 3))
print("The f1 on test data is:", round(f1_score_dt, 3))
print("The roc_score on test data is:", round(roc_score_dt, 3))



In [None]:
# Decision Tree
cm_dt_test = confusion_matrix(y_test, y_pred_dt)
print("\nConfusion Matrix for Decision Tree - Test:")
print(cm_dt_test)

### **Logistic Regression**

In [None]:
# Create and train the Logistic Regression model
logi = LogisticRegression()
logi.fit(X_train, y_train)

# Class prediction of y
y_pred_logi = logi.predict(X_test)
y_train_pred_logi = logi.predict(X_train)

# Getting all scores for Logistic Regression
train_accuracy_logi = round(accuracy_score(y_train, y_train_pred_logi), 3)
accuracy_logi = round(accuracy_score(y_test, y_pred_logi), 3)
precision_score_logi = round(precision_score(y_test, y_pred_logi), 3)
recall_score_logi = round(recall_score(y_test, y_pred_logi), 3)
f1_score_logi = round(f1_score(y_test, y_pred_logi), 3)
roc_score_logi = round(roc_auc_score(y_test, y_pred_logi), 3)

print("The accuracy on train data is:", train_accuracy_logi)
print("The accuracy on test data is:", accuracy_logi)
print("The precision on test data is:", precision_score_logi)
print("The recall on test data is:", recall_score_logi)
print("The f1 on test data is:", f1_score_logi)
print("The roc_score on test data is: ", roc_score_logi)


In [None]:
# Logistic Regression
cm_logi_test = confusion_matrix(y_test, y_pred_logi)
#print confusion matrix for Logistic Regression
print("\nConfusion Matrix for Logistic Regression - Test:")
print(cm_logi_test)

### **Gradient Boosting**

In [None]:
# Create and train the Gradient Boosting Classifier model
gb_classifier = GradientBoostingClassifier( random_state=42)
gb_classifier.fit(X_train, y_train)

# Class prediction of y on test data
y_pred_gb = gb_classifier.predict(X_test)
y_train_pred_gb = gb_classifier.predict(X_train)

# Getting all scores for Gradient Boosting Classifier on test data
train_accuracy_gb = accuracy_score(y_train, y_train_pred_gb)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
precision_score_gb = precision_score(y_test, y_pred_gb)
recall_score_gb = recall_score( y_pred_gb, y_test)
f1_score_gb = f1_score(y_test, y_pred_gb)
roc_score_gb = roc_auc_score(y_test, y_pred_gb)

print("The accuracy on train data is:", round(train_accuracy_gb, 3))
print("The accuracy on test data is:", round(accuracy_gb, 3))
print("The precision on test data is:", round(precision_score_gb, 3))
print("The recall on test data is:", round(recall_score_gb, 3))
print("The f1 on test data is:", round(f1_score_gb, 3))
print("The roc_score on test data is:", round(roc_score_gb, 3))


In [None]:
# Gradient Boosting
cm_gb_test = confusion_matrix(y_test, y_pred_gb)
print("\nConfusion Matrix for Gradient Boosting - Test:")
print(cm_gb_test)


### **XGBoost Classifier**

In [None]:
# Train and predict using XGBoost
xgb_classifier = XGBClassifier(random_state=42)
xgb_classifier.fit(X_train, y_train)

# Class prediction of y on test data
y_pred_xgb = xgb_classifier.predict(X_test)
y_train_pred_xgb = xgb_classifier.predict(X_train) # Predict on train data

# Getting all scores for  XBoosting Classifier on test data
train_accuracy_xgb = accuracy_score( y_train_pred_xgb, y_train)
accuracy_xgb = accuracy_score( y_pred_xgb , y_test)
precision_score_xgb = precision_score(y_pred_xgb , y_test)
recall_score_xgb = recall_score(y_pred_xgb , y_test)
f1_score_xgb = f1_score(y_pred_xgb , y_test)
roc_score_xgb = roc_auc_score(y_pred_xgb , y_test)

print("The accuracy on train data is:", round(train_accuracy_xgb, 3))
print("The accuracy on test data is:", round(accuracy_xgb, 3))
print("The precision on test data is:", round(precision_score_xgb, 3))
print("The recall on test data is:", round(recall_score_xgb, 3))
print("The f1 on test data is:", round(f1_score_xgb, 3))
print("The roc_score on test data is:", round(roc_score_xgb, 3))



In [None]:
# XG Boosting
cm_xgb_test = confusion_matrix(y_test, y_pred_xgb)
print("\nConfusion Matrix for XG Boosting - Test:")
print(cm_xgb_test)


### **SVM Classifier**

In [None]:
# Create and train the SVM Classifier model
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train, y_train)

# Class prediction of y on test data
y_pred_svm = svm_classifier.predict(X_test)
y_train_pred_svm = svm_classifier.predict(X_train)

# Getting all scores for SVM Classifier on test data
train_accuracy_svm = accuracy_score(y_train, y_train_pred_svm)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_score_svm = precision_score(y_test, y_pred_svm)
recall_score_svm = recall_score(y_test, y_pred_svm)
f1_score_svm = f1_score(y_test, y_pred_svm)
roc_score_svm = roc_auc_score(y_test, y_pred_svm)

print("The accuracy on train data is:", round(train_accuracy_svm, 3))
print("The accuracy on test data is:", round(accuracy_svm, 3))
print("The precision on test data is:", round(precision_score_svm, 3))
print("The recall on test data is:", round(recall_score_svm, 3))
print("The f1 on test data is:", round(f1_score_svm, 3))
print("The roc_score on test data is:", round(roc_score_svm, 3))


In [None]:
# SVM
cm_svm_test = confusion_matrix(y_test, y_pred_svm)
print("\nConfusion Matrix for Logistic Regression - Test:")
print(cm_svm_test)



## **Evaluating the models**

In [None]:
# Define the evaluation metrics for each classifier
classifiers = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'Gradient Boosting', 'XG Boosting']
train_accuracy = [train_accuracy_logi, train_accuracy_dt, train_accuracy_rf, train_accuracy_svm, train_accuracy_gb, train_accuracy_xgb]
test_accuracy = [accuracy_logi, accuracy_dt, accuracy_rf, accuracy_svm, accuracy_gb, accuracy_xgb]
precision_score = [precision_score_logi, precision_score_dt, precision_score_rf, precision_score_svm, precision_score_gb, precision_score_xgb]
recall_score = [recall_score_logi, recall_score_dt, recall_score_rf, recall_score_svm, recall_score_gb, recall_score_xgb]
f1_score = [f1_score_logi, f1_score_dt, f1_score_rf, f1_score_svm, f1_score_gb, f1_score_xgb]
auc_score = [roc_score_logi, roc_score_dt, roc_score_rf, roc_score_svm, roc_score_gb, roc_score_xgb]


In [None]:
# Create a DataFrame to store the results
metrics =pd.DataFrame({
    'Classifier': classifiers,
    'Train Accuracy': train_accuracy,
    'Test Accuracy': test_accuracy,
    'Precision': precision_score,
    'Recall': recall_score,
    'F1 Score': f1_score,
    'AUC': auc_score
})

# Create a DataFrame from the dictionary
metrics_df = pd.DataFrame(metrics)

# Round off the values to 3 decimal places
metrics_df = metrics_df.round(3)
metrics_df

In [None]:
import matplotlib.pyplot as plt
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_curve, roc_auc_score

# Create a list of model names and classifiers
model_names = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'Gradient Boosting', 'XGBoost']
classifiers = [logi, dt_classifier, rf_classifier, svm_classifier, gb_classifier, xgb_classifier]

# Plot ROC AUC for each model
plt.figure(figsize=(8, 6))
for i in range(len(model_names)):
    y_pred_prob = classifiers[i].predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    plt.plot(fpr, tpr, label=f'{model_names[i]} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC AUC Curve for Different Classifiers')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()




### Mean Accuracy (coss-validation)

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***