<a href="https://colab.research.google.com/github/Rishabhyadav888/credit_card_fraud_dedication/blob/main/Credit_card_fraud_dedicution_in_Taiwan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -**Rishabh kumar yadav


# **Project Summary -**

The project aimed to predict the probability of default for Taiwanese credit card clients to minimize losses for banks and credit card companies. The classification problem involved predicting if a client would default or not in the next month. The study highlighted the importance of risk management and demonstrated how machine learning algorithms can improve predictive accuracy.

In the data preprocessing step, the ID column was dropped, and the numeric columns were converted to the integer data type. The values of the categorical columns were also changed from integer to string data type for better understanding. Some columns were renamed to make them more meaningful. Additionally, a new feature called "total due" was added, which represented the total amount left for payment. These steps were taken to improve the quality and usability of the data for machine learning analysis.

In the data visualization step, univariate analysis, bivariate analysis, and multivariate analysis were performed to understand the relationships between variables. The correlation heatmap showed that bill months had multicollinearity with VIF values greater than 20. To address this issue, the columns were added together, and a new column called "total bill" was created. This step helped to improve the accuracy of the machine learning algorithm by reducing the impact of multicollinearity on the analysis.

In the feature engineering step, category encoding was performed on several categorical columns. For sex, level encoding was used, and for marriage and education, one-hot encoding was used. In the feature selection step, recursive feature selection was employed, which involved the backward elimination of features based on their coefficient values by running different combinations. The data scaling was performed using the standard scalar. To handle the imbalanced data of the dependent variable, the SMOTE oversampling method was used. Finally, the data was split into training and testing sets using the train-test split method. These steps were taken to improve the quality of the data and prepare it for machine learning analysis.

In the ML model implementation step, various algorithms were used to build the predictive model, including logistic regression, decision tree, K-nearest neighbor classifier, random forest, and XGBoost classifier. Evaluation metrics such as accuracy score, ROC-AUC score, and confusion matrix were used to assess the performance of the models. These metrics helped to determine the accuracy and reliability of the model and were used to select the best performing model for the final analysis.

Identifying credit card transactions that are likely to default on payment is crucial for businesses to minimize financial losses, protect their credit score, avoid legal issues, and maintain a good reputation. This type of problem requires minimizing false negatives to accurately identify transactions that are likely to default. The K-nearest neighbor classifier had the highest recall score of 0.92, indicating that it is the most suitable model for predicting credit card defaults. Therefore, it can be used for the final prediction to help businesses prevent financial losses and minimize risk.

In conclusion, the project demonstrated the importance of risk management and how machine learning algorithms can be used to improve predictive accuracy for credit card default prediction in Taiwan. Through various steps of data preprocessing, visualization, feature engineering, selection, and ML model implementation, the project aimed to prepare the data for analysis and select the best performing model for the final prediction. The K-nearest neighbor classifier had the highest recall score, indicating its suitability for predicting credit card defaults, and can be used to help businesses minimize financial losses and risk.

# **GitHub Link -**


https://github.com/Rishabhyadav888/credit_card_fraud_dedication

# **Problem Statement**


 
*   Certain Cases of Customers default on Payments in Taiwan.
*   From a Risk Management Perspective a Bank/Credit Card Company is more interested in minimizing their losses towards a particular customer.
*  Goal: To compute the predictive accuracy of probability of default for a Taiwanese Credit Card Client.
*Problem Analysis - Classify Probability of default for next month: 1 as "Default" and o as "Not Default".



# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xg
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import warnings; warnings.simplefilter('ignore')


### Dataset Loading

In [None]:
# connect to google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path='/content/drive/MyDrive/Capstone Project/Credit Card fraud/'
df= pd.read_excel(path+'default of credit card clients.xls')

### Dataset First View

In [None]:
# Dataset First Look
headers=df.iloc[0]
df  = pd.DataFrame(df.values[1:], columns=headers)

In [None]:
# First 5 rows of dataset
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

There is no NAN/NULL values in our dataset,So we dont have to impute any record.

### What did you know about your dataset?

* Our dataset contains 30000 rows with 23 input variable and 1 target column.
* All the columns are object type.some of the feature need to be converted into int or float type.
* There are variables that need to be converted to categories like(sex,education,marraige)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)
Inspiration

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
label_colum=['SEX', 'EDUCATION', 'MARRIAGE', 'AGE','default payment next month','PAY_0']
list_columns=list(label_colum)
for colm in list_columns:
 print(f"Feature unique values for each variable:{colm} {df[colm].unique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Deleted the ID column from df
df.drop('ID',axis=1,inplace=True)

In [None]:
# Converted the numeric columns to int dtype
categorical_colum=['SEX', 'EDUCATION', 'MARRIAGE']
numeric_colum=df.columns.drop(categorical_colum)
df[numeric_colum] = df[numeric_colum].apply(pd.to_numeric)

In [None]:
# Data type of each variable
df.dtypes

In [None]:
df.describe().T

In [None]:
# Converted the numeric value to categorical
df.replace({'SEX': {1 : 'Male', 2 : 'Female'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)

In [None]:
#renaming columns 

df.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)
df.rename(columns={'default payment next month':'default'},inplace=True)

In [None]:
df.head()

In [None]:
# Add new feature (total due)
df['due_amount'] = (df['BILL_AMT_APR']+df['BILL_AMT_MAY']+df['BILL_AMT_JUN']+df['BILL_AMT_JUL']+df['BILL_AMT_SEPT'])-(df['PAY_AMT_APR']+df['PAY_AMT_MAY']+df['PAY_AMT_JUN']+df['PAY_AMT_JUL']+df['PAY_AMT_AUG']+df['PAY_AMT_SEPT'])

#### What all manipulations have you done and insights you found?

* Drop the ID column
* Converted the numeric columns to int dtype.
* Changed the values of categorical column from int to str type.
* Renamed few columns to make it more understandable.
* Added new feature total due - (total amount left for payment)