<a href="https://colab.research.google.com/github/Rishabhyadav888/credit_card_fraud_dedication/blob/main/Credit_card_fraud_dedicution_in_Taiwan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -**Rishabh kumar yadav


# **Project Summary -**

The project aimed to predict the probability of default for Taiwanese credit card clients to minimize losses for banks and credit card companies. The classification problem involved predicting if a client would default or not in the next month. The study highlighted the importance of risk management and demonstrated how machine learning algorithms can improve predictive accuracy.

In the data preprocessing step, the ID column was dropped, and the numeric columns were converted to the integer data type. The values of the categorical columns were also changed from integer to string data type for better understanding. Some columns were renamed to make them more meaningful. Additionally, a new feature called "total due" was added, which represented the total amount left for payment. These steps were taken to improve the quality and usability of the data for machine learning analysis.

In the data visualization step, univariate analysis, bivariate analysis, and multivariate analysis were performed to understand the relationships between variables. The correlation heatmap showed that bill months had multicollinearity with VIF values greater than 20. To address this issue, the columns were added together, and a new column called "total bill" was created. This step helped to improve the accuracy of the machine learning algorithm by reducing the impact of multicollinearity on the analysis.

In the feature engineering step, category encoding was performed on several categorical columns. For sex, level encoding was used, and for marriage and education, one-hot encoding was used. In the feature selection step, recursive feature selection was employed, which involved the backward elimination of features based on their coefficient values by running different combinations. The data scaling was performed using the standard scalar. To handle the imbalanced data of the dependent variable, the SMOTE oversampling method was used. Finally, the data was split into training and testing sets using the train-test split method. These steps were taken to improve the quality of the data and prepare it for machine learning analysis.

In the ML model implementation step, various algorithms were used to build the predictive model, including logistic regression, decision tree, K-nearest neighbor classifier, random forest, and XGBoost classifier. Evaluation metrics such as accuracy score, ROC-AUC score, and confusion matrix were used to assess the performance of the models. These metrics helped to determine the accuracy and reliability of the model and were used to select the best performing model for the final analysis.

Identifying credit card transactions that are likely to default on payment is crucial for businesses to minimize financial losses, protect their credit score, avoid legal issues, and maintain a good reputation. This type of problem requires minimizing false negatives to accurately identify transactions that are likely to default. The K-nearest neighbor classifier had the highest recall score of 0.92, indicating that it is the most suitable model for predicting credit card defaults. Therefore, it can be used for the final prediction to help businesses prevent financial losses and minimize risk.

In conclusion, the project demonstrated the importance of risk management and how machine learning algorithms can be used to improve predictive accuracy for credit card default prediction in Taiwan. Through various steps of data preprocessing, visualization, feature engineering, selection, and ML model implementation, the project aimed to prepare the data for analysis and select the best performing model for the final prediction. The K-nearest neighbor classifier had the highest recall score, indicating its suitability for predicting credit card defaults, and can be used to help businesses minimize financial losses and risk.

# **GitHub Link -**


https://github.com/Rishabhyadav888/credit_card_fraud_dedication

# **Problem Statement**


 
*   Certain Cases of Customers default on Payments in Taiwan.
*   From a Risk Management Perspective a Bank/Credit Card Company is more interested in minimizing their losses towards a particular customer.
*  Goal: To compute the predictive accuracy of probability of default for a Taiwanese Credit Card Client.
*Problem Analysis - Classify Probability of default for next month: 1 as "Default" and o as "Not Default".



# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xg
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import warnings; warnings.simplefilter('ignore')


### Dataset Loading

In [None]:
# connect to google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path='/content/drive/MyDrive/Capstone Project/Credit Card fraud/'
df= pd.read_excel(path+'default of credit card clients.xls')

### Dataset First View

In [None]:
# Dataset First Look
headers=df.iloc[0]
df  = pd.DataFrame(df.values[1:], columns=headers)

In [None]:
# First 5 rows of dataset
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

There is no NAN/NULL values in our dataset,So we dont have to impute any record.

### What did you know about your dataset?

* Our dataset contains 30000 rows with 23 input variable and 1 target column.
* All the columns are object type.some of the feature need to be converted into int or float type.
* There are variables that need to be converted to categories like(sex,education,marraige)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description 

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)
Inspiration

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
label_colum=['SEX', 'EDUCATION', 'MARRIAGE', 'AGE','default payment next month','PAY_0']
list_columns=list(label_colum)
for colm in list_columns:
 print(f"Feature unique values for each variable:{colm} {df[colm].unique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Deleted the ID column from df
df.drop('ID',axis=1,inplace=True)

In [None]:
# Converted the numeric columns to int dtype
categorical_colum=['SEX', 'EDUCATION', 'MARRIAGE']
numeric_colum=df.columns.drop(categorical_colum)
df[numeric_colum] = df[numeric_colum].apply(pd.to_numeric)

In [None]:
# Data type of each variable
df.dtypes

In [None]:
df.describe().T

In [None]:
# Converted the numeric value to categorical
df.replace({'SEX': {1 : 'Male', 2 : 'Female'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)

In [None]:
#renaming columns 

df.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)
df.rename(columns={'default payment next month':'default'},inplace=True)

In [None]:
df.head()

In [None]:
# Add new feature (total due)
df['due_amount'] = (df['BILL_AMT_APR']+df['BILL_AMT_MAY']+df['BILL_AMT_JUN']+df['BILL_AMT_JUL']+df['BILL_AMT_SEPT'])-(df['PAY_AMT_APR']+df['PAY_AMT_MAY']+df['PAY_AMT_JUN']+df['PAY_AMT_JUL']+df['PAY_AMT_AUG']+df['PAY_AMT_SEPT'])

#### What all manipulations have you done and insights you found?

* Drop the ID column
* Converted the numeric columns to int dtype.
* Changed the values of categorical column from int to str type.
* Renamed few columns to make it more understandable.
* Added new feature total due - (total amount left for payment)

## ***4. Data Visualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis**

#### DEFAULT

In [None]:
# visualization code
plt.figure(figsize=(7,7))
sns.countplot(x = 'default', data = df)

In [None]:
# Unique value count of target column(default)
df['default'].value_counts()

In [None]:
# Distribution of unique value(default)
df['default'].value_counts(1)

##### 1. Why did you pick the specific chart?

Through bar  plot we can easily see the difference in the ratio of default payment and non defaulters.

##### 2. What is/are the insight(s) found from the chart?

* We can see that the dataset consists of 77.88% clients are not expected to default payment whereas 22.12% clients are expected to default the payment.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* As we can see from above graph that both classes are not in proportion and we have imbalanced dataset.

#### SEX

In [None]:
# visualization code
plt.figure(figsize=(5,5))
df.groupby('SEX').size().plot(kind='pie', autopct='%.0f%%',).set_ylabel('SEX')

##### 1. Why did you pick the specific chart?

Through pie chart we can find out by which gender more transactions are being done through credit cards.

##### 2. What is/are the insight(s) found from the chart?

* You can easily see that more transactions are done by the female compared to male maybe for the shopping of household items, luxury items, cosmetic products etc.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* The probability of getting defaulters of females are high because they have more number of transaction.

#### EDUCATION

In [None]:
# visualization code
plt.figure(figsize=(10,7))
sns.countplot(x = 'EDUCATION', data = df)

##### 1. Why did you pick the specific chart?

Through count plot we can find out education level of the people making transections through credit card.

##### 2. What is/are the insight(s) found from the chart?

* Most transactions are being done by the university graduated people.
* we have values like 5,6,0 as well for which we are not having description so we can add up them in Others.

In [None]:
# Value without description added to others 
fil =(df['EDUCATION']==0)|(df['EDUCATION']==5)|(df['EDUCATION']==6)
df.loc[fil, 'EDUCATION'] = 'others'

In [None]:
# Education unique value count
df['EDUCATION'].value_counts()

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* There are 468 transactions which and their education level are not known so it is difficult to know which groups of people are doing these transactions.

#### AGE

In [None]:
# visualization code

plt.figure(figsize=(15,7))
sns.countplot(x = 'AGE', data = df)

In [None]:
df['AGE'].describe()

##### 1. Why did you pick the specific chart?

Through count plot we can find out the transection count of each age group people.

##### 2. What is/are the insight(s) found from the chart?

* Credit transactions are done by the minmum age of 21 to the maximum age of 79.
* Maximum credit transactions are done by the age group of 29.
* 50% of the credit transactions are done under the age group of 21 to 34.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* As we get to know which age groups of people are doing more transactions. So maybe there is higher chances of getting the default in repayments for the age group of 24 to 34.

#### MARRIAGE

In [None]:
# visualization code
plt.figure(figsize=(10,7))
sns.countplot(x = 'MARRIAGE', data = df)

In [None]:
# Changed the vlaue from 0 to others
df.loc[(df['MARRIAGE']==0),'MARRIAGE'] = 'others'

##### 1. Why did you pick the specific chart?

Through bar plot we can find out the transection count for married and single people.

##### 2. What is/are the insight(s) found from the chart?

* More transection are being done by single people.

#### LIMIT BALANCE

In [None]:
# visualization code
plt.figure(figsize = (14,6))
sns.histplot(df.LIMIT_BAL,kde=True,bins=200)
plt.show()

In [None]:
df['LIMIT_BAL'].describe()

In [None]:
# Limit balance with 5 largest group of people
df['LIMIT_BAL'].value_counts().head(5)

##### 1. Why did you pick the specific chart?

Through count plot we can find out the limit balance distribution among group of people.

##### 2. What is/are the insight(s) found from the chart?

* Minimum limit balance is 10,000.
* Maximum limit balance is 10,00,000
* Largest group of people are having credit limit of 50,000.

#### BILL AMOUNT

In [None]:
# visualization code
plt.figure(figsize=(10,5))
sns.boxplot(data=df.loc[:,['BILL_AMT_SEPT',	'BILL_AMT_AUG',	'BILL_AMT_JUL',	'BILL_AMT_JUN',	'BILL_AMT_MAY',	'BILL_AMT_APR']])
plt.ticklabel_format(style='plain', axis='y')
plt.show()

In [None]:
# Bill amount describtion
df[['BILL_AMT_SEPT',	'BILL_AMT_AUG',	'BILL_AMT_JUL',	'BILL_AMT_JUN',	'BILL_AMT_MAY',	'BILL_AMT_APR']].describe()

##### 1. Why did you pick the specific chart?

Through boxplot we can find out the IQR of bill amount.

##### 2. What is/are the insight(s) found from the chart?

* Bill amounts are in positive as well as in negative and the reason behind the negative amount maybe due to the advanced payment of the bills.
* Most of th bills are with zero amount.
* By boxplot we can see that there are some outliers with negative and positive amount that there could be a chance of having high amount of bills as their credit limit is also high.

#### PAY

In [None]:
#  visualization code
cols = 3
rows = 3
num_cols = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR']
fig = plt.figure( figsize=(cols*4, rows*4))
for i, col in enumerate(num_cols):
    
    ax=fig.add_subplot(rows,cols,i+1)
    
    sns.countplot(x = df[col], ax = ax)
    
fig.tight_layout()  
plt.show()

##### 1. Why did you pick the specific chart?

Through count graph we can find out the payment status of each month like weather the payment was done on time, before or after due date.

##### 2. What is/are the insight(s) found from the chart?

* Most of the payments are done on time or before the due date.
* Most of the delay payments were made on the second month from the due date.
* There are few delay payemnts which were max paid on the 8th month from the due date.

#### PAYMENT AMOUNT

In [None]:
# visualization code
plt.figure(figsize=(10,5))
sns.boxplot(data=df.loc[:,['PAY_AMT_SEPT',	'PAY_AMT_AUG',	'PAY_AMT_JUL',	'PAY_AMT_JUN',	'PAY_AMT_MAY',	'PAY_AMT_APR']])
plt.ticklabel_format(style='plain',axis='y')
plt.show()

In [None]:
df[['PAY_AMT_SEPT',	'PAY_AMT_AUG',	'PAY_AMT_JUL',	'PAY_AMT_JUN',	'PAY_AMT_MAY',	'PAY_AMT_APR']].describe()

##### 1. Why did you pick the specific chart?

Through boxplot we can find out the IQR of payment amount.

##### 2. What is/are the insight(s) found from the chart?

* Minimum payment was done of 0.
* 75% of payments where done under 5000.

#### DUE AMOUNT

In [None]:
# visualization code
plt.figure(figsize=(7,5))
sns.boxplot(df.due_amount)
plt.ticklabel_format(style='plain',axis="y")
plt.show()

In [None]:
df[['due_amount']].describe()

##### 1. Why did you pick the specific chart?

Through boxplot we can find out the IQR of total due amount.

##### 2. What is/are the insight(s) found from the chart?

* Minimum payment due is in negative which is an advance payment towards the bill of 2800000.
* Average due amount is 189000.
* Maximum due amount is of 3100000.

### **Bivariate Analysis**

#### SEX & DEFAULT

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,6))
sns.countplot(x = df['SEX'], hue = 'default', data = df)

##### 1. Why did you pick the specific chart?

Thorugh count plot we can compere the count of default transection between both the gender.

##### 2. What is/are the insight(s) found from the chart?

* Females are more likely to get default next month compered to male.As credit transection are done more by female compered to male.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By the analysis we get to know which gender has more chances of getting default next month.

#### EDUCATION & DEFAULT

In [None]:
#  visualization code
plt.figure(figsize=(10,7))
sns.countplot(x = df['EDUCATION'], hue = 'default', data = df)

##### 1. Why did you pick the specific chart?

Through count plot we can find out which educated group of people are having more chances of getting default next month.

##### 2. What is/are the insight(s) found from the chart?

* University people are getting more default next month.
* Mainly educated peoples are getting more default in paying there credit card bills.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Other group of people are not defined maybe they belong to the part of uneducated group but chances of getting default for these groups are very less compared to other educated group of people.

#### MARRIAGE & DEFAULT

In [None]:
#  visualization code
plt.figure(figsize=(10,7))
sns.countplot(x = df['MARRIAGE'], hue = 'default', data = df)

##### 1. Why did you pick the specific chart?

Through count plot we can find out which group of people are having more chances of getting default next month.

##### 2. What is/are the insight(s) found from the chart?

* Both married & single people are likely equally chance to get default next month.
* Single people are more likely to pay there bills on time compering to married people.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

* There's some categories in marriage which are not labeled like others which may belongs to the divorced class.

#### AGE & DEFAULT

In [None]:
# visualization code
plt.figure(figsize=(15,7))
sns.countplot(x = 'AGE',hue='default', data = df)
plt.show()

##### 1. Why did you pick the specific chart?

Through count plot we can find out which age group of people are having more chances of getting default next month.

##### 2. What is/are the insight(s) found from the chart?

* Age group of 23 to 30 are most likely going to default next month.
* After the age of 30 the default ratio is decreasing as the age is increasing may be due to decrease in credit card transaction the chance of getting default next month also decreased.

#### LIMIT BALANCE & DEFAULT

In [None]:
# visualization code
plt.figure(figsize=(8,6))
sns.violinplot(y = 'LIMIT_BAL',x='default' ,data = df)
plt.ticklabel_format(style='plain',axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

Through violin plot we can known the distribution of the continuous variable for each category of the categorical variable.As it is a combination of box plot and a kernel density plot.

##### 2. What is/are the insight(s) found from the chart?

* People having credit limit under 200000 are getting more likely to get default next month.

####  DUE AMOUNT & DEFAULT

In [None]:
# visualization code
plt.figure(figsize=(10,7))
sns.boxplot(y='due_amount',x='default',data=df)
plt.ticklabel_format(style='plain',axis='y')

In [None]:
df.loc[df['default']==1,'due_amount'].describe()

##### 1. Why did you pick the specific chart?

Through boxplot we can find out the IQR of due amount for those transection which are going to default next month.

##### 2. What is/are the insight(s) found from the chart?

* 75% of the transection are getting default next mounth whoes due amount are under 233000.
* Average due amount is 193000 for those transection are getting default next month.

### **Multibivariate Analysis**

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Through correlation heatmap we can see the correlation among different features as well as with dependent variable.

##### 2. What is/are the insight(s) found from the chart?

* Pay months is highly correlated with the dependent variable(default).
* There are some features like pay month & bill months are having multicollinearity.let's check through VIF method is it highly correlated or not.

In [None]:
# create funtion for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
# calculate VIF
calc_vif(df[[i for i in df.describe().columns]])


* Pay months vif value less than 5.so there is no need to remove these columns.As they are correlated to default column as well.
* Bill amount columns are having vif more than 20.so we will add these columns and make a new column [total_bill].

#### 15. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# create new column total_bill
df['total_bill']= df['BILL_AMT_APR'] + df['BILL_AMT_MAY'] + df['BILL_AMT_JUN'] + df['BILL_AMT_JUL'] + df['BILL_AMT_AUG'] + df['BILL_AMT_SEPT'] 

In [None]:
# drop bill amount columns
df.drop(['BILL_AMT_SEPT', 'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN','BILL_AMT_MAY', 'BILL_AMT_APR'],axis=1,inplace=True)

In [None]:
# calculate VIF
calc_vif(df[[i for i in df.describe().columns]])

* Now VIF value is less than 5.

## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There is no null values in the dataset.

### 2. Categorical Encoding

In [None]:
# Encode your categorical columns
#label encoding
encoders_nums = {"SEX":{"Male":1,"Female":0}}

df = df.replace(encoders_nums)
# one hot encoding
df=pd.get_dummies(df,columns=["MARRIAGE",'EDUCATION'],drop_first=True)

In [None]:
# shape of data after encoding
df.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

Encoding categorical columns:
* In sex column we have used the label encoding.
* In MARRIAGE & EDUCATION feature containe more than 2 categories.So we have used the one hot encoding.
* Earlier df was (30000, 19) and now after encoding categorical columns the shape of data is (30000, 22).In which columns are increased by 3.

### 3. Feature Manipulation & Selection

#### 2. Feature Selection

In [None]:
# divided the input feature & target columns
x=df.drop(['default'],axis=1)
y=df['default']

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
rfe =RFE(lr)
rfe = rfe.fit(x,y)

print(rfe.support_)
print(rfe.ranking_)

In [None]:
col=x.columns[rfe.support_]

In [None]:
x=x[col]

##### What all feature selection methods have you used  and why?

* For feature selection we have used the recursive feature selection in which backward elimination of features is done based on their coefficient value of each feature by running different combinations.

##### Which all features you found important and why?

In [None]:
# Important Feature
x.columns

* ['LIMIT_BAL', 'AGE', 'PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_AMT_SEPT','PAY_AMT_AUG', 'PAY_AMT_JUN', 'PAY_AMT_APR', 'due_amount','total_bill']
* These features were having the higher coefficient values as compared to other features. As these features were relatively highly explaining the relationship with the dependent variable.

### 4. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=sc.fit_transform(x)


##### Which method have you used to scale you data and why?

* We have used the standard scalar for data scaling as mean equals to zero and the variation of 1. As we were having some features relatively high value compared to other features.

### 5. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# Value count percentage of dependent variable
y.value_counts(1)

* 77.8% of the transactions were non defaulted which was the majority class and in military class of 22% of the transaction we're going to default next month.
* With the ratio of 2/7 we can say that aur dependent variable is imblanced data.

In [None]:
# Handling Imbalanced Dataset (By using smote)
from imblearn.over_sampling import SMOTE

smote=SMOTE()
x,y = smote.fit_resample(x,y)

In [None]:
# New value count of dependent variable
y.value_counts()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

* To handle the imbalance data we have used the smote function.
* SMOTE works by utilizing a k-nearest neighbour algorithm to create synthetic data. SMOTE first starts by choosing random data from the minority class, then k-nearest neighbors from the data are set.

### 6. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=10)

print(x_train.shape)
print(x_test.shape)


##### What data splitting ratio have you used and why? 

* Splitting ratio of 80/20 is used for the evaluation and learing of model.
* 80% for training & 20% for testing.