# Predicting Credit Default

## Built a classification methodology to determine whether a person defaults the credit card payment for the next month

This code loads the dataset into a pandas DataFrame and prints the first five rows of the dataset. Note that the header=1 argument is used to skip the first row of the Excel file, which contains a description of the dataset.

In [40]:
import pandas as pd

# Load the dataset
df = pd.read_excel('C:\\Users\\Manjunath\\Desktop\\RawData\\default of credit card clients.xls', header=1)

# Print the first five rows of the dataset
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   ID                          30000 non-null  int64
 1   LIMIT_BAL                   30000 non-null  int64
 2   SEX                         30000 non-null  int64
 3   EDUCATION                   30000 non-null  int64
 4   MARRIAGE                    30000 non-null  int64
 5   AGE                         30000 non-null  int64
 6   PAY_0                       30000 non-null  int64
 7   PAY_2                       30000 non-null  int64
 8   PAY_3                       30000 non-null  int64
 9   PAY_4                       30000 non-null  int64
 10  PAY_5                       30000 non-null  int64
 11  PAY_6                       30000 non-null  int64
 12  BILL_AMT1                   30000 non-null  int64
 13  BILL_AMT2                   30000 non-null  int64
 14  BILL_A

In [42]:
df.shape

(30000, 25)

In [None]:
Here are some of the variables in the dataset:

ID: ID of each client
LIMIT_BAL: Amount of the given credit (in New Taiwan dollar)
SEX: Gender (1 = male; 2 = female)
EDUCATION: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others)
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others)
AGE: Age (in years)
PAY_0 to PAY_6: History of past payments. 
PAY_0 represents the repayment status in September 2005; 
PAY_2 represents the repayment status in August 2005; and so on. 
The measurement scale for the repayment status is: 
    -2 = no consumption; 
    -1 = paid in full; 
    0 = use of revolving credit; 
    1 = payment delay for one month; 
    2 = payment delay for two months; and so on, 
    up to 8 = payment delay for eight months and 
    9 = payment delay for nine months or more.
BILL_AMT1 to BILL_AMT6: Amount of bill statement. 
BILL_AMT1 represents the bill statement amount in September 2005;
BILL_AMT2 represents the bill statement amount in August 2005; and so on.
PAY_AMT1 to PAY_AMT6: Amount of previous payment.
PAY_AMT1 represents the payment amount in September 2005; 
PAY_AMT2 represents the payment amount in August 2005; and so on.
default payment next month: 
    Whether or not the client defaulted on their credit card payment the following month (1 = yes; 0 = no)
 

## data preprocessing:

### Remove unnecessary columns:

The ID column is not useful for modeling, so we can remove it. 

The PAY_0 column is redundant since it represents the same information as PAY_1, so we can remove it as well.

In [31]:
df = df.drop(['ID', 'PAY_0'], axis=1)

### Handle missing values:

Missing values in the dataset and handle them accordingly. 

In this dataset, missing values are represented as "Not Available" or "Unknown", so we can replace them with NaN values and then use an imputer to fill in the missing values.
python

In [32]:
import numpy as np
df = df.replace(['Not Available', 'Unknown'], np.nan)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(df)
df = pd.DataFrame(imputer.transform(df), columns=df.columns)

### Handle categorical features: 

The SEX, EDUCATION, and MARRIAGE columns are categorical features, so we need to convert them into one-hot encoded columns using pandas' get_dummies() function.

In [33]:
df = pd.get_dummies(df, columns=['SEX', 'EDUCATION', 'MARRIAGE'])

### Normalize the data: 
We can use the StandardScaler from scikit-learn to normalize the data.

In [34]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:, 1:] = scaler.fit_transform(df.iloc[:, 1:])

After preprocessing the data, we can then split the dataset into training and testing sets and build a machine learning model. The specific steps and code for building the model will depend on the algorithm used for modeling.

### Split the dataset:
We can split the dataset into a training set and a testing set using scikit-learn's train_test_split() function.

In [51]:
from sklearn.model_selection import train_test_split
X = df.drop(['default payment next month'], axis=1)
y = df['default payment next month']
from sklearn import preprocessing
from sklearn import utils

#convert y values to categorical values
lab = preprocessing.LabelEncoder()
y_transformed = lab.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [52]:
df.shape

(30000, 25)

### Build a machine learning model: 
We can choose a machine learning algorithm to model the data. For example, we can use a logistic regression model from scikit-learn to predict credit card defaulters.

In [65]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(random_state=42)

### Evaluate the model

We can evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1 score.

In [64]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred))

Accuracy: 0.8205
Precision: 0.670028818443804
Recall: 0.3541507996953541
F1 score: 0.4633781763826606


## DecisionTreeClassifier

In [69]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred))


Accuracy: 0.722
Precision: 0.37420269312544296
Recall: 0.4021325209444021
F1 score: 0.3876651982378854


## RandomForestClassifier

In [66]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred))

Accuracy: 0.8136666666666666
Precision: 0.6312247644683715
Recall: 0.3571972581873572
F1 score: 0.45622568093385213


## GradientBoostingClassifier

In [67]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 score:', f1_score(y_test, y_pred))
print('roc_auc_score',roc_auc_score(y_test,y_pred))
print('confusion_matrix',confusion_matrix(y_test,y_pred))

Accuracy: 0.8205
Precision: 0.670028818443804
Recall: 0.3541507996953541
F1 score: 0.4633781763826606
roc_auc_score 0.6526461273919485
confusion_matrix [[4458  229]
 [ 848  465]]
