## Problem Statement
The data comes from customers default payments in Taiwan. We need to build  predictive models that predict if a customer defaults or not, fine tune them using validation and compare them w.r.t the evaluation metrics

In [9]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport

In [None]:
df = pd.read_excel(r"C:\Users\jithe\OneDrive\Documents\Datasets\Credit_Cards\default of credit card clients.xls",skiprows=[0])

### Variable information
- This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
- Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- Gender (1 = male; 2 = female).
- Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- Marital status (1 = married; 2 = single; 3 = others).
- Age (year).
- PAY_0-PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- BILL_AMT1-BILL_AMT6: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
- PAY_AMT1-PAY_AMT6: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

### Exploratory Data Analysis

In [None]:
profile = ProfileReport(df, title="Credit Default Data")
profile

## Observations
- We see that we have a clean dataset with no missing values or duplicate rows and all columns are numerically encoded
- Features BILL_AMT1 to BILL_AMT6 are higly correlated which is expected.
- We have a rich dataset with 25 columns and 30000 observations. The id column can be dropped for further analysis and "default payment next month" is the target variable
- We have a class imbalance dataset with our target variable containing around 78% zeros(no default) and 22% ones(defaulted)

In [None]:
# Dropping the ID column as it doesn't provide any info on target variable
df.drop(labels=["ID"],inplace=True,axis=1)

In [6]:
df.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


### PCA to reduce the no of columns

In [11]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df.iloc[:,:-1])

In [12]:
explained_var_ratio = pca.explained_variance_ratio_
cumulative_var_ratio = np.cumsum(explained_var_ratio)

plt.plot(cumulative_var_ratio, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance vs. Number of Principal Components')
plt.grid(True)
plt.show()

  plt.show()


- We see that 95% of the variance in the dataset is explained by just 4 components and almost all the variance is explained by 10 components. Lets take 10 as it is a reasonable number of variables in this scenario

In [13]:
# Selecting the first 10 principal components and transforming our data
pca = PCA(n_components=10)
pca.fit(df.iloc[:,:-1])
print(pca.explained_variance_ratio_.sum()) # explains 99.4% of the total variance
X = pca.fit_transform(df.iloc[:,:-1])

0.9944388491466637


### Splitting the data into test and train


In [17]:
y = df.iloc[:,-1]

In [18]:
X.shape,y.shape

((30000, 10), (30000,))

In [19]:
y.reshape(1,-1)
y.shape

AttributeError: 'Series' object has no attribute 'reshape'

In [8]:
from sklearn.model_selection import train_test_split
# Split the data into temp (90%) and test set (10%)
X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.1, random_state=1,stratify=y)

NameError: name 'X' is not defined

In [21]:
# Further split the temp set into validation (10%) and final train set (90%)
X_train, X_val, y_train, y_val = train_test_split(X_train_temp, y_train_temp, test_size=0.1, random_state=1)

In [22]:
X_train_temp.shape, y_train_temp.shape
X_train.shape,y_train.shape
X_val.shape,y_val.shape
X_test.shape,y_test.shape

((3000, 10), (3000,))

### Logistic Regression

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'penalty': ['l1', 'l2'],
              'solver': ['liblinear', 'saga']}

# Create logistic regression model
model = LogisticRegression()

# Set up k-fold cross-validation (StratifiedKFold for classification tasks)
k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Grid search for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, cv=k_fold, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Access the best model (already refitted on the entire dataset)
best_model = grid_search.best_estimator_


# Optionally, you can also access cross-validation results
cv_results = grid_search.cv_results_
mean_scores = cv_results['mean_test_score']
std_scores = cv_results['std_test_score']
# Print mean cross-validation scores for each hyperparameter combination
print("Mean Cross-Validation Scores:")
for params, mean_score,std_score in zip(cv_results['params'], mean_scores,std_scores):
    print(f"{params}: {mean_score:.4f},{std_score:.4f}")

Best Hyperparameters: {'C': 0.001, 'penalty': 'l1', 'solver': 'liblinear'}
Mean Cross-Validation Scores:
{'C': 0.001, 'penalty': 'l1', 'solver': 'liblinear'}: 0.7798,0.0001
{'C': 0.001, 'penalty': 'l1', 'solver': 'saga'}: 0.5050,0.0060
{'C': 0.001, 'penalty': 'l2', 'solver': 'liblinear'}: 0.7798,0.0001
{'C': 0.001, 'penalty': 'l2', 'solver': 'saga'}: 0.5050,0.0059
{'C': 0.01, 'penalty': 'l1', 'solver': 'liblinear'}: 0.7798,0.0001
{'C': 0.01, 'penalty': 'l1', 'solver': 'saga'}: 0.5050,0.0059
{'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}: 0.7797,0.0001
{'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}: 0.5050,0.0059
{'C': 0.1, 'penalty': 'l1', 'solver': 'liblinear'}: 0.7798,0.0001
{'C': 0.1, 'penalty': 'l1', 'solver': 'saga'}: 0.5050,0.0059
{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}: 0.7798,0.0001
{'C': 0.1, 'penalty': 'l2', 'solver': 'saga'}: 0.5050,0.0059
{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}: 0.7798,0.0001
{'C': 1, 'penalty': 'l1', 'solver': 'saga'}: 0.5050,0.005

In [26]:
# Evaluate the best model on a holdout set 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
y_pred = best_model.predict(X_val)
accuracy_lg = accuracy_score(y_val, y_pred)
precision_lg = precision_score(y_val, y_pred, average='weighted')
recall_lg = recall_score(y_val, y_pred, average='weighted')
f1_lg = f1_score(y_val, y_pred, average='weighted')
conf_matrix_lg = confusion_matrix(y_val, y_pred)

  _warn_prf(average, modifier, msg_start, len(result))


In [27]:
# Print the evaluation metrics
print("Accuracy:", accuracy_lg)
print("Precision:", precision_lg)
print("Recall:", recall_lg)
print("F1 Score:", f1_lg)
print("\nConfusion Matrix:")
print(conf_matrix_lg)

Accuracy: 0.7703703703703704
Precision: 0.5934705075445816
Recall: 0.7703703703703704
F1 Score: 0.6704478537114521

Confusion Matrix:
[[2080    0]
 [ 620    0]]


### Decision Tree Classifier

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {'max_depth': [None, 10, 20, 30],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [ 2, 4]}

# Create decision tree model
model = DecisionTreeClassifier()

# Set up k-fold cross-validation (StratifiedKFold for classification tasks)
k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Grid search for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, cv=k_fold, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Access the best model (already refitted on the entire dataset)
best_model = grid_search.best_estimator_


# Optionally, you can also access cross-validation results
cv_results = grid_search.cv_results_
mean_scores = cv_results['mean_test_score']
std_scores = cv_results['std_test_score']
# Print mean cross-validation scores for each hyperparameter combination
print("Mean Cross-Validation Scores:")
for params, mean_score,std_score in zip(cv_results['params'], mean_scores,std_scores):
    print(f"{params}: {mean_score:.4f},{std_score:.4f}")

NameError: name 'X_train' is not defined

In [148]:
# Evaluate the best model on a holdout set 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
y_pred = best_model.predict(X_val)
accuracy_dt = accuracy_score(y_val, y_pred)
precision_dt = precision_score(y_val, y_pred, average='weighted')
recall_dt = recall_score(y_val, y_pred, average='weighted')
f1_dt = f1_score(y_val, y_pred, average='weighted')
conf_matrix_dt = confusion_matrix(y_val, y_pred)

In [149]:
# Print the evaluation metrics
print("Accuracy:", accuracy_dt)
print("Precision:", precision_dt)
print("Recall:", recall_dt)
print("F1 Score:", f1_dt)
print("\nConfusion Matrix:")
print(conf_matrix_dt)

Accuracy: 0.7607407407407407
Precision: 0.7103518865484459
Recall: 0.7607407407407407
F1 Score: 0.7157422759165307

Confusion Matrix:
[[1953  127]
 [ 519  101]]


### Random Forest Classifier

In [135]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
# Define hyperparameter grid for RandomForestClassifier
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10],
    'min_samples_split': [5, 10],
    'max_features': ['sqrt', 'log2']
}

# Create RandomForestClassifier model
model = RandomForestClassifier()

# Set up k-fold cross-validation (StratifiedKFold for classification tasks)
k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Grid search for hyperparameter tuning with k-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=k_fold, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Access the best model (already refitted on the entire dataset)
best_model = grid_search.best_estimator_


# Optionally, you can also access cross-validation results
cv_results = grid_search.cv_results_
mean_scores = cv_results['mean_test_score']
std_scores = cv_results['std_test_score']
# Print mean cross-validation scores for each hyperparameter combination
print("Mean Cross-Validation Scores:")
for params, mean_score,std_score in zip(cv_results['params'], mean_scores,std_scores):
    print(f"{params}: {mean_score:.4f},{std_score:.4f}")

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             estimator=RandomForestClassifier(),
             param_grid={'max_depth': [5, 10], 'max_features': ['sqrt', 'log2'],
                         'min_samples_split': [5, 10],
                         'n_estimators': [50, 100, 150]},
             scoring='accuracy')

Best Hyperparameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_split': 10, 'n_estimators': 50}
Mean Cross-Validation Scores:


TypeError: unsupported format string passed to numpy.ndarray.__format__

In [141]:
# Evaluate the best model on a holdout set 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
y_pred = best_model.predict(X_val)
accuracy_rf = accuracy_score(y_val, y_pred)
precision_rf = precision_score(y_val, y_pred, average='weighted')
recall_rf = recall_score(y_val, y_pred, average='weighted')
f1_rf = f1_score(y_val, y_pred, average='weighted')
conf_matrix_rf = confusion_matrix(y_val, y_pred)

In [142]:
# Print the evaluation metrics
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_rf)
print("\nConfusion Matrix:")
print(conf_matrix_rf)

Accuracy: 0.7748148148148148
Precision: 0.7614041757754332
Recall: 0.7748148148148148
F1 Score: 0.6859791328927132

Confusion Matrix:
[[2072    8]
 [ 600   20]]


### XG Boost

In [151]:
from xgboost import XGBClassifier

# Define hyperparameter grid
param_grid = {'learning_rate': [0.01, 0.1, 0.2],
              'max_depth': [3, 5, 7],
              'n_estimators': [50, 100, 200]}

# Create XGBoost model
model = XGBClassifier()

# Set up k-fold cross-validation (StratifiedKFold for classification tasks)
k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Grid search for hyperparameter tuning with k-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=k_fold, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Access the best model (already refitted on the entire dataset)
best_model = grid_search.best_estimator_


# Optionally, you can also access cross-validation results
cv_results = grid_search.cv_results_
mean_scores = cv_results['mean_test_score']
std_scores = cv_results['std_test_score']
# Print mean cross-validation scores for each hyperparameter combination
print("Mean Cross-Validation Scores:")
for params, mean_score,std_score in zip(cv_results['params'], mean_scores,std_scores):
    print(f"{params}: {mean_score:.4f},{std_score:.4f}")

KeyboardInterrupt: 

In [None]:
# Evaluate the best model on a holdout set 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
y_pred = best_model.predict(X_val)
accuracy_xg = accuracy_score(y_val, y_pred)
precision_xg = precision_score(y_val, y_pred, average='weighted')
recall_xg = recall_score(y_val, y_pred, average='weighted')
f1_xg = f1_score(y_val, y_pred, average='weighted')
conf_matrix_xg = confusion_matrix(y_val, y_pred)

In [None]:
# Print the evaluation metrics
print("Accuracy:", accuracy_xg)
print("Precision:", precision_xg)
print("Recall:", recall_xg)
print("F1 Score:", f1_xg)
print("\nConfusion Matrix:")
print(conf_matrix_xg)

In [152]:
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build the neural network model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    layers.Dropout(0.5),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on the test set
y_pred_probs = model.predict(X_test_scaled)
y_pred = (y_pred_probs > 0.5).astype(int)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

ModuleNotFoundError: No module named 'tensorflow'