# Credit Card Defaulter Prediction

## Problem Statement:
With the rapid growth of the financial industry, commercial banks face significant threats from credit risks. One of the critical challenges is predicting the likelihood of credit card defaults. This task involves assessing the probability of credit default based on the credit card owner's characteristics and payment history.

## Objective:
The goal is to develop a machine learning model that can predict the probability of credit default. This model will help banks and financial institutions mitigate risks and make informed lending decisions.

## Expected Results:
Build a robust solution capable of predicting the probability of credit default, enabling financial institutions to assess risk effectively and make data-driven decisions.


## Step 1: Load and Explore the Dataset

In this step, we load the "UCI_Credit_Card.csv" dataset using `pandas` and display the first few rows to get an overview of the data structure.
We also check for any missing values in the dataset and drop the rows with missing data (this strategy can be modified depending on the dataset).


In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("UCI_Credit_Card.csv")

# Let's display the data
df.head()

# Check for missing values
print(df.isna().sum())

# Handle missing values (if any)
# For now, let's drop rows with missing values if there are any (can be modified based on strategy)
df_cleaned = df.dropna()

# Display the cleaned data
df_cleaned.info()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------          

## Step 2: Feature Selection

Here, we separate the target variable `default.payment.next.month` from the other features. Using `SelectKBest` with the ANOVA F-test (`f_classif`), we select the top 8 most important features that are strongly correlated with the target. These selected features will be used for modeling.

In [2]:
from sklearn.feature_selection import SelectKBest, f_classif

# Separating the target variable (default payment) and features
X = df_cleaned.drop(columns=['default.payment.next.month', 'ID'])
y = df_cleaned['default.payment.next.month']

# Applying SelectKBest to pick the top 8 features
selector = SelectKBest(score_func=f_classif, k=8)
X_new = selector.fit_transform(X, y)

# Display the selected features
selected_features = selector.get_support(indices=True)
print(X.columns[selected_features])

Index(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
       'PAY_AMT1'],
      dtype='object')


## Step 3: Feature Scaling

In this step, we standardize the selected features using `StandardScaler` to ensure that all features are on the same scale. This is crucial for machine learning algorithms such as SVM and KNN, which are sensitive to feature scales.

In [3]:
from sklearn.preprocessing import StandardScaler

# Scaling the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_new)

# Display the scaled data
print(X_scaled[:5])

[[-1.13672015  1.79456386  1.78234817 -0.69666346 -0.66659873 -1.53004603
  -1.48604076 -0.34194162]
 [-0.3659805  -0.87499115  1.78234817  0.1388648   0.18874609  0.23491652
   1.99231551 -0.34194162]
 [-0.59720239  0.01486052  0.1117361   0.1388648   0.18874609  0.23491652
   0.25313738 -0.25029158]
 [-0.90549825  0.01486052  0.1117361   0.1388648   0.18874609  0.23491652
   0.25313738 -0.22119058]
 [-0.90549825 -0.87499115  0.1117361  -0.69666346  0.18874609  0.23491652
   0.25313738 -0.22119058]]


## Step 4: Split the Data for Training and Testing

In this step, we split the scaled features and target variables into training and testing datasets. We allocate `70% of the data to training` and `30% to testing`, ensuring that we maintain a `random state of 42` for reproducibility.

In [4]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

## Step 5: Perform Model Training with Cross-Validation and Grid Search
In this step, we train the models using **Grid Search with 5-fold Cross-Validation** to optimize hyperparameters. Grid Search helps us select the best model by trying different combinations of hyperparameters and choosing the one that yields the best performance. 

The models evaluated include:

1. **Logistic Regression**: We vary the penalty type (L1/L2) and the regularization strength (C).
2. **Decision Tree**: We tune the maximum depth and minimum samples required to split a node.
3. **K-Nearest Neighbors (KNN)**: We test different numbers of neighbors, weight functions, and distance metrics.
4. **Support Vector Machine (SVM)**: We try different regularization strengths (C) and kernel types (linear and RBF).

We evaluate each model using precision, recall, accuracy, and F1-Score, and finally, we select the best model based on the highest F1-Score.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV

# Define models and hyperparameter grids for GridSearchCV
param_grids = {
    "Logistic Regression": {
        'penalty': ['l1', 'l2'],
        'C': [0.01, 0.1, 1, 10]
    },
    "Decision Tree": {
        'max_depth': [5, 10, 15],
        'min_samples_split': [2, 5, 10]
    },
    "KNN": {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan']
    },
    "SVM": {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    }
}

# Define models
models = {
    "Logistic Regression": LogisticRegression(solver='liblinear'),
    "Decision Tree": DecisionTreeClassifier(),
    "KNN": KNeighborsClassifier(),
    "SVM": SVC()
}

# Dictionary to store the results
results = {}

# Function to perform Grid Search and evaluate the model
def evaluate_model_with_grid_search(model, param_grid, X_train, X_test, y_train, y_test):
    # Perform Grid Search with 5-fold cross-validation
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1', n_jobs=-1)
    
    # Fit the model
    grid_search.fit(X_train, y_train)
    
    # Get the best model
    best_model = grid_search.best_estimator_
    
    # Make predictions on the test set
    y_pred = best_model.predict(X_test)
    
    # Calculate metrics
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Return the best parameters and the metrics
    return {
        "Best Params": grid_search.best_params_,
        "Precision": precision,
        "Recall": recall,
        "Accuracy": accuracy,
        "F1-Score": f1
    }

# Perform Grid Search for each model and evaluate
for model_name, model in models.items():
    print(f"Performing Grid Search for {model_name}...")
    results[model_name] = evaluate_model_with_grid_search(model, param_grids[model_name], X_train, X_test, y_train, y_test)

# Display results for all models
for model_name, metrics in results.items():
    print(f"\n{model_name} Results:")
    print(f"Best Hyperparameters: {metrics['Best Params']}")
    for metric_name, score in metrics.items():
        if metric_name != "Best Params":
            print(f"{metric_name}: {score:.4f}")

# Find the best model based on F1-Score
best_model = max(results, key=lambda x: results[x]['F1-Score'])
print(f"\nBest model based on F1-Score: {best_model}")

Performing Grid Search for Logistic Regression...
Performing Grid Search for Decision Tree...
Performing Grid Search for KNN...


  _data = np.array(data, dtype=dtype, copy=copy,


Performing Grid Search for SVM...
