# Loan Approval Prediction - Kaggle Competition

This notebook provides a comprehensive solution to the Kaggle competition **Playground Series Season 4, Episode 10**. The goal of this competition is to predict loan approval statuses based on features such as applicant demographics, financial background, and loan details.

## Problem Statement

Financial institutions need to make quick and accurate decisions when approving or rejecting loan applications. Incorrect approvals may lead to financial losses, while excessive rejections can drive customers to competitors. The goal is to develop a machine learning model that accurately predicts loan approvals, enabling institutions to balance risk and customer satisfaction.

## Dataset Description

- **train.csv**: Contains labeled data for model training.
- **test.csv**: Contains unlabeled data for which predictions need to be made.
- **target**: The column to predict, indicating loan approval (1 for approved, 0 for rejected).

The dataset includes both numerical and categorical features, with potential missing values and categorical variables requiring encoding.

## Solution Approach

We will follow these steps:

1. **Data Exploration**: Understand the dataset's structure, identify missing values, and analyze distributions.
2. **Data Preprocessing**: Handle missing values, encode categorical features, and scale numeric features if needed.
3. **Feature Engineering**: Create additional features and select the most informative ones.
4. **Model Training**: Use Random Forest as the baseline model and explore hyperparameter tuning.
5. **Evaluation**: Validate the model using cross-validation and evaluate metrics like accuracy, F1-score, and ROC-AUC.
6. **Submission Preparation**: Generate predictions for the test set and prepare the submission file.


## Step 1: Importing Libraries

We begin by importing essential libraries for data manipulation, visualization, and machine learning.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Step 2: Loading the Data

We load the training and test datasets. The training set contains labeled data with the target variable (`target`), while the test set is unlabeled and used for predictions.


In [2]:
# @title Default title text
from google.colab import files
files.upload()

# get json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# copy data file
!kaggle competitions download -c playground-series-s4e10
!unzip -q playground-series-s4e10.zip -d data

Saving kaggle.json to kaggle.json
Downloading playground-series-s4e10.zip to /content
  0% 0.00/1.45M [00:00<?, ?B/s]
100% 1.45M/1.45M [00:00<00:00, 69.9MB/s]


In [19]:
# Load data
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

# Display the first few rows
train_data.head()


Unnamed: 0,id,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
0,0,37,35000,RENT,0.0,EDUCATION,B,6000,11.49,0.17,N,14,0
1,1,22,56000,OWN,6.0,MEDICAL,C,4000,13.35,0.07,N,2,0
2,2,29,28800,OWN,8.0,PERSONAL,A,6000,8.9,0.21,N,10,0
3,3,30,70000,RENT,14.0,VENTURE,B,12000,11.11,0.17,N,5,0
4,4,22,60000,RENT,2.0,MEDICAL,A,6000,6.92,0.1,N,3,0


## Step 3: Data Preprocessing

Data preprocessing ensures that the dataset is clean and ready for modeling. This involves:
1. Separating numeric and categorical columns.
2. Handling missing values:
   - For numeric columns: Fill missing values with the column mean.
   - For categorical columns: Fill missing values with the column mode.
3. Encoding categorical variables using one-hot encoding to make them suitable for the model.


In [20]:
!pip install scikit-learn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
# Assuming train_data and test_data are already loaded

# Separate numeric and categorical columns for training data
numeric_columns = train_data.select_dtypes(include=['number']).columns
categorical_columns = train_data.select_dtypes(include=['object']).columns

# Remove 'loan_status' from numeric_columns for test data
numeric_columns_test = numeric_columns.drop('loan_status')  # Assuming 'loan_status' is in numeric_columns

# Fill missing values for numeric columns with the mean
train_data[numeric_columns] = train_data[numeric_columns].fillna(train_data[numeric_columns].mean())

# Fill missing values for categorical columns with the mode
for column in categorical_columns:
    train_data[column].fillna(train_data[column].mode()[0], inplace=True)

# Create and fit OneHotEncoder on training data
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop='first') # handle_unknown='ignore' to avoid errors if test data has unseen values
encoder.fit(train_data[categorical_columns])

# Transform training and test data using the fitted encoder
encoded_train_data = encoder.transform(train_data[categorical_columns])
encoded_test_data = encoder.transform(test_data[categorical_columns])

# Create DataFrames from encoded data
encoded_train_df = pd.DataFrame(encoded_train_data, columns=encoder.get_feature_names_out(categorical_columns))
encoded_test_df = pd.DataFrame(encoded_test_data, columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate encoded features with numeric features
train_data = pd.concat([train_data[numeric_columns], encoded_train_df], axis=1)
test_data = pd.concat([test_data[numeric_columns_test], encoded_test_df], axis=1) # Use numeric_columns_test for test_data

# Split features and target
X = train_data.drop('loan_status', axis=1)
y = train_data['loan_status']



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data[column].fillna(train_data[column].mode()[0], inplace=True)


## Step 4: Model Training

We train a Random Forest model, a robust ensemble method, to predict loan approvals. The dataset is split into training and validation sets for evaluating the model's performance.

In [21]:
# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy}")

Validation Accuracy: 0.9510614715662035


In [23]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid,
                                   n_iter=50, cv=5, verbose=2, random_state=42, n_jobs=-1)

random_search.fit(X_train, y_train)
best_rf = random_search.best_estimator_
print(f"Best Parameters: {random_search.best_params_}")

# Evaluate the model
y_pred = random_search.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy}")

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters: {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': None, 'max_depth': 20, 'bootstrap': True}
Validation Accuracy: 0.9516582828885668


## Step 5: Submission Preparation

Finally, we generate predictions for the test dataset and create a CSV file for submission to Kaggle.


In [24]:
# Predict on test data
predictions = random_search.predict(test_data) # test_data now has the same columns as training data

# Create submission file
submission = pd.DataFrame({'id': test_data['id'], 'loan_status': predictions})
submission.to_csv('submission.csv', index=False)