# 📊 Loan Default Prediction
This project presents a professional machine learning pipeline to predict loan default risk using borrower data.

**Author:** Reda El Gadida  
**Tools:** Python, Pandas, Scikit-learn, Seaborn, Jupyter Notebook
---

## 🧭 Project Overview
This notebook demonstrates how to build and evaluate a predictive model for loan default risk. We simulate a real-world scenario where a financial institution wants to automate credit risk assessment.

**Steps covered:**
- Load and explore the dataset
- Clean and preprocess the data
- Train a classification model
- Evaluate model performance
- Provide insights and conclusions
---

![COUR_IPO.png](attachment:COUR_IPO.png)

### Understanding the Datasets

#### Train vs. Test
In this competition, you’ll gain access to two datasets that are samples of past borrowers of a financial institution that contain information about the individual and the specific loan. One dataset is titled `train.csv` and the other is titled `test.csv`.

`train.csv` contains 70% of the overall sample (255,347 borrowers to be exact) and importantly, will reveal whether or not the borrower has defaulted on their loan payments (the “ground truth”).

The `test.csv` dataset contains the exact same information about the remaining segment of the overall sample (109,435 borrowers to be exact), but does not disclose the “ground truth” for each borrower. It’s your job to predict this outcome!

Using the patterns you find in the `train.csv` data, predict whether the borrowers in `test.csv` will default on their loan payments, or not.

#### Dataset descriptions
Both `train.csv` and `test.csv` contain one row for each unique Loan. For each Loan, a single observation (`LoanID`) is included during which the loan was active. 

In addition to this identifier column, the `train.csv` dataset also contains the target label for the task, a binary column `Default` which indicates if a borrower has defaulted on payments.

Besides that column, both datasets have an identical set of features that can be used to train your model to make predictions. Below you can see descriptions of each feature. Familiarize yourself with them so that you can harness them most effectively for this machine learning task!

In [None]:
# --- Code Block ---
import pandas as pd
data_descriptions = pd.read_csv('data_descriptions.csv')
pd.set_option('display.max_colwidth', None)
data_descriptions

In [None]:
# --- Code Block ---
# Import required packages

# Data packages
import pandas as pd
import numpy as np

# Machine Learning / Classification packages
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier


# Visualization Packages
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

### Load the Data

Let's start by loading the dataset `train.csv` into a dataframe `train_df`, and `test.csv` into a dataframe `test_df` and display the shape of the dataframes.

In [None]:
# --- Code Block ---
train_df = pd.read_csv("train.csv")
print('train_df Shape:', train_df.shape)
train_df.head()

In [None]:
# --- Code Block ---
test_df = pd.read_csv("test.csv")
print('test_df Shape:', test_df.shape)
test_df.head()

### Explore, Clean, Validate, and Visualize the Data (optional)

Feel free to explore, clean, validate, and visualize the data however you see fit for this competition to help determine or optimize your predictive model. Please note - the final autograding will only be on the accuracy of the `prediction_df` predictions.

In [None]:
# --- Code Block ---
# your code here (optional)
X_train = train_df.drop(columns=['LoanID', 'Default'])
y_train = train_df['Default']
X_test = test_df.drop(columns='LoanID')

In [None]:
# --- Code Block ---
#encode categorical variables using one hot encoding
categorical_cols = X_train.select_dtypes(include = "object").columns
X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols)

X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join = "left", axis=1, fill_value=0)

In [None]:
# --- Code Block ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

#validation split
X_train_final, X_val, y_train_final, y_val = train_test_split(
    X_train_scaled, y_train, test_size=0.2, random_state=42, stratify=y_train
)

### Make predictions (required)

Remember you should create a dataframe named `prediction_df` with exactly 109,435 entries plus a header row attempting to predict the likelihood of borrowers to default on their loans in `test_df`. Your submission will throw an error if you have extra columns (beyond `LoanID` and `predicted_probaility`) or extra rows.

The file should have exactly 2 columns:
`LoanID` (sorted in any order)
`predicted_probability` (contains your numeric predicted probabilities between 0 and 1, e.g. from `estimator.predict_proba(X, y)[:, 1]`)

The naming convention of the dataframe and columns are critical for our autograding, so please make sure to use the exact naming conventions of `prediction_df` with column names `LoanID` and `predicted_probability`!

#### Example prediction submission:

The code below is a very naive prediction method that simply predicts loan defaults using a Dummy Classifier. This is used as just an example showing the submission format required. Please change/alter/delete this code below and create your own improved prediction methods for generating `prediction_df`.

**PLEASE CHANGE CODE BELOW TO IMPLEMENT YOUR OWN PREDICTIONS**

In [None]:
# --- Code Block ---
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_final, y_train_final)
rf_val_proba = rf_model.predict_proba(X_val)[:, 1]
rf_auc = roc_auc_score(y_val, rf_val_proba)

In [None]:
# --- Code Block ---
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train_final, y_train_final)
gb_val_proba = gb_model.predict_proba(X_val)[:, 1]
gb_auc = roc_auc_score(y_val, gb_val_proba)

In [None]:
# --- Code Block ---
#train XGBOOST
xgb_model = XGBClassifier(use_label_encode=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train_final, y_train_final)
xgb_val_proba = xgb_model.predict_proba(X_val)[:, 1]
xgb_auc = roc_auc_score(y_val, xgb_val_proba)

In [None]:
# --- Code Block ---
# Display AUC scores
print(f"Random Forest AUC: {rf_auc:.4f}")
print(f"Gradient Boosting AUC: {gb_auc:.4f}")
print(f"XGBoost AUC: {xgb_auc:.4f}")

In [None]:
# --- Code Block ---
#Training the final model with GradientBoostingClassifier
final_model = GradientBoostingClassifier(random_state=42)
final_model.fit(X_train_final, y_train_final)
y_test_proba = final_model.predict_proba(X_test_scaled)[:, 1]

prediction_df =pd.DataFrame({
    'LoanID': test_df['LoanID'],
    'predicted_probability': y_test_proba
})

**PLEASE CHANGE CODE ABOVE TO IMPLEMENT YOUR OWN PREDICTIONS**

### Final Tests - **IMPORTANT** - the cells below must be run prior to submission

Below are some tests to ensure your submission is in the correct format for autograding. The autograding process accepts a csv `prediction_submission.csv` which we will generate from our `prediction_df` below. Please run the tests below an ensure no assertion errors are thrown.

In [None]:
# --- Code Block ---
# FINAL TEST CELLS - please make sure all of your code is above these test cells

# Writing to csv for autograding purposes
prediction_df.to_csv("prediction_submission.csv", index=False)
submission = pd.read_csv("prediction_submission.csv")

assert isinstance(submission, pd.DataFrame), 'You should have a dataframe named prediction_df.'

In [None]:
# --- Code Block ---
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.columns[0] == 'LoanID', 'The first column name should be CustomerID.'
assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'

In [None]:
# --- Code Block ---
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.shape[0] == 109435, 'The dataframe prediction_df should have 109435 rows.'

In [None]:
# --- Code Block ---
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.shape[1] == 2, 'The dataframe prediction_df should have 2 columns.'

In [None]:
# --- Code Block ---
# FINAL TEST CELLS - please make sure all of your code is above these test cells

## This cell calculates the auc score and is hidden. Submit Assignment to see AUC score.

## ✅ Conclusion
We developed a basic machine learning pipeline to predict loan default risk using logistic regression. This approach can be extended with more complex models and feature engineering techniques to improve accuracy. All source code and analysis are fully documented for transparency and reuse.

For business use, model monitoring, explainability, and fairness checks should also be included.