# Loan Default Prediction Dataset

## Overview
This dataset is designed to help predict which individuals are most likely to default on their loan payments. It provides a valuable opportunity to tackle a significant machine learning problem within the financial services industry.

Companies, including big banks and financial institutions, leverage such data to decrease payment defaults and ensure timely loan repayments. By applying machine learning, organisations can identify high-risk individuals, allowing for the effective deployment of appropriate interventions.

The dataset originates from **Coursera's Loan Default Prediction Challenge**, offering a unique resource to test and enhance modelling skills.

---

## Columns
The dataset contains **18 distinct columns**, each providing specific details about the loan and the borrower:

| Column Name | Description | Range / Values |
| :--- | :--- | :--- |
| **LoanID** | A unique identifier for each loan. | Unique string/number |
| **Age** | The age of the borrower. | 18 to 69 |
| **Income** | The annual income of the borrower. | £15,000 to £150,000 |
| **LoanAmount** | The specific amount of money borrowed. | £5,000 to £250,000 |
| **CreditScore** | The credit score of the borrower. | 300 to 849 |
| **MonthsEmployed** | The number of months the borrower has been employed. | 0 to 119 months |
| **NumCreditLines** | The total number of open credit lines the borrower possesses. | 1 to 4 |
| **InterestRate** | The interest rate applied to the loan. | 2% to 25% |
| **LoanTerm** | The duration of the loan in months. | 12, 24, 36, 48, or 60 months |
| **DTIRatio** | The Debt-to-Income ratio. | 0.1 to 0.9 |
| **Education** | The highest level of education attained by the borrower. | `Bachelor's`, `High School`, `Other` |
| **EmploymentType** | The borrower's employment status. | `Part-time`, `Unemployed`, `Other` |
| **MaritalStatus** | The marital status of the borrower. | `Married`, `Divorced`, `Other` |
| **HasMortgage** | A boolean indicating whether the borrower has a mortgage. | `true` / `false` |
| **HasDependents** | A boolean indicating whether the borrower has dependents. | `true` / `false` |
| **LoanPurpose** | The stated purpose for which the loan was taken. | `Business`, `Home`, `Other` |
| **HasCoSigner** | A boolean indicating if the loan has a co-signer. | `true` / `false` |
| **Default** | **The target variable**, indicating whether the loan defaulted or not. | `0` (No) or `1` (Yes) |

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt


In [None]:
DATA_PATH = os.path.join("..", "Data", "raw", "Loan_default.csv")
df = pd.read_csv(DATA_PATH)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.drop(columns=['LoanID'],axis=1,inplace=True)

In [None]:
df['Default'].value_counts(normalize=True).plot(kind='bar')
plt.title('Proportion of Loan Defaults')
plt.xlabel('Default Status')
plt.ylabel('Proportion')
plt.show()
legend_labels = ['No Default', 'Default']

In [None]:
df.duplicated().sum()

In [None]:
cat_df=df.select_dtypes(include=['object'])
cat_df.nunique()

In [None]:
cat_df.describe()

In [None]:
unique_values = {col: cat_df[col].unique().tolist() for col in cat_df.columns}
unique_values

In [None]:
num_df=list(set(df)-set(cat_df))
num_df=pd.DataFrame(df[num_df])
num_df.describe()

In [None]:
num_df.skew()

In [None]:
df.hist(bins=40, figsize=(15,10))
plt.tight_layout()
plt.show()

In [None]:
DATA_PATH = os.path.join("..", 'Data',"processed",'loan_data.csv')

In [None]:
yesNoColumns = ["HasMortgage","HasDependents", "HasCoSigner"]

categorical_features = list(set(df.select_dtypes(include=['object'])) - set(yesNoColumns))

numeric_features = list(set(df.columns) - set(categorical_features) - set(yesNoColumns))

In [None]:
categorical_features

In [None]:
yesNoColumns = ["HasMortgage","HasDependents", "HasCoSigner"]
categorical_features=list(set(df.select_dtypes(include=['object']))-set(yesNoColumns))
numeric_features=list(set(df)-set(categorical_features)-set(yesNoColumns))
numeric_features+ yesNoColumns + categorical_features==list(df.columns)

In [None]:
numeric_features+ yesNoColumns + categorical_features

In [None]:
df.info()