# Project Overview

**Predicting Loan Defaults**  
This project aims to develop a machine learning model to predict whether the customers of a financial institution will default on a loan based on data from their loan application. Accurately identifying potential defaulters enables financial institutions to make informed lending decisions, adjust loan terms, and better manage financial risk.

**Problem**  
Predicting loan defaults is complex due to various influencing factors, including customer demographics (e.g., age, income, profession) and other background information such as house ownership and years in the job. Traditional credit scoring models often rely on limited historical data and linear relationships, which can oversimplify risk assessment. Machine learning, however, can capture non-linear patterns and complex interactions among variables, improving the accuracy of default predictions.

**Data**  
The dataset contains information provided by customers of a financial institution during the loan application process. It is sourced from the "Loan Prediction Based on Customer Behavior" dataset by Subham Jain, available on [Kaggle](https://www.kaggle.com/datasets/subhamjain/loan-prediction-based-on-customer-behavior). 

Data Overview Table:

| Column | Description | Storage Type | Semantic Type | Theoretical Range | Training Data Range |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Risk Flag | Defaulted on loan (0: No, 1: Yes) | Integer | Categorical (Binary) | [0, 1] | [0, 1] |
| Income | Income of the applicant | Integer | Numerical | [0, ∞] | [4K, 224K] |
| Age | Age of the applicant (in years) | Integer | Numerical | [18, ∞] | [20, 65] |
| Experience | Work experience (in years) | Integer | Numerical | [0, ∞] | [0, 42] |
| Profession | Applicant's profession | String | Categorical (Nominal) | Any profession [e.g., "Architect", "Dentist"] | 8 unique professions |
| Married | Marital status | String | Categorical (Binary) | ["single", "married"] | ["single", "married"] |
| House Ownership | Applicant owns or rents a house | String | Categorical (Nominal) | ["rented", "owned", "norent_noown"] | ["rented", "owned", "norent_noown"] |
| Car Ownership | Whether applicant owns a car | String | Categorical (Binary) | ["yes", "no"] | ["yes", "no"] |
| Current Job Years | Years in the current job | Integer | Numerical | [0, ∞] | [0, 40] |
| Current House Years | Years in the current house | Integer | Numerical | [0, ∞] | [0, 35] |
| City | City of residence | String | Categorical (Nominal) | Any city [e.g., "Mumbai", "Bangalore"] | 12 unique cities |
| State | State of residence | String | Categorical (Nominal) | Any state [e.g., "Maharashtra", "Tamil_Nadu"] | 5 unique states |

The dataset consists of three CSV files:
1. `Training Data.csv`: Contains all the features along with the target variable (`Risk Flag`) and an `ID` column. 
2. `Test Data.csv`: Contains only the features and an `ID` column from the test data.
3. `Sample Prediction Dataset.csv`: Contains the target variable (`Risk Flag`) and an `ID` column from the test data.

Example Training Data:

| Risk Flag | Income    | Age | Experience | Profession         | Married | House Ownership | Car Ownership | Current Job Years | Current House Years | City      | State         |
| :-------- | :-------- | :-- | :--------- | :----------------- | :------ | :-------------- | :------------ | :---------------- | :------------------ | :-------- | :------------ |
| 0         | 1,303,834 | 23  | 3          | Mechanical_engineer | single  | rented          | no            | 3                 | 13                   | Rewa      | Madhya_Pradesh |
| 1         | 6,256,451 | 41  | 2          | Software_Developer | single  | rented          | yes           | 2                 | 12                   | Bangalore | Tamil_Nadu    |
| 0         | 3,991,815 | 66  | 4          | Technical_writer   | married | rented          | no            | 4                 | 10                   | Alappuzha | Kerala        |

**Objectives**  
The project aims to:
- Develop a machine learning model to predict loan defaults based on customer data from loan applications.
- Evaluate and compare different models (e.g., Logistic Regression, Random Forest, XGBoost) using metrics such as Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
- Analyze feature importance to understand key factors driving loan default risk.

**Value Proposition**  
This project provides several benefits to financial institutions:
- Improved risk assessment through data-driven prediction of loan defaults.
- More efficient loan application processing through automated risk evaluation.
- Enhanced customer segmentation based on risk profiles.
- Better-informed lending decisions supported by machine learning insights.

**Business Requirements**  
The project must meet the following business requirements:
- **Predictive Accuracy:** Achieve an accuracy of at least 85% in predicting loan defaults.
- **Regulatory Compliance:** Ensure the model adheres to relevant financial regulations and ethical standards.
- **Scalability:** Design the model to handle large datasets and adapt to various loan products.
- **Interpretability:** Provide clear explanations for predictions to support decision-making and regulatory scrutiny.

**Technical Requirements**  
The project must meet the following technical requirements:
- **Data Preprocessing:** Implement data cleaning, transformation, and feature engineering using Python and libraries such as Pandas and Scikit-learn.
- **Exploratory Data Analysis (EDA):** Conduct thorough data analysis and visualization using tools like Seaborn and Matplotlib.
- **Modeling:** Train and evaluate multiple machine learning models, including Logistic Regression, Random Forest, and XGBoost, utilizing Scikit-learn and XGBoost libraries.
- **Model Evaluation:** Assess model performance using metrics such as Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
- **Deployment:** Deploy the final model as a REST API for integration with existing loan processing systems.
  - **Scalability:** Ensure the model can process large volumes of loan applications efficiently.
  - **Cloud Infrastructure:** Utilize cloud platforms (e.g., AWS, Microsoft Azure, Google Cloud Platform) for scalable and secure deployment.

By fulfilling these objectives and requirements, the project will provide a robust tool for predicting loan defaults, thereby enhancing decision-making processes within financial institutions.

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Loading

## CSV 
Load data from a .csv file into a Pandas DataFrame.

In [None]:
try:
    df_train = pd.read_csv("data/training_data.csv")
    X_test = pd.read_csv("data/test_data.csv")
    y_test = pd.read_csv("data/sample_prediction_dataset.csv")
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except pd.errors.EmptyDataError:
    print("Error: The file is empty.")
except pd.errors.ParserError:
    print("Error: The file content could not be parsed as a CSV.")
except PermissionError:
    print("Error: Permission denied when accessing the file.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")