# Project Overview

**Predicting Loan Defaults**  
This project aims to develop a machine learning model to predict loan defaults based on customer behavior. Accurately identifying potential defaulters enables financial institutions to make informed lending decisions, manage risk effectively, and maintain financial stability.

**Problem**  
Predicting loan defaults is complex due to various influencing factors, including customer demographics (e.g., age, income, employment status), financial history (e.g., credit history, existing debts), and loan characteristics (e.g., loan amount, interest rate). Traditional credit scoring models may not capture these nuances, leading to suboptimal risk assessments. Machine learning offers a data-driven approach to enhance predictive accuracy in identifying potential defaulters.

**Data**  
The dataset comprises customer information provided at the time of the loan application. The data is available at [Kaggle](https://www.kaggle.com/datasets/subhamjain/loan-prediction-based-on-customer-behavior), specifically the "Loan Prediction Based on Customer Behavior" dataset by Subham Jain. 

| Column               | Description                                     | Data Type            |
| :------------------- | :---------------------------------------------- | :------------------- |
| Risk Flag            | Target Variable: Defaulted on loan (No vs. Yes) | Categorical (String) |
| Income               | Income of the applicant                         | Numerical (Integer)  |
| Age                  | Age of the applicant (in years)                 | Numerical (Integer)  |
| Experience           | Work experience (in years)                      | Numerical (Integer)  |
| Profession           | Applicant's profession                          | Categorical (String) |
| Married              | Marital status (e.g., Single, Married)          | Categorical (String) |
| House Ownership      | Whether the applicant owns or rents a house     | Categorical (String) |
| Car Ownership        | Whether the applicant owns a car                | Categorical (String) |
| Current Job Years    | Years in the current job                        | Numerical (Integer)  |
| Current House Years  | Years in the current house                      | Numerical (Integer)  |
| City                 | City of residence                               | Categorical (String) |
| State                | State of residence                              | Categorical (String) |

**Example Data**

| Income (€) | Age | Experience (years) | Marital Status | House Ownership | Car Ownership | Current Job Years | Current House Years | Credit Score | Loan Amount (€) | Loan Term (years) | Interest Rate (%) | Loan Status |
| :-------- | :-- | :----------------- | :------------- | :-------------- | :------------ | :---------------- | :----------------- | :----------- | :------------- | :---------------- | :--------------- | :---------- |
| 50,000    | 35  | 10                | Married        | Own             | Yes           | 5                 | 7                  | 700          | 20,000         | 5                 | 5.5              | 0          |
| 30,000    | 28  | 5                 | Single         | Rent            | No            | 2                 | 3                  | 650          | 10,000         | 3                 | 6.0              | 1          |

**Objectives**  
The project aims to:

- Develop a machine learning model to accurately predict the likelihood of loan default.
- Evaluate and compare different models (e.g., Logistic Regression, Random Forest, XGBoost) using metrics such as Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
- Identify key factors influencing loan defaults through feature importance analysis.

**Value Proposition**  
This project offers significant benefits to financial institutions:

- **Risk Management:** Enhance the ability to identify high-risk loan applicants, reducing default rates.
- **Operational Efficiency:** Streamline the loan approval process by automating risk assessments.
- **Customer Relationship Management:** Develop targeted strategies for customer segments based on predicted risk levels.

**Business Requirements**  
The project must meet the following business requirements:

- **Predictive Accuracy:** Achieve an accuracy of at least 85% in predicting loan defaults.
- **Regulatory Compliance:** Ensure the model adheres to relevant financial regulations and ethical standards.
- **Scalability:** Design the model to handle large datasets and adapt to various loan products.
- **Interpretability:** Provide clear explanations for predictions to support decision-making and regulatory scrutiny.

**Technical Requirements**  
The project must meet the following technical requirements:

- **Data Preprocessing:** Implement data cleaning, transformation, and feature engineering using Python and libraries such as Pandas and Scikit-learn.
- **Exploratory Data Analysis (EDA):** Conduct thorough data analysis and visualization using tools like Seaborn and Matplotlib.
- **Modeling:** Train and evaluate multiple machine learning models, including Logistic Regression, Random Forest, and XGBoost, utilizing Scikit-learn and XGBoost libraries.
- **Model Evaluation:** Assess model performance using metrics such as Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
- **Deployment:** Deploy the final model as a REST API for integration with existing loan processing systems.
  - **Scalability:** Ensure the model can process large volumes of loan applications efficiently.
  - **Cloud Infrastructure:** Utilize cloud platforms (e.g., AWS, Microsoft Azure, Google Cloud Platform) for scalable and secure deployment.

By fulfilling these objectives and requirements, the project will provide a robust tool for predicting loan defaults, thereby enhancing decision-making processes within financial institutions.

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data Loading

## CSV 
Load data from a .csv file into a Pandas DataFrame.

In [None]:
try:
    df_train = pd.read_csv("data/training_data.csv")
    X_test = pd.read_csv("data/test_data.csv")
    y_test = pd.read_csv("data/sample_prediction_dataset.csv")
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except pd.errors.EmptyDataError:
    print("Error: The file is empty.")
except pd.errors.ParserError:
    print("Error: The file content could not be parsed as a CSV.")
except PermissionError:
    print("Error: Permission denied when accessing the file.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")