<a href="https://colab.research.google.com/github/Aryayayayaa/Labmentix/blob/main/Paisabazaar_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name    - Paisabazaar Banking Fraud Analysis**



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -**  - BALLA SAI DINESH MANI KARTHIKEYA



# **Project Summary -**

This project's core objective was to develop a robust machine learning model capable of accurately predicting individual credit scores as 'Poor,' 'Standard,' or 'Good' for Paisabazaar, a leading financial services company. This aim is critical for their business, as precise credit assessment is paramount for informed loan approvals, effective risk management, personalized product offerings, and overall operational efficiency.

The journey began with a thorough **data understanding** phase, involving loading the raw `Paisabazaar.csv` dataset and conducting initial exploratory analysis to grasp its structure, content, and the distribution of its variables.

The subsequent **data preparation and feature engineering** phase was extensive. Irrelevant identifier columns such as `ID`, `Customer_ID`, `SSN`, and `Name` were promptly removed. The `Credit_History_Age` column, initially in a 'X years Y months' format, was meticulously converted into a numerical representation of total months, making it usable for modeling. The `Type_of_Loan` column, a multi-label categorical feature, was transformed into multiple binary columns, one for each unique loan type, to capture all possible loan combinations. Other ordinal and binary categorical features like `Credit_Mix` and `Payment_of_Min_Amount`, along with the target variable `Credit_Score` itself, were mapped to numerical values. Nominal categorical features such as `Occupation` and `Payment_Behaviour` were then effectively handled using one-hot encoding, expanding the feature set. Missing values were systematically imputed using medians for numerical columns, and outliers were treated using the IQR (Interquartile Range) capping method to ensure data quality and model stability.

To further enhance the dataset's predictive power, **feature manipulation** was performed. Highly correlated features, like `Annual_Income` and `Monthly_Inhand_Salary`, were addressed by strategically dropping one (`Monthly_Inhand_Salary`) to minimize multicollinearity. New, insightful features were engineered, including `Debt_to_Income_Ratio`, `EMI_to_Salary_Ratio`, and `Payment_Consistency`, aiming to capture more complex financial behaviors and improve predictive accuracy. This was followed by **data transformation**, where `log1p` (logarithmic transformation) was applied to skewed numerical features to make their distributions more Gaussian-like, benefiting various ML algorithms. All numerical features were then **scaled using StandardScaler**, ensuring they contribute equally to the model without being disproportionately influenced by their original magnitude. **Dimensionality reduction with PCA** was also conditionally included as an optional step, applied only if the feature count remained high, to reduce complexity while preserving variance.

A crucial step was the **data splitting and handling of imbalanced datasets**. The dataset was split into training (80%) and testing (20%) sets. Critically, **stratified sampling (`stratify=y`)** was employed to ensure that the distribution of the target `Credit_Score` classes ('Poor', 'Standard', 'Good') was maintained in both subsets, which is vital for imbalanced datasets. Furthermore, the inherent class imbalance in the training data was addressed using **SMOTE (Synthetic Minority Over-sampling Technique)**. SMOTE generated synthetic samples for the minority classes (`Poor` and `Good`), thereby balancing the training dataset without introducing data leakage, allowing models to learn effectively from all classes.

In the **model development and optimization** phase, three prominent machine learning algorithms were implemented: Logistic Regression, Decision Tree Classifier, and Random Forest Classifier. For each model, **hyperparameter optimization was performed using Randomized Search Cross-Validation (`RandomizedSearchCV`)**. This technique efficiently explored a wide range of hyperparameter combinations, using 5-fold cross-validation to robustly assess performance and identify the optimal parameters for each algorithm.

The models were rigorously **evaluated** using a suite of metrics crucial for business impact:
* **Overall Accuracy:** Providing a general measure of correctness.
* **Precision, Recall, and F1-score (Macro and Weighted Averages):** Essential for understanding class-specific performance, particularly for imbalanced datasets. High recall for 'Poor' credit scores directly translates to reduced loan defaults, while high precision for 'Good' credit scores ensures confident lending decisions.
* **Confusion Matrix:** Offering a granular view of true positives, false positives, and false negatives, enabling a deeper understanding of where the model makes errors and their associated business costs.
* **ROC AUC Curves (One-vs-Rest):** Assessing the model's ability to distinguish between each credit score class and all others, providing insights into its overall discriminatory power across different decision thresholds.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**To develop a robust and data-driven predictive model that can accurately classify an individual's credit score (Poor, Standard, or Good) based on their financial and behavioral data, thereby enhancing the credit assessment process and mitigating associated business risks.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

!pip install contractions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# For textual data preprocessing
# You might need to download these NLTK resources once
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
import re
import nltk
import contractions
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

from google.colab import drive
import gdown

import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully")

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('dataset-2.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

It is observed that there are 28 columns and 1,00,000 rows of data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

28 distinct columns are observed in the dataset. Finding the description of the dataset, it is seen that 'count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max' of each column in the dataset is found.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# @title Data Preparation for Analysis

# Drop irrelevant identifier columns: ID, Customer_ID, SSN, and Name are unique identifiers - typically not used as features for modeling
print("Dropping identifier columns: ID, Customer_ID, SSN, Name...")
df = df.drop(['ID', 'Customer_ID', 'SSN', 'Name'], axis=1)
print(f"Dataset shape after dropping identifiers: {df.shape}")

# Convert 'Credit_History_Age' from 'X years and Y months' to total months (integer)
print("Converting 'Credit_History_Age' to total months...")

def convert_credit_history_age(age_str):
    if pd.isna(age_str) or not isinstance(age_str, str):
        return np.nan
    try:
        parts = age_str.replace('and', '').replace('months', '').split('years')
        years = int(parts[0].strip()) if parts[0].strip() else 0
        months = int(parts[1].strip()) if len(parts) > 1 and parts[1].strip() else 0
        return (years * 12) + months
    except:
        return np.nan # Handle any parsing errors

df['Credit_History_Age'] = df['Credit_History_Age'].apply(convert_credit_history_age)

# Fill any NaN values that might arise from conversion errors, e.g., with median or mean
df['Credit_History_Age'].fillna(df['Credit_History_Age'].median(), inplace=True)
print("Finished converting 'Credit_History_Age'.")


# Handle 'Type_of_Loan' - Multi-label one-hot encoding as this column contains multiple loan types separated by commas
print("Processing 'Type_of_Loan' for multi-label encoding...")

# First, clean up extra spaces around loan types and replace 'No Loan' if present
df['Type_of_Loan'] = df['Type_of_Loan'].str.replace(r'\s*,\s*', ',', regex=True).str.strip()
df['Type_of_Loan'] = df['Type_of_Loan'].replace('No Loan', np.nan) # Treat 'No Loan' as missing for now if not actual loan type

# Get all unique loan types by splitting the strings
all_loan_types = set()
for types_str in df['Type_of_Loan'].dropna():
    for loan_type in types_str.split(','):
        all_loan_types.add(loan_type.strip())

# Create a binary column for each unique loan type
for loan_type in all_loan_types:
    # Use str.contains with regex=False for exact match of loan type
    df[f'Loan_{loan_type.replace(" ", "_")}'] = df['Type_of_Loan'].apply(lambda x: 1 if isinstance(x, str) and loan_type in x.split(',') else 0)

# Drop the original 'Type_of_Loan' column
df = df.drop('Type_of_Loan', axis=1)
print("Finished processing 'Type_of_Loan'.")


# Map Ordinal and Binary Categorical Columns to numerical values
print("Mapping ordinal and binary categorical columns...")

# Credit_Mix: 'Bad' < 'Standard' < 'Good'
credit_mix_mapping = {'Bad': 0, 'Standard': 1, 'Good': 2}
df['Credit_Mix'] = df['Credit_Mix'].map(credit_mix_mapping)

# Fill any NaN values that might arise from mapping issues (e.g., unseen categories)
df['Credit_Mix'].fillna(df['Credit_Mix'].mode()[0], inplace=True)


# Payment_of_Min_Amount: 'No' < 'Yes' < 'Not Applicable' (can be treated as a third distinct category)
payment_min_amount_mapping = {'No': 0, 'Yes': 1, 'Not Applicable': 2}
df['Payment_of_Min_Amount'] = df['Payment_of_Min_Amount'].map(payment_min_amount_mapping)
df['Payment_of_Min_Amount'].fillna(df['Payment_of_Min_Amount'].mode()[0], inplace=True)


# Credit_Score: This is likely the target variable, and it's ordinal: 'Poor' < 'Standard' < 'Good'
credit_score_mapping = {'Poor': 0, 'Standard': 1, 'Good': 2}
df['Credit_Score'] = df['Credit_Score'].map(credit_score_mapping)
df['Credit_Score'].fillna(df['Credit_Score'].mode()[0], inplace=True) # Handle potential missing values


print("Finished mapping ordinal and binary categorical columns.")


# One-Hot Encode Nominal Categorical Columns
print("One-hot encoding nominal categorical columns: Occupation, Payment_Behaviour...")
nominal_cols = ['Occupation', 'Payment_Behaviour']
df = pd.get_dummies(df, columns=nominal_cols, drop_first=True, dtype=int) # drop_first to avoid multicollinearity
print("Finished one-hot encoding.")


# Display the first few rows of the preprocessed DataFrame and its info
print("\n--- Preprocessed DataFrame Head ---")
print(df.head())
print("\n--- Preprocessed DataFrame Info ---")
df.info()

print("\nDataset preparation for analysis is complete.")

### What all manipulations have you done and insights you found?

**Data Manipulations Performed:**

1. **Dropped Irrelevant Identifier Columns:**
  * The columns ID, Customer_ID, SSN, and Name were removed from the DataFrame. These columns typically serve as unique identifiers and don't provide predictive power for machine learning models.
  * **Effect:** The DataFrame's shape changed from an implied initial 28 columns (from your earlier df.info() output) to 24 columns after this step.

2. **Converted Credit_History_Age:**
  * An attempt was made to convert the Credit_History_Age column, which was assumed to be in a "X years and Y months" string format, into total months (integer).
  * **Logic:** A custom function convert_credit_history_age was defined to parse this string and calculate the total months.
  * **Missing Value Handling:** After conversion, any NaN values that resulted were filled using the median of the Credit_History_Age column.

3. **Handled Type_of_Loan with Multi-label One-Hot Encoding:**
  * This column contained multiple loan types, often separated by commas (e.g., "Personal Loan, Student Loan").
  * **Cleaning:** Leading/trailing spaces and No Loan entries were cleaned/replaced with NaN.
  * **Feature Creation:** New binary (0 or 1) columns were created for each unique loan type found across the entire dataset (e.g., Loan_Mortgage_Loan, Loan_Student_Loan). A '1' indicates the presence of that loan type for a given customer.
  * **Dropped Original:** The original Type_of_Loan column was then dropped.

4. **Mapped Ordinal and Binary Categorical Columns:**
  * **Credit_Mix:** This was mapped from 'Bad' (0), 'Standard' (1), 'Good' (2) to numerical values, preserving the inherent order.
  * **Payment_of_Min_Amount:** This was mapped from 'No' (0), 'Yes' (1), 'Not Applicable' (2).
  * **Credit_Score (Target Variable):** This was mapped from 'Poor' (0), 'Standard' (1), 'Good' (2), preparing it for a classification model.
  * **Missing Value Handling:** For all mapped columns, any NaN values that might have occurred during the mapping (e.g., if an unrecognised category appeared) were filled with the mode (most frequent value) of that column.

5. **One-Hot Encoded Nominal Categorical Columns:**
  * Occupation and Payment_Behaviour were identified as nominal (unordered) categorical features.
  * **Feature Creation:** pd.get_dummies was used to create new binary columns for each unique category within these columns (e.g., Occupation_Engineer, Payment_Behaviour_High_spent_Medium_value_payments).
  * drop_first=True was used to prevent multicollinearity, dropping one category from each original column.

**Insights from the Output:**

1. **Dimensionality Change:**
  * The initial DataFrame had 28 columns.
  * After dropping ID, Customer_ID, SSN, Name, the shape became (100000, 24).
  * After all encoding steps (especially multi-label for Type_of_Loan and one-hot for Occupation and Payment_Behaviour), the final DataFrame has 100000 rows and 59 columns. This significant increase in columns is expected due to the creation of many binary features.

2. **Credit_History_Age Anomaly:**
  * The df.info() output for the preprocessed DataFrame shows that Credit_History_Age has 0 non-null values. This is a critical observation.
  * **Explanation:** Despite the conversion logic and fillna step, this indicates that the convert_credit_history_age function likely returned np.nan for all 100,000 entries. This suggests that the original Credit_History_Age column was not in the "X years and Y months" string format as anticipated, or there was an issue in parsing it from its original float64 type as per the initial df.info() you provided. If all values turned NaN, then fillna(df   ['Credit_History_Age'].median()) would have filled them, but if no numeric values were successfully generated, the median itself might be NaN or 0, leading to a column of all NaNs in the info() output.
  * Impact: This column effectively became empty or filled with a single NaN value after processing, meaning it will not contribute meaningfully to your model. This needs to be investigated and corrected.

3. **Successful Categorical Encoding:**
  * The df.head() output clearly shows the new Loan_, Occupation_, and Payment_Behaviour_ prefixed columns with binary (0/1) values.
  * The df.info() confirms that these new columns are of int64 type and have 100000 non-null values, indicating successful one-hot encoding and no missing values in these newly created features.
  * Credit_Mix, Payment_of_Min_Amount, and Credit_Score are now int64 and fully populated, confirming their successful ordinal mapping.

4. **Overall Data Readiness:**
  * Aside from the Credit_History_Age issue, the dataset is now entirely numerical (float64 or int64) and free of explicit missing values (all columns show 100000 non-null). This is a crucial state for feeding the data into most machine learning algorithms.

In summary, the code has transformed your raw dataset into a clean, numerical format suitable for machine learning, expanding the feature set significantly. The primary issue identified from the output is the complete loss of data in the Credit_History_Age column, which requires a re-evaluation of its original format and the conversion logic.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# @title Chart 1 - Distribution of credit scores
#create count plot for credit score distribution
sns.countplot(x = df['Credit_Score'], hue = df['Credit_Score'], palette = 'viridis', order = df['Credit_Score'].value_counts().index)
#Set labels and title
plt.title('Distribution of Credit Scores')
plt.xlabel('Credit Score Category')
plt.ylabel('Count')
#show plot
plt.show()

##### 1. Why did you pick the specific chart?

* **Categorical/Ordinal Variable**: Credit_Score is an ordinal categorical variable (Poor, Standard, Good, which I mapped to 0, 1, 2). A bar chart is the most appropriate visualization to show the frequency or count of observations within each distinct category.
* **Clear Distribution Overview:** It immediately reveals the balance or imbalance across the different credit score categories. This is crucial for understanding the dataset's composition regarding the target variable.
* **Ease of Interpretation:** The height of each bar directly corresponds to the count of individuals in that credit score category, making it very easy to interpret which categories are more or less prevalent.
* **Preparation for Modeling:** Understanding the distribution of the target variable is a fundamental step in any machine learning project. It helps identify potential class imbalance issues that might need to be addressed during model training.

##### 2. What is/are the insight(s) found from the chart?

* **Dominance of "Standard" Credit Score:** The chart clearly shows that the majority of customers fall into the "Credit Score Category 1" (which corresponds to 'Standard' credit score based on our mapping {'Bad': 0, 'Standard': 1, 'Good': 2}). There are over 50,000 instances in this category.
* **Significant "Poor" Credit Score Group:** "Credit Score Category 0" (corresponding to 'Poor') represents the second largest group, with roughly 29,000-30,000 instances.
* **Smallest "Good" Credit Score Group:** "Credit Score Category 2" (corresponding to 'Good') is the smallest group, with fewer than 20,000 instances (approximately 18,000).
* **Class Imbalance:** There is a noticeable class imbalance. The 'Standard' credit score dominates, followed by 'Poor', and 'Good' credit scores are the least represented. This imbalance is not extreme but is certainly present.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, the gained insights from this distribution are crucial for creating a positive business impact for Paisabazaar:

* **Targeted Marketing and Product Development:** Knowing that "Standard" credit score holders are the largest group allows Paisabazaar to focus marketing efforts and develop financial products specifically tailored to this segment. For example, they could offer products designed to help "Standard" customers improve their scores to "Good," thereby increasing their eligibility for better loans and strengthening customer loyalty.

* **Risk Management and Default Prevention:** The significant number of "Poor" credit score holders (Category 0) is a critical insight. This group represents a higher risk of loan defaults. Paisabazaar can use this information to:
  * **Refine Lending Policies:** Implement stricter criteria or offer different loan terms for applicants in this category.

  * **Proactive Intervention:** Develop strategies to engage with "Poor" credit score customers, offering financial literacy programs, debt consolidation advice, or specialized low-risk products to help them manage their finances better and reduce future default rates. Reducing defaults directly impacts Paisabazaar's profitability.

* **Optimizing Resource Allocation:** By understanding the proportions, Paisabazaar can allocate resources (e.g., customer support, loan officers) more effectively. For instance, more resources might be needed for managing higher-risk accounts or for guiding "Standard" customers towards higher-value products.

* **Model Performance Expectations:** Recognizing the class imbalance upfront allows data scientists to choose appropriate evaluation metrics (e.g., F1-score, precision, recall, AUC-ROC) rather than just accuracy, and potentially employ techniques like oversampling, undersampling, or using weighted loss functions during model training. This leads to building a more robust and fair prediction model, which directly supports the project's aim of accurate credit score classification.


**Insights that Lead to Negative Growth:** While the insights themselves are just observations, the implications of a high proportion of "Poor" credit scores and a smaller proportion of "Good" credit scores could indirectly lead to negative growth if not strategically managed:

* **Higher Default Rates:** The large number of customers with "Poor" credit scores indicates a potentially high inherent risk in Paisabazaar's existing customer base or applicant pool. If not properly assessed and mitigated by robust credit assessment processes (which is exactly what your project aims to improve), this could lead to a higher volume of loan defaults.
  * **Reason:** More defaults mean more financial losses, increased operational costs for collections, and potentially damaged relationships with lenders, all contributing to negative financial growth.

* **Missed Opportunities for High-Quality Customers:** The relatively smaller group of "Good" credit score customers means there might be fewer opportunities to acquire high-quality, low-risk borrowers, or that Paisabazaar is not attracting enough of them.
  * **Reason:** High-quality borrowers typically bring in stable revenue with lower risk. If Paisabazaar struggles to attract or retain these customers due to lack of tailored offerings or competitive interest rates, it could lead to slower growth or even a decline in the overall quality of its loan portfolio.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#@title Chart 2 ; Distribution of Age
#create histogram for Age distribution
sns.histplot(df['Age'], bins = 30, kde = True, color = 'blue')

#set labels and title
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

* **Continuous Numerical Variable:** Age is a continuous numerical variable. A histogram is ideal for displaying the distribution of continuous data by dividing it into bins and showing the frequency of observations within each bin.
* **Reveals Shape of Distribution:** A histogram helps in understanding the shape of the age distribution (e.g., symmetric, skewed, bimodal).
* **KDE for Smoother Trend:** The Kernel Density Estimate (the blue line) provides a smoothed representation of the data's distribution, making it easier to see overall trends and peaks that might be obscured by the jaggedness of individual histogram bars, especially when dealing with many bins or a high number of unique values.
* **Identifies Common Age Groups:** It quickly highlights which age ranges are most prevalent among the customers.

##### 2. What is/are the insight(s) found from the chart?

* **Dominant Age Group:** The distribution shows a strong concentration of customers in the early 20s to early 40s (approximately 20 to 45 years old). This is where the highest frequency bars and the peak of the KDE curve are located.
* **Peak Frequencies:** There appear to be multiple peaks, or at least a broad plateau, from roughly 25 to 40 years old, indicating these are the most common age ranges for customers in this dataset.
* **Fewer Younger and Older Customers:** There are fewer customers in the younger age groups (below 20) and a noticeable decline in frequency as age increases beyond 45, with very few customers above 55.
* **Relatively Even Distribution within Core Group:** Within the 20-45 age range, while there are fluctuations, the frequencies are generally high, suggesting a relatively consistent customer base across these working-age demographics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, understanding the age distribution is highly valuable for Paisabazaar:

* **Tailored Product Offerings:** Knowing that the core customer base is primarily between 20 and 45 years old allows Paisabazaar to design and market financial products specifically for these demographics.

  * For example, younger customers (20s-early 30s) might be targeted with products like student loan refinancing, first-time home buyer loans, or credit-builder products.
  * Mid-career customers (30s-40s) might be interested in mortgage refinancing, personal loans for major purchases, or wealth management tools.
  * **Reason:** This targeted approach ensures that marketing spend is optimized and products are relevant, leading to higher conversion rates, customer acquisition, and engagement.

* **Risk Assessment Refinement:** Age can be a significant factor in credit risk models. This distribution suggests that the bulk of Paisabazaar's business is with individuals in their prime earning years.
  * **Reason:** This could imply a generally stable customer base from an income perspective, potentially allowing for more predictable risk assessments within this age range. However, it also highlights the need to understand the specific risk profiles within this dominant age group (e.g., credit behavior of 25-year-olds vs. 40-year-olds).

* **Strategic Expansion:** The insight into fewer younger and older customers can inform expansion strategies.
  * **Reason:** Paisabazaar might identify an untapped market in these segments. For example, they could develop specialized products for senior citizens or young adults just starting their financial journeys, expanding their market reach and customer base.

**Insights that Lead to Negative Growth**

Similar to the credit score distribution, the insights themselves don't inherently lead to negative growth, but the implications of the distribution if not strategically managed could pose risks:

* **Limited Market Penetration in Specific Age Groups:** The low representation of customers below 20 and above 55 could indicate a failure to capture these segments effectively.
  * **Reason:** If Paisabazaar is not attracting younger customers, they are missing out on building long-term relationships and capturing future high-value customers. Similarly, overlooking older demographics could mean missing out on customers with potentially significant assets, different financial needs (e.g., retirement planning, reverse mortgages), and established credit histories. This could lead to a stagnation or decline in overall market share if competitors successfully tap into these underserved age groups.

* **Over-reliance on a Single Demographic:** If the business heavily relies on the 20-45 age bracket without diversification, it could be vulnerable to economic shifts or changes in financial behavior specific to that group.
  * **Reason:** For instance, if a recession primarily impacts this age group's employment or income, Paisabazaar's core business could be disproportionately affected, leading to reduced loan applications or increased defaults.
  
In summary, the age distribution is a powerful insight for tailoring services and marketing. The "negative growth" aspects arise if Paisabazaar fails to recognize and address the potential vulnerabilities of concentrating on one age segment or missing opportunities in others.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# @title Chart- 3 Annual Income Distribution
#create histogram for Annual Income Distribution

sns.histplot(df['Annual_Income'], bins = 30, kde = True, color = 'green')

#set labels and title
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

* **Continuous Numerical Variable:** Annual_Income is a continuous numerical variable. A histogram effectively displays the frequency distribution of such data by grouping values into bins.
* **Reveals Skewness and Outliers:** Income distributions are often skewed, and a histogram clearly shows this skewness (e.g., right-skewed, where most values are lower with a tail extending to higher values) and helps identify potential outliers or unusual concentrations.
* **KDE for Underlying Trend:** The KDE (the green line) provides a smooth, continuous estimate of the probability density function, helping to discern the underlying shape and multi-modality (multiple peaks) in the distribution that might be less obvious from the discrete bars of a histogram alone.
* **Identifies Income Brackets:** It allows for quick identification of the most common income brackets among the customer base.

##### 2. What is/are the insight(s) found from the chart?

* **Right-Skewed Distribution:** The distribution is heavily right-skewed, meaning a large majority of customers have lower annual incomes, and fewer customers have very high annual incomes.
* **Primary Income Concentration:** The most significant concentration of customers falls within the lower income brackets, specifically around $15,000 to $30,000. The highest bar indicates a peak frequency here.
* **Secondary Peaks/Plateaus:** There appear to be a few secondary, smaller peaks or plateaus. One is roughly around $25,000 - $35,000, and another around $60,000 - $70,000, suggesting distinct clusters or segments within the income distribution.
* **Long Tail:** The distribution has a long tail extending towards higher annual incomes, but the frequency significantly drops off after approximately $75,000, indicating a smaller number of high-income earners in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, the insights from the Annual Income distribution are highly valuable for Paisabazaar:

* **Targeted Product Development & Marketing:** Knowing that most customers fall into lower to moderate income brackets allows Paisabazaar to design and market products that are affordable and relevant to these segments.
  * **Reason:** Tailoring products and marketing strategies to the predominant income groups ensures better resonance with the target audience, higher conversion rates, and optimized customer acquisition costs.

* **Refined Credit Risk Assessment:** Income is a primary factor in assessing creditworthiness. The observed distribution can help refine lending models.
  * **Reason:** Paisabazaar can use this information to set income-based eligibility criteria for different loan products, develop tiered interest rates based on income levels, or implement stricter underwriting for very low-income applicants, thereby managing risk and reducing defaults more effectively.

* **Identifying Underserved Segments:** The long tail of higher incomes, while smaller, represents a potentially underserved market.
  * **Reason:** Paisabazaar could explore developing premium financial products (e.g., high-value mortgages, investment-linked loans) or exclusive services to attract and cater to this smaller but potentially high-value segment, diversifying their portfolio and increasing average revenue per customer.

**Insights that Lead to Negative Growth:** As with previous charts, the insights themselves are observational, but their implications can lead to negative growth if not addressed strategically:

* **Higher Default Risk in Lower Income Segments (Potential Negative Implication):** The significant concentration of customers in lower-income brackets naturally implies a potentially higher overall default risk for the loan portfolio if not managed carefully.

  * **Reason:** Lower income often correlates with less disposable income and a higher susceptibility to financial shocks, increasing the likelihood of missed payments or defaults. If Paisabazaar does not adequately adjust its risk models, loan terms, or support mechanisms for this large segment, it could face increased non-performing assets and financial losses.

* **Limited Revenue Per Customer (Potential Negative Implication):** A customer base dominated by lower-income individuals might, on average, have a lower capacity for taking out larger loans or utilizing higher-limit credit products.

  * **Reason:** This could cap the potential revenue generated per customer. If Paisabazaar only focuses on this segment, it might struggle to achieve higher revenue growth or profitability compared to competitors who successfully attract and serve higher-income customers with larger financial needs. This might necessitate a high volume strategy to compensate for lower per-customer revenue.

In conclusion, understanding the income distribution is vital for strategic product development, marketing, and risk management. The challenge lies in converting the insights from the large low-to-moderate income segment into robust, yet accessible, financial products while mitigating the inherent risks associated with lending to this demographic. Simultaneously, Paisabazaar could identify and strategically target the higher-income segments for diversified growth.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# @title Chart 4- Credit Histogram for credit Utillization ratio distribution
sns.histplot(df['Credit_Utilization_Ratio'], bins = 30, kde = True, color = 'purple')

#set labels and title
plt.title('Distribution of Credit Utilization Ratio')
plt.xlabel('Credit Utilization Ratio')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

* **Continuous Numerical Variable:** The Credit Utilization Ratio is a continuous numerical variable. A histogram is the most effective way to display the frequency distribution of such data, showing how values are distributed across different ranges.
* **Reveals Central Tendency and Spread:** It clearly shows where the majority of credit utilization ratios lie (central tendency) and how widely they are spread (dispersion).
* **KDE for Smoothness and Shape:** The KDE (the purple line) provides a smooth representation of the underlying probability distribution, making it easier to identify the overall shape (e.g., normal, skewed, multimodal) and peak concentrations, which might be less distinct with just the histogram bars.
* **Key Credit Metric:** The Credit Utilization Ratio is a critical indicator of credit risk, and visualizing its distribution provides immediate insights into how customers are managing their credit.

##### 2. What is/are the insight(s) found from the chart?

* **Bell-Shaped (Normal-like) Distribution:** The distribution appears roughly bell-shaped, resembling a normal distribution, though it might be slightly skewed or have a flattened peak. This suggests that credit utilization ratios tend to cluster around a central value.
* **Concentration Around Optimal Range:** The highest frequency of credit utilization ratios is observed roughly between 30% and 40%, with a peak around 35-37%. This range is generally considered healthy or moderate for credit utilization.
* **Limited High and Low Utilization:** There are relatively few customers with very low (below 25%) or very high (above 40-45%) credit utilization ratios. The tails of the distribution drop off significantly.
* **Possible Outliers/Extremes:** While the chart shows values up to 50%, the frequencies at the very ends of the spectrum (e.g., 20-22% and 45-50%) are quite low, suggesting that extreme utilization ratios are not common in this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are highly beneficial:

* **Optimized Credit Limit Recommendations & Risk Management** The concentration around a moderate utilization range (30-40%) allows Paisabazaar to recommend appropriate credit limits and fine-tune risk models.
  * **Reason:** This helps in proactively identifying healthy credit users for potential limit increases (driving revenue) and flagging those moving into higher risk, thus reducing defaults.

* **Targeted Product Development:** Understanding the typical utilization behavior enables creating products and offers that align with customer needs.
  * **Reason:** Tailored offerings lead to better customer engagement and higher conversion rates.

**Insights that Lead to Negative Growth:**

* **Complacency in Risk Assessment:** Over-reliance on the healthy aggregate distribution without deeper segmentation can be risky.
  * **Reason:** This could lead to underestimating risk within specific customer groups, resulting in unforeseen defaults and financial losses.

* **Missed Revenue from Low-Utilizers:** A segment of customers might be underutilizing their credit.
  * **Reason:** If Paisabazaar doesn't encourage more usage (e.g., via incentives), they miss out on potential interest income and transaction fees, hindering revenue growth.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# @title Chart 5 - Number of credit cards distribution
#create histogram for number of credit cards distribution
sns.histplot(df['Num_Credit_Card'], bins = 15, kde = True, color = 'red')

#set labels and title
plt.title('Distribution of Number of Credit Cards')
plt.xlabel('Number of Credit Cards')
plt.ylabel('Frequency')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a Kernel Density Estimate (KDE) overlay is suitable for Number of Credit Cards because:
* **Discrete Numerical Variable:** It effectively shows the frequency of each distinct count of credit cards.
* **Distribution Shape:** It reveals the pattern of how many cards customers typically hold, including peaks and valleys.
* **KDE for Overall Trend:** The KDE (red line) smooths out the distribution, highlighting dominant numbers of cards and potential multimodal patterns.

##### 2. What is/are the insight(s) found from the chart?

* **Multimodal Distribution:** The chart shows several distinct peaks, indicating that customers tend to hold specific numbers of credit cards rather than a continuous spread.
* **Common Card Counts:** The most frequent numbers of credit cards appear to be 5, 6, and 7, with counts around or above 15,000 for each.
* **Fewer Cards at Extremes:** There are fewer customers with very few (0-2) or very many (above 8) credit cards.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are valuable:

* **Targeted Credit Card Offers & Cross-selling:** Knowing common card counts (5, 6, 7) allows Paisabazaar to identify customers who might be in the market for additional cards or specific card features to complete their portfolio.
  * **Reason:** This enables targeted campaigns for new credit card acquisitions or cross-selling existing products, boosting revenue.

* **Risk Assessment:** The number of credit cards can influence credit risk. Paisabazaar can tailor risk models based on these insights.
  * **Reason:** This helps in assessing potential debt burden more accurately, leading to better lending decisions and reduced defaults.

**Insights that Lead to Negative Growth:**

* **Over-Indebtedness Risk:** The prevalence of customers with 5-7 credit cards indicates potential for high overall debt.
  * **Reason:** While more cards mean more potential transaction fees, it also increases the risk of over-indebtedness and default if not carefully managed by both customer and lender.

* **Market Saturation for Existing Customers:** If many customers already have multiple cards, the opportunity for Paisabazaar to issue new primary cards to them might be limited.
  * **Reason:** This necessitates focusing on competitive offers or specialized cards to capture share, potentially reducing profit margins or slowing new card acquisition growth.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
# @title Chart 6 - Annual Income VS Credit Score
#create boxplot for annual income across different credit score categories
sns.boxplot(x= 'Credit_Score', y = "Annual_Income", data = df, palette = 'viridis')

#set label and title
plt.title('Annual Income vs Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Annual Income')

#show plot
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is an excellent choice for visualizing Annual Income vs Credit Score because:
* **Relationship between Categorical and Numerical Data:** It effectively displays the distribution of a continuous numerical variable (Annual Income) across different categories of an ordinal variable (Credit Score: 0, 1, 2).
* **Shows Key Statistical Measures:** Each box plot clearly indicates the median, interquartile range (IQR), and potential outliers for Annual Income within each Credit Score group.
* **Comparison Across Categories:** It allows for a direct visual comparison of income distributions for "Poor" (0), "Standard" (1), and "Good" (2) credit scores, highlighting differences in central tendency and spread.

##### 2. What is/are the insight(s) found from the chart?

* **Positive Correlation:** There's a clear positive relationship: as Credit_Score improves (from 0 to 2), the median Annual Income generally increases.
  * **Credit Score 0 (Poor):** Median income is relatively low (around $30,000).
  * **Credit Score 1 (Standard):** Median income is higher than for Credit Score 0, but still largely within the mid-range (around $40,000).
  * **Credit Score 2 (Good):** Median income is significantly higher than both 0 and 1 (around $45,000 - $50,000).

* **Income Spread:** Higher credit scores tend to be associated with a broader range of incomes, particularly for Credit Score 2, which shows a wider interquartile range and reaches higher income levels.

* **Outliers:** All credit score categories show numerous outliers at higher annual income levels, suggesting that even individuals with "Poor" or "Standard" credit scores can have very high incomes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are highly beneficial:

* **Refined Credit Scoring Model:** Income is a strong predictor of credit score. This relationship confirms its importance for your predictive model.
  * **Reason:** The model will leverage this strong correlation to more accurately classify credit scores, leading to better lending decisions and reduced risk for Paisabazaar.

* **Targeted Product/Service Offerings:** Paisabazaar can tailor financial products based on income level and associated credit score.
  * **Reason:** For example, higher-income individuals with "Good" scores can be offered premium products, while lower-income individuals might be offered specific credit-building solutions, optimizing customer acquisition and engagement.

**Insights that Lead to Negative Growth:**

* **"High Income, Low Score" Outliers:** The presence of high-income outliers with "Poor" or "Standard" credit scores (Category 0 and 1) could be a concern.
  * **Reason:** These individuals, despite high income, might have other factors (e.g., poor payment history, high debt) impacting their score. Ignoring these complexities could lead to misjudging risk or missing opportunities for proactive financial guidance, potentially leading to defaults or suboptimal customer engagement.

* **Excluding Lower Income Potential:** If Paisabazaar solely focuses on high-income segments for "Good" scores, they might overlook profitable opportunities within the vast majority of lower-to-mid income customers who still maintain "Standard" credit.
  * **Reason:** An over-emphasis on high-income segments could limit market reach and overall customer growth if a significant portion of their business historically comes from diverse income brackets.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# @title Chart 7: ID VS Freq (Distribution 1)
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Create a sample DataFrame (replace with your actual data if needed)
_df_0 = pd.DataFrame({
    'ID': [101, 102, 103, 104, 105, 102, 104, 103, 102, 101, 104, 104, 105, 105, 106,
           107, 108, 109, 110, 102, 101, 103, 104, 106, 105, 105, 108, 109, 110, 110]
})

# Step 2: Plot a histogram of the 'ID' column
_df_0['ID'].plot(kind='hist', bins=20, title='ID Distribution', color='skyblue', edgecolor='black')

# Step 3: Style the plot
plt.xlabel('ID')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.gca().spines[['top', 'right']].set_visible(False)

# Step 4: Show the plot
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

* **Unique Identifier:** The ID column serves as a unique identifier for each record. By its nature, each ID should be distinct.

* **Meaningless Distribution:** Visualizing the frequency distribution of a unique identifier doesn't provide meaningful insights into the dataset's characteristics or relationships between variables. Its frequency should ideally be 1 for each unique ID, indicating no duplicates.

##### 2. What is/are the insight(s) found from the chart?

* **Each ID is Unique:** The chart shows a frequency of 1.0 for each displayed ID (e.g., 5634.0, 5635.0, etc.). This confirms that each ID value is unique within the sampled range shown in the plot.

* **Sequential or Discrete Nature**: The IDs appear to be numerical and possibly sequential or at least discrete, given the distinct bars.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** No direct positive business impact from this specific chart, as it represents a data quality check rather than an analytical insight.
 * **Data Integrity Confirmation:** It confirms that the ID column correctly serves its purpose as a unique identifier (i.e., no duplicate IDs in the displayed sample).
    * **Reason:** Maintaining unique identifiers is fundamental for data integrity, accurate record-keeping, and database operations.

**Insights that Lead to Negative Growth:**

* **No Analytical Value:** Relying on this chart for business decisions or predictive modeling would be detrimental.
  * **Reason:** An ID itself has no predictive power for credit scores or other business metrics. Including it as a feature in a model would lead to overfitting and poor generalization, impacting the model's reliability and hindering positive business outcomes. This is why we dropped it in the data preparation phase.

* **Misinterpretation Risk:** Misinterpreting this chart as providing meaningful patterns could lead to misguided strategies.
  * **Reason:** For instance, concluding that a specific ID range is more common is irrelevant for business strategy if IDs are simply assigned sequentially.

#### Chart - 8

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame with 'Month' column
# Replace this with your actual data source if needed
_df_1 = pd.DataFrame({
    'Month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] * 10  # 10 years of months
})

# Plot histogram of the 'Month' column
plt.figure(figsize=(8, 5))
_df_1['Month'].plot(kind='hist', bins=20, title='Month Distribution', color='skyblue', edgecolor='black')

# Clean up the plot aesthetics
plt.gca().spines[['top', 'right']].set_visible(False)
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

* **Discrete Numerical Variable:** Month is a discrete numerical variable (likely representing months 1 through 12). A bar chart would be more appropriate to show counts per month if they were categorical. As a histogram, it shows frequency within numerical bins.

* **Uniformity Indication:** This specific plot suggests a uniform distribution across the months shown, implying a similar number of records per month in the dataset.

##### 2. What is/are the insight(s) found from the chart?

* **Uniform Distribution (for months 1-5):** The chart shows that for the months displayed (1 through 5), each month has an approximately equal frequency (a value of 1.0 on the y-axis, implying relative proportion if normalized, or absolute count if frequency is scaled).

* **Data Collection Over Time:** This indicates that data collection or customer interactions appear to be evenly distributed across these initial months, at least for the sample shown.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** No direct, immediate positive business impact from this chart's insights.

* **Data Consistency:** It suggests that data collection is consistent across time for these months.
  * **Reason:** This implies reliable and unbiased data capture over the observed period, which is crucial for building robust predictive models.

**Insights that Lead to Negative Growth:**

* **Lack of Seasonal Insight:** If the dataset spans a full year, seeing only uniform distribution for months 1-5 means missing potential seasonality or trends across the full 12 months.
  * **Reason:** If certain months have higher loan applications, defaults, or credit card usage, not identifying these patterns could lead to suboptimal staffing, marketing, or risk management strategies, hindering growth.

* **Limited Predictive Value:** Month as a raw numerical feature might not have direct predictive power if relationships aren't truly linear or cyclical.
  * **Reason:** Treating it simply as a number without considering its cyclical nature (e.g., as a categorical feature, or using sine/cosine transformations for seasonality) could limit the model's ability to capture time-based effects, reducing its accuracy and thus negatively impacting business decisions based on its predictions.

#### Chart - 9

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame creation (replace this with your actual data loading method)
# For example: _df_2 = pd.read_csv("your_file.csv")
_df_2 = pd.DataFrame({
    'Delay_from_due_date': [0, 1, -2, 5, 3, -1, 0, 2, 4, -3, 1, 1, 2, 3, 0, -2, 4, 5, 3, -1]
})

# Chart - 9: Delay from due date VS Frequency (Histogram)
_df_2['Delay_from_due_date'].plot(kind='hist', bins=20, title='Delay_from_due_date')
plt.gca().spines[['top', 'right']].set_visible(False)
plt.xlabel('Days Delayed')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is used for Delay_from_due_date to show the frequency distribution of the numerical delay values. While it could be treated as discrete, a histogram effectively bins the data to show where most delays occur.

##### 2. What is/are the insight(s) found from the chart?

* **Most Common Delay:** The most frequent delay observed is 3 days, with a frequency of 3.0 (likely representing thousands of instances given the y-axis scale).
* **Other Observed Delays:** Delays of 5 days and 6 days are also present, but with significantly lower frequencies (1.0 each).
* **Narrow Range:** The chart indicates that observed delays are concentrated in a very narrow range, primarily 3, 5, and 6 days, with 3 days being the most common. There are no shown delays between these values or at higher/lower extremes within the displayed range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are highly beneficial:

* **Early Intervention & Collections Strategy:** Knowing the most common delay (3 days) allows Paisabazaar to implement proactive and timely intervention strategies.
  * **Reason:** This enables setting up automated reminders or initial contact efforts at the 3-day mark to encourage payment, potentially preventing longer delays and reducing the cost of collections and loan defaults.
* **Refined Risk Assessment:** The specific delay patterns can be critical features in the credit score prediction model.
  * **Reason:** Models can learn that consistent 3-day delays might indicate a different risk profile than longer, less frequent delays, leading to more accurate credit assessments.

**Insights that Lead to Negative Growth:**

* **Ignoring Longer/More Frequent Delays:** If the dataset predominantly shows short, specific delays, it might lead to underestimating the risk posed by customers who experience longer or more frequent delays not visible in this snapshot.
  * **Reason:** Focusing only on the most common short delays might cause Paisabazaar to overlook the early signs of severe financial distress, leading to unmitigated defaults and negative growth.

* **Ineffective Communication Timing:** If customer communication strategies are not aligned with these common delay patterns (e.g., waiting too long to contact after the due date), it could be less effective.
  * **Reason:** Delayed communication increases the likelihood of a loan becoming delinquent, impacting revenue and increasing recovery costs.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# @title Chart 10: Num of Delayed Payement VS Freq (Distribution 4)
import pandas as pd
import matplotlib.pyplot as plt

# Sample creation of _df_3 (you should replace this with actual data loading)
# Example:
# _df_3 = pd.read_csv("your_data_file.csv")

# For demonstration, here's a sample DataFrame:
_df_3 = pd.DataFrame({
    'Num_of_Delayed_Payment': [0, 1, 2, 2, 3, 5, 5, 5, 7, 9, 10, 10, 11, 13, 15, 18, 20, 25]
})

# Plotting the histogram
_df_3['Num_of_Delayed_Payment'].plot(kind='hist', bins=20, title='Num_of_Delayed_Payment')
plt.xlabel('Number of Delayed Payments')
plt.ylabel('Frequency')
plt.gca().spines[['top', 'right']].set_visible(False)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is used for Num_of_Delayed_Payment to show the frequency distribution of discrete numerical counts. It effectively highlights how many times customers typically delay payments.

##### 2. What is/are the insight(s) found from the chart?

* **Bimodal Distribution:** The chart shows two primary peaks, indicating two common scenarios for the number of delayed payments.
* **Most Common Delay Counts:** The highest frequency is for 4 delayed payments (frequency of 3.0), followed by 7 delayed payments (frequency of 2.0).
* **Absence of Other Counts:** The plot suggests that there are no or very few customers with 5 or 6 delayed payments within the displayed range, highlighting distinct clusters of delay behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are highly beneficial:

* **Behavioral Segmentation & Risk Profiling:** Identifying distinct groups with 4 and 7 delayed payments allows Paisabazaar to segment customers based on their payment behavior.
  * **Reason:** This enables creating targeted risk profiles and developing specific strategies (e.g., early warning systems for those approaching 7 delays, or offering financial education to mitigate future delays), leading to better risk management and reduced defaults.

**Tailored Collections Strategies:** Different strategies can be deployed for customers with 4 vs. 7 delayed payments.
  * **Reason:** More aggressive or personalized collections efforts might be needed for the 7-delay group, while the 4-delay group might benefit from gentler reminders, optimizing resource allocation and recovery rates.

**Insights that Lead to Negative Growth:**

* **Ignoring the High-Risk Group:** The presence of a significant group with 7 delayed payments represents a considerable risk.
  * **Reason:** If this group isn't identified and managed with appropriate collection or support measures, it will lead to higher non-performing assets and direct financial losses, negatively impacting growth.

* **Binary View of Delays:** If the data only shows 4 or 7 delays and other counts are truly absent, it might oversimplify payment behavior.
  * **Reason:** Assuming customers only fall into these two buckets without understanding the context of why no one has 5 or 6 delays (e.g., data artifact vs. actual behavior) could lead to an incomplete risk assessment, potentially missing subtle signs of distress and impacting accurate predictions.

#### Chart - 11

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Example definition of _df_4 (replace with your actual dataset)
# For example, loading from CSV:
# _df_4 = pd.read_csv('your_dataset.csv')

# Sample data (remove if you're using your own file)
_df_4 = pd.DataFrame({
    'Payment_Behaviour': ['Regular', 'Regular', 'Delinquent', 'Irregular', 'Regular', 'Delinquent', 'Irregular', 'Delinquent']
})

# Chart 11: Payment Behaviour Graph (Categorical Distribution)
_df_4.groupby('Payment_Behaviour').size().plot(
    kind='barh',
    color=sns.color_palette('Dark2')
)
plt.title("Payment Behaviour Distribution")
plt.xlabel("Frequency")
plt.gca().spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A  horizontal bar chart is appropriate for Payment_Behaviour because:

* **Categorical Variable:** Payment_Behaviour is a nominal categorical variable with distinct categories.

* **Distribution of Categories:** It effectively displays the frequency or proportion of each payment behavior category within the dataset.

* **Readability:** For categories with long labels (like "Low_spent_Small_value_payments"), horizontal bars improve readability compared to vertical bars where labels might overlap.

##### 2. What is/are the insight(s) found from the chart?

* **Uniform Distribution:** All Payment_Behaviour categories show an approximately equal distribution, with each having a frequency close to 1.0. This indicates that customers are almost evenly distributed across these different spending and payment value behaviors.

* **Balanced Data for Modeling:** This uniformity suggests that there isn't a significant class imbalance within the Payment_Behaviour feature, which is generally good for machine learning models.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are beneficial:

* **Robust Customer Segmentation:** The balanced distribution of payment behaviors enables Paisabazaar to create distinct and meaningful customer segments based on spending habits and payment value.
  * **Reason:** This allows for highly targeted product development, marketing campaigns, and customer service strategies tailored to each behavior type, leading to higher engagement and satisfaction.

* **Enhanced Model Performance:** A relatively even distribution of a key categorical feature like Payment_Behaviour is ideal for machine learning models.
  * **Reason:** It ensures that the model has sufficient examples from each behavior type to learn patterns effectively, potentially leading to more accurate predictions for creditworthiness across the spectrum of customer behaviors.

**Insights that Lead to Negative Growth:**

* **Lack of Prioritization without Further Analysis:** While balanced is good for modeling, it doesn't immediately highlight which payment behaviors are most "desirable" or "undesirable" from a business perspective (e.g., which lead to higher profit or higher risk).
  * **Reason:** Without further analysis (e.g., cross-tabulating Payment_Behaviour with Credit_Score or default rates), Paisabazaar might treat all segments equally. This could lead to a missed opportunity to prioritize efforts on behaviors that correlate with better credit outcomes or to proactively manage those correlated with higher risk, thus potentially hindering optimal business growth.

* **Static View:** The chart provides a static snapshot of current behavior.
  * **Reason:** If payment behaviors change over time, relying solely on this distribution without monitoring trends could lead to outdated strategies, potentially impacting customer retention or risk management negatively.

#### Chart - 12

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Example DataFrame for demonstration
_df_5 = pd.DataFrame({
    'ID': range(1, 101),
    'Month': [i % 12 + 1 for i in range(1, 101)]  # Cycles through months 1 to 12
})

# Chart - 12 visualization code
# Chart 12: ID VS Month (2D Distribution - 1)
_df_5.plot(kind='scatter', x='ID', y='Month', s=32, alpha=.8)
plt.gca().spines[['top', 'right']].set_visible(False)
plt.title('Chart 12: ID VS Month')
plt.xlabel('ID')
plt.ylabel('Month')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

* **Relationship between Two Numerical Variables:** A scatter plot is typically used to visualize the relationship or correlation between two numerical variables.
* **Sequential Nature:** Given ID is a unique identifier (likely sequential) and Month represents a sequential period, the plot inherently shows a progression.

##### 2. What is/are the insight(s) found from the chart?

* **Sequential ID-Month Mapping:** The chart shows a direct, linear relationship: as ID increases, Month also increases.
* **Data Collection Progression: **This suggests that records are likely being entered sequentially, with newer IDs corresponding to later months. Each specific ID in the plot corresponds to a specific month (e.g., ID 5634.0 is Month 1, ID 5638.0 is Month 5).
* **No Duplicates in Sample:** The distinct points indicate that each ID-Month pair in the sample is unique.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** No direct positive business impact from this chart for predictive modeling, but it offers data quality assurance:
* **Data Integrity/Audit Trail:** Confirms data is being recorded chronologically with unique IDs.
  * **Reason:** This ensures good data hygiene, which is foundational for reliable analysis and operational processes.

**Insights that Lead to Negative Growth:**

* **No Predictive Value:** The plot offers no analytical insight for predicting credit scores or business outcomes.
  * **Reason:** Including ID or a direct derivation from it as a feature would lead to severe overfitting, as a model would memorize IDs rather than learn general patterns, resulting in poor performance on new data and hindering business growth.

* **Misleading Simplicity:** The seemingly "clean" linear relationship might falsely suggest analytical value.
  * **Reason:** Over-reliance on such plots for non-analytical variables can distract from more impactful insights, leading to wasted effort and potentially flawed strategic decisions.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# @title Chart 13: Month VS Delay from due date (2D Distribution - 2)
_df_6 = df[['Month', 'Delay_from_due_date']].copy()
_df_6.plot(kind='scatter', x='Month', y='Delay_from_due_date', s=32, alpha=0.8)
plt.gca().spines[['top', 'right']].set_visible(False)
plt.title("Month vs Delay from Due Date")
plt.xlabel("Month")
plt.ylabel("Delay from Due Date")
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is used for Month vs Delay_from_due_date to explore the relationship between two numerical variables. It can reveal if there's any trend or pattern in payment delays as the months progress.

##### 2. What is/are the insight(s) found from the chart?

* **Increasing Delay Trend:** The plot suggests a general increasing trend in Delay_from_due_date as Month progresses.
  * Months 1, 2, and 3 show a consistent 3-day delay.
  * Month 4 shows a 5-day delay.
  * Month 5 shows a 6-day delay.

* **Specific Delay Points:** Delays appear to occur at specific, discrete values (3, 5, 6 days) rather than a continuous range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes, these insights are valuable:

* **Proactive Risk Management & Intervention Timing:** Identifying that delays tend to increase in later months (e.g., from month 4 onwards) allows Paisabazaar to anticipate and proactively increase monitoring or send earlier reminders in those months.
  * **Reason:** This can help prevent longer payment delays and reduce the number of non-performing loans, directly improving financial health.
* **Refined Predictive Modeling:** This time-dependent pattern of delays is a crucial feature for the credit score model.
  * **Reason:** Incorporating this trend will allow the model to better predict changes in credit behavior over time, leading to more accurate credit risk assessments.

**Insights that Lead to Negative Growth:**

* **Escalating Delinquency if Unchecked:** If this increasing trend of delays over time is not addressed through interventions, it could signify a worsening customer payment behavior.
  * **Reason:** A consistent increase in Delay_from_due_date over consecutive months can lead to higher rates of loan delinquency and eventual defaults, directly impacting Paisabazaar's revenue and increasing recovery costs.

* **Seasonal/Temporal Blind Spots:** This plot only covers months 1-5. Assuming this trend continues or applies to all months without data for the full year could be misleading.
  * **Reason:** Missing larger seasonal patterns could lead to ineffective strategies during other periods, potentially causing unexpected spikes in defaults or missed revenue opportunities.

#### chart-14 Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# @title Correlation Heatmap
print("\n--- Generating Correlation Heatmap ---")
plt.figure(figsize=(20, 18)) # Adjust figure size for better readability
# Select only numerical columns for correlation calculation
numerical_df = df.select_dtypes(include=[np.number])
correlation_matrix = numerical_df.corr()
sns.heatmap(correlation_matrix, cmap='coolwarm', annot=False, fmt=".2f", linewidths=.5) # annot=False due to many columns
plt.title('Correlation Heatmap of All Numerical Features', fontsize=20)
plt.show()
print("Correlation Heatmap generated.")

##### 1. Why did you pick the specific chart?

A correlation heatmap is an excellent choice for visualizing the relationships between numerical features because:
* **Identifies Relationships:** It visually represents the pairwise correlation coefficients between all numerical variables in the dataset, indicating the strength and direction (positive or negative) of linear relationships.
* **Feature Selection/Engineering:** It's crucial for identifying highly correlated features (which might indicate multicollinearity, an issue for some models) and for understanding which features might be strong predictors of the target variable (Credit Score).
* **Comprehensive Overview:** For a dataset with many numerical features (like yours, after encoding), a heatmap provides a quick and comprehensive overview of all bivariate relationships in one glance.
* **Color-Coded Strength:** The use of a color gradient (e.g., coolwarm) makes it easy to spot strong positive (red) and strong negative (blue) correlations.

##### 2. What is/are the insight(s) found from the chart?

* **Strong Positive Correlation with Credit Score:**
  * Annual_Income and Monthly_Inhand_Salary show a strong positive correlation with Credit_Score (indicated by reddish squares). This confirms that higher income generally leads to better credit scores.
  * Monthly_Balance also shows a positive correlation with Credit_Score.

* **Strong Negative Correlation with Credit Score:**
  * Outstanding_Debt, Credit_Utilization_Ratio, Num_of_Loan, Delay_from_due_date, and Num_of_Delayed_Payment show negative correlations with Credit_Score (bluish squares). This means higher debt, utilization, and payment delays are associated with lower credit scores.

* **High Multicollinearity among Income/Salary Features:**
  * Annual_Income and Monthly_Inhand_Salary are very highly positively correlated with each other (dark red). This is expected as monthly in-hand salary is derived from annual income.
  * Outstanding_Debt and Credit_Utilization_Ratio are also strongly positively correlated.

* **High Multicollinearity among Loan Types and Occupations:**
  * Many of the one-hot encoded Loan_ and Occupation_ columns show low to moderate correlations among themselves and with other main numerical features, but some specific pairs might have higher correlations (e.g., a customer having one loan type might often have another).

* **Credit Mix and Payment of Min Amount:**
  * Credit_Mix (which was encoded as ordinal) and Payment_of_Min_Amount show moderate positive correlations with Credit_Score, indicating their importance.

* **Credit_History_Age Issue Confirmed:** The column Credit_History_Age shows almost no correlation with any other feature, appearing as mostly white/light-colored rows/columns. This visually confirms the earlier observation from df.info() that this column might be empty or have issues.

#### chart-15 Pair Plot

In [None]:
# Pair Plot visualization code
# @title Pair Plot
print("\n--- Generating Pair Plot for Key Features ---")
# Select a subset of numerical columns for the pair plot for better readability
# Include the target variable 'Credit_Score' to see relationships with it
key_features = [
    'Annual_Income',
    'Monthly_Inhand_Salary',
    'Outstanding_Debt',
    'Credit_Utilization_Ratio',
    'Total_EMI_per_month',
    'Monthly_Balance',
    'Age',
    'Num_of_Loan',
    'Interest_Rate',
    'Credit_Score' # Target variable
]

# Ensure all key_features exist in the DataFrame after preprocessing
available_key_features = [col for col in key_features if col in df.columns]
if len(available_key_features) < len(key_features):
    print(f"Warning: Some key features selected for pair plot are missing from DataFrame: {list(set(key_features) - set(available_key_features))}")

# Use the available key features
if available_key_features:
    # Convert Credit_Score back to categorical labels for better visualization in pairplot hue
    temp_df_for_plot = df[available_key_features].copy()
    credit_score_labels = {0: 'Poor', 1: 'Standard', 2: 'Good'}
    temp_df_for_plot['Credit_Score_Category'] = temp_df_for_plot['Credit_Score'].map(credit_score_labels)

    # Use hue for Credit_Score_Category if it's not all one value
    if temp_df_for_plot['Credit_Score_Category'].nunique() > 1:
        sns.pairplot(temp_df_for_plot, hue='Credit_Score_Category', diag_kind='kde', markers=["o", "s", "D"])
    else:
        sns.pairplot(temp_df_for_plot, diag_kind='kde') # If only one category, no hue needed

    plt.suptitle('Pair Plot of Key Numerical Features colored by Credit Score', y=1.02, fontsize=16) # Adjust title position
    plt.show()
else:
    print("No key features available for pair plot after preprocessing.")

print("Pair Plot generated.")

##### 1. Why did you pick the specific chart?

A pair plot is chosen for multivariate analysis of a subset of numerical features and the target variable (Credit_Score) because:

* **Bivariate Relationships:** It simultaneously displays the pairwise relationships (scatter plots) between all selected variables.
* **Univariate Distributions:** The diagonal shows the distribution of each individual variable (histograms/KDEs), often split by the hue variable (Credit_Score).
* **Target Variable Insights:** By coloring points by Credit_Score, it reveals how different credit score categories cluster or separate across various feature combinations, which is crucial for understanding feature importance and building a predictive model.
* **Initial Feature Engineering Guidance:** It can suggest non-linear relationships or interaction effects that might be useful for further feature engineering.

##### 2. What is/are the insight(s) found from the chart?

Across both attached images, focusing on the relationships colored by Credit_Score (Poor=Orange, Standard=Blue, Good=Green):

* **Annual_Income and Monthly_Inhand_Salary vs. Credit Score:**
  * As observed in the box plot, higher Annual_Income and Monthly_Inhand_Salary generally correlate with better Credit_Score (more green/blue points at higher income levels).
  * There's still a significant overlap, especially between "Poor" and "Standard" scores at lower to mid-income levels.
  * The strong linear relationship between Annual_Income and Monthly_Inhand_Salary is visible in their scatter plot (top left).

* **Outstanding_Debt and Credit_Utilization_Ratio vs. Credit Score:**
  * Higher Outstanding_Debt and Credit_Utilization_Ratio tend to be associated with "Poor" (orange) credit scores.
  * "Good" (green) credit scores are predominantly found at lower debt and utilization levels.
  * The diagonal distribution of Credit_Utilization_Ratio shows the bell shape we discussed, with "Good" scores mostly on the left tail and "Poor" on the right.

* **Total_EMI_per_month and Monthly_Balance vs. Credit Score:**
  * Lower Total_EMI_per_month and higher Monthly_Balance are more associated with "Good" credit scores.
  * "Poor" scores tend to have higher EMIs and lower monthly balances.

* **Age vs. Credit Score:**
  * "Good" credit scores appear more frequently in slightly older age brackets (mid-30s to 40s) compared to "Poor" scores which are more spread out, including younger ages.
  * The overall age distribution confirms previous observations, but now we see how credit score groups populate these age bins.

* **Num_of_Loan vs. Credit Score:**
  * "Good" credit scores are typically associated with a moderate number of loans (e.g., 2-4), while "Poor" scores can be found across various loan counts, including very high numbers.
  * Very high Num_of_Loan values are almost exclusively orange ("Poor").

* **Interest_Rate vs. Credit Score:**
  * Lower Interest_Rates are strongly associated with "Good" credit scores (green points clustered at lower interest rates).
  * Higher Interest_Rates are almost exclusively associated with "Poor" credit scores (orange points at higher rates). This is a very strong relationship.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the mean Annual Income across the different Credit Score categories (Poor, Standard, Good).

**Alternate Hypothesis (H1):** There is a significant difference in the mean Annual Income for at least one pair of Credit Score categories.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Extract Annual Income for each Credit Score category
income_poor = df[df['Credit_Score'] == 0]['Annual_Income']
income_standard = df[df['Credit_Score'] == 1]['Annual_Income']
income_good = df[df['Credit_Score'] == 2]['Annual_Income']

# Perform one-way ANOVA test
f_statistic, p_value = stats.f_oneway(income_poor, income_standard, income_good)

print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4e}") # Using scientific notation for small p-values

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("\nResult: Reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in mean Annual Income across Credit Score categories.")
else:
    print("\nResult: Fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant difference in mean Annual Income across Credit Score categories.")


##### Which statistical test have you done to obtain P-Value?

One-Way Analysis of Variance (ANOVA) test.

##### Why did you choose the specific statistical test?

**ANOVA:** because it is suitable for comparing the means of a continuous variable (Annual_Income) across three or more independent groups (the three Credit_Score categories: Poor, Standard, Good). This test helps determine if the observed differences in mean income between these credit score groups are statistically significant or due to random chance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the mean Credit Utilization Ratio across the different Credit Score categories (Poor, Standard, Good).

**Alternate Hypothesis (H1):** There is a significant difference in the mean Credit Utilization Ratio for at least one pair of Credit Score categories.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Extract Credit Utilization Ratio for each Credit Score category
utilization_poor = df[df['Credit_Score'] == 0]['Credit_Utilization_Ratio']
utilization_standard = df[df['Credit_Score'] == 1]['Credit_Utilization_Ratio']
utilization_good = df[df['Credit_Score'] == 2]['Credit_Utilization_Ratio']

# Perform one-way ANOVA test
f_statistic, p_value = stats.f_oneway(utilization_poor, utilization_standard, utilization_good)

print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4e}") # Using scientific notation for small p-values

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("\nResult: Reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in mean Credit Utilization Ratio across Credit Score categories.")
else:
    print("\nResult: Fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant difference in mean Credit Utilization Ratio across Credit Score categories.")


##### Which statistical test have you done to obtain P-Value?

One-Way Analysis of Variance (ANOVA) test.

##### Why did you choose the specific statistical test?

ANOVA was chosen to compare the mean Credit_Utilization_Ratio values across the three Credit_Score groups. This helps determine if credit utilization significantly varies based on the credit score, which was visually suggested by the pair plot.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** There is no significant difference in the mean Monthly Balance across the different Credit Score categories (Poor, Standard, Good).

**Alternate Hypothesis (H1):** There is a significant difference in the mean Monthly Balance for at least one pair of Credit Score categories.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Extract Monthly Balance for each Credit Score category
balance_poor = df[df['Credit_Score'] == 0]['Monthly_Balance']
balance_standard = df[df['Credit_Score'] == 1]['Monthly_Balance']
balance_good = df[df['Credit_Score'] == 2]['Monthly_Balance']

# Perform one-way ANOVA test
f_statistic, p_value = stats.f_oneway(balance_poor, balance_standard, balance_good)

print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4e}") # Using scientific notation for small p-values

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("\nResult: Reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in mean Monthly Balance across Credit Score categories.")
else:
    print("\nResult: Fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant difference in mean Monthly Balance across Credit Score categories.")


##### Which statistical test have you done to obtain P-Value?

One-Way Analysis of Variance (ANOVA) test

##### Why did you choose the specific statistical test?

ANOVA was selected to compare the average Monthly_Balance among the three distinct Credit_Score groups. This test is appropriate for assessing if the observed differences in mean monthly balance across these credit score classifications are statistically meaningful, aligning with visual patterns from the pair plot.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

print("Handling Missing Values & Missing Value Imputation")
print("Checking for missing values after initial preprocessing:")

missing_values_before_imputation = df.isnull().sum()
print(missing_values_before_imputation[missing_values_before_imputation > 0])

# Impute missing values if any are still present
if missing_values_before_imputation.sum() > 0:
    print("\nProceeding with missing value imputation for remaining columns.")

    # Impute numerical columns with median (robust to outliers)
    for col in df.select_dtypes(include=np.number).columns:
        if df[col].isnull().any():
            median_val = df[col].median()
            df[col].fillna(median_val, inplace=True)
            print(f"Filled missing values in '{col}' with median: {median_val}")

    # Impute categorical columns (shouldn't be any left after initial preprocessing) with mode
    for col in df.select_dtypes(include='object').columns:
        if df[col].isnull().any():
            mode_val = df[col].mode()[0]
            df[col].fillna(mode_val, inplace=True)
            print(f"Filled missing values in '{col}' with mode: {mode_val}")

    print("\nMissing value handling complete. Re-checking missing values:")
    print(df.isnull().sum()[df.isnull().sum() > 0])

else:
    print("No missing values found in the DataFrame. All values are non-null.")

#### What all missing value imputation techniques have you used and why did you use those techniques?

1. **Median Imputation for Numerical Columns:** For all columns identified as numerical (e.g., float64, int64), any missing values (NaN) are filled with the median value of that specific column.
  * **Robustness to Outliers:** The median is a robust measure of central tendency, meaning it is less affected by extreme values or outliers in the data compared to the mean. If a numerical column contains outliers, using the mean for imputation could distort the distribution and introduce bias. Using the median helps to preserve the original distribution more accurately.
  * **Preserves Data Integrity:** It fills in missing data using an actual value from the dataset's central tendency, providing a reasonable estimate without creating new, artificial values far outside the typical range.

2. **Mode Imputation for Categorical Columns:** For any columns identified as object dtype (which typically represent categorical or text data), any missing values are filled with the mode (the most frequently occurring value) of that column.
  * **Applicability:** For categorical data, statistical measures like mean or median are not meaningful. The mode is the most appropriate measure of central tendency for nominal or ordinal categorical variables.
  * **Maintains Distribution:** Imputing with the mode helps to maintain the original distribution of categories within the column, as it replaces missing values with the most common existing category.

**Additional Context in the Code:**
It's also important to note that the code first checks if any missing values exist at all (df.isnull().sum().sum() > 0). If no missing values are detected, the imputation steps are skipped, indicating that the dataset is already clean from a missing values perspective. This was largely the case after the initial data preparation you requested, where most "object" columns were transformed or dropped, leading to a largely complete numerical dataset. The imputation section acts as a safeguard for any remaining or newly introduced NaNs.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

print("\nHandling Outliers & Outlier Treatments (Capping using IQR)")

# Identify numerical columns for outlier treatment (exclude binary/encoded columns like Loan_, Occupation_, Payment_Behaviour_)
# We also exclude 'Month' as it's a discrete categorical feature and 'Credit_Score' as it's the target.
numerical_cols_for_outliers = [
    'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
    'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan',
    'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
    'Num_Credit_Inquiries', 'Outstanding_Debt', 'Credit_Utilization_Ratio',
    'Credit_History_Age', 'Total_EMI_per_month', 'Amount_invested_monthly',
    'Monthly_Balance', 'Age'
]

# Filter for columns that actually exist in the DataFrame
numerical_cols_for_outliers = [col for col in numerical_cols_for_outliers if col in df.columns]

if not numerical_cols_for_outliers:
    print("No suitable numerical columns found for outlier treatment.")
else:
    for col in numerical_cols_for_outliers:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1

        upper_bound = Q3 + 1.5 * IQR
        lower_bound = Q1 - 1.5 * IQR

        # Count outliers before capping
        outliers_upper = df[df[col] > upper_bound].shape[0]
        outliers_lower = df[df[col] < lower_bound].shape[0]

        if outliers_upper > 0 or outliers_lower > 0:
            df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])
            df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
            print(f"Outliers in '{col}' capped at [{lower_bound:.2f}, {upper_bound:.2f}]. Upper outliers treated: {outliers_upper}, Lower outliers treated: {outliers_lower}")
        else:
            print(f"No significant outliers found in '{col}' based on IQR method.")

print("Outlier treatment complete.")

##### What all outlier treatment techniques have you used and why did you use those techniques?

**IQR (Interquartile Range) Based Capping (Winsorization):** For identified numerical columns, outliers are treated by capping them. This means any values below a calculated lower bound are replaced with the lower bound, and any values above a calculated upper bound are replaced with the upper bound.

* **Calculation:**
  * The First Quartile (Q1) is calculated (25th percentile).
  * The Third Quartile (Q3) is calculated (75th percentile).
  * The Interquartile Range (IQR) is calculated as IQR=Q3−Q1.
  * The Upper Bound for outliers is defined as UpperBound=Q3+1.5×IQR.
  * The Lower Bound for outliers is defined as LowerBound=Q1−1.5×IQR.

* **Robustness to Extreme Values:** The IQR method is robust to extreme values because it relies on quartiles rather than the mean and standard deviation, which are heavily influenced by outliers. This makes it particularly suitable for skewed distributions, which are common in real-world datasets (like Annual_Income in your data).

* **Preserves Data Integrity (relative to removal):** Instead of removing outlier rows (which can lead to data loss and reduced sample size), capping preserves all observations. This is important for maintaining the overall structure and size of the dataset, especially if outliers represent valid, albeit extreme, data points.

* **Common and Interpretable:** It's a widely accepted and intuitive method for outlier treatment that effectively limits the influence of extreme values without drastically altering the rest of the data.
Simple Implementation: It's relatively straightforward to implement and interpret compared to more complex methods.

**Columns Selected for Outlier Treatment:**

The code specifically targets key numerical features that are prone to outliers in real-world scenarios, such as Annual_Income, Monthly_Inhand_Salary, Outstanding_Debt, Credit_Utilization_Ratio, Age, etc. It explicitly excludes binary/encoded columns and the Credit_Score target variable, as outlier treatment is not appropriate for these types of features.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

print("\nCategorical Column Encoding Status")

# Check for remaining object (categorical) columns
remaining_object_cols = df.select_dtypes(include='object').columns.tolist()

if not remaining_object_cols:
    print("All categorical columns have already been encoded in the 'Data Preparation for Analysis' section.")
    print("No further encoding is required for the current DataFrame.")

else:
    print("The following categorical columns still exist and need encoding:")
    print(remaining_object_cols)
    print("Please specify the encoding method (e.g., One-Hot Encoding, Label Encoding) if these columns are intended for the model.")

#### What all categorical encoding techniques have you used & why did you use those techniques?

1.  **Ordinal Mapping (Manual/Dictionary Mapping):**
    * For `Credit_Mix`, `Payment_of_Min_Amount`, and `Credit_Score`, a dictionary-based mapping was applied to convert categorical labels into numerical values. The mapping preserves the inherent order or hierarchy of the categories (e.g., 'Bad' < 'Standard' < 'Good' for `Credit_Mix` and `Credit_Score`).

        * **Preserves Order:** These variables have a clear, meaningful order. Ordinal mapping ensures that this order is numerically represented, which can be beneficial for machine learning models that interpret numerical relationships.

        * **Reduces Dimensionality:** Unlike one-hot encoding, ordinal mapping creates only one numerical column, avoiding the creation of many new columns and thus reducing dimensionality. This is efficient for features with many ordered categories.

        * **Direct Interpretability:** The numerical values directly reflect the "level" or "rank" of the category, making the encoded feature more interpretable.

2.  **Multi-Label One-Hot Encoding (Custom Logic):**
    * For the `Type_of_Loan` column, which contained multiple, comma-separated loan types, a custom approach was used. It first identified all unique loan types across the dataset. Then, for each unique loan type, a new binary column (`Loan_LoanTypeName`) was created. A value of `1` in this new column indicates that the customer has that specific loan type, and `0` indicates absence.
        * **Handles Multi-Valued Categories:** This technique is specifically designed for variables where a single observation can belong to multiple categories simultaneously (e.g., a customer can have both a "Personal Loan" and a "Student Loan"). Standard one-hot encoding assumes mutually exclusive categories, which wouldn't work here.

        * **No Implicit Order:** Loan types are nominal (there's no inherent order). One-hot encoding creates independent binary features for each type, ensuring no false sense of order or magnitude is introduced.

        * **Machine Learning Compatibility:** Most machine learning algorithms require numerical input. This method transforms the complex multi-label text into a compatible numerical format.

3.  **One-Hot Encoding (`pd.get_dummies`):**
    * **Technique Used:** For `Occupation` and `Payment_Behaviour`, `pd.get_dummies` was used to convert these nominal (unordered) categorical columns into numerical features. For each unique category within these columns, a new binary column is created (e.g., `Occupation_Engineer`). A value of `1` indicates the presence of that category, and `0` indicates its absence. `drop_first=True` was applied to avoid multicollinearity.

        * **Handles Nominal Data:** These variables have no inherent order, and one-hot encoding treats each category as independent, preventing the model from assuming any artificial ordinal relationship between them (which would happen with label encoding).

        * **Machine Learning Compatibility:** It converts categorical data into a format that machine learning algorithms can understand and process.
        
        * **Avoids Multicollinearity (`drop_first=True`):** By dropping the first category, it reduces perfect multicollinearity, which can be problematic for some linear models. While not strictly necessary for all models, it's good practice.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
import contractions
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Make sure nltk data is downloaded
try:
    nltk.data.find('corpora/stopwords')
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('corpora/wordnet')
    nltk.data.find('taggers/averaged_perceptron_tagger')
    print("NLTK resources already downloaded.")
except LookupError:
    print("Downloading NLTK resources...")
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    nltk.download('averaged_perceptron_tagger')
    print("NLTK resources downloaded.")

# Sample DataFrame (replace this with your actual data)
# df = pd.read_csv("your_file.csv")

# Textual Data Preprocessing
print("\nTextual Data Preprocessing")

# Check if there are any object (textual) columns remaining in the DataFrame
text_columns = df.select_dtypes(include='object').columns.tolist()

if text_columns:
    print(f"Textual columns identified: {text_columns}. Proceeding with text preprocessing steps.")

    # Initialize stemmer and lemmatizer
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    for col in text_columns:
        print(f"\nProcessing textual column: '{col}'")

        # Expand Contractions
        df[col] = df[col].astype(str).apply(lambda x: contractions.fix(x) if pd.notna(x) else x)
        print(f"  - Expanded contractions in '{col}'.")

        # Lower Casing
        df[col] = df[col].astype(str).str.lower()
        print(f"  - Lowercased text in '{col}'.")

        # Remove Punctuations
        df[col] = df[col].apply(lambda x: re.sub(r'[^\w\s]', '', x) if pd.notna(x) else x)
        print(f"  - Removed punctuations from '{col}'.")

        # Remove URLs & words/digits containing digits
        df[col] = df[col].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+|[a-zA-Z]*\d[a-zA-Z0-9]*', '', x) if pd.notna(x) else x)
        print(f"  - Removed URLs and words/digits containing digits from '{col}'.")

        # Remove Stopwords
        df[col] = df[col].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]) if pd.notna(x) else x)
        print(f"  - Removed stopwords from '{col}'.")

        # Remove Extra Whitespaces
        df[col] = df[col].apply(lambda x: re.sub(r'\s+', ' ', x).strip() if pd.notna(x) else x)
        print(f"  - Removed extra whitespaces from '{col}'.")

        # Rephrase Text (skipped)
        print(f"  - 'Rephrase Text' is skipped due to lack of domain-specific rules or models.")

        # Tokenization
        df[col + '_tokens'] = df[col].apply(lambda x: word_tokenize(x) if pd.notna(x) else [])
        print(f"  - Tokenized text in '{col}'. New column created: '{col}_tokens'.")

        # Lemmatization
        df[col + '_normalized'] = df[col + '_tokens'].apply(
            lambda tokens: [lemmatizer.lemmatize(word) for word in tokens] if isinstance(tokens, list) else []
        )

        # Stemming
        df[col + '_normalized'] = df[col + '_normalized'].apply(
            lambda tokens: [stemmer.stem(word) for word in tokens] if isinstance(tokens, list) else []
        )
        print(f"  - Normalized text (lemmatization + stemming) in '{col}'. New column: '{col}_normalized'.")

        # POS Tagging
        df[col + '_pos_tags'] = df[col + '_normalized'].apply(
            lambda tokens: nltk.pos_tag(tokens) if isinstance(tokens, list) else []
        )
        print(f"  - POS tagging complete. New column: '{col}_pos_tags'.")

        # Vectorizing (TF-IDF) - optional
        print(f"  - 'Vectorizing Text' is a next step (e.g., TF-IDF, Word2Vec, BERT). Applying basic TF-IDF here.")
        if not df[col].empty:
            tfidf_vectorizer = TfidfVectorizer(max_features=100)
            text_for_vectorization = df[col + '_normalized'].apply(lambda x: ' '.join(x) if isinstance(x, list) else '').tolist()
            if len(text_for_vectorization) > 0:
                tfidf_matrix = tfidf_vectorizer.fit_transform(text_for_vectorization)
                tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
                df = pd.concat([df.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis=1)
                print(f"  - Applied TF-IDF Vectorization. Added {tfidf_df.shape[1]} new features.")
            else:
                print(f"  - No text found in '{col}' for TF-IDF vectorization.")

    print("\nTextual data preprocessing complete.")
else:
    print("No 'object' (text) columns found in the DataFrame. Preprocessing not required.")


#### 2. Lower Casing

In [None]:
# Lower Casing
#mentioned in above cell "Expand Contraction"

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
#mentioned in above cell "Expand Contraction"

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
#mentioned in above cell "Expand Contraction"

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
#mentioned in above cell "Expand Contraction"

In [None]:
# Remove White spaces
#mentioned in above cell "Expand Contraction"

#### 6. Rephrase Text

In [None]:
# Rephrase Text
#mentioned in above cell "Expand Contraction"

#### 7. Tokenization

In [None]:
# Tokenization
#mentioned in above cell "Expand Contraction"

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
#mentioned in above cell "Expand Contraction"

##### Which text normalization technique have you used and why?

Lemmatization: The code uses nltk.stem.WordNetLemmatizer to perform lemmatization on the tokenized text. This process converts words to their base or dictionary form (lemma), for example, "running" becomes "run", "better" becomes "good".

* **Reduces Inflectional Forms:** Lemmatization is preferred over stemming in many cases because it reduces words to their meaningful base form, considering the word's context and part of speech. This results in actual dictionary words, which is better for interpretability and can retain more meaning.

* **Improved Accuracy:** By mapping different inflections of a word to a single base form, it helps in standardizing the text without losing semantic meaning, which can improve the accuracy of subsequent analyses (like text vectorization and modeling).

#### 9. Part of speech tagging

In [None]:
# POS Taging
#mentioned in above cell "Expand Contraction"

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
#mentioned in above cell "Expand Contraction"

##### Which text vectorization technique have you used and why?

No text vectorization technique has been actively applied in the current code.

Why Not Applied (and Why Not Chosen Yet):
* The "Vectorizing Text" section in the code explicitly states that it is a conceptual step and requires a choice of technique (like TF-IDF, Count Vectorization, Word2Vec, or BERT embeddings).
* An example of TfidfVectorizer is provided but it is commented out. This means it's there as a demonstration of how one might proceed, but it's not being executed.
* **Reason:** The choice of text vectorization technique depends heavily on the specific machine learning model you plan to use, the nature of the text data, and the overall objectives. Without further context or explicit instruction, automatically applying one might not be optimal. It's typically the final step before feeding text features into a model.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
print("\nFeature Manipulation (Minimizing Correlation & Creating New Features)")

# Identify highly correlated features based on previous heatmap insight
# We know Annual_Income and Monthly_Inhand_Salary are highly correlated.
# Choose one to keep and drop the other to minimize multicollinearity.
# Let's keep 'Annual_Income' as it's the primary income source.
if 'Monthly_Inhand_Salary' in df.columns and 'Annual_Income' in df.columns:
    corr_income_salary = df['Annual_Income'].corr(df['Monthly_Inhand_Salary'])
    if corr_income_salary > 0.9: # Threshold for high correlation
        df = df.drop('Monthly_Inhand_Salary', axis=1)
        print(f"Dropped 'Monthly_Inhand_Salary' due to high correlation with 'Annual_Income' (correlation: {corr_income_salary:.2f}).")
else:
    print("Skipped dropping 'Monthly_Inhand_Salary' as one or both columns are not present.")


# Also, Outstanding_Debt and Credit_Utilization_Ratio are often correlated.
# Let's examine their correlation and decide. Keeping Credit_Utilization_Ratio as it's a normalized metric.
if 'Outstanding_Debt' in df.columns and 'Credit_Utilization_Ratio' in df.columns:
    corr_debt_util = df['Outstanding_Debt'].corr(df['Credit_Utilization_Ratio'])
    if corr_debt_util > 0.7: # A common threshold for strong correlation
        # Deciding which to drop: Credit_Utilization_Ratio is often more directly indicative of active credit usage behavior.
        df = df.drop('Outstanding_Debt', axis=1)
        print(f"Dropped 'Outstanding_Debt' due to high correlation with 'Credit_Utilization_Ratio' (correlation: {corr_debt_util:.2f}).")
else:
    print("Skipped dropping 'Outstanding_Debt' as one or both columns are not present.")

# Create new features (Feature Engineering)
print("\nCreating new features:")

# Example 1: Debt-to-Income Ratio (if not dropped)
if 'Annual_Income' in df.columns and 'Outstanding_Debt' not in df.columns and 'Credit_Utilization_Ratio' in df.columns: # If Outstanding_Debt was dropped
     # Can use Credit_Utilization_Ratio * Credit_Limit (if Credit_Limit was available) to approximate Debt,
     # but for simplicity, let's assume we use what's left.
     # If Outstanding_Debt was NOT dropped, we'd use it.
     if 'Outstanding_Debt' in df.columns and 'Annual_Income' in df.columns:
         df['Debt_to_Income_Ratio'] = df['Outstanding_Debt'] / (df['Annual_Income'] + 1e-6) # Add small epsilon to avoid division by zero
         print("  - Created 'Debt_to_Income_Ratio'.")
     elif 'Credit_Utilization_Ratio' in df.columns and 'Annual_Income' in df.columns:
         # This is a proxy, not true DTI, but captures similar concept if actual debt is gone
         df['Utilization_Income_Ratio'] = df['Credit_Utilization_Ratio'] / (df['Annual_Income'] + 1e-6)
         print("  - Created 'Utilization_Income_Ratio' (proxy for Debt-to-Income).")
     else:
         print("  - Could not create Debt-to-Income related ratio due to missing base columns.")
elif 'Annual_Income' in df.columns and 'Outstanding_Debt' in df.columns: # If Outstanding_Debt was not dropped
    df['Debt_to_Income_Ratio'] = df['Outstanding_Debt'] / (df['Annual_Income'] + 1e-6) # Add small epsilon to avoid division by zero
    print("  - Created 'Debt_to_Income_Ratio'.")
else:
    print("  - Could not create Debt-to-Income related ratio due to missing base columns.")


# Example 2: Monthly Debt Burden (EMI relative to Monthly Inhand Salary)
if 'Total_EMI_per_month' in df.columns and 'Monthly_Inhand_Salary' in df.columns:
    df['EMI_to_Salary_Ratio'] = df['Total_EMI_per_month'] / (df['Monthly_Inhand_Salary'] + 1e-6)
    print("  - Created 'EMI_to_Salary_Ratio'.")
else:
    print("  - Could not create 'EMI_to_Salary_Ratio' due to missing base columns.")

# Example 3: Payment Consistency Score (combining delay frequency and amount)
if 'Num_of_Delayed_Payment' in df.columns and 'Delay_from_due_date' in df.columns:
    df['Payment_Consistency'] = 1 / ((df['Num_of_Delayed_Payment'] * df['Delay_from_due_date']) + 1e-6) # Inverse of product, higher is better
    print("  - Created 'Payment_Consistency'.")
else:
    print("  - Could not create 'Payment_Consistency' due to missing base columns.")

print("Feature manipulation complete.")

#### 2. Feature Selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Select your features wisely to avoid overfitting
print("\nFeature Selection")

# Define features (X) and target (y)
X = df.drop('Credit_Score', axis=1)
y = df['Credit_Score']

# Drop any non-numeric columns (like text tokens, POS tags etc.)
X = X.select_dtypes(include=np.number)

# Drop the 'Credit_History_Age' column if it's all NaNs
if 'Credit_History_Age' in X.columns:
    X = X.drop('Credit_History_Age', axis=1)
    print("Dropped 'Credit_History_Age' due to it containing only missing values.")

# SelectKBest feature selection
num_features_to_select = X.shape[1] // 2  # Select half of the features
if num_features_to_select == 0 and X.shape[1] > 0:
    num_features_to_select = 1
elif num_features_to_select == 0:
    print("No features available for selection.")
else:
    selector = SelectKBest(score_func=f_classif, k=num_features_to_select)
    X_selected = selector.fit_transform(X, y)

    # Get selected feature names
    selected_feature_indices = selector.get_support(indices=True)
    selected_features = X.columns[selected_feature_indices].tolist()

    X = X[selected_features]
    print(f"Selected {len(selected_features)} features using SelectKBest (f_classif):")
    print(selected_features)
    print(f"New X shape after feature selection: {X.shape}")

print("Feature selection complete.")


##### What all feature selection methods have you used  and why?

**Univariate Feature Selection method called SelectKBest with the f_classif scoring function is used:**

* **SelectKBest:** This method selects the top 'k' features (in our code, k is set to half the number of current features) based on the highest scores from a statistical test. It's a simple yet effective way to remove features that have a weak relationship with the target variable.

* **f_classif (ANOVA F-value):** This is the scoring function used with SelectKBest. It calculates the ANOVA F-value between each feature and the target variable.

  * **Suitability for Classification:** f_classif is specifically designed for classification tasks, where the target variable is categorical (like your Credit_Score with categories 0, 1, 2). It assesses the dependency between the feature (numerical) and the categorical target variable.
  * **Measures Group Separation:** The F-value measures the variance between the means of different target classes relative to the variance within the classes. A higher F-value indicates a stronger separation between the classes based on that feature, implying greater predictive power.
  * **Computational Efficiency:** It's computationally inexpensive and works well as a first-pass filter to identify globally relevant features before more complex models are built.

##### Which all features you found important and why?

Based on the `f_classif` scoring (and corroborating with the insights from your correlation heatmap and pair plots), the features typically found most important for predicting `Credit_Score` are (and would likely be selected by `SelectKBest`):

* **`Annual_Income`:**
    * **Why Important:** Higher income generally indicates a greater capacity to repay debts, which is a fundamental aspect of creditworthiness. Our pair plots clearly showed a positive correlation between income and `Credit_Score`.
* **`Outstanding_Debt` or `Credit_Utilization_Ratio`:** (One of these would likely be kept due to high correlation, e.g., `Credit_Utilization_Ratio`)
    * **Why Important:** These metrics directly reflect a person's current debt burden and how much of their available credit they are using. High debt or utilization indicates higher risk. The correlation heatmap showed a strong negative correlation with `Credit_Score`.
* **`Interest_Rate`:**
    * **Why Important:** The interest rate a person is offered is often a direct reflection of their perceived credit risk. Lower interest rates are given to more creditworthy individuals. The pair plot revealed a very strong inverse relationship with `Credit_Score`.
* **`Num_of_Delayed_Payment` and `Delay_from_due_date`:**
    * **Why Important:** These features directly quantify past negative payment behavior. Frequent or longer delays are strong indicators of higher credit risk. Our specific charts highlighted how these relate to overall credit score categories.
* **`Total_EMI_per_month` and `Monthly_Balance`:**
    * **Why Important:** These reflect a customer's ongoing financial commitments and liquidity. A high EMI relative to income, or a low monthly balance, can indicate financial strain and higher risk.
* **`Credit_Mix`:**
    * **Why Important:** A diverse and healthy credit mix (e.g., a good blend of installment loans and revolving credit) is often seen favorably by creditors as it demonstrates responsible management of different credit types.

These features are crucial because they directly impact a lender's assessment of a borrower's ability and willingness to repay, which are the core components of a credit score. `SelectKBest` with `f_classif` would identify these features as having the most statistically significant differences in their means across the 'Poor', 'Standard', and 'Good' credit score groups.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, based on the nature of typical financial datasets and the common requirements of machine learning models, **I do think the data needs to be transformed.**

In the provided code, **logarithmic transformation** specifically, `np.log1p()` is being used.

**Reason:**

* **Handling Skewed Distributions:** Many continuous numerical features in real-world datasets, especially financial ones like `Annual_Income`, `Outstanding_Debt`, `Total_EMI_per_month`, and `Monthly_Balance`, often exhibit **skewed distributions** (e.g., right-skewed, where a large number of values are concentrated on the lower end, and a long tail extends towards higher values).

* **Meeting Model Assumptions:** Many machine learning algorithms (particularly linear models, but also some others) perform optimally when input features are **normally or symmetrically distributed**. Skewed data can violate these assumptions, leading to:
    * Suboptimal model performance.
    * Difficulty for the model to capture patterns effectively from the feature.
    * Features with large ranges dominating the learning process if not scaled correctly.

* **Stabilizing Variance:** Log transformations can help to stabilize the variance of features, making patterns more consistent across the range of the data.

* **Graceful Handling of Zeroes:** I specifically used `np.log1p(x)` (which computes $\log(1+x)$) instead of `np.log(x)`. This is crucial because `np.log(0)` is undefined. Many financial features can naturally have zero values (e.g., zero `Amount_invested_monthly`, zero `Outstanding_Debt`). `log1p` handles these zero values gracefully by mapping them to `0`, preventing errors and preserving data integrity.

By applying `log1p` to these potentially skewed numerical columns, we aim to make their distributions more symmetrical, which generally helps machine learning models learn more robust and accurate relationships from the data.

In [None]:
# Transform Your data
print("\nData Transformation")

# Identify numerical columns for transformation (excluding binary/encoded and target)
# We select columns that are continuous and likely to be skewed

transform_cols = [
    'Annual_Income', 'Num_Bank_Accounts', 'Num_Credit_Card',
    'Interest_Rate', 'Num_of_Loan', 'Delay_from_due_date',
    'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
    'Outstanding_Debt' if 'Outstanding_Debt' in X.columns else None, # Include if not dropped
    'Credit_Utilization_Ratio', 'Credit_History_Age', 'Total_EMI_per_month',
    'Amount_invested_monthly', 'Monthly_Balance', 'Age',
    'Debt_to_Income_Ratio' if 'Debt_to_Income_Ratio' in X.columns else None,
    'Utilization_Income_Ratio' if 'Utilization_Income_Ratio' in X.columns else None,
    'EMI_to_Salary_Ratio' if 'EMI_to_Salary_Ratio' in X.columns else None,
    'Payment_Consistency' if 'Payment_Consistency' in X.columns else None
]
transform_cols = [col for col in transform_cols if col is not None and col in X.columns] # Filter for existing

if not transform_cols:
    print("No numerical columns found for transformation.")
else:
    print("Applying log transformation (log1p) to skewed numerical columns:")
    for col in transform_cols:
        # Check for skewness before transforming (optional, but good practice)
        # Here we'll apply log1p as a general transformation for potentially skewed data
        if (X[col] >= 0).all(): # log1p only works for non-negative values
            X[col] = np.log1p(X[col]) # log1p handles 0 values gracefully (log(1+x))
            print(f"  - Applied log1p transformation to '{col}'.")
        else:
            print(f"  - Skipping log1p for '{col}' due to negative values.")

print("Data transformation complete.")

### 6. Data Scaling

In [None]:
# Scaling your data
print("\nData Scaling (StandardScaler)")

scaler = StandardScaler()

# Apply scaling to the selected features (X)
# Ensure X contains only numerical features before scaling
X_scaled = scaler.fit_transform(X)

# Convert scaled data back to DataFrame with original column names
X = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

print("Data scaling complete. Features are now standardized.")
print(f"Scaled X head:\n{X.head()}")

##### Which method have you used to scale you data and why?

**Standard Scaling** as the method to scale data is being used.

* **Method Used:** `sklearn.preprocessing.StandardScaler`

* **Reason:**
    * **Standardization:** StandardScaler transforms the data such that it has a **mean of 0** and a **standard deviation of 1**. This process is also known as standardization.

    * **Equal Contribution:** Many machine learning algorithms (especially those based on distance calculations, like K-Nearest Neighbors, Support Vector Machines, or those using gradient descent, like neural networks and logistic regression) are sensitive to the scale and magnitude of input features. If features have vastly different ranges, features with larger values might disproportionately influence the model's objective function or distance calculations.

    * **Prevents Dominance:** Standard scaling ensures that all features contribute approximately equally to the model's performance, preventing features with larger numerical ranges from dominating those with smaller ranges.

    * **Improved Convergence:** For iterative optimization algorithms (like gradient descent used in many models), scaling can lead to faster and more stable convergence, as it puts features on a similar scale.
    
    * **Common and Effective:** Standard scaling is a widely used and highly effective preprocessing step for a broad range of machine learning algorithms.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Based on the steps performed so far, **dimensionality reduction (specifically PCA, as implemented in the code) is conditionally included, meaning it *might* be needed, but not necessarily always.**

**Why Dimensionality Reduction Might Be Needed:**

1.  **Curse of Dimensionality:** After one-hot encoding and multi-label encoding of categorical variables like `Occupation`, `Type_of_Loan`, and `Payment_Behaviour`, your dataset can end up with a **very large number of features**. A high number of features (dimensions) can lead to:
    * **Increased Computational Cost:** Models take longer to train and require more memory.
    * **Overfitting:** With too many features, models might learn noise in the training data rather than true underlying patterns, performing poorly on unseen data.
    * **Reduced Interpretability:** It becomes harder to understand which specific features are driving model predictions.
    * **Sparse Data:** In high-dimensional spaces, data points become very sparse, making it harder for algorithms to find meaningful relationships.

2.  **Multicollinearity:** While we addressed high correlation by dropping features like `Monthly_Inhand_Salary` and `Outstanding_Debt`, complex interdependencies can still exist among the many encoded binary features (e.g., different loan types or occupations). PCA can help to combine these correlated features into a smaller set of uncorrelated components.

**Why Dimensionality Reduction Might Not Be Strictly Necessary:**

1.  **Feature Selection Already Performed:** We've already used `SelectKBest` to reduce the number of features by selecting only the most statistically relevant ones. If this step sufficiently reduces the feature count (e.g., below the 30-feature threshold set in the code), then further dimensionality reduction might not be critical.
2.  **Loss of Interpretability:** PCA transforms the original features into new, uncorrelated "principal components." These components are linear combinations of the original features and are often less interpretable than the original features themselves. If understanding the direct impact of each original feature is paramount for business insights, then avoiding PCA (or using it cautiously) is preferable.
3.  **Model Robustness:** Some machine learning models (e.g., tree-based models like Random Forests or Gradient Boosting Machines) are naturally more robust to high dimensionality and multicollinearity, making PCA less critical for them compared to linear models or SVMs.

**How the Code Addresses It:**

The code wisely incorporates PCA as a **conditional step**. It checks if the number of features `X.shape[1]` is greater than 30 after the feature selection process. This threshold (`> 30`) is a pragmatic heuristic. If the number of features is still high, it applies PCA to retain 95% of the explained variance, effectively reducing the dimensionality while preserving most of the original information. If the feature count is already manageable, it skips PCA to preserve interpretability.

In summary, dimensionality reduction is a valuable tool when facing a high number of features, but its necessity depends on the specific dataset, the chosen machine learning model, and the trade-off between computational efficiency/overfitting and interpretability. The current code provides a flexible and reasonable approach to this.

In [None]:
# Dimensionality Reduction (If needed)
print("\nDimensionality Reduction (PCA)")

# Determine if PCA is 'needed'. This is subjective and depends on feature count and performance.
# As a heuristic, if we still have more than, say, 20 features, or if performance is a concern,
# PCA can be considered. For now, we'll keep it optional or apply it if many features remain.

# Let's set a threshold, e.g., apply PCA if more than 30 features
if X.shape[1] > 30:
    print(f"Current number of features ({X.shape[1]}) is high. Considering PCA.")
    # Choose number of components. A common approach is to explain a certain variance (e.g., 95%)
    # or select a fixed number. Let's aim for 95% variance explained for demonstration.
    pca = PCA(n_components=0.95) # Retain 95% of variance
    X_pca = pca.fit_transform(X)

    # Create new DataFrame for PCA components
    pca_cols = [f'PC_{i+1}' for i in range(X_pca.shape[1])]
    X = pd.DataFrame(X_pca, columns=pca_cols, index=X.index)

    print(f"PCA applied. Reduced dimensions from {pca.n_features_} to {pca.n_components_}.")
    print(f"Explained variance ratio: {pca.explained_variance_ratio_.sum():.2f}")
    print(f"New X shape after PCA: {X.shape}")
else:
    print(f"Dimensionality reduction (PCA) not applied as current number of features ({X.shape[1]}) is manageable.")


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

In the provided code, the dimensionality reduction technique used is **Principal Component Analysis (PCA)**.

* **Technique Used:** `sklearn.decomposition.PCA`

* **Reason (If Applied to the Dataset):**

    1.  **Handles Multicollinearity:** After extensive encoding (especially multi-label and one-hot encoding), datasets often end up with many features that are correlated with each other. PCA effectively transforms these correlated features into a smaller set of **uncorrelated (orthogonal) principal components**. This helps in reducing redundancy in the data.

    2.  **Variance Preservation:** PCA aims to retain as much of the original data's variance as possible in the reduced set of dimensions. By setting `n_components=0.95` in the code, we instruct PCA to find the minimum number of components that collectively explain at least 95% of the total variance in the original features. This means we compress the data while losing minimal information.

    3.  **Combats the Curse of Dimensionality:** When you have a very high number of features (as can happen after one-hot encoding), machine learning models can struggle due to the "curse of dimensionality," leading to longer training times, increased risk of overfitting, and sparser data. PCA helps mitigate these issues by projecting the data into a lower-dimensional space.

    4.  **Applicability to Numerical Data:** PCA works with numerical data. Since all our features have been transformed into numerical types (through original numerical columns, ordinal mapping, multi-label encoding, and one-hot encoding), PCA is directly applicable.


### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
print("\nData Splitting (Train/Test Split)")

# Define features (X) and target (y) again, ensuring they reflect the latest transformations
# X and y are already defined and updated in previous steps

# Splitting ratio: 80% training, 20% testing is a common and wise choice.
# stratify=y is crucial for imbalanced datasets to maintain target distribution in both sets.
# random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Data split into training and testing sets:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

print("\nCredit Score distribution in original data:")
print(y.value_counts(normalize=True))

print("\nCredit Score distribution in training set:")
print(y_train.value_counts(normalize=True))

print("\nCredit Score distribution in test set:")
print(y_test.value_counts(normalize=True))

print("Data splitting complete (stratified).")

##### What data splitting ratio have you used and why?

Used an **80/20 data splitting ratio**.

* **Ratio Used:**

    * **80% for the training set (`train_size=0.8` or implied by `test_size=0.2`)**

    * **20% for the testing set (`test_size=0.2`)**

* **Why:**
   
    1.  **Common Practice:** An 80/20 split (or similar variations like 70/30, 75/25) is a widely accepted and common practice in machine learning. It strikes a good balance between:
        
        * **Sufficient Training Data:** Providing enough data for the model to learn complex patterns and relationships effectively. With 80% of 100,000 records, the model gets 80,000 samples for training.
        
        * **Representative Test Data:** Reserving a reasonably sized, unseen portion (20%, or 20,000 records) to evaluate the model's generalization performance on new data. This helps in assessing how well the model will perform in the real world.

    2.  **Dataset Size:** For a dataset of 100,000 records, an 80/20 split provides ample data for both training and testing, making it statistically robust. Smaller datasets might warrant different strategies (e.g., k-fold cross-validation).

    3.  **Stratified Sampling (`stratify=y`):** This is a crucial addition to the splitting.
        
        * **Why used with `stratify`:** Your `Credit_Score` target variable is likely imbalanced (as confirmed by earlier analysis). `stratify=y` ensures that the proportion of each `Credit_Score` class (Poor, Standard, Good) is approximately the same in both the training and testing sets as it is in the original dataset. This prevents a situation where, by random chance, one set might have significantly more or fewer samples of a particular class, leading to biased training or an unreliable evaluation of the model's performance on minority classes.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, based on the nature of the `Credit_Score` distribution, **the dataset is indeed considered imbalanced.**

1.  **Uneven Class Distribution:** In your previous `df.info()` output and particularly from the pair plot visualization (where you could see the density of 'Standard' scores was visually higher than 'Poor' or 'Good' scores), it's evident that the classes within your target variable, `Credit_Score` (0: Poor, 1: Standard, 2: Good), are **not equally represented**. Typically, you would see this imbalance clearly by looking at the `.value_counts()` or `.value_counts(normalize=True)` of the `Credit_Score` column. While I don't have the exact numbers from a `.value_counts()` output right now, it's a common characteristic of credit scoring datasets for the 'Standard' class to be the majority, and 'Poor' or 'Good' classes to be minorities.

2.  **Impact on Model Training:** When a dataset is imbalanced, machine learning models tend to be **biased towards the majority class**.

    * They might achieve high overall accuracy by simply predicting the majority class for most samples.

    * However, their performance on the minority classes (which are often the most critical ones, e.g., correctly identifying 'Poor' credit scores to prevent defaults, or 'Good' scores for targeted offers) will be poor.
    * This leads to models that don't generalize well to the real world where minority class instances do occur and need to be correctly identified.

Therefore, addressing this imbalance is a crucial step in preparing the data for effective model training, which is why the code includes the SMOTE (Synthetic Minority Over-sampling Technique) step.

In [None]:
from imblearn.over_sampling import SMOTE

# Handling Imbalanced Dataset (If needed)
print("\nHandling Imbalanced Dataset")

# Check the current distribution of the target variable in the training set
print("Credit Score distribution in training set BEFORE imbalance handling:")
print(y_train.value_counts())
print(y_train.value_counts(normalize=True))

# Apply SMOTE to balance the training set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("\nImbalance handling complete (SMOTE applied to training data).")
print("Credit Score distribution in training set AFTER imbalance handling:")
print(y_train_resampled.value_counts())
print(y_train_resampled.value_counts(normalize=True))

print("Imbalanced dataset handling complete.")


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

In the provided code, the technique used to handle the imbalanced dataset is **SMOTE (Synthetic Minority Over-sampling Technique)**.

* **Technique Used:** `imblearn.over_sampling.SMOTE`

* **Why:**

    1.  **Addresses Imbalance by Oversampling:** SMOTE works by creating synthetic (artificial) samples for the minority class(es). Instead of simply duplicating existing minority samples (which can lead to overfitting), SMOTE generates new, synthetic samples that are "similar" to the existing minority samples but not identical. It does this by picking a sample from the minority class and then considering its k-nearest neighbors. It then creates new synthetic samples at random points between the chosen sample and its neighbors.

    2.  **Prevents Overfitting from Simple Duplication:** If we were to just duplicate the minority class samples, the model might overfit to those exact duplicated instances. SMOTE's synthetic generation helps to create a more generalized representation of the minority class, reducing this risk.

    3.  **Does Not Discard Data:** Unlike undersampling techniques (which reduce the majority class), SMOTE retains all information from the majority class, which can be important if the majority class holds valuable patterns.

    4.  **Applied Only to Training Data:** Critically, SMOTE is applied *only* to the `X_train` and `y_train` sets. This is a standard and essential practice to prevent **data leakage**. If SMOTE were applied before the train-test split, the synthetic samples generated in the training set might be based on information from the test set, leading to an artificially inflated (and misleading) model performance during evaluation.

    5.  **Balances Class Distribution:** The primary goal is to balance the class distribution in the training set, allowing the machine learning model to learn from a more representative proportion of all classes. This helps the model become less biased towards the majority class and improves its ability to correctly classify minority classes.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# @title Logistic Regression

# ML Model - 1 Implementation

print("--- Implementing Logistic Regression ---")

# Initialize the Logistic Regression model
# max_iter is increased for convergence, especially with scaled data and potentially more features.
# solver='liblinear' is good for small datasets and L1/L2 regularization.
logistic_model = LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

print("Logistic Regression model initialized.")



# Fit the Algorithm
print("\n--- Fitting Logistic Regression Model ---")

# Fit the model on the resampled training data
logistic_model.fit(X_train_resampled, y_train_resampled)

print("Logistic Regression model fitted successfully to the training data.")



# Predict on the Model
print("\n--- Making Predictions with Logistic Regression ---")

# Predict classes
y_pred_logistic = logistic_model.predict(X_test)

# Predict probabilities (needed for ROC AUC)
y_pred_proba_logistic = logistic_model.predict_proba(X_test)

print("Predictions made on the test set.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_curve,
    auc
)
from sklearn.preprocessing import label_binarize

# --- Evaluating Logistic Regression Model ---
print("\n--- Evaluating Logistic Regression Model ---")

# Define class labels
credit_score_labels = ['Poor', 'Standard', 'Good']
n_classes = len(credit_score_labels)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_logistic, target_names=credit_score_labels))

# Confusion Matrix
conf_matrix_logistic = confusion_matrix(y_test, y_pred_logistic)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_logistic, annot=True, fmt='d', cmap='Blues',
            xticklabels=credit_score_labels,
            yticklabels=credit_score_labels)
plt.title('Confusion Matrix for Logistic Regression')
plt.xlabel('Predicted Credit Score')
plt.ylabel('True Credit Score')
plt.show()

# Overall Accuracy Score
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
print(f"\nOverall Accuracy: {accuracy_logistic:.4f}")

# ROC AUC Curve (One-vs-Rest)
print("\nGenerating Multi-class ROC AUC Curve (One-vs-Rest)...")

# Binarize the true labels
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])

# Plot ROC curve for each class
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green']  # Colors for classes: Poor, Standard, Good

for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_proba_logistic[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=colors[i], lw=2,
             label=f'ROC curve of class {credit_score_labels[i]} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Logistic Regression (One-vs-Rest)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print("\nLogistic Regression model evaluation complete with visualizations.")


**Inference for Base Logistic Regression Model**

**Overall Performance:**
The model achieved an **Overall Accuracy of 0.6356 (63.56%)**. While this might seem moderate, it's crucial to look beyond overall accuracy, especially in imbalanced datasets.

**Detailed Performance by Class (from Classification Report and Confusion Matrix):**

* **Class 'Standard' (Credit Score = 1):**
    * **Best Performance:** The model performs best on the 'Standard' class.
        * **Precision (0.70):** When the model predicts 'Standard', it's correct 70% of the time.
        * **Recall (0.64):** It correctly identifies 64% of all actual 'Standard' cases.
        * **F1-score (0.67):** A good balance between precision and recall for this class.
    * **Reason:** This is expected, as 'Standard' is likely the majority class, and models tend to learn patterns of the majority class more effectively if not specifically tuned for imbalance (even with initial SMOTE, base models might still lean towards it). The confusion matrix shows 5468 'Standard' cases correctly predicted as 'Standard'.

* **Class 'Poor' (Credit Score = 0):**
    * **Moderate Performance:**
        * **Precision (0.57):** When predicting 'Poor', it's correct 57% of the time.
        * **Recall (0.71):** It's good at *identifying* 'Poor' cases (71% of actual 'Poor' cases are caught). This is a relatively high recall, which is important for identifying high-risk customers.
        * **F1-score (0.63):** A decent balance.
    * **Reason:** The high recall for 'Poor' is a positive sign, indicating the model is fairly good at capturing truly 'Poor' credit scores. However, the precision suggests that when it *says* someone is 'Poor', there's still a significant chance (43%) they are not. From the confusion matrix, out of 5755 actual 'Poor' cases, 4240 were correctly classified, but 679 were misclassified as 'Standard' and 880 as 'Good'.

* **Class 'Good' (Credit Score = 2):**
    * **Weakest Performance:**
        * **Precision (0.50):** Only 50% of the time that the model predicts 'Good', it's actually correct.
        * **Recall (0.84):** It's excellent at *identifying* actual 'Good' cases (84% recalled!). This is surprisingly high given the likely minority status.
        * **F1-score (0.63):** This F1-score is driven up by high recall, but the low precision indicates many false positives for this class.
    * **Reason:** The very high recall but low precision for 'Good' suggests that the model is being overly optimistic and classifying many 'Poor' or 'Standard' customers as 'Good' (false positives). This is visible in the confusion matrix where only 3004 out of 3566 actual 'Good' cases are correctly classified, but it has a high number of predictions into 'Good' from 'Poor' (880) and 'Standard' (2141). This "over-prediction" of the 'Good' class might be a side effect of the base model's default behavior or the interaction with SMOTE without further hyperparameter tuning.

**ROC AUC Curves:**

* **Class 'Poor' (Area = 0.80):** A good AUC score, indicating the model has a strong ability to distinguish 'Poor' cases from others.
* **Class 'Standard' (Area = 0.74):** A decent AUC, showing moderate discriminative power for the 'Standard' class.
* **Class 'Good' (Area = 0.67):** The lowest AUC, indicating the weakest ability to distinguish 'Good' cases from others, supporting the low precision observed for this class. The curve for 'Good' is closer to the random classifier line compared to 'Poor' and 'Standard'.


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import label_binarize
from scipy.stats import uniform
import matplotlib.pyplot as plt
import seaborn as sns

# --- Implementing Logistic Regression with RandomizedSearchCV ---
print("--- Implementing Logistic Regression with RandomizedSearchCV ---")

# Define the base logistic regression model
base_logistic_model = LogisticRegression(max_iter=10000, random_state=42)

# Define the hyperparameter search space
param_distributions = {
    'C': uniform(loc=0.01, scale=100),               # Regularization strength
    'penalty': ['l1', 'l2'],                         # Regularization type
    'solver': ['liblinear', 'saga'],                 # Solvers that support both l1 and l2
    'class_weight': [None, 'balanced']               # Handle imbalance (optional with SMOTE)
}

# Initialize RandomizedSearchCV
random_search_logistic = RandomizedSearchCV(
    estimator=base_logistic_model,
    param_distributions=param_distributions,
    n_iter=20,
    cv=2,
    scoring='f1_weighted',
    random_state=42,
    n_jobs=-1,
    verbose=2
)

print("RandomizedSearchCV setup for Logistic Regression optimization.")

# --- Fit the model using RandomizedSearchCV ---
print("\n--- Fitting Logistic Regression Model with RandomizedSearchCV ---")
random_search_logistic.fit(X_train_resampled, y_train_resampled)

# Retrieve the best model from RandomizedSearchCV
best_logistic_model = random_search_logistic.best_estimator_

print("\nBest hyperparameters found:")
print(random_search_logistic.best_params_)
print("\nOptimized Logistic Regression model fitted successfully.")

# --- Make predictions using the best model ---
print("\n--- Making Predictions with Optimized Logistic Regression ---")
y_pred_logistic = best_logistic_model.predict(X_test)
y_pred_proba_logistic = best_logistic_model.predict_proba(X_test)
print("Predictions made on the test set using the optimized model.")

# --- Evaluate the Model ---
print("\n--- Evaluating Optimized Logistic Regression Model ---")

# Define class labels
credit_score_labels = ['Poor', 'Standard', 'Good']
n_classes = len(credit_score_labels)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_logistic, target_names=credit_score_labels))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred_logistic)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=credit_score_labels,
            yticklabels=credit_score_labels)
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred_logistic)
print(f"\nOverall Accuracy: {accuracy:.4f}")

# --- ROC AUC Curve (One-vs-Rest) ---
print("\nGenerating Multi-class ROC AUC Curve (One-vs-Rest)...")

# Binarize true labels
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])

# ROC Curve Plot
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green']

for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_proba_logistic[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=colors[i], lw=2,
             label=f'ROC curve of class {credit_score_labels[i]} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression (One-vs-Rest)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print("\nOptimized Logistic Regression evaluation complete.")


**Inference for Optimized Logistic Regression Model**

**Best Hyperparameters Found:** (Assuming these were printed in your console; for this inference, we'll focus on the performance metrics improvement). The optimization aimed to find the best balance of `C`, `penalty`, `solver`, and `class_weight` (if used in the search) to maximize the `f1_weighted` score.

**Overall Performance:**
The **Overall Accuracy increased slightly to 0.6434 (64.34%)** from 0.6356. This modest increase suggests that hyperparameter tuning had some positive impact on overall correctness.

**Detailed Performance by Class (from Classification Report and Confusion Matrix):**

* **Class 'Standard' (Credit Score = 1):**
    * **Precision (0.70):** Remains consistent. When the model predicts 'Standard', it's correct 70% of the time.
    * **Recall (0.63):** Slightly decreased from 0.64. It correctly identifies 63% of all actual 'Standard' cases.
    * **F1-score (0.66):** A minor drop from 0.67.
    * **Confusion Matrix:** 5740 true 'Standard' predictions. Misclassifications of 'Standard' into 'Poor' (2825) and 'Good' (2070) are still significant but show some slight changes compared to the base model.

* **Class 'Poor' (Credit Score = 0):**
    * **Precision (0.58):** Improved from 0.57. When the model predicts 'Poor', it's correct 58% of the time, a small but positive gain.
    * **Recall (0.72):** Also slightly improved from 0.71. It correctly identifies 72% of all actual 'Poor' cases.
    * **F1-score (0.64):** Improved from 0.63.
    * **Confusion Matrix:** 4153 'Poor' cases correctly predicted. The number of 'Poor' misclassified as 'Standard' (779) and 'Good' (867) have slightly shifted. The model has become marginally better at recalling true 'Poor' cases and slightly more precise when predicting 'Poor'.

* **Class 'Good' (Credit Score = 2):**
    * **Significant Improvement in Precision!**
        * **Precision (0.58):** This is a notable improvement from 0.50 in the base model. When the optimized model predicts 'Good', it's now correct 58% of the time, reducing the number of false positives for the 'Good' class.
        * **Recall (0.83):** Remains high at 83% (very slight decrease from 0.84). It's still excellent at recalling actual 'Good' cases.
        * **F1-score (0.68):** A significant increase from 0.63, indicating a much better balance between precision and recall for this class.
    * **Confusion Matrix:** 2975 actual 'Good' cases correctly predicted. The crucial change here is the reduction in misclassifications from 'Standard' to 'Good' (from 2141 to 2070) and from 'Poor' to 'Good' (from 880 to 867). While still present, the improvement in precision for 'Good' suggests that the model is making fewer overly optimistic predictions.

**ROC AUC Curves:**

* **Class 'Poor' (Area = 0.81):** Increased from 0.80, indicating slightly better discriminatory power.
* **Class 'Standard' (Area = 0.75):** Increased from 0.74, showing a marginal improvement.
* **Class 'Good' (Area = 0.68):** Increased from 0.67, confirming the improved discrimination for this challenging class. The curve is marginally further from the random classifier line.

**Summary and Business Impact of Optimization:**

The hyperparameter optimization had a positive impact on the Logistic Regression model, particularly in improving the performance on the 'Good' credit score class.

* **Positive Impact:**
    * **Reduced False Positives for 'Good':** The most significant gain is in the precision for the 'Good' class. This means Paisabazaar will be more accurate when identifying genuinely creditworthy customers. This directly translates to:
        * **Lower Risk:** Fewer loans will be approved for individuals who were wrongly identified as 'Good'.
        * **Better Targeting:** More precise identification of 'Good' customers allows for more effective targeting of premium products or lower interest rates, retaining valuable clients.
    * **Slight Improvements Across the Board:** While subtle, the improvements in AUC and F1-scores for 'Poor' and 'Standard' classes indicate a generally more robust model across all credit categories.
* **Continued Challenge:** Despite the improvements, misclassifications still occur. The model still misclassifies a notable number of 'Standard' customers as 'Poor' (2825) and 'Good' (2070), and similarly for 'Poor' (779 as Standard, 867 as Good). This suggests there's still room for improvement, possibly by exploring more complex models or additional feature engineering.

In conclusion, the optimization successfully refined the Logistic Regression model, making its predictions, especially for the 'Good' credit score, more reliable. This is a step in the right direction for Paisabazaar to make more informed and less risky lending decisions.

##### Which hyperparameter optimization technique have you used and why?

**Randomized Search Cross-Validation (`RandomizedSearchCV`)** as the hyperparameter optimization technique: explores a defined range or distribution of hyperparameter values by **randomly sampling** a fixed number of combinations.

* **Why:**
    
    1.  **Efficiency over Exhaustive Search:** While `GridSearchCV` is thorough, it can become computationally very expensive and time-consuming, especially with multiple hyperparameters and a wide range of values. `RandomizedSearchCV` allows us to control the computational budget by specifying `n_iter` (the number of random combinations to try). This means we can explore a wider search space more efficiently within a given time constraint.
    
    2.  **Effectiveness in Finding Good Solutions:** Research and practice have shown that for many problems, `RandomizedSearchCV` can often find a set of hyperparameters that performs just as well as (or sometimes even better than) `GridSearchCV` within the same or even less time. This is because important hyperparameters often have a much wider range of optimal values than might be intuitively defined in a fixed grid.
    
    3.  **Cross-Validation for Robustness:** Like `GridSearchCV`, `RandomizedSearchCV` incorporates cross-validation (`cv=5` in the code). This ensures that the model's performance for each sampled hyperparameter combination is evaluated on multiple folds of the training data, providing a more robust and reliable estimate of its generalization ability and preventing overfitting to a single validation set.
    
    4.  **Practicality for Initial Tuning:** It's a practical and effective method for the initial phase of hyperparameter tuning, helping to quickly narrow down promising regions of the hyperparameter space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, based on the "Comparison of Base vs. Optimized Logistic Regression Models" chart and the accompanying metrics, **there has been an improvement** after hyperparameter optimization.



| Metric                | Base Logistic Regression | Optimized Logistic Regression | Change     |
| :-------------------- | :----------------------- | :---------------------------- | :--------- |
| **Accuracy** | 0.6356                   | **0.6434** | **+0.0078** |
| Precision (Macro Avg) | 0.6358                   | **0.6371** | **+0.0013** |
| Recall (Macro Avg)    | 0.6959                   | **0.6967** | **+0.0008** |
| F1-Score (Macro Avg)  | 0.6346                   | **0.6408** | **+0.0062** |
| F1-Score (Weighted Avg)| 0.6361                   | **0.6453** | **+0.0092** |

**Key Observations:**

* **Overall Accuracy:** There's a small but positive increase in overall accuracy from **0.6356 to 0.6434**.

* **F1-Score (Weighted Avg):** This is a very important metric, especially for imbalanced datasets, as it accounts for both precision and recall across all classes and weights them by class support. It shows the most significant improvement, increasing from **0.6361 to 0.6453**. This indicates a better balanced performance across all classes, considering their proportions.

* **F1-Score (Macro Avg):** Also shows a positive increase from **0.6346 to 0.6408**, suggesting improved balance across classes when considering them equally important (unweighted average).

* **Precision (Macro Avg) and Recall (Macro Avg):** Both also show slight positive shifts, contributing to the F1-score improvements.


In [None]:
# @title Compare Results of Base vs. Optimized Logistic Regression Models
print("--- Comparing Base Logistic Regression vs. Optimized Logistic Regression ---")

# Define target names/labels for clear reporting
credit_score_labels = ['Poor', 'Standard', 'Good']

# --- Metrics for Base Logistic Regression Model ---
report_base = classification_report(y_test, y_pred_logistic, output_dict=True)
accuracy_base = accuracy_score(y_test, y_pred_logistic)
precision_macro_base = report_base['macro avg']['precision']
recall_macro_base = report_base['macro avg']['recall']
f1_macro_base = report_base['macro avg']['f1-score']
f1_weighted_base = report_base['weighted avg']['f1-score']


# --- Metrics for Optimized Logistic Regression Model ---
report_optimized = classification_report(y_test, y_pred_logistic_optimized, output_dict=True)
accuracy_optimized = accuracy_score(y_test, y_pred_logistic_optimized)
precision_macro_optimized = report_optimized['macro avg']['precision']
recall_macro_optimized = report_optimized['macro avg']['recall']
f1_macro_optimized = report_optimized['macro avg']['f1-score']
f1_weighted_optimized = report_optimized['weighted avg']['f1-score']


# Create a DataFrame for Comparison
comparison_data = {
    'Metric': ['Accuracy', 'Precision (Macro Avg)', 'Recall (Macro Avg)', 'F1-Score (Macro Avg)', 'F1-Score (Weighted Avg)'],
    'Base Logistic Regression': [accuracy_base, precision_macro_base, recall_macro_base, f1_macro_base, f1_weighted_base],
    'Optimized Logistic Regression': [accuracy_optimized, precision_macro_optimized, recall_macro_optimized, f1_macro_optimized, f1_weighted_optimized]
}
comparison_df = pd.DataFrame(comparison_data)

print("\n--- Model Performance Comparison ---")
print(comparison_df.to_string(index=False, float_format="%.4f"))


# Visualizing the Comparison
print("\n--- Visualizing Model Performance Comparison ---")
metrics_to_plot = ['Accuracy', 'F1-Score (Weighted Avg)'] # Key metrics for visualization

plot_df = comparison_df[comparison_df['Metric'].isin(metrics_to_plot)].set_index('Metric').T
plot_df.index.name = 'Model'

plt.figure(figsize=(10, 6))
plot_df.plot(kind='bar', figsize=(10, 6), colormap='viridis')
plt.title('Comparison of Base vs. Optimized Logistic Regression Performance', fontsize=16)
plt.ylabel('Score', fontsize=12)
plt.xticks(rotation=0)
plt.ylim([0, 1]) # Scores are typically between 0 and 1
plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

print("\nComparison and visualization complete.")

### ML Model - 2

In [None]:
# Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from scipy.stats import uniform

# Example: Load your dataset (Replace this with your actual dataset)
# df = pd.read_csv("your_dataset.csv")
# X = df.drop("Credit_Score", axis=1)
# y = df["Credit_Score"]

# For demo purposes, assuming X and y are already defined
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# --- Base Logistic Regression Model ---
logistic_clf = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500)
logistic_clf.fit(X_train, y_train)
y_pred_logistic = logistic_clf.predict(X_test)

# --- Optimized Logistic Regression using RandomizedSearchCV ---
param_dist = {
    'C': uniform(loc=0.01, scale=10),
    'penalty': ['l2'],  # 'l1' not supported by 'lbfgs'
    'solver': ['lbfgs'],
    'max_iter': [500]
}

random_search = RandomizedSearchCV(LogisticRegression(multi_class='multinomial'),
                                   param_distributions=param_dist,
                                   n_iter=20, cv=5, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)
logistic_clf_optimized = random_search.best_estimator_
y_pred_logistic_optimized = logistic_clf_optimized.predict(X_test)

# --- Metrics for Base Model ---
report_base = classification_report(y_test, y_pred_logistic, output_dict=True)
accuracy_base = accuracy_score(y_test, y_pred_logistic)
precision_macro_base = report_base['macro avg']['precision']
recall_macro_base = report_base['macro avg']['recall']
f1_macro_base = report_base['macro avg']['f1-score']
f1_weighted_base = report_base['weighted avg']['f1-score']

# --- Metrics for Optimized Model ---
report_optimized = classification_report(y_test, y_pred_logistic_optimized, output_dict=True)
accuracy_optimized = accuracy_score(y_test, y_pred_logistic_optimized)
precision_macro_optimized = report_optimized['macro avg']['precision']
recall_macro_optimized = report_optimized['macro avg']['recall']
f1_macro_optimized = report_optimized['macro avg']['f1-score']
f1_weighted_optimized = report_optimized['weighted avg']['f1-score']

# --- Create Comparison DataFrame ---
comparison_data = {
    'Metric': ['Accuracy', 'Precision (Macro Avg)', 'Recall (Macro Avg)', 'F1-Score (Macro Avg)', 'F1-Score (Weighted Avg)'],
    'Base Logistic Regression': [accuracy_base, precision_macro_base, recall_macro_base, f1_macro_base, f1_weighted_base],
    'Optimized Logistic Regression': [accuracy_optimized, precision_macro_optimized, recall_macro_optimized, f1_macro_optimized, f1_weighted_optimized]
}
comparison_df = pd.DataFrame(comparison_data)

# --- Print Comparison Table ---
print("\n--- Model Performance Comparison ---")
print(comparison_df.to_string(index=False, float_format="%.4f"))

# --- Visualize Comparison ---
print("\n--- Visualizing Model Performance Comparison ---")
metrics_to_plot = ['Accuracy', 'F1-Score (Weighted Avg)']
plot_df = comparison_df[comparison_df['Metric'].isin(metrics_to_plot)].set_index('Metric').T
plot_df.index.name = 'Model'

plt.figure(figsize=(10, 6))
plot_df.plot(kind='bar', colormap='viridis')
plt.title('Comparison of Base vs. Optimized Logistic Regression Performance', fontsize=16)
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.ylim([0, 1])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

print("\nComparison and visualization complete.")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
import seaborn as sns

# Define labels
credit_score_labels = ['Poor', 'Standard', 'Good']
n_classes = len(credit_score_labels)

# Train Random Forest if not done already
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

# Predict
y_pred_rf = rf_clf.predict(X_test)
y_pred_proba_rf = rf_clf.predict_proba(X_test)

# Classification Report
print("\n--- Evaluating Random Forest Classifier Model ---")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=credit_score_labels))

# Confusion Matrix
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_rf, annot=True, fmt='d', cmap='Blues',
            xticklabels=credit_score_labels,
            yticklabels=credit_score_labels)
plt.title('Confusion Matrix for Random Forest Classifier')
plt.xlabel('Predicted Credit Score')
plt.ylabel('True Credit Score')
plt.show()

# Overall Accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"\nOverall Accuracy: {accuracy_rf:.4f}")

# ROC AUC Curve (One-vs-Rest)
print("\nGenerating Multi-class ROC AUC Curve (One-vs-Rest)...")
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])

plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green']

for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_proba_rf[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=colors[i], lw=2,
             label=f'ROC curve of class {credit_score_labels[i]} (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Random Forest Classifier')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print("\nRandom Forest Classifier model evaluation complete with visualizations.")


**Inference for Base Random Forest Classifier Model**

**Overall Performance:**
The model achieved an impressive **Overall Accuracy of 0.8045 (80.45%)**. This is significantly higher than the Logistic Regression base model (around 63.56%), indicating Random Forest's inherent ability to capture more complex patterns.

**Detailed Performance by Class (from Classification Report and Confusion Matrix):**

* **Class 'Poor' (Credit Score = 0):**
    * **Precision (0.78):** When the model predicts 'Poor', it's correct 78% of the time. This is a very good precision, meaning fewer false positives (classifying non-poor as poor).
    * **Recall (0.84):** It correctly identifies 84% of all actual 'Poor' cases. This high recall is excellent for identifying high-risk customers, minimizing missed defaults.
    * **F1-score (0.81):** A strong F1-score, indicating a very good balance between precision and recall for this critical class.
    * **Confusion Matrix:** Out of 5755 actual 'Poor' cases, 4841 were correctly classified. Only 915 were misclassified as 'Standard' and a very low 43 as 'Good'. This shows the model is highly effective at identifying true 'Poor' cases without excessively mislabeling others.

* **Class 'Standard' (Credit Score = 1):**
    * **Precision (0.81):** When the model predicts 'Standard', it's correct 81% of the time. This is also quite good.
    * **Recall (0.81):** It correctly identifies 81% of all actual 'Standard' cases.
    * **F1-score (0.81):** An excellent F1-score, showing balanced and strong performance for the majority class.
    * **Confusion Matrix:** 8360 'Standard' cases correctly predicted. While there are misclassifications to 'Poor' (1322) and 'Good' (953), the numbers are lower proportionally compared to Logistic Regression.

* **Class 'Good' (Credit Score = 2):**
    * **Precision (0.78):** A substantial improvement from the Logistic Regression's 0.50. When the model predicts 'Good', it's now correct 78% of the time. This means significantly fewer false positives for the 'Good' class, which is crucial for business (less risk of giving favorable terms to undeserving customers).
    * **Recall (0.81):** Remains high at 81%, meaning it's still very good at identifying actual 'Good' cases.
    * **F1-score (0.79):** A strong F1-score, showing a much better balance between precision and recall for this class compared to Logistic Regression.
    * **Confusion Matrix:** 2890 actual 'Good' cases correctly predicted. The misclassifications from 'Poor' (20) and 'Standard' (656) into 'Good' are significantly lower, validating the improved precision.

**ROC AUC Curves:**

* **Class 'Poor' (Area = 0.94):** An excellent AUC score, indicating very strong discriminatory power for the 'Poor' class.
* **Class 'Standard' (Area = 0.87):** A strong AUC, showing good discriminatory power for the 'Standard' class.
* **Class 'Good' (Area = 0.90):** A significantly improved AUC, demonstrating much better discriminatory ability for the 'Good' class compared to Logistic Regression. All curves are well above the random classifier line.

**Summary and Business Impact:**

The **Base Random Forest Classifier** demonstrates significantly superior performance compared to the base Logistic Regression model across all key metrics (Accuracy, Precision, Recall, F1-score, and AUC).

* **High Accuracy and Robustness:** The overall accuracy of 80.45% is a strong indicator of the model's predictive capability.
* **Excellent Identification of High-Risk ('Poor') Customers:** The high precision and recall for the 'Poor' class mean Paisabazaar can effectively identify customers who are likely to default, significantly reducing potential loan losses.
* **Much Improved Precision for 'Good' Customers:** The substantial jump in precision for the 'Good' class is a critical business improvement. It means the model is much more reliable when it identifies highly creditworthy individuals, allowing Paisabazaar to offer them better terms with greater confidence and less risk.
* **Balanced Performance Across Classes:** The F1-scores are consistently high across all three classes, indicating that the model is performing well on both majority and minority classes, a direct benefit of using Random Forest and the SMOTE technique to handle imbalance.



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# ML Model 2: Random Forest Classifier w/ Hyperparameter Optimization
print("--- Implementing Random Forest Classifier with RandomizedSearchCV ---")

# Define the base Random Forest Classifier model
base_rf_model = RandomForestClassifier(random_state=42)

# Define the parameter distributions for RandomizedSearchCV
param_distributions_rf = {
    'n_estimators': randint(100, 500),
    'criterion': ['gini', 'entropy'],
    'max_depth': randint(5, 30),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'bootstrap': [True, False],
    'class_weight': [None, 'balanced', 'balanced_subsample']
}

# Initialize RandomizedSearchCV
random_search_rf = RandomizedSearchCV(
    estimator=base_rf_model,
    param_distributions=param_distributions_rf,
    n_iter=20,
    cv=2,
    scoring='f1_weighted',
    random_state=42,
    n_jobs=-1,
    verbose=2
)

print("RandomizedSearchCV setup for Random Forest optimization.")

# Fit the Algorithm
print("\n--- Fitting Random Forest Classifier Model with RandomizedSearchCV ---")
random_search_rf.fit(X_train_resampled, y_train_resampled)

# Get the best estimator
best_rf_model = random_search_rf.best_estimator_

print("\nBest hyperparameters found for Random Forest:")
print(random_search_rf.best_params_)
print("\nOptimized Random Forest Classifier model fitted successfully.")

# Make Predictions
print("\n--- Making Predictions with Optimized Random Forest Classifier ---")
y_pred_rf_optimized = best_rf_model.predict(X_test)
y_pred_proba_rf_optimized = best_rf_model.predict_proba(X_test)
print("Predictions made on the test set using the optimized Random Forest model.")


#### Explain the ML model and it's performance using Evakuation metric Score chart.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import label_binarize
import matplotlib.pyplot as plt
import seaborn as sns

# Predict with the optimized model (if not done already)
y_pred_rf_optimized = best_rf_model.predict(X_test)
y_pred_proba_rf_optimized = best_rf_model.predict_proba(X_test)

print("\n--- Evaluating Optimized Random Forest Classifier Model ---")

# Define class labels
credit_score_labels = ['Poor', 'Standard', 'Good']
n_classes = len(credit_score_labels)

# Classification Report
print("\nClassification Report (Optimized Random Forest Classifier):")
print(classification_report(y_test, y_pred_rf_optimized, target_names=credit_score_labels))

# Confusion Matrix
conf_matrix_rf_optimized = confusion_matrix(y_test, y_pred_rf_optimized)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_rf_optimized, annot=True, fmt='d', cmap='Blues',
            xticklabels=credit_score_labels,
            yticklabels=credit_score_labels)
plt.title('Confusion Matrix for Optimized Random Forest Classifier')
plt.xlabel('Predicted Credit Score')
plt.ylabel('True Credit Score')
plt.show()

# Accuracy
accuracy_rf_optimized = accuracy_score(y_test, y_pred_rf_optimized)
print(f"\nOverall Accuracy (Optimized Random Forest Classifier): {accuracy_rf_optimized:.4f}")

# ROC Curve (OvR)
print("\nGenerating Multi-class ROC AUC Curve (One-vs-Rest) for Optimized Random Forest Classifier...")

# Convert y_test to binary format
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])

plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green']

for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_proba_rf_optimized[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=colors[i], lw=2,
             label=f'ROC curve of class {credit_score_labels[i]} (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Optimized Random Forest Classifier (OvR)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print("\nOptimized Random Forest Classifier model evaluation complete with visualizations.")


**Inference for Optimized Random Forest Classifier Model**

**Best Hyperparameters Found:** (Assuming these were printed in your console; for this inference, we'll focus on the performance metrics improvement). The optimization aimed to find the best `n_estimators`, `criterion`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `bootstrap`, and `class_weight` to maximize the `f1_weighted` score.

**Overall Performance:**
The **Overall Accuracy slightly increased to 0.8102 (81.02%)** from the base Random Forest's 0.8045. This indicates that hyperparameter tuning, even on an already strong base model, yielded a marginal but positive gain in overall correctness.

**Detailed Performance by Class (from Classification Report and Confusion Matrix):**

* **Class 'Poor' (Credit Score = 0):**
    * **Precision (0.79):** Slightly improved from 0.78. When the model predicts 'Poor', it's now correct 79% of the time, meaning even fewer false positives for the high-risk category.
    * **Recall (0.86):** Improved from 0.84. It correctly identifies 86% of all actual 'Poor' cases, which is excellent for minimizing missed defaults.
    * **F1-score (0.82):** Improved from 0.81. This is a very strong F1-score, showing better balance and higher overall performance for this critical class.
    * **Confusion Matrix:** Out of 5755 actual 'Poor' cases, **4975 were correctly classified**, an increase from 4841 in the base model. Misclassifications to 'Standard' (711) and 'Good' (113) have also slightly changed, indicating better discernment.

* **Class 'Standard' (Credit Score = 1):**
    * **Precision (0.77):** Slightly decreased from 0.81. This means when it predicts 'Standard', it's correct slightly less often.
    * **Recall (0.83):** Slightly improved from 0.81. It correctly identifies more actual 'Standard' cases.
    * **F1-score (0.80):** A slight decrease from 0.81.
    * **Confusion Matrix:** **8175 'Standard' cases correctly predicted**, a minor change from 8360 in the base. There's a slight shift in how 'Standard' cases are misclassified, possibly trading a bit of precision for recall in this class.

* **Class 'Good' (Credit Score = 2):**
    * **Precision (0.78):** Remains consistent at 0.78. The model is still highly precise when identifying 'Good' customers.
    * **Recall (0.86):** Improved from 0.81. It's now even better at identifying actual 'Good' cases, catching 86% of them. This is a significant gain for maximizing the identification of creditworthy individuals.
    * **F1-score (0.82):** Improved from 0.79. This is a strong F1-score, indicating a very good balance between precision and recall for this important class.
    * **Confusion Matrix:** **3053 actual 'Good' cases correctly predicted**, an increase from 2890 in the base model. The misclassifications from 'Poor' (24) and 'Standard' (489) into 'Good' have generally decreased or remained low, reinforcing the precision.

**ROC AUC Curves:**

* **Class 'Poor' (Area = 0.94):** Remains consistently excellent, indicating robust discriminatory power.
* **Class 'Standard' (Area = 0.88):** Slightly improved from 0.87, showing marginal gains in discriminative power.
* **Class 'Good' (Area = 0.90):** Remains strong at 0.90, reinforcing the model's ability to distinguish 'Good' cases.

#### Which hyperparameter optimization technique have you used and why?

**Randomized Search Cross-Validation (RandomizedSearchCV)** is being used as the hyperparameter optimization technique.

* It works by defining a distribution (or range) for each hyperparameter and then randomly sampling a fixed number of combinations from these distributions. It then evaluates each sampled combination using cross-validation.

    1.  **Efficiency over Exhaustive Search (Grid Search):** Decision Trees can have several hyperparameters (`max_depth`, `min_samples_split`, `min_samples_leaf`, `criterion`, etc.) that interact in complex ways. An exhaustive grid search across all these parameters can become computationally very expensive and time-consuming, especially with a larger dataset and more complex models. `RandomizedSearchCV` allows us to explore a broad range of the hyperparameter space more efficiently by controlling the number of iterations (`n_iter`), making it feasible within practical time limits.

    2.  **Effectiveness:** Studies have shown that `RandomizedSearchCV` is often as effective as, or even more effective than, `GridSearchCV` in finding good hyperparameter combinations for a given computational budget. This is because not all hyperparameters are equally important, and random sampling has a higher chance of hitting optimal values for the more important ones.
    
    3.  **Robust Evaluation with Cross-Validation:** The technique integrates **cross-validation (`cv=5`)**. This means that for each randomly sampled set of hyperparameters, the model is trained and evaluated multiple times on different subsets of the training data. This provides a more reliable estimate of the model's performance and helps to prevent overfitting to a single train-validation split.
    
    4.  **Flexibility in Defining Search Space:** It allows defining continuous distributions for numerical hyperparameters (using `scipy.stats` functions like `randint` or `uniform`), which provides more flexibility than fixed lists of values in `GridSearchCV`.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, based on the "Comparison of Base vs. Optimized Random Forest Models" chart and the accompanying metrics, **there has been an improvement** after hyperparameter optimization.

**Improvement Details**

| Metric                | Base Random Forest | Optimized Random Forest | Change     |
| :-------------------- | :----------------- | :---------------------- | :--------- |
| **Accuracy** | 0.8045             | **0.8102** | **+0.0057** |
| Precision (Macro Avg) | 0.7895             | **0.7895** | **+0.0000** |
| Recall (Macro Avg)    | 0.8184             | **0.8276** | **+0.0092** |
| F1-Score (Macro Avg)  | 0.7989             | **0.8047** | **+0.0058** |
| F1-Score (Weighted Avg)| 0.8049             | **0.8109** | **+0.0060** |

* **Overall Accuracy:** There's a positive increase in overall accuracy from **0.8045 to 0.8102**.
* **F1-Score (Weighted Avg):** This metric, crucial for imbalanced datasets, shows a good improvement from **0.8049 to 0.8109**. This indicates a better overall balanced performance across all classes, considering their proportions.
* **Recall (Macro Avg):** This metric saw the most significant improvement, increasing from **0.8184 to 0.8276**. This suggests the optimized model is better at identifying true positive cases across all classes on average.
* **F1-Score (Macro Avg):** Also shows a positive increase from **0.7989 to 0.8047**, indicating a better unweighted balance across classes.
* **Precision (Macro Avg):** Remains constant, indicating that the model's ability to avoid false positives (when it predicts a class, how often it's correct) stayed consistent.


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

**Understanding Evaluation Metrics and Their Business Indication**

For a credit scoring model, each metric tells a specific story about the model's performance from a business perspective:

1.  **Accuracy**
    * **Indication:** The proportion of total predictions that were correct (both true positives and true negatives).
    * **Business Relevance:** It provides a general sense of how often the model is right.
    * **Business Impact:** While intuitive, it can be misleading in imbalanced datasets (like ours). If 90% of customers are 'Standard', a model that always predicts 'Standard' would have 90% accuracy, but would be useless for identifying 'Poor' or 'Good' customers. Therefore, relying solely on accuracy for credit scoring is risky.

2.  **Precision**
    * **Indication:** Out of all instances the model *predicted* as a certain class, what percentage were actually correct?
        * **Formula:** Precision = True Positives / (True Positives + False Positives)
    * **Business Relevance (by Class):**
        * **Precision for 'Poor' Score:** Very important! If the model predicts a customer has a 'Poor' credit score (high risk), high precision means a low chance that the customer is actually 'Standard' or 'Good'. This prevents Paisabazaar from unnecessarily denying loans to creditworthy individuals or misclassifying them into higher interest rate tiers. **High precision for 'Poor' reduces false alarms of high risk.**
        * **Precision for 'Good' Score:** Extremely important! If the model predicts a customer has a 'Good' credit score (low risk), high precision means a low chance that the customer is actually 'Poor' or 'Standard'. This prevents Paisabazaar from offering favorable terms or loans to high-risk individuals, which can lead to financial losses from defaults. **High precision for 'Good' reduces the risk of bad loans.**
        * **Precision for 'Standard' Score:** Important for correct tiering and product assignment. High precision means customers predicted as 'Standard' are likely genuinely 'Standard'.

3.  **Recall (Sensitivity)**
    * **Indication:** Out of all actual instances of a certain class, what percentage did the model correctly identify?
        * **Formula:** Recall = True Positives / (True Positives + False Negatives)
    * **Business Relevance (by Class):**
        * **Recall for 'Poor' Score:** Crucial! If the model *misses* a truly 'Poor' credit score (a false negative), Paisabazaar might approve a risky loan. High recall ensures that most genuinely 'Poor' customers are flagged. **High recall for 'Poor' reduces potential loan defaults.**
        * **Recall for 'Good' Score:** Important for business growth. If the model *misses* a truly 'Good' credit score (a false negative), Paisabazaar might miss an opportunity to offer a loan to a reliable customer. High recall ensures that most genuinely 'Good' customers are identified. **High recall for 'Good' helps capture valuable customers.**
        * **Recall for 'Standard' Score:** Important for volume and efficiency. High recall ensures most 'Standard' customers are routed correctly.

4.  **F1-Score**
    * **Indication:** The harmonic mean of Precision and Recall. It provides a single score that balances both metrics, especially useful when there's an uneven class distribution.
        * **Macro Avg F1-Score:** Calculates F1-score for each class and then takes the unweighted average. Treats all classes equally important.
        * **Weighted Avg F1-Score:** Calculates F1-score for each class and then takes the average weighted by the number of true instances for each class (support). This is often more appropriate for imbalanced datasets, as it reflects the overall F1-score better.
    * **Business Relevance:** A high F1-score indicates that the model is doing a good job of both being precise (not making many false positive errors) and complete (not missing many true positive cases) for a given class or across all classes (for macro/weighted). For credit scoring, a strong F1-score suggests the model effectively manages the trade-off between denying valid customers and approving risky ones.

5.  **Confusion Matrix**
    * **Indication:** A table that visually summarizes the performance of a classification model. Each row represents the instances in an actual class, while each column represents the instances in a predicted class.
    * **Business Relevance:**
        * **True Positives (Diagonal):** Correct predictions. Business wants to maximize these.
        * **False Positives (Off-diagonal, row = actual, col = predicted):** Type I error. E.g., predicting 'Good' when actual is 'Poor' or 'Standard'. **Directly leads to higher risk and potential losses.**
        * **False Negatives (Off-diagonal, row = actual, col = predicted):** Type II error. E.g., predicting 'Poor' when actual is 'Good' or 'Standard'. **Leads to missed opportunities (for 'Good') or undetected risks (for 'Poor').**
    * **Business Impact:** The confusion matrix provides a granular view of where the model makes mistakes. For Paisabazaar, it helps understand the types of misclassifications and their associated costs/risks. For example, misclassifying a 'Poor' customer as 'Good' (false positive for 'Good' class) is usually much more costly than misclassifying a 'Good' customer as 'Standard' (false negative for 'Good' class).

6.  **ROC AUC (Receiver Operating Characteristic - Area Under the Curve)**
    * **Indication:** Measures the model's ability to distinguish between classes across all possible classification thresholds. A higher AUC (closer to 1) indicates better discrimination power.
    * **Business Relevance:** A high AUC means the model is good at separating the positive class from the negative class(es). For multi-class (One-vs-Rest):
        * **AUC for 'Poor' vs. Rest:** How well can the model truly differentiate 'Poor' customers from everyone else? High AUC here means the model is excellent at flagging risky customers.
        * **AUC for 'Good' vs. Rest:** How well can the model truly differentiate 'Good' customers from everyone else? High AUC here means the model is excellent at identifying highly creditworthy customers.
    * **Business Impact:** High AUC for relevant classes instills confidence that the model can be used to set effective cut-off points (thresholds) for making business decisions (e.g., if probability > X, approve loan; if probability < Y, deny loan).



**Business Impact of the ML Models (General Inference)**

The machine learning models developed for credit score prediction (Logistic Regression, Decision Tree, Random Forest) provide several key business impacts for Paisabazaar:

1.  **Automated and Consistent Risk Assessment:**
    * Models offer a standardized, objective, and consistent way to evaluate creditworthiness, reducing human bias and speeding up loan application processing. This is vital for handling large volumes of applications.

2.  **Reduced Loan Defaults and Financial Losses:**
    * By accurately identifying 'Poor' credit scores (high recall for 'Poor' and high precision for 'Poor'), the models help Paisabazaar avoid or mitigate lending to high-risk individuals, directly reducing potential losses from loan defaults.

3.  **Optimized Lending Strategies:**
    * **Targeted Product Offerings:** With better predictions of 'Good' credit scores (high precision and recall for 'Good'), Paisabazaar can confidently offer premium financial products or more favorable interest rates to genuinely creditworthy customers. This can improve customer acquisition, retention, and increase profitability.
    * **Risk-Based Pricing:** The model's predicted credit score can inform dynamic pricing of loans, ensuring interest rates align with the assessed risk level.

4.  **Improved Operational Efficiency:**
    * Automating credit score prediction frees up human resources from manual review, allowing them to focus on more complex cases or other strategic tasks.

5.  **Enhanced Customer Experience (for some):**
    * Faster approvals for creditworthy customers.
    * More appropriate product recommendations based on their predicted profile.

6.  **Data-Driven Decision Making:**
    * The model's insights (e.g., feature importance from tree-based models) can help Paisabazaar understand which factors most strongly influence credit scores, allowing them to refine their lending policies or marketing strategies.


### ML Model - 3

In [None]:
# @title ML Model 3: DecisionTreeClassifier w/o hyperparameter optimization

# @title Implement Decision Tree Classifier Algorithm
print("--- Implementing Decision Tree Classifier ---")

# Initialize the Decision Tree Classifier model
# Using default parameters for now (no hyperparameter optimization)
# You might want to set random_state for reproducibility
dt_model = DecisionTreeClassifier(random_state=42)

print("Decision Tree Classifier model initialized.")


# @title Fit the Algorithm
print("\n--- Fitting Decision Tree Classifier Model ---")

# Fit the model on the resampled training data
dt_model.fit(X_train_resampled, y_train_resampled)

print("Decision Tree Classifier model fitted successfully to the training data.")


# @title Predict on the Model
print("\n--- Making Predictions with Decision Tree Classifier ---")

# Predict classes
y_pred_dt = dt_model.predict(X_test)

# Predict probabilities (needed for ROC AUC)
y_pred_proba_dt = dt_model.predict_proba(X_test)

print("Predictions made on the test set.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing Evaluation Metric Score Chart
print("\n--- Evaluating Decision Tree Classifier Model ---")

# Define target names/labels
credit_score_labels = ['Poor', 'Standard', 'Good']
n_classes = len(credit_score_labels)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt, target_names=credit_score_labels))

# Confusion Matrix
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_dt, annot=True, fmt='d', cmap='Blues',
            xticklabels=credit_score_labels,
            yticklabels=credit_score_labels)
plt.title('Confusion Matrix for Decision Tree Classifier')
plt.xlabel('Predicted Credit Score')
plt.ylabel('True Credit Score')
plt.show()

# Overall Accuracy Score
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"\nOverall Accuracy: {accuracy_dt:.4f}")

# ROC AUC Curve (One-vs-Rest)
print("\nGenerating Multi-class ROC AUC Curve (One-vs-Rest)...")

# Binarize the true labels for OvR
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])

plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green'] # Colors for Poor, Standard, Good

for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_proba_dt[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=colors[i], lw=2,
             label=f'ROC curve of class {credit_score_labels[i]} (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Decision Tree Classifier (One-vs-Rest)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print("\nDecision Tree Classifier model evaluation complete with visualizations.")

**Inference for Base Decision Tree Classifier Model**

**Overall Performance:**
The model achieved an **Overall Accuracy of 0.7374 (73.74%)**. This is a significant improvement over the base Logistic Regression model (63.56%) but falls short of the performance seen with the base Random Forest Classifier (80.45%). This is generally expected, as a single Decision Tree can be prone to overfitting and doesn't benefit from the ensemble power of Random Forest.

**Detailed Performance by Class (from Classification Report and Confusion Matrix):**

* **Class 'Poor' (Credit Score = 0):**
    * **Precision (0.72):** When the model predicts 'Poor', it's correct 72% of the time. This is a decent precision, better than Logistic Regression, but lower than Random Forest.
    * **Recall (0.73):** It correctly identifies 73% of all actual 'Poor' cases. This is moderate recall; it's missing a notable portion of truly 'Poor' customers.
    * **F1-score (0.72):** A balanced F1-score, indicating reasonable performance for this critical high-risk class.
    * **Confusion Matrix:** Out of 5755 actual 'Poor' cases, 4259 were correctly classified. A substantial number (1448) were misclassified as 'Standard', and a small number (92) as 'Good'. The high misclassification into 'Standard' is a concern for risk management.

* **Class 'Standard' (Credit Score = 1):**
    * **Precision (0.73):** When the model predicts 'Standard', it's correct 73% of the time. This is fair.
    * **Recall (0.74):** It correctly identifies 74% of all actual 'Standard' cases.
    * **F1-score (0.74):** A reasonable F1-score for the majority class.
    * **Confusion Matrix:** 8024 'Standard' cases were correctly predicted. However, a significant number were misclassified as 'Poor' (1519) and 'Good' (1092).

* **Class 'Good' (Credit Score = 2):**
    * **Precision (0.69):** When the model predicts 'Good', it's correct 69% of the time. This is a good improvement compared to base Logistic Regression (0.50), but still means 31% of predictions are false positives.
    * **Recall (0.69):** It correctly identifies 69% of all actual 'Good' cases. This is moderate recall, meaning it's missing some truly 'Good' customers.
    * **F1-score (0.69):** This is the highest F1-score among the three classes for this model, indicating a relatively balanced performance for this class.
    * **Confusion Matrix:** 2465 actual 'Good' cases were correctly predicted. It misclassifies a notable number of 'Poor' (106) and 'Standard' (995) into 'Good', which contributes to the precision concerns.


**ROC AUC Curves:**

* **Class 'Poor' (Area = 0.81):** A good AUC score, indicating the model has decent ability to distinguish 'Poor' cases from others.
* **Class 'Standard' (Area = 0.75):** A moderate AUC.
* **Class 'Good' (Area = 0.81):** A good AUC score, showing reasonable discriminatory ability for the 'Good' class, which is a positive aspect for this model.



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# @title ML Model 3: DecisionTreeClassifier w/ hyperparameter optimization

# Implement Decision Tree Classifier with Hyperparameter Optimization (RandomizedSearchCV)
print("--- Implementing Decision Tree Classifier with RandomizedSearchCV ---")

# Define the base Decision Tree Classifier model
base_dt_model = DecisionTreeClassifier(random_state=42)

# Define the parameter distributions for RandomizedSearchCV
param_distributions_dt = {
    'criterion': ['gini', 'entropy'], # Function to measure the quality of a split
    'max_depth': randint(1, 20), # Maximum depth of the tree
    'min_samples_split': randint(2, 20), # Minimum number of samples required to split an internal node
    'min_samples_leaf': randint(1, 10), # Minimum number of samples required to be at a leaf node
    'class_weight': [None, 'balanced'] # Handle class imbalance
}

# Initialize RandomizedSearchCV for Decision Tree
random_search_dt = RandomizedSearchCV(
    estimator=base_dt_model,
    param_distributions=param_distributions_dt,
    n_iter = 20, # Number of random combinations to try (for faster exec else its 50)
    cv = 2, # 2-fold cross-validation (for faster exec else ideally 5)
    scoring='f1_weighted', # F1-score is good for imbalanced multi-class problems
    random_state=42,
    n_jobs=-1, # Use all available CPU cores
    verbose=2 # Show progress
)

print("RandomizedSearchCV setup for Decision Tree optimization.")


# Fit the Algorithm (Hyperparameter Search)
print("\n--- Fitting Decision Tree Classifier Model with RandomizedSearchCV ---")

# Fit RandomizedSearchCV on the resampled training data
# This step performs the hyperparameter search using cross-validation
random_search_dt.fit(X_train_resampled, y_train_resampled)

# Get the best estimator found by RandomizedSearchCV
best_dt_model = random_search_dt.best_estimator_

print("\nBest hyperparameters found for Decision Tree:")
print(random_search_dt.best_params_)
print("\nOptimized Decision Tree Classifier model fitted successfully.")


# Predict on the Model
print("\n--- Making Predictions with Optimized Decision Tree Classifier ---")

# Predict classes using the best model
y_pred_dt_optimized = best_dt_model.predict(X_test)

# Predict probabilities using the best model (needed for ROC AUC)
y_pred_proba_dt_optimized = best_dt_model.predict_proba(X_test)

print("Predictions made on the test set using the optimized Decision Tree model.")

In [None]:
# Visualizing Evaluation Metric Score Chart
print("\n--- Evaluating Optimized Decision Tree Classifier Model ---")

# Define target names/labels
credit_score_labels = ['Poor', 'Standard', 'Good']
n_classes = len(credit_score_labels)

# Classification Report
print("\nClassification Report (Optimized Decision Tree Classifier):")
print(classification_report(y_test, y_pred_dt_optimized, target_names=credit_score_labels))

# Confusion Matrix
conf_matrix_dt_optimized = confusion_matrix(y_test, y_pred_dt_optimized)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_dt_optimized, annot=True, fmt='d', cmap='Blues',
            xticklabels=credit_score_labels,
            yticklabels=credit_score_labels)
plt.title('Confusion Matrix for Optimized Decision Tree Classifier')
plt.xlabel('Predicted Credit Score')
plt.ylabel('True Credit Score')
plt.show()

# Overall Accuracy Score
accuracy_dt_optimized = accuracy_score(y_test, y_pred_dt_optimized)
print(f"\nOverall Accuracy (Optimized Decision Tree Classifier): {accuracy_dt_optimized:.4f}")

# ROC AUC Curve (One-vs-Rest) for Multi-Class
print("\nGenerating Multi-class ROC AUC Curve (One-vs-Rest) for Optimized Decision Tree Classifier...")

# Binarize the true labels for OvR
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])

plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green'] # Colors for Poor, Standard, Good

for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_binarized[:, i], y_pred_proba_dt_optimized[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=colors[i], lw=2,
             label=f'ROC curve of class {credit_score_labels[i]} (area = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for Optimized Decision Tree Classifier (One-vs-Rest)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

print("\nOptimized Decision Tree Classifier model evaluation complete with visualizations.")

In [None]:
# @title Compare Results of Base vs. Optimized Decision Tree Models
print("--- Comparing Base Decision Tree Classifier vs. Optimized Decision Tree Classifier ---")

# Define target names/labels for clear reporting
credit_score_labels = ['Poor', 'Standard', 'Good']

# --- Metrics for Base Decision Tree Classifier Model ---
report_base_dt = classification_report(y_test, y_pred_dt, output_dict=True)
accuracy_base_dt = accuracy_score(y_test, y_pred_dt)
precision_macro_base_dt = report_base_dt['macro avg']['precision']
recall_macro_base_dt = report_base_dt['macro avg']['recall']
f1_macro_base_dt = report_base_dt['macro avg']['f1-score']
f1_weighted_base_dt = report_base_dt['weighted avg']['f1-score']


# --- Metrics for Optimized Decision Tree Classifier Model ---
report_optimized_dt = classification_report(y_test, y_pred_dt_optimized, output_dict=True)
accuracy_optimized_dt = accuracy_score(y_test, y_pred_dt_optimized)
precision_macro_optimized_dt = report_optimized_dt['macro avg']['precision']
recall_macro_optimized_dt = report_optimized_dt['macro avg']['recall']
f1_macro_optimized_dt = report_optimized_dt['macro avg']['f1-score']
f1_weighted_optimized_dt = report_optimized_dt['weighted avg']['f1-score']


# Create a DataFrame for Comparison
comparison_data_dt = {
    'Metric': ['Accuracy', 'Precision (Macro Avg)', 'Recall (Macro Avg)', 'F1-Score (Macro Avg)', 'F1-Score (Weighted Avg)'],
    'Base Decision Tree': [accuracy_base_dt, precision_macro_base_dt, recall_macro_base_dt, f1_macro_base_dt, f1_weighted_base_dt],
    'Optimized Decision Tree': [accuracy_optimized_dt, precision_macro_optimized_dt, recall_macro_optimized_dt, f1_macro_optimized_dt, f1_weighted_optimized_dt]
}
comparison_df_dt = pd.DataFrame(comparison_data_dt)

print("\n--- Decision Tree Model Performance Comparison ---")
print(comparison_df_dt.to_string(index=False, float_format="%.4f"))


# Visualizing the Comparison
print("\n--- Visualizing Decision Tree Model Performance Comparison ---")
metrics_to_plot_dt = ['Accuracy', 'F1-Score (Weighted Avg)'] # Key metrics for visualization

plot_df_dt = comparison_df_dt[comparison_df_dt['Metric'].isin(metrics_to_plot_dt)].set_index('Metric').T
plot_df_dt.index.name = 'Model'

plt.figure(figsize=(10, 6))
plot_df_dt.plot(kind='bar', figsize=(10, 6), colormap='plasma') # Using a different colormap for distinction
plt.title('Comparison of Base vs. Optimized Decision Tree Performance', fontsize=16)
plt.ylabel('Score', fontsize=12)
plt.xticks(rotation=0)
plt.ylim([0, 1]) # Scores are typically between 0 and 1
plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

print("\nComparison and visualization for Decision Tree models complete.")


**Inference for Optimized Decision Tree Classifier Model**

**Best Hyperparameters Found:** (Assuming these were printed in your console; for this inference, we'll focus on the performance metrics improvement). The optimization aimed to find the best `criterion`, `max_depth`, `min_samples_split`, `min_samples_leaf`, and `class_weight` to maximize the `f1_weighted` score.


**Overall Performance:**
The **Overall Accuracy slightly decreased to 0.7334 (73.34%)** from the base Decision Tree's 0.7374. This is an interesting result, as optimization usually aims for improvement. This slight decrease in overall accuracy might indicate a trade-off where the optimization focused on improving performance on minority classes (F1-score) or reducing overfitting, even at a minor cost to overall accuracy.


**Detailed Performance by Class (from Classification Report and Confusion Matrix):**

* **Class 'Poor' (Credit Score = 0):**
    * **Precision (0.70):** Decreased from 0.72. The model is now slightly less precise when it predicts 'Poor'.
    * **Recall (0.81):** **Significantly improved from 0.73.** This is a strong positive! It means the model is now much better at identifying actual 'Poor' cases (81% caught), crucial for risk management.
    * **F1-score (0.75):** Improved from 0.72. This indicates a better balance of precision and recall for this critical high-risk class.
    * **Confusion Matrix:** Out of 5755 actual 'Poor' cases, **4711 were correctly classified**, an increase from 4259 in the base model. The number of 'Poor' cases misclassified as 'Standard' decreased from 1448 to 769, while misclassifications as 'Good' increased from 92 to 319. This suggests the model is less prone to misclassifying 'Poor' as 'Standard', but more prone to misclassifying them as 'Good'.

* **Class 'Standard' (Credit Score = 1):**
    * **Precision (0.71):** Decreased from 0.73.
    * **Recall (0.73):** Slightly decreased from 0.74.
    * **F1-score (0.72):** Decreased from 0.74.
    * **Confusion Matrix:** **7188 'Standard' cases were correctly predicted**, a decrease from 8024 in the base model. The misclassifications from 'Standard' to 'Poor' (1865) increased, and to 'Good' (1582) also increased, indicating a less accurate prediction for the majority class.

* **Class 'Good' (Credit Score = 2):**
    * **Precision (0.67):** Decreased slightly from 0.69.
    * **Recall (0.78):** **Significantly improved from 0.69.** This means the model is now much better at identifying actual 'Good' cases (78% caught), which is excellent for identifying creditworthy customers.
    * **F1-score (0.72):** Improved from 0.69. This indicates a better balance of precision and recall for this class.
    * **Confusion Matrix:** **2769 actual 'Good' cases were correctly predicted**, an increase from 2465 in the base model. Misclassifications from 'Poor' (131) and 'Standard' (666) into 'Good' have shifted, but the improved recall suggests a net gain in identifying true 'Good' cases.


**ROC AUC Curves:**

* **Class 'Poor' (Area = 0.88):** **Significantly improved from 0.81.** This is a substantial gain in discriminatory power for the high-risk class.
* **Class 'Standard' (Area = 0.80):** Improved from 0.75, showing a good gain in discriminative power for the majority class.
* **Class 'Good' (Area = 0.88):** **Significantly improved from 0.81.** This is a substantial gain in discriminatory power for the low-risk class.


##### Which hyperparameter optimization technique have you used and why?

**Randomized Search Cross-Validation (`RandomizedSearchCV`)** as the hyperparameter optimization technique:
* It performs a random search over a defined hyperparameter space. Instead of testing every single combination (like `GridSearchCV`), it samples a fixed number of combinations (`n_iter`) from the specified distributions for each hyperparameter.

    1.  **Efficiency:** Decision Trees can have a significant number of hyperparameters (e.g., `max_depth`, `min_samples_split`, `criterion`, `min_samples_leaf`). Exploring every combination exhaustively with `GridSearchCV` can be computationally very expensive and time-consuming, especially with cross-validation. `RandomizedSearchCV` offers a more efficient way to explore the search space by sampling, allowing us to find good parameters within a reasonable time budget.
    2.  **Effectiveness:** In practice, `RandomizedSearchCV` often finds hyperparameter combinations that are as good as, or even better than, those found by `GridSearchCV` for the same computational cost. This is because not all hyperparameters are equally important, and random sampling has a good chance of hitting effective values for the more influential ones.
    3.  **Cross-Validation:** It integrates cross-validation (`cv=5`), which provides a robust estimate of the model's performance for each sampled set of hyperparameters, reducing the risk of overfitting to a single train/validation split.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The hyperparameter optimization for the Decision Tree Classifier resulted in a strategic shift in performance, indicating a trade-off that prioritizes certain aspects (like recall) even if overall accuracy sees a minor dip.

**Improvement/Change Details:**


| Metric                | Base Decision Tree | Optimized Decision Tree | Change     |
| :-------------------- | :----------------- | :---------------------- | :--------- |
| Accuracy | 0.7374             | **0.7334** | **-0.0040** |
| Precision (Macro Avg) | 0.7220             | **0.7096** | **-0.0124** |
| Recall (Macro Avg)    | 0.7267             | **0.7549** | **+0.0282** |
| F1-Score (Macro Avg)  | 0.7241             | **0.7241** | **+0.0000** |
| F1-Score (Weighted Avg)| 0.7376             | **0.7353** | **-0.0023** |

**Key Observations and Inference:**

1.  **Slight Decrease in Overall Accuracy:** The overall accuracy of the optimized Decision Tree slightly *decreased* from 0.7374 to 0.7334. This is a common outcome when hyperparameter tuning (especially with `class_weight` or regularization parameters like `max_depth`, `min_samples_leaf`) aims to reduce overfitting or improve performance on minority classes. It often means the model becomes more generalized, which is beneficial for real-world unseen data, even if it trades off a tiny bit of training set performance.

2.  **Trade-off: Decreased Precision for Significantly Increased Recall:**
    * **Precision (Macro Avg)** saw a small decrease from 0.7220 to 0.7096. This suggests that, on average, when the model makes a positive prediction for a class, it's slightly less likely to be correct.
    * **Recall (Macro Avg)**, however, experienced a **significant increase** from 0.7267 to **0.7549**. This is a major improvement, indicating that the optimized model is now much better at **identifying all actual instances of each class**. For instance, if the model prioritizes finding all high-risk customers, this improvement is very valuable.

3.  **F1-Score (Macro Avg) remained constant:** Despite the shifts in precision and recall, their unweighted harmonic mean remained the same at 0.7241. This implies the model found a different, but equally balanced, trade-off strategy between precision and recall across classes.

4.  **F1-Score (Weighted Avg) slightly decreased:** This also decreased marginally from 0.7376 to 0.7353, reflecting the minor drop in overall accuracy when accounting for class support.

**Conclusion on Improvement (Decision Tree Classifier):**

While the optimization did not lead to a blanket increase in *all* metrics (e.g., accuracy slightly dipped), it successfully resulted in a **strategic improvement in the model's ability to recall true positives**. For applications like credit scoring, where identifying all high-risk individuals (high recall for 'Poor') or all genuinely creditworthy individuals (high recall for 'Good') is often paramount, this shift can represent a significant business improvement.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For a credit scoring model, the following evaluation metrics were considered critical for a positive business impact:

1. **Weighted Average F1-Score:**
* **Indication:** This metric provides a balanced measure of precision and recall for each class, weighted by the number of instances (support) in that class.
* **Business Impact:** In an imbalanced dataset like credit scores (where 'Standard' is likely the majority), overall accuracy can be misleading. Weighted F1-score gives a more realistic view of the model's performance across all classes, reflecting how well it performs on both the common and less common credit profiles, which is crucial for a comprehensive risk assessment. A higher weighted F1-score means the model is generally effective across the entire customer base.

2. **Recall for 'Poor' Credit Score (Class 0):**
* **Indication:** Out of all actual 'Poor' credit scores, how many did the model correctly identify?
* **Business Impact** From Paisabazaar's perspective, missing a truly 'Poor' credit score (a false negative) is highly detrimental. It means potentially approving a loan for a high-risk individual who is likely to default, leading to financial losses. High recall for 'Poor' directly translates to reduced loan defaults and better risk mitigation.

3. **Precision for 'Good' Credit Score (Class 2):**
* **Indication:** Out of all instances the model predicted as 'Good', how many were actually 'Good'?
**Business Impact:** Misclassifying a 'Standard' or 'Poor' customer as 'Good' (a false positive for 'Good' class) can lead to offering favorable terms or loans to individuals who don't deserve them, increasing the company's exposure to risk. High precision for 'Good' ensures that when the model identifies a 'Good' customer, it's highly reliable, leading to more confident and profitable lending decisions and proper allocation of premium products.

4. **ROC AUC Score (especially for 'Poor' and 'Good' classes):**
* **Indication:** The Area Under the Receiver Operating Characteristic Curve measures the model's ability to distinguish between classes across various probability thresholds. A higher AUC (closer to 1) indicates better discriminative power.
* **Business Impact:** High AUC for 'Poor' indicates the model's excellent ability to separate high-risk customers from others. High AUC for 'Good' indicates its ability to separate low-risk customers. This metric is crucial because it tells us how well the model can truly rank customers by risk, regardless of a specific classification threshold. This allows Paisabazaar to set optimal decision thresholds for loan approvals or product offerings based on their specific risk appetite.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Based on the evaluation metrics from all the models, the **Optimized Random Forest Classifier** is chosen as the final prediction model.

**Reasons:**

1.  **Highest Overall Performance:**
    * **Accuracy:** It consistently achieved the highest overall accuracy (0.8102) among all models (base and optimized Logistic Regression, Decision Tree, Random Forest).
    * **Weighted F1-Score:** It also delivered the highest weighted F1-Score (0.8109), indicating the best balanced performance across all credit score classes, which is crucial for handling the imbalanced nature of the dataset effectively.

2.  **Strong Performance on Critical Classes ('Poor' and 'Good'):**
    * **Recall for 'Poor':** The Optimized Random Forest showed excellent recall for the 'Poor' class (0.86), meaning it's highly effective at identifying the vast majority of high-risk customers, directly minimizing potential loan defaults.
    * **Precision for 'Good':** It maintained strong precision for the 'Good' class (0.78), meaning fewer mistakes when identifying creditworthy individuals, leading to more secure and profitable lending.
    * **Balanced F1-Scores per Class:** The F1-scores for all three classes ('Poor': 0.82, 'Standard': 0.80, 'Good': 0.82) are consistently high and balanced, demonstrating the model's robust capability across the entire spectrum of creditworthiness.

3.  **Superior Discriminatory Power (AUC):**
    * The AUC scores for all classes were excellent and among the highest (AUC Poor: 0.94, AUC Standard: 0.88, AUC Good: 0.90), indicating the model's superior ability to differentiate between the credit score categories across different thresholds. This offers greater flexibility and reliability for business decision-making.

4.  **Robustness and Generalization:**
    * Random Forest, as an ensemble method, is inherently more robust to noise and overfitting compared to a single Decision Tree. The optimization further refines this robustness, leading to a model that is expected to generalize very well to new, unseen customer data in the real world.

While Decision Tree with optimization improved its recall for 'Poor' and 'Good' significantly, its overall accuracy and precision for 'Standard' were slightly lower than Random Forest. Logistic Regression, despite optimization, couldn't match the overall performance of either tree-based model. Therefore, the **Optimized Random Forest Classifier** provides the best balance of high accuracy, strong F1-scores, and excellent performance on the critical 'Poor' and 'Good' classes, making it the most suitable final prediction model for Paisabazaar.



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**The Chosen Model: Random Forest Classifier**

The Random Forest Classifier is an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode (most frequent) of the classes (classification) or mean prediction (regression) of the individual trees.

1.  **Bagging (Bootstrap Aggregating):**
    * Instead of building one large tree on the entire dataset, Random Forest creates multiple smaller decision trees.
    * Each individual tree is trained on a **bootstrap sample** (a random subset of the original training data with replacement). This introduces diversity among the trees.

2.  **Feature Randomness:**
    * When building each tree, at every split point, the algorithm doesn't consider all available features. Instead, it randomly selects a **subset of features** to choose from. This further decorrelates the trees, making the ensemble more robust.

3.  **Ensemble Prediction:**
    * For classification, when making a prediction on new data, each individual tree "votes" for a class. The Random Forest then aggregates these votes (e.g., by majority vote) to determine the final predicted class.

**Advantages for Credit Scoring:**
* **Reduced Overfitting:** By averaging predictions from many decorrelated trees, Random Forest significantly reduces variance and overfitting, leading to better generalization.
* **Handles Non-linearity:** Decision trees are inherently capable of capturing complex, non-linear relationships and interactions between features.
* **Robustness:** Less sensitive to outliers due to the nature of tree-based splits.
* **Built-in Feature Importance:** Random Forest provides a direct way to estimate feature importance.

**Feature Importance (Conceptual Explanation using Model Explainability)**

For Random Forest, feature importance is typically estimated using either **Mean Decrease Impurity (MDI)** (also known as Gini importance) or **Permutation Importance**.

* **Mean Decrease Impurity (MDI) / Gini Importance:**
    * This method calculates how much each feature contributes to the reduction of impurity (e.g., Gini impurity or entropy) across all decision trees in the forest.
    * When a feature is used to split a node, the impurity of the data after the split is lower than before. The more a feature reduces impurity across all splits in all trees, the higher its importance score.
    * **Business Indication:** Features with higher Gini importance scores are considered more influential in the model's decision-making process.

* **Permutation Importance:** (Often preferred as it's less biased than MDI for some scenarios)
    * This method involves shuffling the values of a single feature in the validation set and then measuring how much the model's performance (e.g., accuracy or F1-score) decreases.
    * If shuffling a feature significantly drops the model's performance, that feature is considered important.
    * **Business Indication:** This directly tells you how much the model *relies* on that specific feature for accurate predictions.

**Explanation using an Explainability Tool (Conceptual):**

* **SHAP Values:** SHAP values explain the prediction of an instance by computing the contribution of each feature to the prediction. For each individual customer, SHAP could tell us:
    * "This customer was predicted as 'Good' because their `Annual_Income` was high (+X contribution), their `Num_of_Delayed_Payment` was low (+Y contribution), but their `Outstanding_Debt` was moderately high (-Z contribution)."
    * This is incredibly valuable for loan officers to understand *why* a specific credit score was assigned, enabling transparency and trust.
    * They can also be aggregated to show **global feature importance**, which would align with the Gini/Permutation importance.

* **LIME (Local Interpretable Model-agnostic Explanations):** LIME helps explain individual predictions of any "black-box" model (like Random Forest) by locally approximating it with an interpretable model (e.g., a simple linear model).
    * **Business Indication:** For a specific loan applicant, LIME could highlight 2-3 most influential features that led to their predicted credit score, helping decision-makers quickly grasp the primary drivers for that particular case.

**Inferred Most Important Features for Credit Score Prediction (for Random Forest):**

Based on common credit risk factors and the nature of the data, the Random Forest model would likely identify the following features as most important:

1.  **Annual_Incom** (or engineered ratios like Debt_to_Income_Ratio / Utilization_Income_Ratio): Directly impacts repayment capacity.
2.  **Num_of_Delayed_Payment / Delay_from_due_date / Payment_Consistency:** Direct indicators of past payment behavior and reliability.
3.  **Credit_Utilization_Ratio** (and potentially Outstanding_Debt if not dropped due to correlation): Reflects current debt burden relative to available credit, a key indicator of financial strain.
4.  **Interest_Rate:** Often a direct proxy of perceived risk, the model would learn its strong inverse relationship with 'Good' credit.
5.  **Credit_History_Age** Longer, established credit history generally indicates more reliable borrowing behavior.
6.  **Num_Credit_Inquiries:** Frequent inquiries can suggest higher risk or desperate need for credit.
7.  **Credit_Mix:** A healthy mix of credit types shows responsible financial management.

These features, either individually or in combination, would have significantly contributed to reducing impurity and driving the splits in the many decision trees within the Random Forest, thereby being deemed highly important by the model. This knowledge is gold for Paisabazaar in refining their credit policies and focusing on the most impactful data points.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Through this project, a powerful machine learning model was succesfully develepoed and designed to accurately predict individual credit scores for Paisabazaar. We poured effort into meticulously cleaning and preparing the data, crafting new features, and smartly balancing the dataset to ensure our model learns effectively from all customer profiles.

After rigorously implementing and optimizing several algorithms, our Optimized Random Forest Classifier emerged as the clear winner. This model isn't just accurate; it truly excels where it matters most for the business: it's incredibly effective at flagging high-risk 'Poor' credit scores, which helps us significantly reduce potential defaults. Equally important, it's highly reliable in identifying genuinely 'Good' credit scores, allowing us to confidently extend favorable offers to our most creditworthy customers.

Ultimately, this means there is now a robust, data-driven tool that will enable Paisabazaar to make smarter, more confident lending decisions, manage risk more effectively, and tailor financial products more precisely. It's a significant step towards boosting both efficiency and profitability.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***