# Exploratory Data Analysis – Credit Score Classification

This notebook provides a **clean, professional, text-driven Exploratory Data Analysis (EDA)** for a credit score classification problem.

The working assumption is that the dataset is stored in a file named `train.csv` in the same directory as this notebook.


## 1. Imports and data loading

In this section we load the dataset and perform a very first sanity check: shape of the data and a quick peek at the head of the table.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', lambda x: f'{x:0.4f}')

df = pd.read_csv('train.csv')
df.head()

## 2. Dataset structure and basic properties

In this project each row represents a **(customer, month)** observation. The dataset is panel-like: the same customer appears in multiple months.

Key questions in this section:
- How many rows and columns does the dataset have?
- How many unique customers and months are present?
- What is the granularity of the data?


In [None]:
print("Shape (rows, columns):", df.shape)
print("\nUnique customers:", df['Customer_ID'].nunique())
print("Months present:", df['Month'].unique())

print("\nColumn data types:\n")
print(df.dtypes)

### 2.1. High-level description

At a high level the dataset contains:

- **100,000+ observations** (monthly records),
- **28 columns**, covering identification, demographics, income, credit products, payment history, behavioural indicators and the target label,
- approximately **12,500 unique customers**, each observed across multiple months (e.g. January–August).

The target variable is **`Credit_Score`**, a three-class label:

- `Good` – low credit risk,
- `Standard` – medium credit risk,
- `Poor` – high credit risk.


## 3. Target variable: `Credit_Score`

Before analysing features, we need to understand the distribution of the target classes. This directly affects model choice, metrics and evaluation strategy.

In [None]:
target_col = 'Credit_Score'
value_counts = df[target_col].value_counts()
value_pct = (value_counts / len(df) * 100).round(2)
display(pd.DataFrame({'count': value_counts, 'percentage': value_pct}))

In [None]:
plt.figure(figsize=(5, 4))
plt.bar(value_counts.index.astype(str), value_counts.values)
plt.xlabel('Credit_Score')
plt.ylabel('Number of observations')
plt.title('Class distribution – Credit_Score')
plt.tight_layout()
plt.show()

### 3.1. Interpretation

The class distribution is moderately imbalanced:

- `Standard` is the most frequent label, representing a little more than half of all records,
- `Poor` constitutes roughly one third of the dataset,
- `Good` is the smallest class.

Because of this imbalance, **macro-averaged F1** is a more robust primary metric than raw accuracy. It penalises models that ignore minority classes and better reflects real performance across all risk segments.

## 4. Missing values and data quality

Next, we examine missing values and overall data quality. This is critical in a credit-scoring setting, where data quality issues can easily translate into biased or unstable models.

In [None]:
missing_counts = df.isnull().sum().sort_values(ascending=False)
missing_pct = (missing_counts / len(df) * 100).round(2)
missing_df = pd.DataFrame({'missing_count': missing_counts, 'missing_pct': missing_pct})
missing_df.head(15)

In [None]:
top_n = 15
top_missing = missing_df.head(top_n)

plt.figure(figsize=(8, 5))
plt.barh(top_missing.index, top_missing['missing_pct'])
plt.xlabel('Missing values [%]')
plt.title('Columns with the highest share of missing values')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### 4.1. Key observations on missingness and quality

Several columns exhibit substantial missingness:

- **`Monthly_Inhand_Salary`**, **`Type_of_Loan`**, **`Credit_History_Age`**, **`Num_of_Delayed_Payment`**, **`Amount_invested_monthly`**, **`Num_Credit_Inquiries`** and **`Monthly_Balance`** have notable fractions of missing values,
- identification-like fields such as `Name` may also be missing but are not relevant as predictive features.

In addition to missing values, there are evident data-quality issues:

- numeric fields stored as text (for example: `Annual_Income`, `Outstanding_Debt`, `Amount_invested_monthly`, `Monthly_Balance`),
- invalid values in columns that should be strictly numeric (e.g. negative ages, unrealistic numbers of bank accounts or credit cards, extremely high interest rates),
- composite textual formats such as `Credit_History_Age` (e.g. `'10 Years and 5 Months'`).

These issues need to be addressed systematically in preprocessing: parsing numeric values, enforcing valid ranges, handling outliers and deciding on an imputation strategy for missing values.

## 5. Feature overview

Below is a structured overview of the most important feature groups. The focus is on **semantics** and **expected relationship with credit risk**, rather than on raw distributions only.


### 5.1. Identification and time

- **`ID`** – technical row identifier, not informative for modelling.
- **`Customer_ID`** – unique customer identifier. The presence of repeated `Customer_ID` values across months turns the dataset into a panel. During model evaluation, train/test splits must be constructed so that records belonging to the same customer do not leak across folds.
- **`Month`** – calendar month (January–August). Can be used to check temporal drift or seasonality in behaviour.
- **`Name`** and **`SSN`** – personally identifiable information; such columns should not be used as model inputs and are only useful for linkage and consistency checks.


### 5.2. Demographics and income

- **`Age`** – customer age. In the raw data it is stored as text and occasionally contains invalid values (e.g. negative ages or strings with suffixes). After cleaning, age is expected to be a strong risk driver: very young customers typically exhibit higher risk.
- **`Occupation`** – categorical variable describing the customer's profession (e.g. Engineer, Scientist, etc.). It is an indirect proxy for income stability and social profile.
- **`Annual_Income`** – annual gross income, stored as text. After conversion to a numeric type and appropriate outlier handling, this feature captures the earning capacity of a customer.
- **`Monthly_Inhand_Salary`** – net monthly salary. The distribution is positively skewed: most customers earn moderate amounts, while a small subset earns significantly more. This variable is essential for evaluating affordability and capacity to service debt.


### 5.3. Accounts and credit products

- **`Num_Bank_Accounts`** – number of bank accounts. The raw data include unrealistic values (e.g. negative counts or extremely high numbers), so validation is required. Reasonable ranges can carry information on financial sophistication and fragmentation of funds.
- **`Num_Credit_Card`** – number of credit cards. As above, extreme values must be treated as data errors. Medium values may correspond to typical consumer behaviour, while very high counts could signal risk or synthetic patterns.
- **`Num_of_Loan`** – number of active loans, originally stored as text. Once standardised, it reflects the customer's overall exposure to credit.
- **`Type_of_Loan`** – multi-valued textual description of loan types held by the customer (e.g. Auto Loan, Home Loan, Credit-Card Loan). For modelling, this variable should be decomposed into binary indicators for each loan category.


### 5.4. Credit history and delinquencies

- **`Interest_Rate`** – average interest rate across active obligations. The distribution contains a reasonable central range and a few extreme outliers that must be handled separately.
- **`Delay_from_due_date`** – how many days a payment is delayed relative to the due date. While most values fall within a plausible range, there are cases with negative values, which should be either interpreted as early payment or treated as invalid.
- **`Num_of_Delayed_Payment`** – total number of delayed payments. This column is stored as text and includes missing values. After cleaning it is expected to be one of the strongest predictors of `Credit_Score`.
- **`Credit_History_Age`** – age of the credit history in a composite string format (e.g. `'10 Years and 5 Months'`). A robust preprocessing strategy converts this into total months of credit history, which is a standard feature in real-world scoring systems.


### 5.5. Debt, utilisation and cash flow

- **`Outstanding_Debt`** – total outstanding debt, stored as text. After conversion to numeric type and trimming of extreme outliers, this feature captures direct exposure to credit.
- **`Credit_Utilization_Ratio`** – proportion of the credit limit that is currently used. This is one of the most important risk indicators in consumer credit. High utilisation typically correlates strongly with elevated risk.
- **`Total_EMI_per_month`** – total monthly instalments (EMIs) across all loans. This reflects the fixed monthly repayment burden on the customer.
- **`Amount_invested_monthly`** – monthly investment amount (savings, investments). Missing values and textual numeric representations must be resolved, but this feature can act as a proxy for financial discipline and surplus cash.
- **`Monthly_Balance`** – monthly balance after accounting for income and expenses, also stored as text. A positive balance indicates a more comfortable financial position, whereas negative or very low balances can signal stress.


### 5.6. Behavioural and risk-profile features

- **`Payment_of_Min_Amount`** – whether the customer pays only the minimum amount due on credit products. Habitually paying only the minimum is a classic red flag in credit risk management.
- **`Payment_Behaviour`** – categorical variable describing combined spending and payment patterns (e.g. High_spent / Small_value_payments). Correct encoding of this feature can significantly improve model performance, as it condenses complex behavioural information.
- **`Credit_Mix`** – high-level categorisation of the customer's credit portfolio quality (Good / Standard / Bad). It is a compressed proxy for overall risk profile and tends to correlate strongly with the final credit score.
- **`Changed_Credit_Limit`** – amount of recent change in credit limit. Positive changes may reflect lender confidence, while reductions can be a reaction to perceived risk.
- **`Num_Credit_Inquiries`** – number of credit inquiries. Higher values often indicate aggressive credit seeking and are consistently associated with higher default rates in real-world portfolios.


## 6. Summary statistics for key numeric features

Here we build a numeric view of the most important quantitative variables.
Many of them are stored as `object` and need to be converted to numeric with coercion.


In [None]:
numeric_candidates = [
    'Age', 'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
    'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Delay_from_due_date',
    'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
    'Outstanding_Debt', 'Credit_Utilization_Ratio', 'Total_EMI_per_month',
    'Amount_invested_monthly', 'Monthly_Balance'
]

existing_numeric = [c for c in numeric_candidates if c in df.columns]
df_num = df[existing_numeric].apply(pd.to_numeric, errors='coerce')
df_num.describe().T

### 6.1. High-level interpretation of distributions

From the summary statistics we can typically observe that:

- income-related variables (`Annual_Income`, `Monthly_Inhand_Salary`) are **right-skewed**, with a long tail of high-income customers,
- debt-related measures (`Outstanding_Debt`, `Total_EMI_per_month`) display substantial variance and a few extreme outliers,
- utilisation (`Credit_Utilization_Ratio`) is concentrated in a mid-range band (for example around 30–40%), where higher utilisation is associated with worse credit scores,
- delinquency measures (`Delay_from_due_date`, `Num_of_Delayed_Payment`) have a mass at 0 or low values, with a fraction of customers exhibiting significantly higher levels of arrears.


## 7. Correlation structure among numeric features

While tree-based models do not require strict linear independence, it is useful to understand how the main numeric features relate to each other. This also helps in spotting redundant variables or potential leakage.

In [None]:
corr = df_num.corr()

plt.figure(figsize=(9, 7))
im = plt.imshow(corr, aspect='auto')
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.index)), corr.index)
plt.title('Correlation matrix – numeric features')
plt.tight_layout()
plt.show()

### 7.1. Interpretation

The correlation heatmap typically reveals:

- moderate positive relationships between different measures of indebtedness (`Outstanding_Debt`, `Total_EMI_per_month`, `Credit_Utilization_Ratio`),
- expected links between income-related variables and affordability metrics (`Annual_Income`, `Monthly_Inhand_Salary`, `Monthly_Balance`),
- limited linear correlation for categorical-turned-numeric variables, which will still be captured well by non-linear models.

No single pair of variables is perfectly collinear, but there are clear clusters of related features. This confirms that non-linear models with automatic interaction handling (such as gradient boosting) are a natural fit for this problem.

## 8. Relationship between key features and `Credit_Score`

In this final EDA section we focus on intuitive relationships between selected drivers and the target. The goal is not to build a model yet, but to check whether the data align with risk intuition.


In [None]:
features_to_inspect = [
    'Credit_Utilization_Ratio',
    'Num_of_Delayed_Payment',
    'Num_Credit_Inquiries',
    'Total_EMI_per_month'
]

for col in features_to_inspect:
    if col in df_num.columns:
        plt.figure(figsize=(7, 4))
        for label in df[target_col].dropna().unique():
            mask = df[target_col] == label
            plt.hist(df_num.loc[mask, col].dropna(), bins=40, alpha=0.4, label=str(label))
        plt.xlabel(col)
        plt.ylabel('Number of observations')
        plt.title(f'{col} vs Credit_Score')
        plt.legend(title='Credit_Score')
        plt.tight_layout()
        plt.show()

### 8.1. Qualitative insights

From these comparisons one typically observes patterns such as:

- customers labelled **`Poor`** tend to exhibit higher utilisation ratios, more delayed payments, more credit inquiries and heavier EMI burdens,
- customers labelled **`Good`** tend to have lower utilisation, fewer delinquencies and more moderate inquiry levels,
- the **`Standard`** class usually sits between these extremes, representing a medium-risk population.

These relationships are aligned with standard credit risk intuition and provide a sanity check that the dataset is coherent from a business perspective.

## 9. EDA summary

Key takeaways from the exploratory analysis:

1. The dataset is a **panel of monthly customer-level records**, with a rich mix of demographic, financial, behavioural and credit-history features.
2. The target variable `Credit_Score` is moderately imbalanced across three classes (Good / Standard / Poor), which motivates the use of macro-averaged F1 as a core evaluation metric.
3. Data quality issues are non-trivial: many numeric fields are stored as text, there are missing values in several critical columns, and some records contain clearly invalid or extreme values. Robust preprocessing is essential.
4. From a risk perspective, the most promising predictors are linked to **delinquencies, utilisation, debt burden, enquiries and payment behaviour**. These are classic drivers in real-world credit scoring.
5. The correlation structure shows clusters of related variables rather than strict multicollinearity, which supports the use of **non-linear, tree-based models** that can exploit interactions.
6. Qualitative comparisons across `Credit_Score` classes confirm that higher-risk labels correspond to patterns typically associated with financial stress: higher utilisation, more delayed payments and more credit inquiries.

This concludes the EDA phase and provides a solid foundation for the **modelling & evaluation** stage, where different algorithms (including gradient boosting and random forests) can be benchmarked and tuned for the credit scoring task.