
# 💳 Home Credit Default Risk — Exploratory Data Analysis

## 1. Goal
This assignment requires learners to:
- Understand the structure of a real-world business dataset
- Explore customer attributes and their relationship to default risk
- Practice handling missing values and class imbalance
- Formulate reasonable business-driven questions



## Problem 1: Load dataset
Load `application_train.csv` into a Pandas DataFrame.


In [None]:

import pandas as pd

# Load dataset
data = pd.read_csv("application_train.csv")
print("Shape:", data.shape)
data.head()



## Problem 2: Dataset overview
Check first rows, datatypes, and summary statistics to understand data structure.


In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe().T.head()


## Problem 3: Target variable (default risk)
The target variable is `TARGET`: 1 = default, 0 = non-default.


In [None]:

print(data['TARGET'].value_counts())
print(data['TARGET'].value_counts(normalize=True))



## Problem 4: Missing values
Check missingness, visualize, and drop columns/rows as instructed.


In [None]:

import missingno as msno
msno.matrix(data)


In [None]:

missing_ratio = data.isnull().sum() / len(data)
missing_ratio = missing_ratio[missing_ratio > 0].sort_values(ascending=False)
print("Missing ratios (top 20):")
print(missing_ratio.head(20))


In [None]:

# Drop columns with >=5 missing values
data_reduced = data.dropna(axis=1, thresh=len(data)-5)

# Drop remaining rows with missing values
data_reduced = data_reduced.dropna()

print("Shape before:", data.shape)
print("Shape after:", data_reduced.shape)



## Additional Question 1: Class Imbalance
Visualize percentage of defaults vs non-defaults.


In [None]:

import matplotlib.pyplot as plt

counts = data_reduced['TARGET'].value_counts()
plt.title("Class Distribution: Default vs Non-Default")
plt.pie(counts, labels=counts.index, autopct="%.1f%%")
plt.show()



## Additional Question 2: Demographics and Risk
Explore relationships between demographic features and default risk.


In [None]:

import seaborn as sns

sns.barplot(x="NAME_CONTRACT_TYPE", y="TARGET", data=data_reduced)
plt.title("Default Rate by Contract Type")
plt.show()

sns.barplot(x="NAME_FAMILY_STATUS", y="TARGET", data=data_reduced)
plt.title("Default Rate by Family Status")
plt.xticks(rotation=45)
plt.show()



### Age and default risk
Convert `DAYS_BIRTH` to age in years.


In [None]:

data_reduced['AGE'] = (-data_reduced['DAYS_BIRTH'] / 365).astype(int)
sns.histplot(data=data_reduced, x="AGE", hue="TARGET", bins=30, multiple="stack")
plt.title("Age Distribution by Default Status")
plt.show()



## Additional Question 3: Income and Credit Ratios
Explore income, credit, and credit-to-income ratio.


In [None]:

data_reduced['CREDIT_INCOME_RATIO'] = data_reduced['AMT_CREDIT'] / data_reduced['AMT_INCOME_TOTAL']

sns.histplot(data=data_reduced, x="CREDIT_INCOME_RATIO", hue="TARGET", bins=50, element="step", stat="density")
plt.title("Credit-to-Income Ratio Distribution by Default Status")
plt.xlim(0, 5)
plt.show()



## Additional Question 4: External Scores
Explore EXT_SOURCE variables, which act as external credit scores.


In [None]:

ext_sources = ["EXT_SOURCE_1","EXT_SOURCE_2","EXT_SOURCE_3"]
corr = data_reduced[ext_sources + ["TARGET"]].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation of External Scores with TARGET")
plt.show()



## Problem 7: Correlation coefficients
Find top 10 features correlated with TARGET.


In [None]:

corr_all = data_reduced.corr(numeric_only=True)
top10_corr = corr_all['TARGET'].abs().sort_values(ascending=False).head(11).index

plt.figure(figsize=(10,8))
sns.heatmap(data_reduced[top10_corr].corr(), annot=True, cmap="coolwarm", square=True)
plt.title("Top 10 Correlated Features with TARGET")
plt.show()

print("Top correlated features:")
print(corr_all['TARGET'].abs().sort_values(ascending=False).head(11))



## Problem 8: Business Interpretation
Explain findings in terms of business strategy:
- High credit-to-income ratios may signal higher risk.
- Younger applicants may have higher default rates.
- External credit scores are predictive and can serve as proxies for creditworthiness.



## 9. Summary
This assignment required demonstrating the ability to:
- Explore dataset structure and missingness
- Handle missing values carefully
- Examine class imbalance
- Investigate relationships between demographics, income ratios, and default risk
- Identify most predictive features
