<a href="https://colab.research.google.com/github/AMANDASTHE/mcom-prep-portfolio/blob/main/projects/week1-intro-eda/notebooks/01_initial_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìò Week 1 ‚Äî Initial Exploratory Data Analysis (EDA)
### Project: Credit Risk ‚Äî Default Prediction

This notebook performs a structured exploratory data analysis (EDA) on the Week 1 dataset.  
The goal is to understand data quality, variable distributions, correlations, outliers, and early risk signals.

---

## üîç Objectives
- Understand the dataset structure and variable types.
- Inspect missing data and potential data quality issues.
- Explore distributions of numeric and categorical variables.
- Compute correlations between features and with the target.
- Identify early trends related to credit risk.
- Produce insights to inform Week 2 (feature engineering + modelling).

---

## üìÇ Dataset
File: `data/50k_dataset/credit_risk_dataset_50k.csv`  
Rows: 50,000  
Domain: Credit default prediction  
Target variable: `default`

Let‚Äôs begin.


In [10]:
# Core libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")


In [16]:
url = "https://raw.githubusercontent.com/AMANDASTHE/mcom-prep-portfolio/main/projects/week1-intro-eda/data/raw/credit_risk_dataset_50k.csv"

df = pd.read_csv(url)

df.head()



Unnamed: 0,CustomerID,Age,Income,CreditLimit,CurrentBalance,Utilisation,DelinquencyStatus,DaysPastDue,Region,EmploymentLength,CreditScore,TenureMonths,LastPaymentAmount,LastPaymentDate,WasEver60DPD
0,1,59,19836.683698,50986.253155,35087.886588,0.688183,3,102,Other,27,598.411178,101,1058.710933,2023-01-27 10:30:11,0
1,2,72,35851.242779,4318.061619,3157.630868,0.731261,3,81,Gauteng,17,592.657601,166,168.370142,2024-03-15 09:01:18,0
2,3,49,25922.242386,31455.71958,21160.175507,0.672697,0,0,Other,7,397.834314,68,1967.250509,2023-12-28 07:26:25,0
3,4,35,34319.612887,30891.788111,28524.320248,0.923363,5,25,Gauteng,8,531.20048,139,3931.987364,2023-08-20 03:53:06,0
4,5,63,49103.692491,34762.55522,24843.309915,0.714657,0,0,Gauteng,25,637.040174,34,203.924929,2022-06-04 04:17:24,0


## üìä Dataset Snapshot

We start with an overview of the dataset structure, variable types, and completeness.


In [None]:
print("Shape:", df.shape)

print("\n--- Info ---")
df.info()

print("\n--- Summary Statistics ---")
df.describe(include='all')



## üß© Missing Values Analysis

Understanding missingness helps guide imputation and feature engineering.


In [None]:
missing = df.isna().sum().sort_values(ascending=False)
missing[missing > 0]



## üóÇÔ∏è Categorical Variable Distributions

Here we inspect distributions of categorical fields to understand class balance.


In [None]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

for col in cat_cols:
    print(f"--- {col} ---")
    display(df[col].value_counts(normalize=True).head())
    print("\n")



## üìà Numeric Variable Distributions

This helps identify skewness, heavy tails, and potential outliers.


In [None]:
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

df[num_cols].hist(figsize=(18, 14), bins=40)
plt.tight_layout()
plt.show()



## üì¶ Outlier Detection Using Boxplots


In [None]:
plt.figure(figsize=(16, 12))

for i, col in enumerate(num_cols, 1):
    plt.subplot(len(num_cols)//3 + 1, 3, i)
    sns.boxplot(x=df[col])
    plt.title(col)

plt.tight_layout()
plt.show()



## üîó Correlation Analysis

This includes:
- Feature-to-feature correlations  
- Relationship of each feature with the **target variable: `default`**  


In [None]:
plt.figure(figsize=(14, 10))
sns.heatmap(df[num_cols].corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Matrix")
plt.show()



In [None]:
if 'default' in df.columns:
    df[num_cols].corr()['default'].sort_values(ascending=False)
else:
    print("Target variable 'default' not found.")



# üß† Week 1 Findings & Insights

### üîç Data Quality
- Missing values: (summarize key findings here)
- Duplicates: (if any)
- Outlier-heavy fields: (income, debt, etc.)

### üìà Key Distributions
- Skewed variables:  
- Well-balanced variables:  

### üîó Correlations & Credit Risk Signals
- Top positive correlates of default:
- Top negative correlates:
- Early hypotheses:

---

# üü¢ Next Steps (Week 2)
- Handle missing values  
- Encode categorical variables  
- Scale numeric variables  
- Feature selection  
- Train baseline models (Logistic, Tree-based)  
- Start integrating with your GitHub project structure  


# Key Insights

- (Fill in after you analyze)
