# ðŸ“Š Dataset Loading and Exploratory Data Analysis (EDA)

In this notebook the credit risk dataset is loaded and analyzed. This phase corresponds to the **data understanding stage** of the machine learning lifecycle.

The dataset is structured tabular data for credit risk evaluation. Each row represents a client credit record and the target variable indicates whether the client paid or did not pay the credit obligation.

Main objectives of this analysis:

- Identify number of records and variables
- Inspect data types
- Review sample records
- Analyze statistical summaries
- Check class distribution
- Detect missing values


In [2]:
import os
import pandas as pd
from sklearn.datasets import fetch_openml

os.makedirs("../artifacts/dataset", exist_ok=True)

data = fetch_openml(name="credit-g", version=1, as_frame=True)
df = data.frame.copy()

print("Dataset shape:", df.shape)

display(df.head())

raw_path = "../artifacts/dataset/german_credit_raw.csv"
df.to_csv(raw_path, index=False)
print("Saved:", raw_path)


Dataset shape: (1000, 21)


  warn(


Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,...,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes,good
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,...,real estate,22.0,none,own,1.0,skilled,1.0,none,yes,bad
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,...,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes,good
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,...,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes,good
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,...,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes,bad


Saved: ../artifacts/dataset/german_credit_raw.csv


## Dataset Dimensions

The dataset contains **X rows (clients)** and **Y variables (features + target)**.  
From these variables, only a subset will be selected for modeling based on relevance and data quality.



## Variable Description

The dataset includes demographic and financial variables such as:

- credit duration
- credit amount
- age
- employment status
- savings category
- checking account status
- credit purpose

These variables are commonly used in credit scoring tasks.


## Sample Records

Sample records are displayed to understand the structure and encoding of categorical and numerical variables.


## Statistical Analysis

Descriptive statistics show the distribution, central tendency, and dispersion of numerical variables. This helps detect scale differences and potential outliers.


## Target Distribution

The target class distribution is analyzed to understand whether the dataset is balanced or imbalanced. This is important because class imbalance affects model training and evaluation metrics.
