# Competition Dataset: Synthetic Credit Risk Data

Author: Jakub Pilchoń, IAD 1 rok

### 📂 Dataset Overview

The dataset consists of **10,000 records** across **16 columns**. Below is an overview of the key columns:

| Column             | Description |
|--------------------|-------------|
| `age`              | Age of the client (years) |
| `income`           | Annual income of the client (with "złoty" added as currency) |
| `children`         | Number of children the client has; 0 is encoded as "none" |
| `credit_history`   | Credit history status: "no history," "good history," "bad history" |
| `overdue_payments` | Status of overdue payments: "no overdue" or "overdue" |
| `active_loans`     | Number of active loans held by the client |
| `years_in_job`     | Number of years in current employment |
| `employment_type`  | Employment status (e.g., "self-employed," "permanent") |
| `owns_property`    | Whether the client owns property: "yes" or "no" |
| `assets_value`     | Value of assets owned (with "złoty" as currency) |
| `other_loans`      | Number of other loans held by the client |
| `education`        | Education level (e.g., "higher," "secondary") |
| `city`             | Size category of the city of residence (e.g., "small," "medium," "large") |
| `marital_status`   | Marital status of the client |
| `support_indicator`| An auxiliary metric introduced in data generation |
| `credit_risk`      | Target variable: 0 (low risk) or 1 (high risk) |




In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from typing import Union

In [None]:
raw_data = pd.read_csv("data_atlas.csv")
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          10000 non-null  int64  
 1   age                 10000 non-null  int64  
 2   income              8000 non-null   object 
 3   children            10000 non-null  object 
 4   credit_history      8000 non-null   object 
 5   overdue_payments    8000 non-null   object 
 6   active_loans        10000 non-null  int64  
 7   years_in_job        10000 non-null  int64  
 8   employment_type     10000 non-null  object 
 9   owns_property       8000 non-null   object 
 10  assets_value        8000 non-null   object 
 11  other_loans         10000 non-null  int64  
 12  education           10000 non-null  object 
 13  city                10000 non-null  object 
 14  marital_status      10000 non-null  object 
 15  support_indicator   10000 non-null  float64
 16  credi

In [6]:
raw_data.head()

Unnamed: 0.1,Unnamed: 0,age,income,children,credit_history,overdue_payments,active_loans,years_in_job,employment_type,owns_property,assets_value,other_loans,education,city,marital_status,support_indicator,credit_risk
0,0,44,15689 złoty,brak,dobra historia,brak opóźnień,2,9,samozatrudnienie,,,1,wyższe,małe,żonaty/zamężna,0.178131,0
1,1,38,18906 złoty,4 dzieci,brak historii,brak opóźnień,0,1,stała,tak,62965 złoty,0,średnie,średnie,kawaler/panna,0.37048,0
2,2,46,16338 złoty,2 dzieci,,,2,4,brak,tak,124967 złoty,0,podstawowe,duże,żonaty/zamężna,0.712334,0
3,3,55,23276 złoty,3 dzieci,dobra historia,opóźnienia,2,10,stała,tak,52147 złoty,1,średnie,małe,kawaler/panna,0.66505,0
4,4,37,40000 złoty,1 dzieci,brak historii,,1,9,określona,nie,33957 złoty,1,wyższe,małe,kawaler/panna,0.607151,0


# Cleaning

In [31]:
# let's now make a copy of dataset for analytics purposes
an_data = raw_data.copy()

In [32]:
# first delete unnecessary columns
an_data.drop(["Unnamed: 0", "support_indicator "], axis=1)

Unnamed: 0,age,income,children,credit_history,overdue_payments,active_loans,years_in_job,employment_type,owns_property,assets_value,other_loans,education,city,marital_status,credit_risk
0,44,15689 złoty,brak,dobra historia,brak opóźnień,2,9,samozatrudnienie,,,1,wyższe,małe,żonaty/zamężna,0
1,38,18906 złoty,4 dzieci,brak historii,brak opóźnień,0,1,stała,tak,62965 złoty,0,średnie,średnie,kawaler/panna,0
2,46,16338 złoty,2 dzieci,,,2,4,brak,tak,124967 złoty,0,podstawowe,duże,żonaty/zamężna,0
3,55,23276 złoty,3 dzieci,dobra historia,opóźnienia,2,10,stała,tak,52147 złoty,1,średnie,małe,kawaler/panna,0
4,37,40000 złoty,1 dzieci,brak historii,,1,9,określona,nie,33957 złoty,1,wyższe,małe,kawaler/panna,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,53,26739 złoty,3 dzieci,dobra historia,brak opóźnień,1,14,stała,tak,28518 złoty,1,średnie,małe,żonaty/zamężna,0
9996,20,40000 złoty,1 dzieci,dobra historia,brak opóźnień,0,10,stała,,,0,średnie,duże,żonaty/zamężna,0
9997,32,26613 złoty,2 dzieci,dobra historia,2,1,11,określona,nie,27826 złoty,0,podstawowe,duże,żonaty/zamężna,0
9998,44,40000 złoty,1 dzieci,brak historii,opóźnienia,0,8,określona,tak,62710 złoty,0,średnie,małe,kawaler/panna,0


In [None]:
# now let's make income and assets continous variable
def income_map(income: str) -> Union[np.float64, np.nan]:
    if income is not np.nan:
        income = income.strip(" złoty")
        income = np.float64(income)
        return income
    else:
        return np.nan
    
an_data["income"] = an_data["income"].map(income_map)
an_data["assets_value"] = an_data["assets_value"].map(income_map)

In [None]:
# now clear nans from both "income" and "assets_value"
# i will replace nans with mean value of their respective columns

an_data["income"].fillna(an_data["income"].mean())
an_data["income"].fillna(an_data["income"].mean())

np.float64(23598.62125)

## Demographic overview 
Let's look at basic info about participants in the study

In [None]:
ax,fig = plt.subplots()