# Competition Dataset: Synthetic Credit Risk Data

Author: Jakub Pilchoń, IAD 1 rok

### 📂 Dataset Overview

The dataset consists of **10,000 records** across **16 columns**. Below is an overview of the key columns:

| Column             | Description |
|--------------------|-------------|
| `age`              | Age of the client (years) |
| `income`           | Annual income of the client (with "złoty" added as currency) |
| `children`         | Number of children the client has; 0 is encoded as "none" |
| `credit_history`   | Credit history status: "no history," "good history," "bad history" |
| `overdue_payments` | Status of overdue payments: "no overdue" or "overdue" |
| `active_loans`     | Number of active loans held by the client |
| `years_in_job`     | Number of years in current employment |
| `employment_type`  | Employment status (e.g., "self-employed," "permanent") |
| `owns_property`    | Whether the client owns property: "yes" or "no" |
| `assets_value`     | Value of assets owned (with "złoty" as currency) |
| `other_loans`      | Number of other loans held by the client |
| `education`        | Education level (e.g., "higher," "secondary") |
| `city`             | Size category of the city of residence (e.g., "small," "medium," "large") |
| `marital_status`   | Marital status of the client |
| `support_indicator`| An auxiliary metric introduced in data generation |
| `credit_risk`      | Target variable: 0 (low risk) or 1 (high risk) |




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from typing import Union

In [2]:
raw_data = pd.read_csv("data_atlas.csv")
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          10000 non-null  int64  
 1   age                 10000 non-null  int64  
 2   income              8000 non-null   object 
 3   children            10000 non-null  object 
 4   credit_history      8000 non-null   object 
 5   overdue_payments    8000 non-null   object 
 6   active_loans        10000 non-null  int64  
 7   years_in_job        10000 non-null  int64  
 8   employment_type     10000 non-null  object 
 9   owns_property       8000 non-null   object 
 10  assets_value        8000 non-null   object 
 11  other_loans         10000 non-null  int64  
 12  education           10000 non-null  object 
 13  city                10000 non-null  object 
 14  marital_status      10000 non-null  object 
 15  support_indicator   10000 non-null  float64
 16  credi

In [3]:
raw_data.head()

Unnamed: 0.1,Unnamed: 0,age,income,children,credit_history,overdue_payments,active_loans,years_in_job,employment_type,owns_property,assets_value,other_loans,education,city,marital_status,support_indicator,credit_risk
0,0,44,15689 złoty,brak,dobra historia,brak opóźnień,2,9,samozatrudnienie,,,1,wyższe,małe,żonaty/zamężna,0.178131,0
1,1,38,18906 złoty,4 dzieci,brak historii,brak opóźnień,0,1,stała,tak,62965 złoty,0,średnie,średnie,kawaler/panna,0.37048,0
2,2,46,16338 złoty,2 dzieci,,,2,4,brak,tak,124967 złoty,0,podstawowe,duże,żonaty/zamężna,0.712334,0
3,3,55,23276 złoty,3 dzieci,dobra historia,opóźnienia,2,10,stała,tak,52147 złoty,1,średnie,małe,kawaler/panna,0.66505,0
4,4,37,40000 złoty,1 dzieci,brak historii,,1,9,określona,nie,33957 złoty,1,wyższe,małe,kawaler/panna,0.607151,0


# Cleaning
*note: i will handle Nans later as it might introduce bias to analysis*

In [4]:
# let's now make a copy of dataset for analytics purposes
an_data = raw_data.copy()

In [5]:
# first delete unnecessary columns
an_data = an_data.drop(["Unnamed: 0", "support_indicator "], axis=1)

In [6]:
# now let's make income and assets continous variable
def income_map(income: str) -> Union[np.float64, np.nan]:
    if income is not np.nan:
        income = income.strip(" złoty")
        income = np.float64(income)
        return income
    else:
        return np.nan
    
an_data["income"] = an_data["income"].map(income_map)
an_data["assets_value"] = an_data["assets_value"].map(income_map)

In [7]:
## now lets handle children
an_data["children"].value_counts()

children
brak        5043
1 dzieci    2066
2 dzieci    1477
3 dzieci     945
4 dzieci     364
5 dzieci     105
Name: count, dtype: int64

*Next i will handle binary or discrete ordinal variables \
the process is kinda repetitive, so there wont be some intresting stuff here \
Non-ordinal variables will be handled with during maching learning preprocessing,  \
as one-hot encoding it here would introduce chaos into the analysis.*

In [8]:
# we don't worry about nans as there are non in this column
def children_map(children: str) -> np.int64:
    if children == "brak":
        return np.int64(0)
    else:
        children = children.strip(" dzieci")
        children = np.int64(children)
        return children
    

an_data["children"] = an_data["children"].map(children_map)

In [9]:
an_data["overdue_payments"].value_counts()

overdue_payments
brak opóźnień    4831
opóźnienia       2429
2                 634
3                  98
4                   8
Name: count, dtype: int64

In [10]:
def overdue_map(payments: str) -> Union[np.int64, np.nan]:
    match payments:
        case "brak opóźnień":
            return np.int64(0)
        case "opóźnienia":
            return np.int64(1)
        case _ if pd.isna(payments): # for some reason np.nan != np.nan, so that's why there is this weird thing lol
            return np.nan
        case _:
            return np.int64(2) # there is a small group of people having 2+ overdue paymets so we can group the with the 2 overdue payments group
        
an_data["overdue_payments"] = an_data["overdue_payments"].map(overdue_map)

In [11]:
def property_map(prop: str) -> Union[np.int64, np.nan]:
    match prop:
        case "tak":
            return np.int64(1)
        case "nie":
            return np.int64(0)
        case _:
            return np.nan

an_data["owns_property"] = an_data["owns_property"].map(property_map)

In [12]:
def edu_map(edu: str) -> np.int64:
    match edu:
        case "podstawowe":
            return np.int64(0)
        case "średnie":
            return np.int64(1)
        case "wyższe":
            return np.int64(2)
        
an_data["education"] = an_data["education"].map(edu_map)

In [13]:
def city_map(edu: str) -> np.int64:
    match edu:
        case "małe":
            return np.int64(0)
        case "średnie":
            return np.int64(1)
        case "duże":
            return np.int64(2)
        
an_data["city"] = an_data["city"].map(city_map)

In [14]:
an_data.tail(10)

Unnamed: 0,age,income,children,credit_history,overdue_payments,active_loans,years_in_job,employment_type,owns_property,assets_value,other_loans,education,city,marital_status,credit_risk
9990,54,,0,brak historii,1.0,2,4,stała,1.0,127494.0,0,1,0,kawaler/panna,0
9991,41,17425.0,0,dobra historia,1.0,3,12,stała,1.0,60775.0,1,1,1,żonaty/zamężna,0
9992,46,34352.0,0,brak historii,0.0,1,4,samozatrudnienie,,115158.0,1,1,1,kawaler/panna,0
9993,25,34587.0,0,dobra historia,0.0,1,11,samozatrudnienie,1.0,86394.0,1,1,0,rozwiedziony/rozwiedziona,0
9994,51,35464.0,0,dobra historia,1.0,3,9,stała,0.0,181977.0,0,1,0,kawaler/panna,0
9995,53,26739.0,3,dobra historia,0.0,1,14,stała,1.0,28518.0,1,1,0,żonaty/zamężna,0
9996,20,40000.0,1,dobra historia,0.0,0,10,stała,,,0,1,2,żonaty/zamężna,0
9997,32,26613.0,2,dobra historia,2.0,1,11,określona,0.0,27826.0,0,0,2,żonaty/zamężna,0
9998,44,40000.0,1,brak historii,1.0,0,8,określona,1.0,62710.0,0,1,0,kawaler/panna,0
9999,46,9799.0,0,dobra historia,1.0,0,12,samozatrudnienie,,,0,1,1,kawaler/panna,0


In [15]:
# change object type to categorical
an_data = an_data.apply(lambda column: column.astype("category") if column.dtype == "object" else column)

In [16]:
# save cleared dataset
an_data.to_csv("cleared_data.csv")

In [17]:
an_data = pd.read_csv("cleared_data.csv")
an_data = an_data.apply(lambda column: column.astype("category") if column.dtype == "object" else column)
del an_data["Unnamed: 0"]

In [18]:
an_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   age               10000 non-null  int64   
 1   income            8000 non-null   float64 
 2   children          10000 non-null  int64   
 3   credit_history    8000 non-null   category
 4   overdue_payments  8000 non-null   float64 
 5   active_loans      10000 non-null  int64   
 6   years_in_job      10000 non-null  int64   
 7   employment_type   10000 non-null  category
 8   owns_property     8000 non-null   float64 
 9   assets_value      8000 non-null   float64 
 10  other_loans       10000 non-null  int64   
 11  education         10000 non-null  int64   
 12  city              10000 non-null  int64   
 13  marital_status    10000 non-null  category
 14  credit_risk       10000 non-null  int64   
dtypes: category(3), float64(4), int64(8)
memory usage: 967.4 KB
