# Assessing Bank Borrower Reliability Under Incomplete Information

**Study objective:**: to assess the reliability of a bank borrower under incomplete initial information, specifically the presence of missing data and artefacts in data provided by the bank.

 - Client: the bank’s Credit Department.
 - Input data from the bank: statistics on clients’ creditworthiness.
 - Object of the study: the bank borrower.
 - Subject of the study: borrower reliability.


The results of the study will be used in building a **credit scoring** model — a specialised system that assesses a potential borrower’s ability to repay a loan.


**Tasks:**

- Develop an approach to data preprocessing;
- Handle missing values and artefacts;
- Handle duplicate records;
- Within the scope of this analysis, answer the following questions:
    - Is there a relationship between having children and repaying the loan on time?
    - Is there a relationship between marital status and repaying the loan on time?
    - Is there a relationship between income level and repaying the loan on time?
    - How do different loan purposes affect on-time repayment?

    

## 1. Overview of the data

In [1]:
import pandas as pd #loading the necessary libraries
import numpy as np
import re
pd.options.mode.chained_assignment = None  # default='warn'
data = pd.read_csv('data.csv') #read the file
display(data.head(10))
data.info()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**Summary**


We have a dataset consisting of a table with 12 columns:

- `children` — number of children in the family
- `days_employed` — total employment length in days
- `dob_years` — client’s age in years
- `education` — client’s education level
- `education_id` — education level identifier
- `family_status` — marital status
- `family_status_id` — marital status identifier
- `gender` — client’s gender
- `income_type` — employment type
- `debt` — whether the client had overdue debt on loan repayment
- `total_income` — monthly income
- `purpose` — loan purpose

The column names contain no spaces; all names use a consistent case and are in one language.

The dataset includes both categorical and numerical variables.

There are missing values in the `days_employed` and `total_income` columns.

Values in the `days_employed` column are negative.

The column data types match the information stored in them.

Some values are in lower case and some in upper case (e.g., in `education`: “среднее” vs “СРЕДНЕЕ”). It is better to convert all values to lower case.


## 2. Data preprocessing


<a name='duplicates'></a>
### Handling duplicates


In [2]:
data.duplicated().sum()

np.int64(54)

There are 54 rows that are exact duplicates.

Duplicates are usually removed from a DataFrame. However, I suggest taking a closer look at them.

In this dataset, it is possible that two different people could have identical values in the following columns: `children`, `dob_years`, `education`, `family_status`, `gender`, `income_type`, `debt`, `purpose`. These are not necessarily duplicates that should be removed.

However, it is unlikely that two different clients would have exactly the same values in `days_employed` and `total_income`. Let us check the `days_employed` and `total_income` values for these duplicate rows.


In [3]:
print(data[data.duplicated()].days_employed.unique()) 
print(data[data.duplicated()].total_income.unique())

[nan]
[nan]



Although these rows are identical across all available columns, two key variables (`days_employed` and `total_income`) are missing in these records. Because of this, we cannot confirm whether these are true repeated entries or different clients who share the same recorded characteristics (e.g., the same family status, education, employment type, and loan purpose).

Given this uncertainty, and to avoid removing potentially valid observations, we keep these rows in the dataset. We note that this assumption may slightly affect group-level default rate estimates.

During preprocessing, additional duplicates may appear (for example, after converting values to lower case, imputing missing values using the median, or after lemmatisation). However, we will continue to keep duplicate rows, since we assume they may represent different clients with identical recorded characteristics.


### Handling invalid values and missing data

Real-world datasets often contain missing values for various reasons, which can reduce the effectiveness of statistical models. They may also include artefacts (impossible or invalid values) which, if left uncorrected, can distort model results and lead to errors. Let us check whether such issues are present in our data.


In [4]:
print('Missing values (percentage):')
print(data.isna().sum().apply(lambda x: round(x*100/len(data)))) # calculate the percentage of missing values for each column
print('Missing values occur in the same rows - ', data[data.days_employed.isna()].equals(data[data.total_income.isna()])) # check if NaN values are in the same rows

Missing values (percentage):
children             0
days_employed       10
dob_years            0
education            0
education_id         0
family_status        0
family_status_id     0
gender               0
income_type          0
debt                 0
total_income        10
purpose              0
dtype: int64
Missing values occur in the same rows -  True


Already at the initial data review stage, we noticed that some values in certain columns do not match what those variables should contain. We also found missing values in two columns: `days_employed` and `total_income`. Around 10% of the values are missing, which is too much to ignore and simply drop.

We also checked that if the total employment length is missing, then the monthly income is missing as well. One possible explanation is that some clients chose not to provide this information.

To decide how to handle incorrect and missing values, we will review each column separately.


<a name='children'></a>
#### Column `children`

In [5]:
print(f'Unique values for column children: {list(data.children.unique())}') #find unique values for column 'children'

Unique values for column children: [np.int64(1), np.int64(0), np.int64(3), np.int64(2), np.int64(-1), np.int64(4), np.int64(20), np.int64(5)]


In [6]:
data.loc[(data.children == -1), 'children'] = 1 #replace -1 to 1
print(f'Unique values for column children after changes: {list(data.children.unique())}') #check the replacement

Unique values for column children after changes: [np.int64(1), np.int64(0), np.int64(3), np.int64(2), np.int64(4), np.int64(20), np.int64(5)]


In the `children` column, there was a value of `-1`. This is not possible, so it is most likely a technical error. We corrected it by replacing `-1` with `1`.


<a name='dob_years'></a>
#### Column `dob_years`

In [7]:
print('Discriptive statistics `dob_years`:')
data.dob_years.describe()

Discriptive statistics `dob_years`:


count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

The minimum borrower age is 0 years. Let us check what other unrealistic values may appear in this column.


In [8]:
print('Number of clients with under 18 years old:',data[data.dob_years < 18].dob_years.unique()) #are there clients under 18 years old
print(f'{len(data[data.dob_years==0])} clients with 0 years old value')

Number of clients with under 18 years old: [0]
101 clients with 0 years old value


In [9]:
data.loc[(data.dob_years == 0), 'dob_years'] = round(data.dob_years.mean()) #mean value instead zero
print('Descriptive statistics `dob_years` after changes:')
data.dob_years.describe()

Descriptive statistics `dob_years` after changes:


count    21525.000000
mean        43.495145
std         12.218213
min         19.000000
25%         34.000000
50%         43.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

We replaced the zero values with the mean of this column. The descriptive statistics of the column changed only slightly.


<a name='days_employed'></a>
#### Column `days_employed`

In [10]:
print('Descriptive statistics `days_employed`:')
data.days_employed.describe() #show descriptive statistics

Descriptive statistics `days_employed`:


count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64

The `days_employed` column contains values that are not plausible for employment length:
- negative values;
- extremely large values (e.g., the maximum is 401755.4 days, which would correspond to well over 1,000 years of work).

Such values indicate artefacts in the dataset. In practice, employment length should be non-negative and within a realistic human working lifespan. We will check the distribution and identify these outliers for further handling.

As an approximate sanity check, we can assume employment length should not exceed around 60 working years (≈ 60 × 247 working days ≈ 14820 days). We will check whether the dataset contains values far beyond this range.



In [11]:
data[(data.days_employed>=0) & (data.days_employed<=14820)]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


There are no such values in the dataset (the filtered table is empty). Next, we will examine the rows where `days_employed` is positive separately from those where it is negative.

In [12]:
df_days_empl_neg = data[(data.days_employed<0)] #slice dataframe - only negative days_employed
print('Descriptive statistics `days_employed` with negative values:')
print(df_days_empl_neg.days_employed.describe())
print(f'Values for income_type column when days_employed is negative: {df_days_empl_neg.income_type.unique()}')

Descriptive statistics `days_employed` with negative values:
count    15906.000000
mean     -2353.015932
std       2304.243851
min     -18388.949901
25%      -3157.480084
50%      -1630.019381
75%       -756.371964
max        -24.141633
Name: days_employed, dtype: float64
Values for income_type column when days_employed is negative: ['сотрудник' 'компаньон' 'госслужащий' 'студент' 'предприниматель'
 'в декрете']


 The negative sign in these values is most likely a technical error. To address this, we will remove the minus sign (take the absolute value).

In [13]:
df_days_empl_pos = data[(data.days_employed>=0)] #slice dataframe - only positive days_employed
print('Descriptive statistics `days_employed` with positive values:')
print(df_days_empl_pos.days_employed.describe())
print(f'Values for income_type column when days_employed is positive: {df_days_empl_pos.income_type.unique()}')

Descriptive statistics `days_employed` with positive values:
count      3445.000000
mean     365004.309916
std       21075.016396
min      328728.720605
25%      346639.413916
50%      365213.306266
75%      383246.444219
max      401755.400475
Name: days_employed, dtype: float64
Values for income_type column when days_employed is positive: ['пенсионер' 'безработный']



We observe 3,445 rows with implausibly large `days_employed` values. These rows occur only for pensioners and unemployed clients, which suggests a systematic data artefact (e.g., a placeholder value) rather than true employment length.

Ideally, this should be clarified with the data provider. Since this is not possible here, we treat these values as missing by replacing them with `NaN`.

To impute missing `days_employed`, we use group-wise medians. As an approximation, we group clients by gender and age and assign the median `days_employed` within each group to the missing values in the same group. (This is a simplifying assumption and may introduce some uncertainty for specific employment types.)


In [14]:
data.loc[data['days_employed']>0, 'days_employed'] = np.nan #NaN values instead possitive
data.loc[data['days_employed']<0, 'days_employed'] = abs(data.days_employed) #absolute value for negative
print('Descriptive statistics `days_employed` after changes:')
data.days_employed.describe()

Descriptive statistics `days_employed` after changes:


count    15906.000000
mean      2353.015932
std       2304.243851
min         24.141633
25%        756.371964
50%       1630.019381
75%       3157.480084
max      18388.949901
Name: days_employed, dtype: float64

Now the `days_employed` column contains only positive values and `NaN`s. There are 5,619 missing values, which is too many to ignore and drop.

We impute missing values using group-wise medians based on age and gender. We first compute the median `days_employed` for each (age × gender) group. If a median cannot be computed for a specific age (insufficient data), we fall back to the overall median for that gender. We then assign these values to the missing entries in the corresponding groups.



In [15]:
table_days_empl = data.pivot_table(values='days_employed',index='dob_years',
                                   columns='gender',aggfunc='median')# pivot table with median values by age and gender

for g in table_days_empl.columns: # Fill gaps in the pivot table with the overall median for that gender
    table_days_empl[g] = table_days_empl[g].fillna(table_days_empl[g].median())

def fill_nan_days_employed(row):
    """
    Return a group-wise median value of days_employed (by age and gender)
    if days_employed is NaN; otherwise return the original value.
    """
    if pd.isna(row.days_employed):
        return table_days_empl.loc[row.dob_years, row.gender]
    return row.days_employed

data['days_employed'] = data.apply(fill_nan_days_employed, axis=1)

print('Descriptive statistics `days_employed` without NaN values:')
data['days_employed'].describe()


Descriptive statistics `days_employed` without NaN values:


count    21525.000000
mean      2391.334416
std       2041.346327
min         24.141633
25%        984.469860
50%       1991.395531
75%       3017.005276
max      18388.949901
Name: days_employed, dtype: float64

<a name='total_income'></a>
#### Column `total_income`

In [16]:
pd.options.display.float_format ='{:,.2f}'.format
print('Descriptive statistics `total_income`:')
data.total_income.describe()

Descriptive statistics `total_income`:


count      19,351.00
mean      167,422.30
std       102,971.57
min        20,667.26
25%       103,053.15
50%       145,017.94
75%       203,435.07
max     2,265,604.03
Name: total_income, dtype: float64

This column contains missing values. Similar to the previous column, we will impute them using the median value based on age and gender.


In [17]:
table_total_income = data.pivot_table(values='total_income',index='dob_years',  # Pivot table with median values by age and gender
                                      columns='gender',aggfunc='median')

for g in table_total_income.columns: # Fill gaps in the pivot table with the overall median for that gender
    table_total_income[g] = table_total_income[g].fillna(table_total_income[g].median())

def fill_nan_total_income(row):
    """
    Return a group-wise median value of total_income (by age and gender)
    if total_income is NaN; otherwise return the original value.
    """
    if pd.isna(row.total_income):
        return table_total_income.loc[row.dob_years, row.gender]
    return row.total_income

data['total_income'] = data.apply(fill_nan_total_income, axis=1)

print('Descriptive statistics `total_income` without NaN values:')
data['total_income'].describe()

Descriptive statistics `total_income` without NaN values:


count      21,525.00
mean      165,148.16
std        98,062.93
min        20,667.26
25%       107,485.65
50%       143,362.57
75%       195,543.62
max     2,265,604.03
Name: total_income, dtype: float64

<a name='education'></a>
#### Column `education` and `education_id`

In [18]:
print(data.education.unique()) 

['высшее' 'среднее' 'Среднее' 'СРЕДНЕЕ' 'ВЫСШЕЕ' 'неоконченное высшее'
 'начальное' 'Высшее' 'НЕОКОНЧЕННОЕ ВЫСШЕЕ' 'Неоконченное высшее'
 'НАЧАЛЬНОЕ' 'Начальное' 'Ученая степень' 'УЧЕНАЯ СТЕПЕНЬ'
 'ученая степень']


The `education` column contains the same categories written with different letter cases (e.g., “среднее”, “Среднее”, “СРЕДНЕЕ”). In addition, the categories are recorded in Russian. To standardise the data and make the analysis readable for an English-speaking audience, we:
1) convert all values to lower case;
2) translate the category names into English equivalents.


In [19]:
data['education'] = data['education'].str.lower() # standardise case

education_map = {'начальное': 'primary', # translate categories to English
                 'среднее': 'secondary',
                 'высшее': 'higher education',
                 'неоконченное высшее': 'incomplete higher education',
                 'ученая степень': 'academic degree'}

data['education'] = data['education'].replace(education_map)
print(data.education.unique()) 

['higher education' 'secondary' 'incomplete higher education' 'primary'
 'academic degree']


In [20]:
print(data.education_id.unique()) 

[0 1 2 3 4]


Now the number of unique values in these columns matches.


<a name='family_status'></a>
#### Columns `family_status`  and `family_status_id`

In [21]:
print(data.family_status.unique())
print(data.family_status_id.unique())

['женат / замужем' 'гражданский брак' 'вдовец / вдова' 'в разводе'
 'Не женат / не замужем']
[0 1 2 3 4]


The `family_status` column is recorded in Russian. To make the analysis readable for an English-speaking audience, we translate the category names into English equivalents.

In [22]:
data['family_status'] = data['family_status'].str.lower() # standardise case
family_status_map = {'женат / замужем': 'married',
                     'гражданский брак': 'civil partnership',
                     'вдовец / вдова': 'widowed',
                     'в разводе': 'divorced',
                     'не женат / не замужем': 'not married'}

data['family_status'] = data['family_status'].replace(family_status_map)
print(data.family_status.unique())

['married' 'civil partnership' 'widowed' 'divorced' 'not married']


<a name='gender'></a>
#### Column `gender`

In [23]:
data.gender.value_counts()

gender
F      14236
M       7288
XNA        1
Name: count, dtype: int64

As shown in the table above, only one record has this gender value. We could remove this row, and then we would have only two unique gender values. However, we can also keep it — at this stage of the analysis, this single record does not affect the results.


<a name='income_type'></a>
#### Column `income_type`

In [24]:
data.income_type.value_counts()

income_type
сотрудник          11119
компаньон           5085
пенсионер           3856
госслужащий         1459
безработный            2
предприниматель        2
студент                1
в декрете              1
Name: count, dtype: int64

The `income_type` column contains categorical values recorded in Russian. To make the dataset readable for an English-speaking audience, we translate these categories into English equivalents.


In [25]:
income_type_map = {'сотрудник': 'employee',
                   'компаньон': 'business owner',
                   'пенсионер': 'pensioner',
                   'госслужащий': 'civil servant',
                   'безработный': 'unemployed',
                   'предприниматель': 'entrepreneur',
                   'студент': 'student',
                   'в декрете': 'on maternity leave'}

data['income_type'] = data['income_type'].replace(income_type_map)

Although there are also rare categories (e.g., unemployed, entrepreneur, student, on maternity leave), they are too small to be informative and will not meaningfully contribute to the analysis.


In [26]:
rare_types = ['unemployed', 'entrepreneur', 'student', 'on maternity leave']
data = data[~data['income_type'].isin(rare_types)] # keep only sufficiently represented categories
data['income_type'].value_counts()

income_type
employee          11119
business owner     5085
pensioner          3856
civil servant      1459
Name: count, dtype: int64

<a name='debt'></a>
#### Column `debt`

In [27]:
data.debt.value_counts()

debt
0    19780
1     1739
Name: count, dtype: int64

In the `debt` column, `0` means there was no overdue debt, and `1` means the client had overdue debt.

Overall, the share of clients with overdue debt in this dataset is 8.8%.


<a name='purpose'></a>
#### Column `purpose`

In [28]:
print('Unique values of column `purpose`:')
print(data.purpose.unique())

Unique values of column `purpose`:
['покупка жилья' 'приобретение автомобиля' 'дополнительное образование'
 'сыграть свадьбу' 'операции с жильем' 'образование'
 'на проведение свадьбы' 'покупка жилья для семьи' 'покупка недвижимости'
 'покупка коммерческой недвижимости' 'покупка жилой недвижимости'
 'строительство собственной недвижимости' 'недвижимость'
 'строительство недвижимости' 'на покупку подержанного автомобиля'
 'на покупку своего автомобиля' 'операции с коммерческой недвижимостью'
 'строительство жилой недвижимости' 'жилье'
 'операции со своей недвижимостью' 'автомобили' 'заняться образованием'
 'сделка с подержанным автомобилем' 'получение образования' 'автомобиль'
 'свадьба' 'получение дополнительного образования' 'покупка своего жилья'
 'операции с недвижимостью' 'получение высшего образования'
 'свой автомобиль' 'сделка с автомобилем' 'профильное образование'
 'высшее образование' 'покупка жилья для сдачи' 'на покупку автомобиля'
 'ремонт жилью' 'заняться высшим образован

We can see that many loan purposes are essentially the same, but they are recorded in different ways. It appears that the operators who entered the data did not have clear guidance on how these values should be recorded. For example, the purposes “сыграть свадьбу”, “на проведение свадьбы”, and “свадьба” all mean the same thing.

To standardise the `purpose` column, we apply rule-based keyword matching to the original Russian text and map each value to a small set of consistent categories. For readability, we then translate the final grouped categories into English.




In [29]:
def normalise_purpose_ru(text: str):
    t = str(text).lower()

    if re.search(r'свад', t):
        return 'свадьба'
    if re.search(r'авто|автомоб', t):
        return 'автомобиль'
    if re.search(r'образован', t):
        return 'образование'
    if re.search(r'жиль|недвиж|строит|ремонт', t):
        return 'недвижимость'
    return 'другое'

data['purpose_group_ru'] = data['purpose'].apply(normalise_purpose_ru)

purpose_map_en = {'свадьба': 'wedding',
                  'автомобиль': 'car',
                  'образование': 'education',
                  'недвижимость': 'housing',
                  'другое': 'other'}

data['purpose_group'] = data['purpose_group_ru'].replace(purpose_map_en)

data['purpose_group'].value_counts()


purpose_group
housing      10836
car           4314
education     4022
wedding       2347
Name: count, dtype: int64

<a name='data-types'></a>
### Data type conversion

We convert discrete count/flag variables to integer types. Continuous variables such as `total_income` are kept as float to avoid losing precision (especially after median-based imputation).


In [30]:
data = data.astype({'days_employed':'int', 'dob_years':'int'})
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_group_ru,purpose_group
0,1,8437,42,higher education,0,married,0,F,employee,0,253875.64,покупка жилья,недвижимость,housing
1,1,4024,36,secondary,1,married,0,F,employee,0,112080.01,приобретение автомобиля,автомобиль,car
2,0,5623,33,secondary,1,married,0,M,employee,0,145885.95,покупка жилья,недвижимость,housing
3,3,4124,32,secondary,1,married,0,M,employee,0,267628.55,дополнительное образование,образование,education
4,0,2575,53,secondary,1,civil partnership,1,F,pensioner,0,158616.08,сыграть свадьбу,свадьба,wedding


At this point, the data has been cleaned. Next, we will perform the categorisation required to answer the research questions.


<a name='categorization'></a>
### Categorisation

First, we will categorise the `children` column. We will create a new column `children_cat` with three categories: no children, 1–2 children, and large families.


In [31]:
data['children_cat'] = data.children.apply(lambda x: 'no children' if x==0
                                                    else '1-2 children' if (x==1)or(x==2)
                                                    else 'large families') #categorizing column `children`

Next, we will categorise monthly income. First, we will review the descriptive statistics for this variable. We will define income categories using quartiles.

Quartiles are values that split the data into four groups with approximately equal numbers of observations. Therefore, we will create four income groups:
- low — 1st quartile,
- below average — 2nd quartile,
- above average — 3rd quartile,
- high — 4th quartile.

In [32]:
data.total_income.describe()

count      21,519.00
mean      165,144.19
std        98,043.20
min        20,667.26
25%       107,489.29
50%       143,362.57
75%       195,540.34
max     2,265,604.03
Name: total_income, dtype: float64

In [33]:
def categorize_total_income(income): #function for categorizing `total_income`
    if income<=100000:
        return 'low'
    elif 100000<income<=150000:
        return 'below average'
    elif 150000<income<=200000:
        return 'above average'
    else:
        return 'high'
    
data['total_income_cat'] = data['total_income'].apply(categorize_total_income)

We convert `children_cat` and `total_income_cat` to ordered categorical types so that the categories are displayed in the intended order.


In [34]:
data = data.astype({'children_cat': pd.CategoricalDtype(['no children', '1-2 children', 'large families'], ordered=True),
                    'total_income_cat': pd.CategoricalDtype(['low', 'below average', 'above average', 'high'], ordered=True)})

### **Data Preprocessing Summary**

We have now completed the data preprocessing.

- We assumed that the duplicate rows are not corrupted data, but represent different clients with identical recorded characteristics.

    This is described in detail in the [Handling duplicates](#duplicates) section. Therefore, these rows were kept in the dataset.

- We identified artefacts and missing values in the dataset. To correct the data and handle missing values, we reviewed each column separately:
    * the columns [`gender`](#gender), [`debt`](#debt), [`income_type`](#income_type), and [`family_status`](#family_status) did not require corrections;
    * minor issues were found and corrected in [`children`](#children), [`dob_years`](#dob_years), and [`education`](#education);
    * the columns [`days_employed`](#days_employed) and [`total_income`](#total_income) required the most preprocessing, as they contained the largest number of invalid values and missing data;
    * for the [`purpose`](#purpose) column, we standardised the values using rule-based keyword matching and then translated the final grouped categories into English.

- We performed [data categorisation](#categorisation) for the variables needed for the subsequent analysis in this work.

Next, we will examine whether any category is associated with a higher probability of overdue debt and answer the questions stated at the beginning.


## 3. Descriptive analysis



For each question, we will create a pivot table and calculate percentages that represent the probability of having overdue debt for clients belonging to each categorised group.


#### Is there a relationship between having children and repaying the loan on time?


In [35]:
debt_by_children = (data.groupby('children_cat', observed=False)['debt']
                    .agg(clients='size', debt_rate=lambda s: s.mean() * 100)
                    .round({'debt_rate': 1}))
debt_by_children

Unnamed: 0_level_0,clients,debt_rate
children_cat,Unnamed: 1_level_1,Unnamed: 2_level_1
no children,14145,7.5
1-2 children,6918,9.2
large families,456,8.6


Clients with 1–2 children have the highest share of overdue debt in this sample (9.2%). Clients with 3 or more children have a slightly lower share (8.6%), and clients without children have the lowest share (7.5%).

Overall, having children is associated with a slightly higher overdue-debt rate in this dataset, but there is no clear monotonic trend as the number of children increases.


#### Is there a relationship between marital status and repaying the loan on time?

In [36]:
debt_by_family = (data.groupby('family_status', observed=False)['debt']
                        .agg(clients='size', debt_rate=lambda s: s.mean() * 100)
                        .round({'debt_rate': 1})
                        .sort_values('debt_rate', ascending=False))
debt_by_family

Unnamed: 0_level_0,clients,debt_rate
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1
not married,2812,9.7
civil partnership,4175,9.3
married,12377,7.5
divorced,1195,7.1
widowed,960,6.6


Clients who are not married, as well as clients in a civil partnership, have the highest overdue-debt rates (9.7% and 9.3%, respectively).

The lowest overdue-debt rate is observed among widowed clients (6.5%).


#### Is there a relationship between income level and repaying the loan on time?


In [37]:
debt_by_income = (data.groupby('total_income_cat', observed=False)['debt']
                        .agg(clients='size', debt_rate=lambda s: s.mean() * 100)
                        .round({'debt_rate': 1})
                        .sort_index())
debt_by_income

Unnamed: 0_level_0,clients,debt_rate
total_income_cat,Unnamed: 1_level_1,Unnamed: 2_level_1
low,4461,7.9
below average,7251,8.2
above average,4743,9.2
high,5064,7.1


As expected, clients with a high income level have the lowest overdue-debt rate (7.1%).

However, the highest overdue-debt rate is observed among clients with above-average income (9.2%).


#### How do different loan purposes affect on-time repayment?

In [38]:
debt_by_purpose = (data.groupby('purpose_group', observed=False)['debt']
                        .agg(clients='size', debt_rate=lambda s: s.mean() * 100)
                        .round({'debt_rate': 1})
                        .sort_values('debt_rate', ascending=False))
debt_by_purpose

Unnamed: 0_level_0,clients,debt_rate
purpose_group,Unnamed: 1_level_1,Unnamed: 2_level_1
car,4314,9.3
education,4022,9.2
wedding,2347,7.9
housing,10836,7.2


The highest overdue-debt rate is observed for car loans (9.3%).

The lowest overdue-debt rate is observed for housing-related loans (7.2%).


## 4. Conclusion

Within the scope of this work, the following tasks were completed:

- Data preprocessing was carried out:
    - anomalous values and missing data were identified;
    - an approach to handling them was developed, documented, and applied;
    - the variables required for the analysis were categorised;
- The research questions were answered.

_**Practical significance:**_

The preprocessing performed in this work enables the cleaned dataset to be used for further analysis and for building credit scoring models. The resulting summary tables help assess how borrower reliability varies with selected client characteristics.

_**Recommendations for data collection:**_

- review the logic used to compute `days_employed` for pensioners and unemployed clients;
- provide standardised data-entry guidance for recording values in the `purpose` column.

