# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

[In this notebook you're provided with hints and brief instructions and thinking prompts. Don't ignore them as they are designed to equip you with the structure for the project and will help you analyze what you're doing on a deeper level. Before submitting your project, make sure you remove all hints and descriptions provided to you. Instead, make this report look as if you're sending it to your teammates to demonstrate your findings - they shouldn't know you had some external help from us! To help you out, we've placed the hints you should remove in square brackets.]

[Before you dive into analyzing your data, explain the purposes of the project and hypotheses you're going to test.]

In [1]:
# Loading all the libraries

import pandas as pd


# Load the data
data=pd.read_csv('/datasets/credit_scoring_eng.csv')

ModuleNotFoundError: No module named 'pandas'

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

[Now let's explore our data. You'll want to see how many columns and rows it has, look at a few rows to check for potential issues with the data.]

In [2]:
# Let's see how many rows and columns our dataset has


data.shape

(21525, 12)

In [3]:
# let's print the first 15 rows

data.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


[I notice in printed data sample both positive and negative values in days_employed, as well as missing values in total income and days_employed/also, we see the values of the type of education are written in a different case, i think we should make it all in lower case ]

In [4]:
# Get info on data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


[Yes, there are  missing values across 2 columns.  Data may be missing simmetricaly in 2 columns per row (days_employed, total income).] 

In [5]:
# Let's look in the filtered table at the the first column with missing data
data[data['days_employed'].isnull()].head(20)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
65,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
67,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
72,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
82,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
83,0,,52,secondary education,1,married,0,M,employee,0,,housing


[Do missing values seem symmetric?Yes, at the first glance they are. However, to be sure i want to filter the table by missing values in all the rows for both columns and immidiately count the number of rows in the table filtered by missing values. If the number of rows coincides with the number of missing values, then the assuption about simmetry in two columns is right.]

In [6]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.

data[(data['days_employed'].isnull())&(data['total_income'].isnull())].shape

(2174, 12)

**Intermediate conclusion**

[Does the number of rows in the filtered table match the number of missing values? Yes, they match. Thus,missing values in two columns are symmetrical. ]

[It's necessary to tell colleages who are involved in formation of reports about missing values and it's symmetry in the dataset. 10% of clients missed information is in days employed and total income. Removing such a fraction of the data can discort the final results. Thus, we will use the method for filling that missing values. ]

[First of all, we should consider the nature of the missing values. Symmetrical missing values may reflect specific client characteristics. One of the main indicators when we are deciding whether to issue a loan is the type of employement and the customer's age. So, we will start to check the dependance of missing values on the value of the other indicators whith the columns INCOME_TYPE-type of employement and DOB_YEARS- the age of customers. ]

In [7]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values

data_nan = data[data['days_employed'].isnull()]

In [8]:
# Checking distribution

data_nan['income_type'].value_counts(normalize=True)

employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64

Data is mostly absent for employees, businnesmen and retirees. Let's display the distribution of the original dataset to check for the radomness of missing values/
**Possible reasons for missing values in data**

[The lack of data in days_employed and total_income among retiree may be due to the fact of that customers at their age have income only in the pension form and have alredy written an application for a loan during their retirement:
they did not have any total income/days employed at the time of the loan application. It's possible that the clients total income indicates private income rather than government pension.
It's necessary to request detaled metadata of the dataset from colleagues who generate reports.]

[Let's start checking whether the missing values are random.]

In [9]:
# Checking the distribution in the whole dataset

data['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

**Intermediate conclusion**

[The distribution in the original dataset of income type is very similar to the distribution of the filtered table. 


In [10]:
# Check for other reasons and patterns that could lead to missing values
data['income_type'].value_counts()/len(data['income_type'])


employee                       0.516562
business                       0.236237
retiree                        0.179141
civil servant                  0.067782
entrepreneur                   0.000093
unemployed                     0.000093
student                        0.000046
paternity / maternity leave    0.000046
Name: income_type, dtype: float64

**Intermediate conclusion**

[The situation may still indicate that values are missing due to the randomness. To test it, let's look at the data on clients who can also recieve some government benefits such as maternity leave, student. Also we will compare the data for interpreneurs.]

In [11]:
data[(data['income_type']=='student')|(data['income_type']=='paternity / maternity leave')|\
     (data['income_type']=='unemployed')|(data['income_type']=='entrepreneur')]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
3133,1,337524.466835,31,secondary education,1,married,0,M,unemployed,1,9593.119,buying property for renting out
5936,0,,58,bachelor's degree,0,married,0,M,entrepreneur,0,,buy residential real estate
9410,0,-578.751554,22,bachelor's degree,0,unmarried,4,M,student,0,15712.26,construction of own property
14798,0,395302.838654,45,Bachelor's Degree,0,civil partnership,1,F,unemployed,0,32435.602,housing renovation
18697,0,-520.848083,27,bachelor's degree,0,civil partnership,1,F,entrepreneur,0,79866.103,having a wedding
20845,2,-3296.759962,39,SECONDARY EDUCATION,1,married,0,F,paternity / maternity leave,1,8612.661,car


In [12]:
data_nan_pivot=data_nan.pivot_table(index='dob_years', columns='income_type', values='debt', aggfunc='count', margins=True)

In [13]:
data_nan_pivot.sort_values(by='All', ascending=False)

income_type,business,civil servant,employee,entrepreneur,retiree,All
dob_years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,508.0,147.0,1105.0,1.0,413.0,2174
34,22.0,4.0,43.0,,,69
40,26.0,9.0,30.0,,1.0,66
42,20.0,1.0,40.0,,4.0,65
31,28.0,7.0,29.0,,1.0,65
35,25.0,2.0,37.0,,,64
36,16.0,9.0,36.0,,2.0,63
47,16.0,5.0,38.0,,,59
41,14.0,2.0,42.0,,1.0,59
30,10.0,4.0,44.0,,,58


In [14]:
data_nan_pivot.sort_values(by='employee', ascending=False)

income_type,business,civil servant,employee,entrepreneur,retiree,All
dob_years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,508.0,147.0,1105.0,1.0,413.0,2174
30,10.0,4.0,44.0,,,58
34,22.0,4.0,43.0,,,69
41,14.0,2.0,42.0,,1.0,59
42,20.0,1.0,40.0,,4.0,65
47,16.0,5.0,38.0,,,59
35,25.0,2.0,37.0,,,64
49,9.0,3.0,36.0,,2.0,50
36,16.0,9.0,36.0,,2.0,63
37,16.0,2.0,35.0,,,53


The assuption that missing values may not be accidential is not confirmed. We have examinate data whith missing values regarding the type of employement and the age. There is no  relationship between these customer caracteristics/ The summary table shows us that the majority of customers who do not have data are over 30 years oldand they work as employees and businessmen. So the data is missing complitly by accident. 
For good restoration of missing values we will focus on the most common values, 
client's age, education, type of employment/
I will split customers by age with step of 20 years:
1. less than 25
2. 25-45 y.o.
3. 46-65 y.o.
4. older than 66 y.o.

The recovery of data on monthly income will be based on three columns, in which there are some duplicates and other issues, so first of all we should make the Data transformation and then restore missing values. We see that in column of customer's age there are some zero values. Also, we should check the data for duplicates and make the column of EDUCATION in one lower case mode.   

## Data transformation

[Let's begin with removing duplicates and fixing educational information.]

In [15]:
# Let's see all values in education column to check if and what spellings will need to be fixed
data['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

In [16]:
data.duplicated().sum()

54

In [17]:
data['education']=data['education'].str.lower()

In [18]:
# Checking all the values in the column to make sure we fixed them

data['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

In [19]:

data.duplicated().sum()

71

[ Yes,there are some strange things in the column, the number of duplicates is increasing.  The percentage of problematic data is  much higher, because now they are grupped. We will remove the duplicates after the transformation of some problematic artifacts.]

In [20]:
# Checking the `children` column again to make sure it's all fixed

data['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [21]:
data.drop(data[data['children']==20].index, inplace=True)
data.drop(data[data['children']==-1].index, inplace=True)

[We checked children column and find some strange thigs, such as -1 and 20 children. May it was an error of a manager, we will inform our colleages. But it's not a big deal to delete this from the dataset, less than 1% of the sample.

In [22]:
data['children'].value_counts()

0    14149
1     4818
2     2055
3      330
4       41
5        9
Name: children, dtype: int64

In [23]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage
print('Rows whith negative days_employed:', data[data['days_employed']<0].shape[0])
print(f"% Rows whith negative days_employed:{ data[data['days_employed']<0].shape[0]/data.shape[0]:.2%}")



Rows whith negative days_employed: 15809
% Rows whith negative days_employed:73.87%


Althogh the amount of problematic data is high- 73%, I think that it could've been due to some technical errors. So, we will inform our colleages about this and change negative to positive days_employed.

In [24]:
# Address the problematic values, if they exist
data['days_employed'] = data['days_employed'].abs()


In [25]:
# Check the result - make sure it's fixed
data.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


[Let's now look at the client's age and whether there are any issues there.]Zero yers old people doesn't have sence, we will report our colleages about this mistake and drop it, because it's just less than a 1% 

In [26]:
data.duplicated().sum()

71

[Decide what you'll do with the problematic values and explain why.]

In [27]:
# Address the issues in the `dob_years` column, if they exist
data['dob_years'].value_counts().sort_index().head()

0     100
19     14
20     51
21    110
22    183
Name: dob_years, dtype: int64

In [28]:
data.drop(data[data['dob_years']==0].index, inplace=True)

In [29]:
data[data['dob_years']==0]['dob_years'].count()

0

[Now let's check the `family_status` column. Let's use str.lower() to have a one register.]

In [30]:
# Let's see the values for the column
data['family_status'].unique()


array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

In [31]:
#We will transform it to the one low register 


data['family_status']=data['family_status'].str.lower()

In [32]:

data.duplicated().sum()

71

[Now let's check the `gender` column. We see 1 person whith a gender XNA(?), let's drop this mistake.]

In [33]:
# Let's see the values in the column
data['gender'].value_counts()

F      14083
M       7218
XNA        1
Name: gender, dtype: int64

In [34]:
# Address the problematic values, if they exist

data.drop(data[data['gender']=='XNA'].index, inplace=True)

In [35]:
# Check the result - make sure it's fixed

data['gender'].value_counts()

F    14083
M     7218
Name: gender, dtype: int64

[Now let's check the `income_type` column.  We see no problems here ]

In [36]:
# Let's see the values in the column

data['income_type'].value_counts()

employee                       10996
business                        5033
retiree                         3819
civil servant                   1447
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [37]:
# Address the problematic values, if they exist
#NO problematic values. Let's continue remove duplicates

[Now let's see if we have any duplicates in our data. We have 71 duplicates.]

In [38]:
# Checking duplicates

data.duplicated().sum()

71

In [39]:
# Cheking retiree. We see some people retired too early, even 22 y.o. It's can be a mistake, but we shall leave whis data.
#I's not too big,but it can be real, f.ex. disabled persons with desability pension.
data[data['income_type']=='retiree']['dob_years'].value_counts().sort_index().head(10)

22    1
24    1
26    2
27    3
31    1
32    3
33    2
34    3
35    1
36    5
Name: dob_years, dtype: int64

In [40]:
data=data.drop_duplicates().reset_index(drop=True)

In [41]:
# Last check whether we have any duplicates
data.duplicated().sum()

0

In [42]:
# Check the size of the dataset that you now have after your first manipulations with it
data.shape

(21230, 12)

It was 21525 rows from the start and now 21230 left after dataset transformation. So, 295 rows I removed. It is not too much (1,37%)

# Working with missing values

[To speed up working with some data,we want to work with dictionaries for some values, where IDs are provided. So, we will work with education and family status dictionaries.]

In [43]:
# Find the dictionaries
education_dict=data[['education_id', 'education']]
education_dict=education_dict.drop_duplicates().reset_index(drop=True)
education_dict

Unnamed: 0,education_id,education
0,0,bachelor's degree
1,1,secondary education
2,2,some college
3,3,primary education
4,4,graduate degree


In [44]:
family_status_dict=data[['family_status_id', 'family_status']]
family_status_dict=family_status_dict.drop_duplicates().reset_index(drop=True)

In [45]:
family_status_dict

Unnamed: 0,family_status_id,family_status
0,0,married
1,1,civil partnership
2,2,widow / widower
3,3,divorced
4,4,unmarried


# Restoring missing values in `total_income`

[Let's create an age category for more useful calculation of missing values in the column of monthly income. To simplify, we will devide age into 4 categories: 
<25 y.o., 26-45y.o., 46-65 y.o. and >66y.o.]


In [46]:
# Let's write a function that calculates the age category

def age_group(age):
    try:
        if age <= 25:
            return '<25'
        if 26<=age<=45:
            return '26-45'
        if 46<=age<=65:
            return '46-65'
        else:
            return '>66'
    except:
        return 0
            

In [47]:
# Test if the function works
age_group(41)

'26-45'

In [48]:
# Creating new column based on function
data['age_group']=data['dob_years'].apply(age_group)


In [49]:
# Checking how values in the new column
data['age_group'].value_counts()


26-45    10900
46-65     8404
<25       1226
>66        700
Name: age_group, dtype: int64

[The factors on which income usually depends: 1.age, 2. education, 3. type of employment.I want to find out whether i should use mean or median values for replacing missing values. To make this decision let's to look at the distribution of the age and the type of employment. ]

[I will use a pivot table that only has data without missing values. This data will be used to restore the missing values.]

In [50]:
# Create a table without missing values and print a few of its rows to make sure it looks fine
data_without_nan=data[data['days_employed'].isnull() !=True]

In [51]:
data_without_nan.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,26-45
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,26-45
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,26-45
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,26-45
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,46-65
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,26-45
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,26-45
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,46-65
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,26-45
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,26-45


In [52]:
data_without_nan.pivot_table(index='income_type', columns= 'age_group', values='total_income', aggfunc='mean')


age_group,26-45,46-65,<25,>66
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
business,32930.411643,33051.369974,25752.351209,33470.062
civil servant,27442.064958,27499.138709,24510.974242,30992.299
employee,26128.693783,26015.18575,22265.980563,26185.02487
entrepreneur,79866.103,,,
paternity / maternity leave,8612.661,,,
retiree,24408.993847,22343.926661,14298.976,19663.470405
student,,,15712.26,
unemployed,21014.3605,,,


In [53]:
data_without_nan.pivot_table(index='income_type', columns= 'age_group', values='total_income', aggfunc='median')


age_group,26-45,46-65,<25,>66
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
business,28094.431,28406.377,22814.5995,29314.4045
civil servant,24368.015,23847.285,22758.5535,26089.687
employee,23066.173,22781.846,20634.665,24643.1985
entrepreneur,79866.103,,,
paternity / maternity leave,8612.661,,,
retiree,20028.725,19420.007,14298.976,17074.579
student,,,15712.26,
unemployed,21014.3605,,,


In [54]:
data_without_nan.pivot_table(index='income_type', columns= 'education', values='total_income', aggfunc='median')


education,bachelor's degree,graduate degree,primary education,secondary education,some college
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
business,32285.664,,21887.825,25441.23,28778.744
civil servant,27564.459,17822.757,23734.287,21864.475,25694.775
employee,26587.423,31771.321,20159.186,21841.813,24209.43
entrepreneur,79866.103,,,,
paternity / maternity leave,,,,8612.661,
retiree,23030.247,28334.215,16415.785,18372.071,19221.903
student,15712.26,,,,
unemployed,32435.602,,,9593.119,


In [55]:
data_without_nan.pivot_table(index=['income_type', 'age_group'], columns= 'education', values='total_income', \
                             aggfunc='median')

Unnamed: 0_level_0,education,bachelor's degree,graduate degree,primary education,secondary education,some college
income_type,age_group,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
business,26-45,32348.966,,21664.873,25857.158,29868.2435
business,46-65,34807.689,,26144.483,25569.3035,35066.262
business,<25,24721.904,,19891.3025,21711.044,22499.4125
business,>66,36808.968,,,24259.687,
civil servant,26-45,27034.605,17822.757,30545.949,21933.996,26452.0395
civil servant,46-65,29517.239,,16922.625,21549.283,42561.9115
civil servant,<25,23839.4605,,,21857.0015,21640.4795
civil servant,>66,31984.616,,,25307.4555,
employee,26-45,26766.026,25161.5835,19810.253,22047.114,25591.706
employee,46-65,27847.316,42945.794,20923.857,21741.8725,27120.275


The median income is the best for describe these 3 characteristics


[I will use a median to avoid the influence of abnormal outliers. Usually it's a common decision for income, because if Ilon Musk walks into a bar all visitors on average become a multimillionaires automatically.
Also, we should check how the income depends on type of income and on education level.]


In [56]:
#  Write a function that we will use for filling in missing values
pivot_table_for_total_income= data_without_nan.pivot_table(index=['age_group', 'income_type'],
                                                          columns='education',
                                                          values='total_income',\
                                                          aggfunc='median')
def get_median_total_income(x):
    education=x['education']
    age_group=x['age_group']
    income_type= x['income_type']
    try:
        return pivot_table_for_total_income [education][age_group][income_type]
    except:
        return 'error'

In [57]:
# Check if it works
pivot_table_for_total_income ["some college"]['<25']['employee']

21588.03

In [58]:
# Apply it to every row
data['median_total_income']=data.apply(get_median_total_income, axis=1)

In [59]:
# Check if we got any errors
data[data['median_total_income']=='error']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_total_income
5880,0,,58,bachelor's degree,0,married,0,M,entrepreneur,0,,buy residential real estate,46-65,error


[I want to fix it manually, there's no data to find median for intrepreneur. So let's copy the income from the first intrepreneur to the second. ]
We remove all the missing values in total_income with the appropriate median_total_income

In [60]:
# Replacing missing values if there are any errors
data['total_income']=data['total_income'].fillna(data['median_total_income'])

[When you think you've finished with `total_income`, check that the total number of values in this column matches the number of values in other ones.]

In [61]:
# Checking the number of entries in the columns
data.head(15)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_total_income
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,26-45,26766.026
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,26-45,22047.114
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,26-45,22047.114
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,26-45,22047.114
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,46-65,18749.915
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,26-45,32348.966
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,26-45,32348.966
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,46-65,21741.8725
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,26-45,26766.026
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,26-45,22047.114


In [62]:
data.loc[data['total_income']=='error', ['total_income']]=79866
data.loc[data['median_total_income']=='error',['median_total_income']]=79866

In [63]:
data[data['income_type']=='entrepreneur']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_total_income
5880,0,,58,bachelor's degree,0,married,0,M,entrepreneur,0,79866.0,buy residential real estate,46-65,79866.0
18450,0,520.848083,27,bachelor's degree,0,civil partnership,1,F,entrepreneur,0,79866.103,having a wedding,26-45,79866.103


In [64]:
data['total_income']=data['total_income'].astype('int')

In [65]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21230 entries, 0 to 21229
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   children             21230 non-null  int64  
 1   days_employed        19149 non-null  float64
 2   dob_years            21230 non-null  int64  
 3   education            21230 non-null  object 
 4   education_id         21230 non-null  int64  
 5   family_status        21230 non-null  object 
 6   family_status_id     21230 non-null  int64  
 7   gender               21230 non-null  object 
 8   income_type          21230 non-null  object 
 9   debt                 21230 non-null  int64  
 10  total_income         21230 non-null  int64  
 11  purpose              21230 non-null  object 
 12  age_group            21230 non-null  object 
 13  median_total_income  21230 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 2.3+ MB


###  Restoring values in `days_employed`

Let's check the parameters that may help to restore the missing values in this column. I will want to find out whether to use mean or median values for replacing missing values. I will conduct a research similar to the one i've done when restoring data in a previous column.]

In [66]:
# Distribution of `days_employed` medians based on your identified parameters
data_without_nan.pivot_table(index='income_type', columns= 'age_group', values='days_employed', aggfunc='median')

age_group,26-45,46-65,<25,>66
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
business,1484.965033,2086.501856,748.818654,2318.709538
civil servant,2521.457009,3601.346755,1132.739641,4137.331615
employee,1513.345802,2184.737485,798.699314,2830.361431
entrepreneur,520.848083,,,
paternity / maternity leave,3296.759962,,,
retiree,364348.197352,365062.716683,334764.259831,366157.236636
student,,,578.751554,
unemployed,366413.652744,,,


In [67]:
# Distribution of `days_employed` means based on your identified parameters
data_without_nan.pivot_table(index='income_type', columns='age_group', values='days_employed', aggfunc='mean')


age_group,26-45,46-65,<25,>66
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
business,1884.45509,2898.938303,861.980178,3725.387
civil servant,2940.240153,4583.807565,1174.47888,4145.742201
employee,2067.795829,3148.116338,929.403103,4092.413329
entrepreneur,520.848083,,,
paternity / maternity leave,3296.759962,,,
retiree,364295.980201,364941.071441,334764.259831,365676.37598
student,,,578.751554,
unemployed,366413.652744,,,


Again we will choose median. It will be enough for restore days emnpoyed.It's more convenient.

In [68]:
# Let's write a function that calculates means or medians (depending on your decision) based on your identified parameter
groupby_income_type_for_days_employed= data.groupby('income_type')['days_employed'].median()

def fill_nan_days_employed(income_type):
    try:
        return groupby_income_type_for_days_employed[income_type]
    except:
        return 'error'

In [69]:
# Check that the function works

fill_nan_days_employed('student')

578.7515535382181

In [70]:
# Apply function to the income_type

data['median_days_employed']= data['income_type'].apply(fill_nan_days_employed)

In [71]:
# Check if function worked
data['median_days_employed'].value_counts()


1573.791064      10961
1555.993659       5026
365269.100414     3792
2672.903939       1445
366413.652744        2
520.848083           2
578.751554           1
3296.759962          1
Name: median_days_employed, dtype: int64

In [72]:
# Replacing missing values
data['days_employed']=data['days_employed'].fillna(data['median_days_employed'])


In [73]:
data.isnull().sum()

children                0
days_employed           0
dob_years               0
education               0
education_id            0
family_status           0
family_status_id        0
gender                  0
income_type             0
debt                    0
total_income            0
purpose                 0
age_group               0
median_total_income     0
median_days_employed    0
dtype: int64

Let's check that the total number of values in this column matches the number of values in other ones.]

In [74]:
data['days_employed']=data['days_employed'].astype('int')

In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21230 entries, 0 to 21229
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   children              21230 non-null  int64  
 1   days_employed         21230 non-null  int64  
 2   dob_years             21230 non-null  int64  
 3   education             21230 non-null  object 
 4   education_id          21230 non-null  int64  
 5   family_status         21230 non-null  object 
 6   family_status_id      21230 non-null  int64  
 7   gender                21230 non-null  object 
 8   income_type           21230 non-null  object 
 9   debt                  21230 non-null  int64  
 10  total_income          21230 non-null  int64  
 11  purpose               21230 non-null  object 
 12  age_group             21230 non-null  object 
 13  median_total_income   21230 non-null  object 
 14  median_days_employed  21230 non-null  float64
dtypes: float64(1), int6

## Categorization of purpose

[To answer the questions and test the hypotheses, we want to work with categorized data. The purpose data will need to be categorized.]


In [76]:
# Print the values for your selected data for categorization
data['purpose'].value_counts()


wedding ceremony                            785
having a wedding                            759
to have a wedding                           755
real estate transactions                    669
buy commercial real estate                  655
buying property for renting out             647
transactions with commercial real estate    643
housing transactions                        641
purchase of the house for my family         636
housing                                     635
purchase of the house                       634
property                                    627
construction of own property                626
transactions with my real estate            623
building a property                         619
purchase of my own house                    618
building a real estate                      617
buy real estate                             612
housing renovation                          602
buy residential real estate                 599
buying my own car                       

[Let's check unique values of purpose]

In [77]:
# Check the unique values
unique_purpose=data['purpose'].unique()

In [78]:
unique_purpose

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

In [79]:
# Let's write a function to categorize the data based on common topics
def purpose_category(row):
    if 'car' in row:
        return 'car'
    elif 'wedd' in row:
        return 'wedding'
    elif 'hous' in row:
        return 'real estate'
    elif 'educat' in row:
        return 'education'
    elif 'university' in row:
        return 'education'
    else:
        return 'real estate'

In [80]:
purpose_category('property')



'real estate'

In [81]:
purpose_category('to buy a car')


'car'

In [82]:
purpose_category('wedding ceremony')


'wedding'

In [83]:
purpose_category('housing renovation')


'real estate'

In [84]:
purpose_category('going to university')


'education'

I created a dictionary and use it with the replace() method. Let's try to aggregate loan purposes into 4 big categories: car, real estate, education, wedding.

I create categories for groupping all the purposes.

In [85]:
#ADDED BY REVIEWER

data['purpose_category']=data['purpose'].apply(purpose_category)

data.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_total_income,median_days_employed,purpose_category
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,26-45,26766.026,1573.791064,real estate
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase,26-45,22047.114,1573.791064,car
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,26-45,22047.114,1573.791064,real estate
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,26-45,22047.114,1573.791064,education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,46-65,18749.915,365269.100414,wedding
5,0,926,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,purchase of the house,26-45,32348.966,1555.993659,real estate
6,0,2879,43,bachelor's degree,0,married,0,F,business,0,38484,housing transactions,26-45,32348.966,1555.993659,real estate
7,0,152,50,secondary education,1,married,0,M,employee,0,21731,education,46-65,21741.8725,1573.791064,education
8,2,6929,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337,having a wedding,26-45,26766.026,1573.791064,wedding
9,0,2188,41,secondary education,1,married,0,M,employee,0,23108,purchase of the house for my family,26-45,22047.114,1573.791064,real estate


In [86]:
data ['total_income'].value_counts()

22047    470
18749    260
21741    250
25857    172
26766    150
        ... 
18317      1
57230      1
22415      1
12176      1
52973      1
Name: total_income, Length: 15279, dtype: int64

In [87]:
data['total_income'].describe().astype('int')

count     21230
mean      26493
std       15768
min        3306
25%       17131
50%       22934
75%       31720
max      362496
Name: total_income, dtype: int64

In [88]:
#Let's create a function for divide income for different income levels
def income_level (income):
    if (income >0) and (income<=17000):
        return 'small'
    if (income >17000) and (income<=40000):
        return 'average'
    if (income >40000) and (income<=80000):
        return 'above average'
    if (income >80000) and (income<=200000):
        return 'high'
    if income >200000:
        return 'very high'



In [89]:
data['income_level']=data['total_income'].apply(income_level)

In [90]:
data['income_level'].value_counts()

average          13234
small             5212
above average     2562
high               211
very high           11
Name: income_level, dtype: int64

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [91]:
# Check the children data and paying back on time
pivot_table_children=data.pivot_table(index='children', columns='debt',values='days_employed', aggfunc='count')

# Calculating default-rate based on the number of children
pivot_table_children['percent_1']=pivot_table_children[1]/(pivot_table_children[1]+pivot_table_children[0])*100



pivot_table_children.sort_values(by='percent_1', ascending=True)

debt,0,1,percent_1
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,12963.0,1058.0,7.545824
3,301.0,27.0,8.231707
1,4351.0,441.0,9.202838
2,1845.0,194.0,9.514468
4,37.0,4.0,9.756098
5,9.0,,


**Conclusion**
Now I see that the default-rate for loaners have no correlation whith the number of children. The rate for customers whithout kids is 1,5% lower, that for those who have kids. Number of kids itself does not affect the default rate.


**Is there a correlation between family status and paying back on time?**

In [92]:
# Check the family status data and paying back on time

pivot_table_family_status= data.pivot_table(index='family_status', columns='debt', values='days_employed', aggfunc='count')
pivot_table_family_status['percent_1']=pivot_table_family_status[1]/(pivot_table_family_status[1]\
                                                                    +pivot_table_family_status[0])*100

# Calculating default-rate based on income level
pivot_table_family_status.sort_values(by='percent_1', ascending=True)


debt,0,1,percent_1
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
widow / widower,884,62,6.553911
divorced,1095,84,7.124682
married,11290,923,7.557521
civil partnership,3729,383,9.314202
unmarried,2508,272,9.784173


**Conclusion**
The only I can make, is that unmarried people have less responsability and so they are more likely to default. 


**Is there a correlation between income level and paying back on time?**

In [93]:
# Check the income level data and paying back on time

pivot_table_income_level=data.pivot_table(index='income_level', columns='debt', values='days_employed', aggfunc='count')
pivot_table_income_level['percent_1']=pivot_table_income_level[1]/(pivot_table_income_level[1]\
                                                                   +pivot_table_income_level[0])*100

pivot_table_income_level.sort_values(by='percent_1', ascending=True)

debt,0,1,percent_1
income_level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high,198,13,6.161137
above average,2383,179,6.986729
small,4799,413,7.924021
average,12116,1118,8.447937
very high,10,1,9.090909


**Conclusions.
There is a positive correlation between the level of income and the propensity to repay the loan on time. Above average and high income customers have the lowest default rate. However, there is an exception, these are people with extremely high incomes, I think this is due to a small sample of such clients (there are only 11 of them and one of them did not repay the debt).

**How does credit purpose affect the default rate?**

In [94]:
# Check the percentages for default rate for each credit purpose and analyze them
pivot_table_purpose_category=data.pivot_table(index='purpose_category', columns='debt', values='days_employed', aggfunc='count')
pivot_table_purpose_category['percent_1']=pivot_table_purpose_category[1]/(pivot_table_purpose_category[1]\
                                                                   +pivot_table_purpose_category[0])*100

pivot_table_purpose_category.sort_values(by='percent_1', ascending=True)


debt,0,1,percent_1
purpose_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
real estate,9926,777,7.259647
wedding,2118,181,7.872988
education,3601,369,9.29471
car,3861,397,9.323626


**Conclusion**

The conclusions can be drawn as follows, it is best to issue mortgage loans and loans for weddings. Borrowers taking out education loans and car loans have the highest default rate.

# General Conclusion 

1. The missing values in 'days_employed', 'total_income' are realy random. We should notify colleages, who are responsible for data collecting, about such a problem. However, the part of these missed values can be explained whith some issues in bank programs.  
2. Duplicates can appeare due to technical mistakes, because we do not have a standart form for transfer client's data to data analyst. Obviosly, it is necessary to store client's ID.   

3. One standartized form of transfer data from engineer to analyst could solve the problem and decrease the time spent on preprocessing data. For example, in this case, we wouldn't have a problem with negative values for children.

4.For a new scoring system we must take into account the following facts for the identified groups
1. Borrowers with children have the highest default rate.
2. Clients with below-average incomes are more likely to default.
3. Clients who are not in a registered marriage are more likely to pay not in time.
4. Car and education loan borrowers have the highest default rate.


