# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building the **credit score** of a potential customer. The **credit score** is used to evaluate the ability of a potential borrower to repay their loan.


## Open the data file and have a look at the general information. 



In [1]:
import pandas as pd# Loading all the libraries

df=pd.read_csv('/datasets/credit_scoring_eng.csv')# Load the data

print(df)

       children  days_employed  dob_years            education  education_id  \
0             1   -8437.673028         42    bachelor's degree             0   
1             1   -4024.803754         36  secondary education             1   
2             0   -5623.422610         33  Secondary Education             1   
3             3   -4124.747207         32  secondary education             1   
4             0  340266.072047         53  secondary education             1   
...         ...            ...        ...                  ...           ...   
21520         1   -4529.316663         43  secondary education             1   
21521         0  343937.404131         67  secondary education             1   
21522         1   -2113.346888         38  secondary education             1   
21523         3   -3112.481705         38  secondary education             1   
21524         2   -1984.507589         40  secondary education             1   

           family_status  family_status

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan



In [2]:
# Let's see how many rows and columns our dataset has
print(df.shape)


(21525, 12)


In [3]:
df.head(10)# let's print the first N rows



Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


The main problem that hinders further data analysis is negative values in the "days_employed" column. It is also necessary to convert the data to a single format to reduce future errors during Date Preprocessing.

In [4]:
df.info()# Get info on data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


There are missing values in only two columns:'days_employed' and 'total_income'. 

In [5]:
# Let's look in the filtered table at the the first column with missing data
print(df['days_employed'].isna().sum())
print(df['total_income'].isna().sum())

2174
2174


It can be concluded that the lost data appeared due to the lack of working days for the customer and, as a result, this lack of income. That is, he does not work and does not receive any income. (there are much fewer of them), therefore, it cannot be concluded that these characteristics are related.

In [6]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
print(df[(df['days_employed'].isna())&(df['total_income'].isna())].count())
print('the_percentage_of_the_missing_values {:.0%}'.format(df['days_employed'].isna().sum()/df.shape[0]))#Calculate the percentage of the missing values


children            2174
days_employed          0
dob_years           2174
education           2174
education_id        2174
family_status       2174
family_status_id    2174
gender              2174
income_type         2174
debt                2174
total_income           0
purpose             2174
dtype: int64
the_percentage_of_the_missing_values 10%


**Intermediate conclusion**

The number of rows in the filtered table match the number of missing values.
Calculate the percentage of the missing values - 10%. This is a significant amount of data, so it needs to be filled in.
We can check if the missing values are related to income sources. Let's find the unique values of income sources and see which of them can indicate the absence of income and working days.

In [7]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values
print(df[df['days_employed'].isna()]['income_type'].unique())
print(df[df['days_employed'].isna()]['education'].unique())
print(df[df['days_employed'].isna()]['debt'].unique())

['retiree' 'civil servant' 'business' 'employee' 'entrepreneur']
['secondary education' "bachelor's degree" 'SECONDARY EDUCATION'
 'some college' 'Secondary Education' 'Some College' "Bachelor's Degree"
 'SOME COLLEGE' 'primary education' "BACHELOR'S DEGREE"
 'Primary Education' 'PRIMARY EDUCATION']
[0 1]


We see that there is no direct relationship between the missing values and the values in other columns (such as source of income, education), because there are lost values in all types of unique values of these columns.
**Possible reasons are errors in the collection and entry of data or the reluctance of customers to provide information about their income and**
Perhaps the lost data is random.

In [8]:
# Checking the distribution in the whole dataset
distribution=[]
for a in df['income_type'].unique():
    b=df[(df['days_employed'].isna())&(df['income_type']==a)]['income_type'].value_counts()/df[df['income_type']==a]['income_type'].value_counts()
    print(a,'{:.0%}'.format(b.sum()))



employee 10%
retiree 11%
business 10%
civil servant 10%
unemployed 0%
entrepreneur 50%
student 0%
paternity / maternity leave 0%


**Intermediate conclusion**
We see that 90% of the missing values come from the employee
retiree, business, civil servant, entrepreneur.
That is, when filling in the missing values further, the data for customers - unemployed and student, can be ignored


In [9]:
# Check for other reasons and patterns that could lead to missing values
print(df[df['dob_years']<18].shape)
print(df[(df['dob_years']<18)&(df['days_employed'].isna())].shape)


(101, 12)
(10, 12)


**Intermediate conclusion**
We can check for customers under the age of 18. We see that the table contains customers under 18 years of age. But missing values and values with age less than 18 coincide only in 10 people. Therefore, they do not depend on each other.

In [10]:
# Checking for other patterns - explain which

**Conclusions**

We did not find any direct relationship between missing values in the 'days_employed' and 'total_income' columns with values from other columns.
We can say that the missing values are random. Therefore, further it is necessary to fill them in in accordance with the average values for customers with similar characteristics in other columns.

## Data transformation

We have to remove duplicates and fixing educational information.

In [11]:
# Let's see all values in education column to check if and what spellings will need to be fixed
print(df['education'].unique())

["bachelor's degree" 'secondary education' 'Secondary Education'
 'SECONDARY EDUCATION' "BACHELOR'S DEGREE" 'some college'
 'primary education' "Bachelor's Degree" 'SOME COLLEGE' 'Some College'
 'PRIMARY EDUCATION' 'Primary Education' 'Graduate Degree'
 'GRADUATE DEGREE' 'graduate degree']


In [12]:
# Fix the registers if required
for row in range(df['education'].shape[0]):
    if df.loc[row,'education'] in ['Secondary Education','SECONDARY EDUCATION']:
        df.loc[row,'education']='secondary education'
    elif df.loc[row,'education'] in ["bachelor's degree","BACHELOR'S DEGREE","Bachelor's Degree"]:
        df.loc[row,'education']='bachelor_s degree'
    elif df.loc[row,'education'] in ['Primary Education','PRIMARY EDUCATION']:
        df.loc[row,'education']='primary education'
    elif df.loc[row,'education'] in ['SOME COLLEGE','Some College','some college']:
        df.loc[row,'education']='college'
    elif df.loc[row,'education'] in ['GRADUATE DEGREE','Graduate Degree']:
        df.loc[row,'education']='graduate degree'



In [13]:
# Checking all the values in the column to make sure we fixed them
print(df['education'].unique())



['bachelor_s degree' 'secondary education' 'college' 'primary education'
 'graduate degree']


In [14]:
print(df['children'].value_counts())# Let's see the distribution of values in the `children` column
print('{:.2%}'.format(df[df.children.isin([20,-1])].value_counts().sum()/df['children'].value_counts().sum()))

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64
0.52%


We see strange values in the number of children: 20 and -1. They make up less than 1 percent of all data, so it is better not to use these lines in calculations. We can remove them.

In [15]:
# [remove strange values in 'children']
df=df[df['children']!=-1]
df=df[df['children']!=20]




In [16]:
print(df['children'].value_counts())# Checking the `children` column again to make sure it's all fixed



0    14149
1     4818
2     2055
3      330
4       41
5        9
Name: children, dtype: int64


We see two problems in the data in the `days_employed` column:
1) values are very large
2) there are negative values
This could be due to incorrect format and dimension.

In [17]:
df['days_employed']=pd.to_numeric(df['days_employed'], errors='ignore')# Find problematic data in `days_employed`, if they exist, and calculate the percentage
print('{:.2%}'.format(df[df['days_employed']<0]['days_employed'].count()/df['days_employed'].count()))# Find problematic data in `days_employed`, if they exist, and calculate the percentage
print(df[df['days_employed']<0]['days_employed'].count())
print(df[(df['days_employed']<0)&(df['dob_years']<=0)]['days_employed'].count())# Check for negative values in the age column

82.17%
15809
73


We see that no negative days_employed values correspond to zero age values. Their number is different.
Their number does not allow us to neglect them. 


In [18]:
try:
    df['days_employed']=(df['days_employed'].abs()/24)
except:
    df['days_employed']=df['days_employed']
    

In [19]:
print(df['days_employed'].max())
print(df['days_employed'].min())

16739.80835313875
1.005901385020049


In [20]:
print(df['dob_years'].value_counts(ascending=False))



35    614
40    603
41    603
34    597
38    595
42    592
33    577
39    572
31    556
36    553
29    543
44    543
48    536
30    536
37    531
43    510
50    509
32    506
49    505
28    501
45    494
27    490
52    483
56    482
47    480
54    476
46    469
58    461
53    457
57    457
51    446
59    441
55    441
26    406
60    376
25    356
61    353
62    351
63    268
24    263
64    263
23    252
65    194
66    183
22    183
67    167
21    110
0     100
68     99
69     83
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64


We see problematic values in dob_years (negative and less than 18).

In [21]:
df[df['dob_years']<18]['dob_years']# Check the `dob_years` for suspicious values and count the percentage
df[df['dob_years']==0]['dob_years']
print('{:.2%}'.format(df[df['dob_years']==0]['dob_years'].count()/df['dob_years'].count()))


0.47%


Problematic age values are null values. We calculated the percentage, it is less than 1%, so these lines can be neglected.

In [22]:
#Address the issues in the `dob_years` column, if they exist
df=df[df['dob_years']!=0]

In [23]:
print(df[df['dob_years']==0])# Check the result - make sure it's fixed


Empty DataFrame
Columns: [children, days_employed, dob_years, education, education_id, family_status, family_status_id, gender, income_type, debt, total_income, purpose]
Index: []


Now let's check the `family_status` column. 

In [24]:
df['family_status'].value_counts()# Let's see the values for the column

married              12254
civil partnership     4139
unmarried             2783
divorced              1179
widow / widower        947
Name: family_status, dtype: int64

Now let's check the `gender` column. 

In [25]:
df['gender'].value_counts()# Let's see the values in the column


F      14083
M       7218
XNA        1
Name: gender, dtype: int64

In [26]:
df=df[df['gender']!='XNA']# Address the problematic values

In [27]:
df['gender'].value_counts()# Check the result 



F    14083
M     7218
Name: gender, dtype: int64

Now let's check the `income_type` column.

In [28]:
df['income_type'].value_counts()# Let's see the values in the column

employee                       10996
business                        5033
retiree                         3819
civil servant                   1447
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [29]:
df['purpose'].value_counts()# Let's see the values in the column

wedding ceremony                            791
having a wedding                            768
to have a wedding                           764
real estate transactions                    670
buy commercial real estate                  658
buying property for renting out             649
transactions with commercial real estate    644
housing transactions                        642
purchase of the house for my family         639
housing                                     636
purchase of the house                       635
property                                    628
construction of own property                626
transactions with my real estate            626
building a property                         620
building a real estate                      619
purchase of my own house                    618
buy real estate                             615
housing renovation                          607
buy residential real estate                 600
buying my own car                       

We have values that are identical in meaning, but with different wordings, for further analysis it is necessary to rename them and combine them into groups.


In [30]:
def assign_purpose_group(purpose):
    if 'real estate' in purpose:
        return 'real_estate'
    if 'property' in purpose:
        return 'real_estate'
    if 'housing' in purpose:
        return 'real_estate'
    if 'house' in purpose:
        return 'real_estate'
    elif 'wedding' in purpose:
        return 'wedding'
    elif 'car' in purpose:
        return 'car'
    elif 'cars' in purpose:
        return 'car'     
    elif 'university' in purpose:
        return 'education'
    elif 'education' in purpose:
        return 'education'
    elif 'educated' in purpose:
        return 'education'

In [31]:
df['purpose']=df['purpose'].apply(assign_purpose_group)
print(df['purpose'].unique())


['real_estate' 'car' 'education' 'wedding']


Now we need to get rid of duplicates.

In [32]:
print(df.duplicated().sum())# Checking duplicates
print(df[df.duplicated()].head(20))


404
      children  days_employed  dob_years            education  education_id  \
360          0            NaN         27  secondary education             1   
829          0            NaN         57  secondary education             1   
1010         0            NaN         66  secondary education             1   
1072         0            NaN         44  secondary education             1   
1247         0            NaN         54  secondary education             1   
2210         1            NaN         49    bachelor_s degree             0   
2343         0            NaN         47  secondary education             1   
2700         0            NaN         65  secondary education             1   
2849         0            NaN         41  secondary education             1   
3018         1            NaN         26  secondary education             1   
3075         0            NaN         62  secondary education             1   
3127         0            NaN         50  second

In [33]:
df=df.drop_duplicates()# Address the duplicates, if they exist

In [34]:
df.duplicated().sum()# Last check whether we have any duplicates


0

In [35]:
print(df.shape)# Check the size of the dataset that you now have after your first manipulations with it
print('the_percentage_of_the_changes {:.2%}'.format((21525-21230)/21525))

(20897, 12)
the_percentage_of_the_changes 1.37%


We removed the missing values, which were in a small amount and do not affect the result, removed duplicates.


# Working with missing values

[To speed up working with some data, you may want to work with dictionaries for some values, where IDs are provided. Explain why and which dictionaries you will work with.]

In [36]:
# Find the dictionaries

### Restoring missing values in `total_income`

We need to fill in the missing values in total_income and days_employed columns.
To do this, we need to calculate the average values for groups in accordance with category of customers as age, income_type.


In [37]:
# Let's write a function that calculates the age category
def assign_age_group(age):
    if age < 30:
        return '18-29'
    elif age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age >= 60:
        return '60+'

    

In [38]:
df['dob_years'].apply(assign_age_group).head(20)# Test if the function works


0     40-49
1     30-39
2     30-39
3     30-39
4     50-59
5     18-29
6     40-49
7     50-59
8     30-39
9     40-49
10    30-39
11    40-49
12      60+
13    50-59
14    50-59
15    18-29
16    30-39
17    30-39
18    50-59
19    40-49
Name: dob_years, dtype: object

In [39]:
df['assign_age_group']=df['dob_years'].apply(assign_age_group)# Creating new column based on function



In [40]:
df['assign_age_group'].value_counts().sort_index()# Checking how values in the new column
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20897 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          20897 non-null  int64  
 1   days_employed     19149 non-null  float64
 2   dob_years         20897 non-null  int64  
 3   education         20897 non-null  object 
 4   education_id      20897 non-null  int64  
 5   family_status     20897 non-null  object 
 6   family_status_id  20897 non-null  int64  
 7   gender            20897 non-null  object 
 8   income_type       20897 non-null  object 
 9   debt              20897 non-null  int64  
 10  total_income      19149 non-null  float64
 11  purpose           20897 non-null  object 
 12  assign_age_group  20897 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.2+ MB


Create a table that only has data without missing values. This data will be used to restore the missing values.

In [41]:
df_no_missing_values=df.dropna()
print(df_no_missing_values.head(30))
print(df_no_missing_values.info())

    children  days_employed  dob_years            education  education_id  \
0          1     351.569709         42    bachelor_s degree             0   
1          1     167.700156         36  secondary education             1   
2          0     234.309275         33  secondary education             1   
3          3     171.864467         32  secondary education             1   
4          0   14177.753002         53  secondary education             1   
5          0      38.591076         27    bachelor_s degree             0   
6          0     119.966752         43    bachelor_s degree             0   
7          0       6.365815         50  secondary education             1   
8          2     288.744387         35    bachelor_s degree             0   
9          0      91.198185         41  secondary education             1   
10         2     173.811819         36    bachelor_s degree             0   
11         0      33.029245         40  secondary education             1   

In [42]:
# Look at the mean values for income based on your identified factors
pd.to_numeric(df_no_missing_values['total_income'])
print(df_no_missing_values.groupby(['gender'])['total_income'].mean())
print(df_no_missing_values.groupby(['education_id'])['total_income'].mean())
print(df_no_missing_values.groupby(['income_type'])['total_income'].mean())
print(df_no_missing_values.groupby(['assign_age_group'])['total_income'].mean())

gender
F    24666.721477
M    30913.888696
Name: total_income, dtype: float64
education_id
0    33197.258790
1    24594.390815
2    29028.844227
3    21144.882211
4    27960.024667
Name: total_income, dtype: float64
income_type
business                       32424.420789
civil servant                  27336.442546
employee                       25822.872585
entrepreneur                   79866.103000
paternity / maternity leave     8612.661000
retiree                        21950.722935
student                        15712.260000
unemployed                     21014.360500
Name: total_income, dtype: float64
assign_age_group
18-29    25544.231203
30-39    28314.525654
40-49    28575.427654
50-59    25807.707523
60+      23015.440182
Name: total_income, dtype: float64


In [43]:
print(df_no_missing_values.groupby(['gender'])['total_income'].median())
print(df_no_missing_values.groupby(['education_id'])['total_income'].median())
print(df_no_missing_values.groupby(['income_type'])['total_income'].median())
print(df_no_missing_values.groupby(['assign_age_group'])['total_income'].mean())# Look at the median values for income based on your identified factors


gender
F    21465.6375
M    26828.2450
Name: total_income, dtype: float64
education_id
0    28086.5425
1    21832.2410
2    25618.4640
3    18741.9760
4    25161.5835
Name: total_income, dtype: float64
income_type
business                       27594.6410
civil servant                  24076.1150
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18959.6260
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64
assign_age_group
18-29    25544.231203
30-39    28314.525654
40-49    28575.427654
50-59    25807.707523
60+      23015.440182
Name: total_income, dtype: float64



The distribution of values is uneven, large differences, so we use the median

In [44]:
group_df=df_no_missing_values.groupby(['income_type'])['total_income'].median()
group_df_dict=pd.Series(group_df).to_dict()
df['total_income']=df['total_income'].fillna(df.income_type.map(group_df_dict))
        
        

In [45]:
print(df['total_income'].isna().sum())
print(df['total_income'].head(30))

0
0     40620.102
1     17932.802
2     23341.752
3     42820.568
4     25378.572
5     40922.170
6     38484.156
7     21731.829
8     15337.093
9     23108.150
10    18230.959
11    12331.077
12    18959.626
13    20873.317
14    26420.466
15    18691.345
16    46272.433
17    14465.694
18     9091.804
19    38852.977
20    33528.423
21    21089.953
22    23948.983
23    20522.515
24    46487.558
25     8818.041
26    24076.115
27    49415.837
28    30058.118
29    18959.626
Name: total_income, dtype: float64


In [46]:
df.info()# Check if we got any errors
print(df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20897 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          20897 non-null  int64  
 1   days_employed     19149 non-null  float64
 2   dob_years         20897 non-null  int64  
 3   education         20897 non-null  object 
 4   education_id      20897 non-null  int64  
 5   family_status     20897 non-null  object 
 6   family_status_id  20897 non-null  int64  
 7   gender            20897 non-null  object 
 8   income_type       20897 non-null  object 
 9   debt              20897 non-null  int64  
 10  total_income      20897 non-null  float64
 11  purpose           20897 non-null  object 
 12  assign_age_group  20897 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.2+ MB
   children  days_employed  dob_years            education  education_id  \
0         1     351.569709         42    bachelo

###  Restoring values in `days_employed`

For filling missing values, we need to calculate the average values for groups in accordance with category of customers as assign_age_group.

In [47]:
print(df_no_missing_values.groupby(['assign_age_group'])['days_employed'].mean())# Distribution of `days_employed` medians based on your identified parameters




assign_age_group
18-29       81.829595
30-39      173.698644
40-49      512.857068
50-59     5551.078523
60+      11934.444208
Name: days_employed, dtype: float64


In [48]:
print(df_no_missing_values.groupby(['assign_age_group'])['days_employed'].median())# Distribution of `days_employed` means based on your identified parameters

assign_age_group
18-29       41.626203
30-39       66.746661
40-49       88.028940
50-59      200.977686
60+      14801.234092
Name: days_employed, dtype: float64


The distribution of values is uneven, large differences, so we use the median

In [49]:
print(df['days_employed'].isna().sum())
group_df=df_no_missing_values.groupby(['assign_age_group'])['days_employed'].median()
group_df_dict=pd.Series(group_df).to_dict()
df['days_employed']=df['days_employed'].fillna(df.assign_age_group.map(group_df_dict))# Let's write a function that calculates means or medians (depending on your decision) based on your identified parameter


1748


In [50]:
df['days_employed'].isna().sum()# Check that the function works



0

In [51]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20897 entries, 0 to 21524
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          20897 non-null  int64  
 1   days_employed     20897 non-null  float64
 2   dob_years         20897 non-null  int64  
 3   education         20897 non-null  object 
 4   education_id      20897 non-null  int64  
 5   family_status     20897 non-null  object 
 6   family_status_id  20897 non-null  int64  
 7   gender            20897 non-null  object 
 8   income_type       20897 non-null  object 
 9   debt              20897 non-null  int64  
 10  total_income      20897 non-null  float64
 11  purpose           20897 non-null  object 
 12  assign_age_group  20897 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.2+ MB


In [52]:
df['total_income'].astype('int') 

0        40620
1        17932
2        23341
3        42820
4        25378
         ...  
21520    35966
21521    24959
21522    14347
21523    39054
21524    13127
Name: total_income, Length: 20897, dtype: int64

## Categorization of data

To analyze customer debts, it is necessary to find the percentage of such customers for each category and identify such categories where the percentage of customers with debt is higher than in other categories.
We will consider such characteristics as: the presence of children, family_status, total_income, income_type, purpose of the loan.


In [53]:
print(df['total_income'].min()) # Print min and max the values for selected data for categorization
print(df['total_income'].max())


3306.762
362496.645


In [54]:
print(df['total_income'].median())


22815.103499999997


Let's group the values into three categories low, middle and high income.



In [55]:
def assign_total_income_group(total_income):
    if total_income < 15000:
        return 'low'
    elif total_income < 30000:
        return 'middle'
    elif total_income >= 30000:
        return 'high'


In [56]:
df['total_income_group']=df['total_income'].apply(assign_total_income_group)# Create a column with the categories and count the values for them



In [57]:
# Looking through all the numerical data in your selected column for categorization
df.head(30)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,assign_age_group,total_income_group
0,1,351.569709,42,bachelor_s degree,0,married,0,F,employee,0,40620.102,real_estate,40-49,high
1,1,167.700156,36,secondary education,1,married,0,F,employee,0,17932.802,car,30-39,middle
2,0,234.309275,33,secondary education,1,married,0,M,employee,0,23341.752,real_estate,30-39,middle
3,3,171.864467,32,secondary education,1,married,0,M,employee,0,42820.568,education,30-39,high
4,0,14177.753002,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,wedding,50-59,middle
5,0,38.591076,27,bachelor_s degree,0,civil partnership,1,M,business,0,40922.17,real_estate,18-29,high
6,0,119.966752,43,bachelor_s degree,0,married,0,F,business,0,38484.156,real_estate,40-49,high
7,0,6.365815,50,secondary education,1,married,0,M,employee,0,21731.829,education,50-59,middle
8,2,288.744387,35,bachelor_s degree,0,civil partnership,1,F,employee,0,15337.093,wedding,30-39,middle
9,0,91.198185,41,secondary education,1,married,0,M,employee,0,23108.15,real_estate,40-49,middle


In [58]:
df.groupby(['total_income_group'])['total_income'].count()# Getting summary statistics for the column



total_income_group
high       5860
low        3706
middle    11331
Name: total_income, dtype: int64

## Checking the Hypotheses


**Hypothesis first: The higher the income, the less often debt arises.**

In [59]:
# Creating function for categorizing 
def debt_total_income_group(row):
    if row['total_income_group']=='low':
        return 'low'
    elif row['total_income_group']=='middle':
        return 'middle'
    elif row['total_income_group']=='high':
        return 'high'
    else: return 'no'
 
df['debt_total_income_group']=df.apply(debt_total_income_group, axis=1)
df_total_income=df.pivot_table(index='debt_total_income_group',columns='debt',values='total_income',aggfunc='count')        
df_total_income['percentages_total_income_group']=df_total_income[1]/(df_total_income[0]+df_total_income[1])*100
print(df_total_income.sort_values(by='percentages_total_income_group',ascending=True))

debt                         0    1  percentages_total_income_group
debt_total_income_group                                            
high                      5427  433                        7.389078
low                       3409  297                        8.014031
middle                   10339  992                        8.754744


According to the interest received, it can be seen that high-income customers do have a lower percentage of debt, that is, they more often repay the loan on time. However, for customers with medium and low income, such a connection cannot be said.

**Hypothesis second: The more children, the more debt.**

In [61]:
pd.to_numeric(df['children'])
print(df.groupby(['children'])['debt'].value_counts())
def percentages_children_group(x):
    a=(df[(df['children']==x)&(df['debt']==1)].shape[0])/(df[(df['children']==x)].shape[0])
    return a
for x in [0,1,2,3]:
    print('Percentage of indebted clients with',x,'child: {:.2%}'.format(percentages_children_group(x)))

children  debt
0         0       12704
          1        1056
1         0        4294
          1         441
2         0        1832
          1         194
3         0         300
          1          27
4         0          36
          1           4
5         0           9
Name: debt, dtype: int64
Percentage of indebted clients with 0 child: 7.67%
Percentage of indebted clients with 1 child: 9.31%
Percentage of indebted clients with 2 child: 9.58%
Percentage of indebted clients with 3 child: 8.26%


**Conclusion**
Since there are a minimum number of customers with 4 and 5 children, we will not use their indicators in order to make the result objective.
According to the results of the audit, we see that customers without children have the least amount of debt. While there is no correlation between the number of children and debt.


**The third hypothesis is that unmarried customers have the biggest percentage of debt.**

In [62]:
print(df.groupby(['family_status'])['debt'].value_counts())
def percentages_family_status(x):
    a=(df[(df['family_status']==x)&(df['debt']==1)].shape[0])/(df[(df['family_status']==x)].shape[0])
    return a
for x in df['family_status'].unique():
    print('Percentage of indebted clients -',x,': {:.2%}'.format(percentages_family_status(x)))


# Check the family status data and paying back on time



# Calculating default-rate based on family status



family_status      debt
civil partnership  0        3702
                   1         383
divorced           0        1093
                   1          84
married            0       11030
                   1         921
unmarried          0        2482
                   1         272
widow / widower    0         868
                   1          62
Name: debt, dtype: int64
Percentage of indebted clients - married : 7.71%
Percentage of indebted clients - civil partnership : 9.38%
Percentage of indebted clients - widow / widower : 6.67%
Percentage of indebted clients - divorced : 7.14%
Percentage of indebted clients - unmarried : 9.88%


**Conclusion**
According to the results obtained, we see that the percentage of debts of widows/widowers is minimal, this is also due to the fact that the total number of loans is small. Divorced and unmarried have about the same percentage of debt. The highest interest rates on debts are unmarried people and couples living in a civil relationship.

**Hypothesis four - business owners have a lower percentage of debts than others**

In [63]:
print(df.groupby(['income_type'])['debt'].value_counts())

def percentages_income_type(x):
    a=(df[(df['income_type']==x)&(df['debt']==1)].shape[0])/(df[(df['income_type']==x)].shape[0])
    return a
for x in ['business','civil servant','employee','retiree']:
    print('Percentage of indebted clients -',x,': {:.2%}'.format(percentages_income_type(x)))

# Check the income level data and paying back on time



# Calculating default-rate based on income level



income_type                  debt
business                     0       4608
                             1        373
civil servant                0       1352
                             1         86
employee                     0       9731
                             1       1046
entrepreneur                 0          2
paternity / maternity leave  1          1
retiree                      0       3480
                             1        215
student                      0          1
unemployed                   0          1
                             1          1
Name: debt, dtype: int64
Percentage of indebted clients - business : 7.49%
Percentage of indebted clients - civil servant : 5.98%
Percentage of indebted clients - employee : 9.71%
Percentage of indebted clients - retiree : 5.82%


**Conclusion**
Since the minimum number of clients is students, unemployed, entrepreneurs and on parental leave / maternity leave, we will not use their indicators in order to make the result objective.
The results show that the lowest percentage of debts belongs to retiree and civil servants, while it is much higher for businessmen and employees.

**The fifth hypothesis is that the lowest percentage of debts falls on loans used for real_estate.**

In [64]:
print(df.groupby(['purpose'])['debt'].value_counts())

def percentages_purpose_type(x):
    a=(df[(df['purpose']==x)&(df['debt']==1)].shape[0])/(df[(df['purpose']==x)].shape[0])
    return a
for x in df['purpose'].unique():
    print('Percentage of indebted clients with purpose -',x,': {:.2%}'.format(percentages_purpose_type(x)))
# Check the percentages for default rate for each credit purpose and analyze them



purpose      debt
car          0       3828
             1        396
education    0       3552
             1        369
real_estate  0       9695
             1        776
wedding      0       2100
             1        181
Name: debt, dtype: int64
Percentage of indebted clients with purpose - real_estate : 7.41%
Percentage of indebted clients with purpose - car : 9.38%
Percentage of indebted clients with purpose - education : 9.41%
Percentage of indebted clients with purpose - wedding : 7.94%



**Conclusion**
Based on the results, we can conclude that wedding and real estate loans are less risky, this may be due to the fact that the size of loans for a wedding is small, and real estate can generate income, so the risk of non-payment of loans is minimal.


# General Conclusion 
Based on the study, we can conclude that the data were sufficient and despite the presence of duplicates and missing values, the results of the analysis are objective. The first and second hypotheses were half confirmed, the third, fourth and fifth were fully confirmed. Based on the results of the study, the following conclusions can be drawn:

1) The safest for the bank
 -high income clients
 -customers located on the civil servant
 -customers taking a loan for real_state
 -customers, without children 
-married or divorsed customers 
 2) The most risky
 -employed customers
 -customers taking a loan for education and car
-unmarried customers