# Analyzing borrowers’ risk of defaulting

The purpose of the project is to prepare a report for a bank’s loan division. 
The hypothesis of the project is that a customer’s marital status and number of children has an impact on whether they will default on a loan. There can be other customers' characteristics that influence their abilyty to return a loan. The product's conclusions should help building credit score of a potential customer.
Let's go step by step.

##  Open the data file and have a look at the general information. 

We start with importing the libraries and loading the data. 

In [1]:
import pandas as pd


In [2]:
df=pd.read_csv('/datasets/credit_scoring_eng.csv')

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan



**Let's see how many rows and columns our dataset has and check for potential issues with the data.**

In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Let's print the first 10 rows:

In [4]:
df.head(10)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


**There are 21525 rows and 12 columns.  All datatypes are ok. Some values need further investigation and changing. There are missing values in two columns.**

In [5]:
df.isnull().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [6]:
df.isnull().sum()/len(df)

children            0.000000
days_employed       0.100999
dob_years           0.000000
education           0.000000
education_id        0.000000
family_status       0.000000
family_status_id    0.000000
gender              0.000000
income_type         0.000000
debt                0.000000
total_income        0.100999
purpose             0.000000
dtype: float64

**There are missing values in columns 'days_employed' and 'total_income', they have the same number(2174). 
It is about 10% of total dataset.**

**Let's look in the filtered table at the the first column with missing data**

In [7]:
df[df.days_employed.isna()]


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


**In rows with missing days_employed we also see missing total_income. Is it always the case? Let's check the intersection of missing values in both columns(filtered table df_na), that contains all rows where values are missing in both 'days_employed' and 'total_income' columns.We apply multiple conditions for filtering data and look at the number of rows in the filtered table.**

In [8]:
df_na=df[df.days_employed.isna() & df.total_income.isna()]
df_na

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


**Our filtered table with missing values has  all 2174 rows, so cases when 'days_employed' and 'total_income' 
are missing totally coinside. Probably there was some technical issue with data input or storage.**




**Intermediate conclusion**

**Number of missing values is relatively large - nearly 10% of  whole dataset. These data need to be further investigated and restored if it's possible.
To do that, firstly we should consider whether the missing data could be due to the specific client characteristic, such as employment type or something else. Secondly, we should check whether there's any dependence missing values have on the value of other indicators with the columns with identified specific client characteristic.**

**Lets compare distribution of specific client characteristics(income_type and gender) in filtered df_na with missing values and in df without missing values.**


In [9]:
df_na['income_type'].value_counts()

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

In [10]:
df_na['income_type'].value_counts()/len(df_na)

employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64

In [11]:
df['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [12]:
df[df.total_income.notna()]['income_type'].value_counts()/len(df[df.total_income.notna()])


employee                       0.517493
business                       0.236525
retiree                        0.177924
civil servant                  0.067800
unemployed                     0.000103
student                        0.000052
entrepreneur                   0.000052
paternity / maternity leave    0.000052
Name: income_type, dtype: float64

In [13]:
df_na['gender'].value_counts()/len(df_na)

F    0.682613
M    0.317387
Name: gender, dtype: float64

In [14]:
df[df.total_income.notna()]['gender'].value_counts()/len(df[df.total_income.notna()])

F      0.658984
M      0.340964
XNA    0.000052
Name: gender, dtype: float64

In [15]:
df_na['purpose'].nunique()

38

In [16]:
df[df.total_income.notna()]['purpose'].nunique()

38

**We see that the distribution of 'gender','income_type' and 'purpose' values is almost the same in datasets with and without 
missing values, so these characteristics do not influence missing values in  days_employed and total_income.**

**Let's investigate clients who do not have data on identified characteristic and the column with the missing values
Lets look at mean values in all columns in two tables, df and df_na.**

In [17]:

df_na.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,2174.0,0.0,2174.0,2174.0,2174.0,2174.0,0.0
mean,0.552438,,43.632015,0.800828,0.975161,0.078197,
std,1.469356,,12.531481,0.530157,1.41822,0.268543,
min,-1.0,,0.0,0.0,0.0,0.0,
25%,0.0,,34.0,0.25,0.0,0.0,
50%,0.0,,43.0,1.0,0.0,0.0,
75%,1.0,,54.0,1.0,1.0,0.0,
max,20.0,,73.0,3.0,4.0,1.0,


In [18]:
df.describe()


Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


**The destribution of values in original df x and df_na is very similar, so there's no relation of missing values with any of other columns, no dependancy. 
I think data were missing randomly, maybe due to some technical reasons.** 



**Intermediate conclusion**

**The distribution in the original dataset similar to the distribution of the filtered table, so it will be possible to consider missing values independently and fill them with mean/median values.**



**The missing values in columns 'days_employed' and 'total_income' are distributed randomly. They occured in every tenth case, 
no correlation found with any of other characteristics. They  most likely 
were caused by some technical reasons.** 


**Conclusions**

**Once missing values are accidental, we may try to replace them by mean or median values for these quantitative categories. But first we need to address issues with oter values (different registers,incorrect artifacts) and to delete duplicates, then we'll fill in missing values.**



## Data transformation

### Let's go through each column to see what issues we may have in them.

#### First we look through all values in education column to check if and what spellings will need to be fixed


In [19]:

df['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

**The register is inconsistent, lets fix it - make all values low case.
Then we check the values in the column to make sure we fixed them.**

In [20]:

df['education']=df['education'].str.lower()



In [21]:

df['education'].value_counts()


secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

**Now we see exactlly the same distribution of values in 'education_id' column:**

In [22]:
df['education_id'].value_counts()

1    15233
0     5260
2      744
3      282
4        6
Name: education_id, dtype: int64

#### Check the data the `children` column

In [23]:

df['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

**There are some weird values: in 47 cases number of children is -1, in 76 cases it's 20! We see from the tables below that these values are find among people of all ages, different family status, education and income type, so most likely they are typos or caused by some technical reasons.**

In [24]:
df[df['children'] == -1]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
291,-1,-4417.703588,46,secondary education,1,civil partnership,1,F,employee,0,16450.615,profile education
705,-1,-902.084528,50,secondary education,1,married,0,F,civil servant,0,22061.264,car purchase
742,-1,-3174.456205,57,secondary education,1,married,0,F,employee,0,10282.887,supplementary education
800,-1,349987.852217,54,secondary education,1,unmarried,4,F,retiree,0,13806.996,supplementary education
941,-1,,57,secondary education,1,married,0,F,retiree,0,,buying my own car
1363,-1,-1195.264956,55,secondary education,1,married,0,F,business,0,11128.112,profile education
1929,-1,-1461.303336,38,secondary education,1,unmarried,4,M,employee,0,17459.451,purchase of the house
2073,-1,-2539.761232,42,secondary education,1,divorced,3,F,business,0,26022.177,purchase of the house
3814,-1,-3045.290443,26,secondary education,1,civil partnership,1,F,civil servant,0,21102.846,having a wedding
4201,-1,-901.101738,41,secondary education,1,married,0,F,civil servant,0,36220.123,transactions with my real estate


In [25]:
df[df.children == 20]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
606,20,-880.221113,21,secondary education,1,married,0,M,business,0,23253.578,purchase of the house
720,20,-855.595512,44,secondary education,1,married,0,F,business,0,18079.798,buy real estate
1074,20,-3310.411598,56,secondary education,1,married,0,F,employee,1,36722.966,getting an education
2510,20,-2714.161249,59,bachelor's degree,0,widow / widower,2,F,employee,0,42315.974,transactions with commercial real estate
2941,20,-2161.591519,0,secondary education,1,married,0,F,employee,0,31958.391,to buy a car
...,...,...,...,...,...,...,...,...,...,...,...,...
21008,20,-1240.257910,40,secondary education,1,married,0,F,employee,1,21363.842,to own a car
21325,20,-601.174883,37,secondary education,1,married,0,F,business,0,16477.771,profile education
21390,20,,53,secondary education,1,married,0,M,business,0,,buy residential real estate
21404,20,-494.788448,52,secondary education,1,married,0,M,business,0,25060.749,transactions with my real estate


**Assume that initially there were 1 and 2 instead of -1 and 20. Let's replace the weird numbers in 'children'column.**

In [26]:
# [fix the data based on your decision]
df['children']=df['children'].replace(20,2)


In [27]:
df['children']=df['children'].replace(-1,1)

**Let's check the `children` column again to make sure it's all fixed**

In [28]:

df['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

#### Check the data in the `days_employed` column. 

In [29]:
df['days_employed'].value_counts().sort_index()


-18388.949901     1
-17615.563266     1
-16593.472817     1
-16264.699501     1
-16119.687737     1
                 ..
 401663.850046    1
 401674.466633    1
 401675.093434    1
 401715.811749    1
 401755.400475    1
Name: days_employed, Length: 19351, dtype: int64

**We see big amount of problematic data: negative values, too big values.**

**Let's first address negative data:**

In [30]:

df['days_employed'][df['days_employed']<0].value_counts()

-327.685916     1
-479.053707     1
-7096.394827    1
-9686.561022    1
-5686.275084    1
               ..
-5830.601748    1
-4822.288673    1
-1389.210351    1
-540.038562     1
-3382.113891    1
Name: days_employed, Length: 15906, dtype: int64

**The majority of data in the column is negative. It can not be a typo. But what about other values? Let's have a look at positive values.**

In [31]:
df['days_employed'][df['days_employed']> 0].value_counts().sort_index()

328728.720605    1
328734.923996    1
328771.341387    1
328795.726728    1
328827.345667    1
                ..
401663.850046    1
401674.466633    1
401675.093434    1
401715.811749    1
401755.400475    1
Name: days_employed, Length: 3445, dtype: int64

 **Positive values in the column are too high, they are all greater than 328 728, which is not realistic and also can not be an ordinary typo.**
 
**All in all negative values in 'days_employed' column represent about 74% of its values ((15906 of 21454). 
Too big non-realistic values are anoter 16% (3445 of 21454) of the data. 
As we saw earlier, 2174 values of the column are missing,
so NaN get remaining 10 %. 
As a result we can conclude that this column is totally corrupted and it is not possible to correct
wrong 'days_employed'data and restore missing values in it, we can leave it as it is.**

**Fortunately, this problem with 'days_employed' column doesn't affect much our task, so it will be better to delete the column completely.**


In [32]:
df.drop('days_employed', inplace = True, axis=1)


In [33]:
df.head()


Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


**Now our table has one column less, but there is no weird data.**

#### Let's now look at the client's age and whether there are any issues there. 

In [34]:
df['dob_years'].value_counts().sort_index()

0     101
19     14
20     51
21    111
22    183
23    254
24    264
25    357
26    408
27    493
28    503
29    545
30    540
31    560
32    510
33    581
34    603
35    617
36    555
37    537
38    598
39    573
40    609
41    607
42    597
43    513
44    547
45    497
46    475
47    480
48    538
49    508
50    514
51    448
52    484
53    459
54    479
55    443
56    487
57    460
58    461
59    444
60    377
61    355
62    352
63    269
64    265
65    194
66    183
67    167
68     99
69     85
70     65
71     58
72     33
73      8
74      6
75      1
Name: dob_years, dtype: int64

In [35]:
df['dob_years'].value_counts()/len(df)

35    0.028664
40    0.028293
41    0.028200
34    0.028014
38    0.027782
42    0.027735
33    0.026992
39    0.026620
31    0.026016
36    0.025784
44    0.025412
29    0.025319
30    0.025087
48    0.024994
37    0.024948
50    0.023879
43    0.023833
32    0.023693
49    0.023600
28    0.023368
45    0.023089
27    0.022904
56    0.022625
52    0.022485
47    0.022300
54    0.022253
46    0.022067
58    0.021417
57    0.021370
53    0.021324
51    0.020813
59    0.020627
55    0.020581
26    0.018955
60    0.017515
25    0.016585
61    0.016492
62    0.016353
63    0.012497
64    0.012311
24    0.012265
23    0.011800
65    0.009013
66    0.008502
22    0.008502
67    0.007758
21    0.005157
0     0.004692
68    0.004599
69    0.003949
70    0.003020
71    0.002695
20    0.002369
72    0.001533
19    0.000650
73    0.000372
74    0.000279
75    0.000046
Name: dob_years, dtype: float64

**We see that 101 cells in 'dob_years' column have the value 0, which is not realistic. It's 0,5% of data. 
Let's take a closer look at rows with 0 dob_years.**

In [36]:
df[df['dob_years'] == 0]

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
99,0,0,secondary education,1,married,0,F,retiree,0,11406.644,car
149,0,0,secondary education,1,divorced,3,F,employee,0,11228.230,housing transactions
270,3,0,secondary education,1,married,0,F,employee,0,16346.633,housing renovation
578,0,0,secondary education,1,married,0,F,retiree,0,15619.310,construction of own property
1040,0,0,bachelor's degree,0,divorced,3,F,business,0,48639.062,to own a car
...,...,...,...,...,...,...,...,...,...,...,...
19829,0,0,secondary education,1,married,0,F,employee,0,,housing
20462,0,0,secondary education,1,married,0,F,retiree,0,41471.027,purchase of my own house
20577,0,0,secondary education,1,unmarried,4,F,retiree,0,20766.202,property
21179,2,0,bachelor's degree,0,married,0,M,business,0,38512.321,building a real estate


In [37]:
df[df['dob_years'] == 0].describe(include='all')

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
count,101.0,101.0,101,101.0,101,101.0,101,101,101.0,91.0,101
unique,,,3,,5,,2,4,,,35
top,,,secondary education,,married,,F,employee,,,housing transactions
freq,,,64,,49,,72,55,,,6
mean,0.49505,0.0,,0.673267,,1.237624,,,0.079208,25334.07289,
std,0.807759,0.0,,0.512033,,1.524129,,,0.27141,11901.096532,
min,0.0,0.0,,0.0,,0.0,,,0.0,5595.912,
25%,0.0,0.0,,0.0,,0.0,,,0.0,15933.259,
50%,0.0,0.0,,1.0,,1.0,,,0.0,24387.07,
75%,1.0,0.0,,1.0,,3.0,,,0.0,34007.9075,


**The tables above show that 0s in 'dob_years' column get people of different education, family status, income and other characteristics, they are distributed randomly. 
It's only 0.5 % of the data, so we can replace 0 with the median age.**


In [38]:
df['dob_years']=df['dob_years'].replace(0,int(df['dob_years'].median()))


**Now we check the result and see no 0s for client's age**

In [39]:

df['dob_years'].value_counts().sort_index()

19     14
20     51
21    111
22    183
23    254
24    264
25    357
26    408
27    493
28    503
29    545
30    540
31    560
32    510
33    581
34    603
35    617
36    555
37    537
38    598
39    573
40    609
41    607
42    698
43    513
44    547
45    497
46    475
47    480
48    538
49    508
50    514
51    448
52    484
53    459
54    479
55    443
56    487
57    460
58    461
59    444
60    377
61    355
62    352
63    269
64    265
65    194
66    183
67    167
68     99
69     85
70     65
71     58
72     33
73      8
74      6
75      1
Name: dob_years, dtype: int64

#### Now let's check the `family_status` column. See what kind of values there are and what problems  may need  addressing

In [40]:
df['family_status'].value_counts()


married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

**All values in 'family_status' column are correct and realistic.**

#### Now let's check the `gender` column. 

In [41]:
df['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

**XNA stands for the third gender or is incorrect information, we do not know. Let's replace it with 'unknown' value.**

In [42]:
df['gender']= df['gender'].replace('XNA', 'unknown')

In [43]:
print(df['gender'].value_counts())

F          14236
M           7288
unknown        1
Name: gender, dtype: int64


**Now the 'gender'column is ok**

#### Now let's check the `income_type` column. 

In [44]:
print(df['income_type'].value_counts())

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64


**These values are correct.
So we fixed all incorrect values where it was possible and deleted 1 totally corrupted column**

### Now let's see if we have any duplicates in our data.

In [45]:
df.duplicated().sum()


72

In [46]:
df[df.duplicated()] 

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,41,secondary education,1,married,0,F,employee,0,,purchase of the house for my family
3290,0,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
4182,1,34,bachelor's degree,0,civil partnership,1,F,employee,0,,wedding ceremony
4851,0,60,secondary education,1,civil partnership,1,F,retiree,0,,wedding ceremony
5557,0,58,secondary education,1,civil partnership,1,F,retiree,0,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...
20702,0,64,secondary education,1,married,0,F,retiree,0,,supplementary education
21032,0,60,secondary education,1,married,0,F,retiree,0,,to become educated
21132,0,47,secondary education,1,married,0,F,employee,0,,housing renovation
21281,1,30,bachelor's degree,0,married,0,F,employee,0,,buy commercial real estate


**There are 71 duplicated rows that can be dropped**

In [47]:
df=df.drop_duplicates()

In [48]:
df.duplicated().sum()

0

**Now there are no duplicates, every column has less values.**

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21453 entries, 0 to 21524
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21453 non-null  int64  
 1   dob_years         21453 non-null  int64  
 2   education         21453 non-null  object 
 3   education_id      21453 non-null  int64  
 4   family_status     21453 non-null  object 
 5   family_status_id  21453 non-null  int64  
 6   gender            21453 non-null  object 
 7   income_type       21453 non-null  object 
 8   debt              21453 non-null  int64  
 9   total_income      19351 non-null  float64
 10  purpose           21453 non-null  object 
dtypes: float64(1), int64(5), object(5)
memory usage: 2.0+ MB


**So far we corrected some values in columns 'children', 'dob_years', 'gender' and decided to delete the most corrupted column 'days_employed'. The other columns didn't need any manipulation. Then we deleted all duplicated rows in the dataset. Now we have values in 21454 rows and 11 columns, one of them (total_income) has missing values that need addressing.**

### Working with missing values

**Only one column 'total_income' has missing values. The other column 'days_employed' that had missing values in the same rows was deleted as totally corrupted and not important for our analysis.**

#### Restoring missing values in `total_income`

**We saw that total_income column has 10% of missing data which are destributed randomly and do not depend on other columns. We are going to replace them with mean or median value for total income. Since total income differs a lot for people of different ages and income type, lets find mean/median values for groups, compare group mean and median and then decide which of them is better for filling missing values. 
First of all, lets consider client's age and form age groups.**



In [50]:
df['dob_years'].describe()

count    21453.000000
mean        43.469025
std         12.214162
min         19.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [51]:
df['dob_years'].value_counts().sort_index()

19     14
20     51
21    111
22    183
23    252
24    264
25    357
26    408
27    493
28    503
29    544
30    537
31    559
32    509
33    581
34    601
35    616
36    554
37    536
38    597
39    572
40    607
41    605
42    696
43    512
44    545
45    496
46    472
47    477
48    536
49    508
50    513
51    446
52    484
53    459
54    476
55    443
56    483
57    456
58    454
59    443
60    374
61    354
62    348
63    269
64    260
65    193
66    182
67    167
68     99
69     85
70     65
71     56
72     33
73      8
74      6
75      1
Name: dob_years, dtype: int64

**Let's write a function that calculates the age category**

In [52]:
def assign_age_group(dob_years):
    if dob_years < 20:
        return '10-19'
    elif dob_years < 30:
        return '20-29'
    elif dob_years < 40:
        return '30-39'
    elif dob_years < 50:
        return '40-49'
    elif dob_years < 60:
        return '50-59'
    elif dob_years < 70:
        return '60-69'
    else:
        return '70+'
    


**Apply function assign_age_group**

In [53]:
df['dob_years'].apply(assign_age_group)


0        40-49
1        30-39
2        30-39
3        30-39
4        50-59
         ...  
21520    40-49
21521    60-69
21522    30-39
21523    30-39
21524    40-49
Name: dob_years, Length: 21453, dtype: object

**Test if the function works**

In [54]:
example=df.loc[0]['dob_years']
assign_age_group(example)

'40-49'

**We creat a new column based on function**

In [55]:
df['age_group'] = df['dob_years'].apply(assign_age_group)

In [56]:
df['age_group'].value_counts()

30-39    5662
40-49    5454
50-59    4657
20-29    3166
60-69    2331
70+       169
10-19      14
Name: age_group, dtype: int64

**Our new column shows age group for every client. It has 7 unique values that are age groups.**

In [57]:
df['age_group'].describe()

count     21453
unique        7
top       30-39
freq       5662
Name: age_group, dtype: object

**Now lets compare mean and median values of total income on the whole dataset and on age groups.**

In [58]:
df['total_income'].describe()

count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64

**The difference between the mean and the median on the whole dataset is significant: 26788 VS 23203$. 
We need to investigate the relationship between the statistics on age groups and to choose which of 
them to take for filling up missing values.**

**mean value for age groups**

In [59]:
age_income_mean=df.groupby(['age_group'])['total_income'].mean()
age_income_mean

age_group
10-19    16993.942462
20-29    25572.630177
30-39    28312.479963
40-49    28491.929026
50-59    25811.700327
60-69    23242.812818
70+      20125.658331
Name: total_income, dtype: float64

**median values for age groups**

In [60]:
age_income_median=df.groupby(['age_group'])['total_income'].median()
age_income_median

age_group
10-19    14934.9010
20-29    22799.2580
30-39    24667.5280
40-49    24755.6960
50-59    22203.0745
60-69    19817.4400
70+      18751.3240
Name: total_income, dtype: float64

**We see that in all age groups mean values are significantly higher than median values. Probably there are many outliers. Age is connected with income type: young people can be students and have less experience, so they get less money, retirees are most of all older people.**

**Lets find mean and median total income for income_type groups.**


In [61]:
df['income_type'].value_counts()

employee                       11083
business                        5078
retiree                         3829
civil servant                   1457
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [62]:
inctype_income_mean=df.groupby(['income_type'])['total_income'].mean()
inctype_income_mean

income_type
business                       32386.793835
civil servant                  27343.729582
employee                       25820.841683
entrepreneur                   79866.103000
paternity / maternity leave     8612.661000
retiree                        21940.394503
student                        15712.260000
unemployed                     21014.360500
Name: total_income, dtype: float64

In [63]:
inctype_income_median=df.groupby(['income_type'])['total_income'].median()
inctype_income_median

income_type
business                       27577.2720
civil servant                  24071.6695
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18962.3180
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64

**Like with age groups, mean values here are greater than median values. 
These two statistics do not differ only in very small groups(1-2 people).
We saw significant difference between mean and median values that may be caused by outliers.
So the best choice is to replace missing values by income_type group median value.**

   
        

In [64]:
 df['total_income'] = df['total_income'].fillna(df.groupby('income_type')['total_income'].transform('median'))

**Check if we got any errors:**

In [65]:
df['total_income'].isna().sum()

0

**Now there are no missing values in out data. Every column contains the same number of non-null values which is equal to number of rows.**


[When you think you've finished with `total_income`, check that the total number of values in this column matches the number of values in other ones.]

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21453 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21453 non-null  int64  
 1   dob_years         21453 non-null  int64  
 2   education         21453 non-null  object 
 3   education_id      21453 non-null  int64  
 4   family_status     21453 non-null  object 
 5   family_status_id  21453 non-null  int64  
 6   gender            21453 non-null  object 
 7   income_type       21453 non-null  object 
 8   debt              21453 non-null  int64  
 9   total_income      21453 non-null  float64
 10  purpose           21453 non-null  object 
 11  age_group         21453 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 2.6+ MB


####  Restoring values in `days_employed`

[Think about the parameters that may help you restore the missing values in this column. Eventually, you will want to find out whether you should use mean or median values for replacing missing values. You will probably conduct a research similar to the one you've done when restoring data in a previous column.]

**I can here repeat what was stated above. 
Missing values are 10% of days_employed column(2174 NaN of 21454).
Negative values in 'days_employed' column represent about 74% of its values ((15906 of 21454). 
Too big non-realistic values are remaining 16% (3445 of 21454) of the data.
So this column is totally corrupted, there is no adequate information and it is not possible to correct it.
Moreover, this problem doesn't affect much our main task, so let's not waste time on addressing the column. 
I decided to delete te column as irrelevant and corrupted information.**



## Categorization of data

**To answer the questions and test the hypotheses, we need categorized data.**

**'purpose' column contains multiple string values that need categotization**


In [67]:
df['purpose'].unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

**Here we have similar values like wedding ceremony, having a wedding, to have a wedding.These values are variants of a single category wedding. The same with education, house, car. There are 4 big categories and it's possible to devide all values in 'purpuse' column into categories to simplify further analysis.**

In [68]:
# Let's define function get_cat:
def get_cat(x):
  if 'wedding' in x:
    return 'wedding'
  if 'real estate' in x or 'hous' in x or 'proper' in x:
    return 'real estate'
  if 'car' in x:
    return 'car'
  if 'educat' in x or 'university' in x :
    return 'education'
  return 'other'



In [69]:
df.loc[0]['purpose']

'purchase of the house'

In [70]:
example = df.loc[0]['purpose']
get_cat(example)

'real estate'

**We create a new table 'purpose_cat'**

In [71]:
df['purpose_cat'] = df['purpose'].apply(get_cat)


**Let's now check unique values which are new categories**

In [72]:
df['purpose_cat'].value_counts()

real estate    10811
car             4306
education       4013
wedding         2323
Name: purpose_cat, dtype: int64

**Now all values in 'purpose' column are devided into 4 big categories, saved in 'purpose_cat' column.**


**'total_income' column also contains multiple numerical values which can be categorized**

In [73]:
df['total_income'].describe()

count     21453.000000
mean      26451.382421
std       15710.314734
min        3306.762000
25%       17219.352000
50%       22815.103500
75%       31331.461000
max      362496.645000
Name: total_income, dtype: float64

**To categorize income values we create income_cat function taking QR1, QR2, QR3 from the table above.**

In [74]:
def income_lev(income):
    if income < 17214:
        return 'level 1'
    if income < 23240:
        return 'level 2'
    if income < 31330:
        return 'level 3'
    else:
        return 'level 4'
    

In [75]:
df['income_lev']=df['total_income'].apply(income_lev)

In [76]:
df['income_lev']

0        level 4
1        level 2
2        level 3
3        level 4
4        level 3
          ...   
21520    level 4
21521    level 3
21522    level 1
21523    level 4
21524    level 1
Name: income_lev, Length: 21453, dtype: object

**Now let's take a look at 'debt' column which is in focus of our analysis. It only has values 0 and 1:**

In [77]:
df['debt']

0        0
1        0
2        0
3        0
4        0
        ..
21520    0
21521    0
21522    1
21523    1
21524    0
Name: debt, Length: 21453, dtype: int64

**Let's find the mean value per column:**

df['debt'].describe()


**The same value we get by deviding number of '1' by total number of values in the column:**

In [78]:
# Creating function for categorizing into different numerical groups based on ranges
df['debt'].sum()/df['debt'].count()


0.08115415093460122

**We found DEBT RATE - 0.081. It is mean debt value or number of debt cases/total number of cases.
Lets investigate relationship between debt rate and other clients characteristics.**

## Checking the Hypotheses


**Initial hypothesis is that a customer’s marital status and number of children has an impact on whether he will default on a loan.**

**Let's check if the ability of a customer to pay back on time depends on number of children, on marital status and other characteristics.**


### Is there a correlation between having children and paying back on time?

**We may need some statistics, so let's import sidetable**

In [79]:
!pip install sidetable
import sidetable

Collecting sidetable
  Downloading sidetable-0.9.1-py3-none-any.whl (19 kB)
Installing collected packages: sidetable
Successfully installed sidetable-0.9.1


**Let's devide cases with debt(debt==1) into groups by number of children.**



In [80]:
df[df.debt==1].groupby(['children'])['debt'].count()

children
0    1063
1     445
2     202
3      27
4       4
Name: debt, dtype: int64

In [81]:
df.stb.freq(['children'])

Unnamed: 0,children,count,percent,cumulative_count,cumulative_percent
0,0,14090,65.67846,14090,65.67846
1,1,4855,22.630867,18945,88.309327
2,2,2128,9.919359,21073,98.228686
3,3,330,1.538246,21403,99.766932
4,4,41,0.191115,21444,99.958048
5,5,9,0.041952,21453,100.0


**The majority of debts were among clients with no children, but it is also the biggest client category(66% of all clients).
So we need to check not number, but mean value of debts per children category.**

In [82]:
df.groupby(['children'])['debt'].mean().reset_index().sort_values(by='debt')

Unnamed: 0,children,debt
5,5,0.0
0,0,0.075444
3,3,0.081818
1,1,0.091658
2,2,0.094925
4,4,0.097561


**Conclusion**


**Having debt rate 0.081, we can assign debt/children scores:**
- 0 children: -1
- 1 child: +1
* 2 children: +2
* 3 children: 0
* 4 children: +3
* 5 children: -2

**The opportunity of default is the highest when a customers has 4 children() and the lowest when a customer has no children or in rare cases of 5 children.**

### Is there a correlation between family status and paying back on time?

**Let's check family_status data**

In [83]:
df.stb.freq(['family_status'])

Unnamed: 0,family_status,count,percent,cumulative_count,cumulative_percent
0,married,12339,57.516431,12339,57.516431
1,civil partnership,4150,19.344614,16489,76.861045
2,unmarried,2810,13.098401,19299,89.959446
3,divorced,1195,5.570317,20494,95.529763
4,widow / widower,959,4.470237,21453,100.0


**Find mean debt in family_status groups:**

In [84]:

df.groupby(['family_status'])['debt'].mean().reset_index().sort_values(by='debt')


Unnamed: 0,family_status,debt
4,widow / widower,0.065693
1,divorced,0.07113
2,married,0.075452
0,civil partnership,0.093494
3,unmarried,0.097509


**Conclusion**

**Having debt rate 0.081, we can assign debt/family status scores:**
* married:-1
* civil partnership:+1
* unmarried:+2
* divorced:-2
* widow / widower:-3

**The opportunity of default is the highest for unmarried customers and the lowest for widowers.**

### Is there a correlation between income level and paying back on time?

**We see that number of customer's children and marital status really affect his ability to pay bback on time.
What about some other client,s characteristics?**

**Let's checK income level data.** 

In [85]:
df.stb.freq(['income_lev'])

Unnamed: 0,income_lev,count,percent,cumulative_count,cumulative_percent
0,level 2,5800,27.035846,5800,27.035846
1,level 4,5365,25.008157,11165,52.044003
2,level 1,5363,24.998835,16528,77.042838
3,level 3,4925,22.957162,21453,100.0


**Find mean debt in income level groups:**

In [86]:
df.groupby(['income_lev'])['debt'].mean().reset_index().sort_values(by='debt')


# Calculating default-rate based on income level



Unnamed: 0,income_lev,debt
3,level 4,0.071389
0,level 1,0.07962
2,level 3,0.085279
1,level 2,0.088103


**Conclusion**

**Having debt rate 0.081, we can assign debt/income scores:**
* level 1 -1
* level 2 +1
* level 3 +2
* level 4 -2

**The opportunity of default is growing with customers income up to a sertain level, bust the richest clients have minimum risk of default.**
**Number of income categories could have been bigger to reflect relations between the categories, but the principle would be the same.**


### How does credit purpose affect the debt rate?

**We  categorized values in 'purpose' column and now have 4 purpose categories:**

In [87]:
df.stb.freq(['purpose_cat'])

Unnamed: 0,purpose_cat,count,percent,cumulative_count,cumulative_percent
0,real estate,10811,50.393884,10811,50.393884
1,car,4306,20.071785,15117,70.465669
2,education,4013,18.706008,19130,89.171678
3,wedding,2323,10.828322,21453,100.0


**Find mean debt in purpose groups:**

In [88]:
df.groupby(['purpose_cat'])['debt'].mean().reset_index().sort_values(by='debt')


Unnamed: 0,purpose_cat,debt
2,real estate,0.072334
3,wedding,0.080069
1,education,0.0922
0,car,0.09359


**Conclusion**

**Having debt rate 0.081, we can assign debt/purpose scores:**
    
* real estate-2
* car+2
* education +1 
* wedding -1

**The opportunity of default is the highest when a customers purpose is a new car and the lowest when the money is taken for real estate.**

### Conclusion

**Our hypothesisis   was approved, customer’s marital status and number of children has an impact on his ability to pay back on time. Customer's income level and purpose of the loan also affect the opportunity of default.**

**To predict default or debt risk of the client we can count and sum  its debt scores.**


**debt/children scores(-2 - +3):**

* 0 children: -1
* 1 child: +1
* 2 children: +2
* 3 children: 0
* 4 children: +3
* 5 children: -2

**debt/family status scores(-3 - +2) :**

* married:-1
* civil partnership:+1
* unmarried:+2
* divorced:-2
* widow / widower:-3

**debt/income scores(-2 - +2):**

* level 1 -1
* level 2 +1
* level 3 +2
* level 4 -2

**debt/purpose scores(-2 - +2):**

* real estate-2
* car+2
* education +1 
* wedding -1

**Total score will be from -9 to +9, where -9 means minimum risk of client's default, +9 means maximum risk.
The scale is  not excellent, the categories are rough and non symmetric, but as a first approach to the task it's ok.**



## General conclusion


**We investigated our datset and revieled some problems with data: missing and weird values, duplicates. Problem values were addressed: weird values were corrected when possible, missing values and zeros in some cases were replaced with mean and median values, duplicates were deleted. One column('days_employed') was totally deleted as containing only incorrect values and missing values which could not be restored and were not necessary for main goal of our analysis.
After preprocessing the data was categorized. We needed to create functions to devide  data in 'reason' column into 4 big categories, to define age groups and income level.
Then we calculated debt rate 0.081 and looked througt its relation with having children , marital status, income level, credit purpose.
We approved our hypothesisis that castomer's ability to pay back on time is affected by his marital status and number of children, as well as by his income level and purpose of the loan.
Four debt scores were proposed for customer's characteristics and the total score ranging from -9 to +9, which can predict default risk of a customer.**
