# Credit Scoring Project
### Analyzing borrowers’ risk of defaulting

The goal is to prepare a report for a bank’s loan division. We are aiming to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. 



## Content plan

1. [Step1](#Step1) Opening the data file and studing the general information
2. [Step2](#Step2) Data preprocessing
3. [Step3](#Step3) Analysing the data
4. [Overall conclusion](#oc)

### Step1

In [111]:
import pandas
import pandas as pd
data = pd.read_csv('credit.csv')

data.info() #we have 12 columns and 21525 rows, Total_income and Days_employed columns have only 19351 rows, it seems there are some missing values. Will check it in the next task 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [112]:
data.head() # there are different ways of writing for the same values, i would need to fix it. Days_employed column has negative numbers, i ned to check these values during data preprocessing

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


In [113]:
data.describe() #i need to check columns Children and Dob_years as well, -1 and 20 children look abnormally, 0 as age can`t be right too

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


In [114]:
data['purpose'].value_counts() #unique values in this column show that there are four groups they can be combined into. i will use this information later for lemmatization

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
housing                                     647
purchase of the house                       647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

 ### Conclusion 


The DataFrame contains a lot of garbage, and while the values in some columns have no effect on the study, others are important for answering the questions.

### Step2

#### Processing missing values

In [115]:
# As i know from the first part Total_income column has missing values.
data['total_income'].isnull().sum()

2174

In [116]:
data['days_employed'].isnull().sum()

2174

The number of rows with missin values in both columns are the same. I will check if there are the same rows. To do it i will  make a new datafame containing rows with missing values from Total_income column

In [117]:
total_income_check = data[data['total_income'].isnull()]
total_income_check.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 12 to 21510
Data columns (total 12 columns):
children            2174 non-null int64
days_employed       0 non-null float64
dob_years           2174 non-null int64
education           2174 non-null object
education_id        2174 non-null int64
family_status       2174 non-null object
family_status_id    2174 non-null int64
gender              2174 non-null object
income_type         2174 non-null object
debt                2174 non-null int64
total_income        0 non-null float64
purpose             2174 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 220.8+ KB


As we can see missing values in Total_income column and in Days_employed column are in the same rows. Perhaps an error occurred while uploading the data, or the bank did not require this data when getting applications for a loan.
We can replace missing values using mean() or median() values. Since median is less affected by extremes values, I'll use it. In addition, we have Income_type data that can affect the average Total_income. After calculating the median of the total and for different groups, I can determine which option is best for replacement.

In [118]:
total_income_median = data['total_income'].median()
total_income_median

23202.87

In [119]:
data['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

In [120]:
data_employee = data[data['income_type'] == 'employee']
total_income_employee_median = data_employee['total_income'].median()
total_income_employee_median

22815.103499999997

In [121]:
data_business = data[data['income_type'] == 'business']
total_income_business_median = data_business['total_income'].median()
total_income_business_median

27577.272

In [122]:
data_retiree = data[data['income_type'] == 'retiree']
total_income_retiree_median = data_retiree['total_income'].median()
total_income_retiree_median

18962.318

In [123]:
data_civil_servant = data[data['income_type'] == 'civil servant']
total_income_civil_servant_median = data_civil_servant['total_income'].median()
total_income_civil_servant_median

24071.6695

There is significant difference between median values of total_income in different Income_type groups. That's why i will use these values

In [124]:
data.loc[data['income_type'] == 'employee', 'total_income'] = data.loc[data['income_type'] == 'employee', 'total_income'].fillna(value = total_income_employee_median)
data.loc[data['income_type'] == 'business', 'total_income'] = data.loc[data['income_type'] == 'business', 'total_income'].fillna(value = total_income_business_median)
data.loc[data['income_type'] == 'retiree', 'total_income'] = data.loc[data['income_type'] == 'retiree', 'total_income'].fillna(value = total_income_retiree_median)
data.loc[data['income_type'] == 'civil servant', 'total_income'] = data.loc[data['income_type'] == 'civil servant', 'total_income'].fillna(value = total_income_civil_servant_median)
data['total_income'] = data['total_income'].fillna(value = total_income_median) #if there was a NAN value in the row with other income_type, i fill it with total_income_median value 
data['total_income'].isnull().sum()

0

Next step - to deal with weird values in Children column. As i already know, it contain negative value -1. i assume it was a typo, human factor mistake. so i will change -1 to 1

In [125]:
data['children'] = data['children'].replace(-1, 1)
data['children'].value_counts()

0     14149
1      4865
2      2055
3       330
20       76
4        41
5         9
Name: children, dtype: int64

The Children column has value of 20 children as well. It looks suspiciously, because there are no values between 5 and 20. So i assume that this is a typo too. I don't want to drop these rows, so i can change the value. But should it be 0 or 2. I will calculate the total_income median value for the groups with 0, 2 and 20 children to try to figure out the correct value.

In [126]:
children_0 = data.query('children == 0')
children_0_income_median = children_0['total_income'].median()
children_0_income_median

22815.103499999997

In [127]:
children_2 = data.query('children == 2')
children_2_income_median = children_2['total_income'].median()
children_2_income_median

22815.103499999997

In [128]:
children_20 = data.query('children == 20')
children_20_income_median = children_20['total_income'].median()
children_20_income_median

23045.9195

The total_income median value from children_20 and children_0 are almost the same, that's why i will change 20 to 0 

In [129]:
data['children'] = data['children'].replace(20, 2)
data['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

Now i want to deal with 0 value in the Dob_years column

In [130]:
data['dob_years'].value_counts()

35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

There are 101 people with age 0. i will use median value to change 0, because it gives integer as a result

In [131]:
dob_years_median = data['dob_years'].median()
dob_years_median

42.0

In [132]:
data['dob_years'] = data['dob_years'].replace(0, 42)
data['dob_years'].value_counts()

42    698
35    617
40    609
41    607
34    603
38    598
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

We have one more column with missing values - days_employed. There are negative numbers, that can't be right. Maybe it was a typo during the input or a systematic error. i will change negative numbers to absolute, and turn days into years to check if these values are adequate

In [133]:
data['days_employed'] = abs(data['days_employed'])
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


In [134]:
data['years_employed'] = data['days_employed'] / 365.25
data['years_employed'] = data['years_employed'].fillna(value=0)
data.sort_values(by='years_employed', ascending=False).head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,years_employed
6954,0,401755.400475,56,secondary education,1,widow / widower,2,F,retiree,0,28204.551,housing renovation,1099.946339
10006,0,401715.811749,69,bachelor's degree,0,unmarried,4,F,retiree,0,9182.441,getting an education,1099.837951
7664,1,401675.093434,61,secondary education,1,married,0,F,retiree,0,20194.323,housing transactions,1099.726471
2156,0,401674.466633,60,secondary education,1,married,0,M,retiree,0,52063.316,cars,1099.724755
7794,0,401663.850046,61,secondary education,1,civil partnership,1,F,retiree,0,7725.831,wedding ceremony,1099.695688
4697,0,401635.032697,56,secondary education,1,married,0,F,retiree,0,7718.772,buy real estate,1099.61679
13420,0,401619.633298,63,Secondary Education,1,civil partnership,1,F,retiree,0,8231.966,to have a wedding,1099.574629
17823,0,401614.475622,59,secondary education,1,married,0,F,retiree,0,24443.151,buying property for renting out,1099.560508
10991,0,401591.828457,56,secondary education,1,divorced,3,F,retiree,0,6322.163,to get a supplementary education,1099.498504
8369,0,401590.452231,58,secondary education,1,married,0,F,retiree,0,28049.01,education,1099.494736


In [135]:
data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income,years_employed
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,21525.0,21525.0
mean,0.479721,66914.728907,43.490453,0.817236,0.972544,0.080883,26433.419484,164.699299
std,0.755528,139030.880527,12.218595,0.548138,1.420324,0.272661,15682.773718,365.108637
min,0.0,24.141633,19.0,0.0,0.0,0.0,3306.762,0.0
25%,0.0,927.009265,34.0,1.0,0.0,0.0,17247.708,1.671874
50%,0.0,2194.220567,42.0,1.0,0.0,0.0,22815.1035,4.950181
75%,1.0,5537.882441,53.0,1.0,1.0,0.0,31286.979,13.085798
max,5.0,401755.400475,75.0,4.0,4.0,1.0,362496.645,1099.946339


1099 working years are unreal. Considering negative values in raw data and unreal results i assume that there was some error durin input or output. It would be great to consult with IT specialists what could have gone wrong. One way or another, this column will not be needed to answer the question. i will delete it.

In [136]:
list(data)

['children',
 'days_employed',
 'dob_years',
 'education',
 'education_id',
 'family_status',
 'family_status_id',
 'gender',
 'income_type',
 'debt',
 'total_income',
 'purpose',
 'years_employed']

In [137]:
data = data[['children', 'dob_years', 'education', 'education_id', 'family_status', 'family_status_id', 'gender', 'income_type', 'debt', 'total_income', 'purpose']]
data.head()

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


#### Conclusion

- The missing values   were in two columns - Total_income and Days_employed, while in the same rows. It is possible that these data were absent initially, since were not required by the bank when applying for a loan.
- I replaced the data in Total_income with median values   for different groups of employment for greater accuracy, since for some groups they differed quite strongly.
- The data in the Days_employed column was most likely uploaded with an error: negative values   and an unrealistic result of 1000 working years clearly speaks of this. I replaced the missing values   with 0. Since it is impossible to get the corrected data, and this column is not required for research, so I will not take it into account in further work.
- There were no gaps in the Children column, but there were incorrect values, which I considered a typo and replaced with adequate values. I also got rid of the null values   in Dob_years.

#### Data type replacement

For convenience in further work, I will convert the data in Total_income to integers. 

In [138]:
data['total_income'] = data['total_income'].
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 11 columns):
children            21525 non-null int64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null int64
purpose             21525 non-null object
dtypes: int64(6), object(5)
memory usage: 1.8+ MB


#### Conclusion

For the convenience of work and the perception of information, the data in the Total_income column have been converted into integer format. Since the numbers after the decimal point do not play a significant role, this did not affect the calculation result.

#### Processing duplicates

As i already know there are different ways of writing for the same values in Education column. I will deal with this problem using str.lower() method

In [139]:
data['education'] = data['education'].str.lower()
data['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

Now i will check Family_status column and after that check all the data for duplicates using duplicated() and sum() methods

In [140]:
data['family_status'].value_counts()

married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

In [141]:
data.duplicated().sum()

72

In [142]:
data = data.drop_duplicates()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21453 entries, 0 to 21524
Data columns (total 11 columns):
children            21453 non-null int64
dob_years           21453 non-null int64
education           21453 non-null object
education_id        21453 non-null int64
family_status       21453 non-null object
family_status_id    21453 non-null int64
gender              21453 non-null object
income_type         21453 non-null object
debt                21453 non-null int64
total_income        21453 non-null int64
purpose             21453 non-null object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


#### Conclusion

By this point in the project, I've changed all of the values in the Education column to a single text style, checked the family_status column and cleaned up the duplicates. Data is ready for further work.

#### Categorizing Data

After checking unique values in Purpose column i think the good idea is to group all the goals by themes. 

In [143]:
data['purpose'].value_counts()

wedding ceremony                            791
having a wedding                            767
to have a wedding                           765
real estate transactions                    675
buy commercial real estate                  661
housing transactions                        652
buying property for renting out             651
transactions with commercial real estate    650
housing                                     646
purchase of the house                       646
purchase of the house for my family         638
construction of own property                635
property                                    633
transactions with my real estate            627
building a real estate                      624
buy real estate                             621
purchase of my own house                    620
building a property                         619
housing renovation                          607
buy residential real estate                 606
buying my own car                       

Make a lemmatization

In [144]:
import nltk
from nltk.stem import WordNetLemmatizer
purpose_lemmatizer = WordNetLemmatizer()
lemmas_token = []
lemmas_after_lem = []
 
for word in data['purpose']:
    lemmas_token.append(nltk.word_tokenize(word))
 
for element in lemmas_token:
    lemmas_after_lem.append([purpose_lemmatizer.lemmatize(w, pos = 'n') for w in element])


This function will sort all the rows into groups. 

In [145]:
def grouped_purpose(purpose):
    
    if 'estate' in purpose or 'house' in purpose or 'property' in purpose or 'housing' in purpose:
        return 'estate'
    elif 'car' in purpose:
        return 'car'
    elif 'education' in purpose or 'educated' in purpose or 'university' in purpose:
        return 'education'
    elif 'wedding' in purpose:
        return 'wedding'
    else:
        return 'other'
 
 
data['grouped_purpose'] = data['purpose'].apply(grouped_purpose)
 
data['grouped_purpose'].value_counts()

estate       10811
car           4306
education     4013
wedding       2323
Name: grouped_purpose, dtype: int64

We have few main questions:
1.Is there a connection between having kids and repaying a loan on time?
2.Is there a connection between marital status and repaying a loan on time?
3.Is there a connection between income level and repaying a loan on time?
4.How do different loan purposes affect timely loan repayment?
To answer the first question I will make a consolidated table where I will group debtors and non-debtors by the number of children.

In [146]:
pivot_data_children = data.pivot_table(index=['children'], columns='debt', values='education', aggfunc='count')
pivot_data_children

debt,0,1
children,Unnamed: 1_level_1,Unnamed: 2_level_1
0,13027.0,1063.0
1,4410.0,445.0
2,1926.0,202.0
3,303.0,27.0
4,37.0,4.0
5,9.0,


I will add a column with the share of debtors from the total number of people in a group with a certain number of children

In [147]:
pivot_data_children['total'] = pivot_data_children[0] + pivot_data_children[1]
pivot_data_children['debt_ratio'] = pivot_data_children[1] / pivot_data_children['total'] * 100
pivot_data_children

debt,0,1,total,debt_ratio
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,13027.0,1063.0,14090.0,7.544358
1,4410.0,445.0,4855.0,9.165808
2,1926.0,202.0,2128.0,9.492481
3,303.0,27.0,330.0,8.181818
4,37.0,4.0,41.0,9.756098
5,9.0,,,


In each group of people, identified by the number of children, the share of debtors from the total number of people in the group was calculated as a percentage. The results show that among childless people the percentage of debtors is lower than among people with children. To get a more accurate ratio of debtors, I will unite people with children into one group.

In [148]:
def children_total (row):
    
    child = row['children']
        
    if child == 0:
        return 'nochildren'
        
    else:
        return 'children'
    

data['children_total'] = data.apply(children_total, axis=1)
data.head()

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,grouped_purpose,children_total
0,1,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,estate,children
1,1,36,secondary education,1,married,0,F,employee,0,17932,car purchase,car,children
2,0,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,estate,nochildren
3,3,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,education,children
4,0,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,wedding,nochildren


In [149]:
pivot_data_children_total = data.pivot_table(index=['children_total'], columns='debt', values='education', aggfunc='count')
pivot_data_children_total['total'] = pivot_data_children_total[0] + pivot_data_children_total[1]
pivot_data_children_total['children_debt_ratio'] = pivot_data_children_total[1] / pivot_data_children_total['total'] * 100
pivot_data_children_total

debt,0,1,total,children_debt_ratio
children_total,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
children,6685,678,7363,9.208203
nochildren,13027,1063,14090,7.544358


People without children are definitely more disciplined in repaying loans. Despite the fact that the difference is less than 2%, perhaps this factor should be taken into account as an additional factor when making a decision.

To answer the second question, I will make a summary table by marital status, add a column with the share of debtors relative to the total number of people in the group

In [150]:
pivot_data_family_status = data.pivot_table(index=['family_status'], columns='debt', values='education', aggfunc='count')
pivot_data_family_status['total'] = pivot_data_family_status[0] + pivot_data_family_status[1]
pivot_data_family_status['debt_ratio'] = pivot_data_family_status[1] / pivot_data_family_status['total'] * 100
pivot_data_family_status

debt,0,1,total,debt_ratio
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
civil partnership,3762,388,4150,9.349398
divorced,1110,85,1195,7.112971
married,11408,931,12339,7.545182
unmarried,2536,274,2810,9.75089
widow / widower,896,63,959,6.569343


From the table, you can see that people who have never been married are the least reliable loaners. At the same time, the most careful among those who took risks are those who are divorced or widowed.

To answer the third question, let's create a table containing the columns Total_income and Debt

In [151]:
data_total_income = data[['total_income', 'debt']]
data_total_income.head()

Unnamed: 0,total_income,debt
0,40620,0
1,17932,0
2,23341,0
3,42820,0
4,25378,0


I will calculate the average income of debtors and non-debtors by grouping the table by the presence of debt

In [152]:
data_total_income.groupby('debt')['total_income'].mean()

debt
0    26506.856331
1    25784.954049
Name: total_income, dtype: float64

The average income of both groups is about the same, but let's check the median income.

In [153]:
data_total_income.groupby('debt')['total_income'].median()

debt
0    22815
1    22815
Name: total_income, dtype: int64

Apparently, the level of income does not in any way affect the payment of the loan on time.

To answer the fourth question, I will group people by loan purpose and add a column with the share of debtors relative to the total number of people in the group

In [154]:
pivot_data_purpose = data.pivot_table(index=['grouped_purpose'], columns='debt', values='education', aggfunc='count')
pivot_data_purpose['total'] = pivot_data_purpose[0] + pivot_data_purpose[1]
pivot_data_purpose['purpose_ratio'] = pivot_data_purpose[1] / pivot_data_purpose['total'] * 100
pivot_data_purpose.sort_values(by='purpose_ratio')

debt,0,1,total,purpose_ratio
grouped_purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
estate,10029,782,10811,7.233373
wedding,2137,186,2323,8.006888
education,3643,370,4013,9.220035
car,3903,403,4306,9.359034


Perhaps there is a certain relationship between the purpose of the loan and its repayment: more often those loans are repayed on time that are taken for real estate transactions or weddings than those that are taken for a car or education.

We have answered the main questions, but we can check whether age or education affects the loan repayment. To test the theory with age, I will divide people into groups: up to 30 - young people, 30-65 - adults, over 65 - pensioners.

In [155]:
def age_group(row):
    
    age = row['dob_years']
        
    if age <= 30:
        return 'young'
    
    elif 30 < age < 65:
        return 'adult'
        
    else:
        return 'retiree'
    

data['age_group'] = data.apply(age_group, axis=1)
data.head(10)

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,grouped_purpose,children_total,age_group
0,1,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,estate,children,adult
1,1,36,secondary education,1,married,0,F,employee,0,17932,car purchase,car,children,adult
2,0,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,estate,nochildren,adult
3,3,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,education,children,adult
4,0,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,wedding,nochildren,adult
5,0,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,purchase of the house,estate,nochildren,young
6,0,43,bachelor's degree,0,married,0,F,business,0,38484,housing transactions,estate,nochildren,adult
7,0,50,secondary education,1,married,0,M,employee,0,21731,education,education,nochildren,adult
8,2,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337,having a wedding,wedding,children,adult
9,0,41,secondary education,1,married,0,M,employee,0,23108,purchase of the house for my family,estate,nochildren,adult


In [156]:
data_dob_years = data[['age_group', 'debt']]
data_dob_years.head()

Unnamed: 0,age_group,debt
0,adult,0
1,adult,0
2,adult,0
3,adult,0
4,adult,0


I will also group the data by age groups and calculate the rating of debtors.

In [157]:
pivot_data_age = data.pivot_table(index=['age_group'], columns='debt', values='education', aggfunc='count')
pivot_data_age['total'] = pivot_data_age[0] + pivot_data_age[1]
pivot_data_age['age_ratio'] = pivot_data_age[1] / pivot_data_age['total'] * 100
pivot_data_age.sort_values(by='age_ratio')

debt,0,1,total,age_ratio
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
retiree,846,49,895,5.47486
adult,15552,1289,16841,7.65394
young,3314,403,3717,10.842077


Obviously, the most unreliable borrowers are young people, the most careful payers are pensioners. The difference in percentage is quite noticeable, so the bank can be advised to pay extra attention to this factor.

To be thorough, let's look at the effect of education on loan repayment on time.

In [158]:
pivot_data_education = data.pivot_table(index=['education'], columns='debt', values='age_group', aggfunc='count')
pivot_data_education['total'] = pivot_data_education[0] + pivot_data_education[1]
pivot_data_education['education_ratio'] = pivot_data_education[1] / pivot_data_education['total'] * 100
pivot_data_education.sort_values(by='education_ratio')

debt,0,1,total,education_ratio
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bachelor's degree,4972.0,278.0,5250.0,5.295238
secondary education,13807.0,1364.0,15171.0,8.990838
some college,676.0,68.0,744.0,9.139785
primary education,251.0,31.0,282.0,10.992908
graduate degree,6.0,,,


It can be concluded that people with higher education are much more reliable loaners, although judging by the data provided, they are less likely to apply for a loan. However, this factor can also be considered when deciding on lending.

### Conclusion

The study of people grouped according to various factors made it possible to draw some conclusions about what affects and what does not affect the loan repayment on time. Thus, the results showed that Total_income does not at all affect the conscientiousness of the borrower, while the presence of children, marital status and the purpose of the loan have some dependence. In addition, it turned out that the age of the borrower and his education are also associated with the repayment of the loan on time.

### Step3

#### Analysing data answering main questions

- Is there a relation between having kids and repaying a loan on time?

In [159]:
pivot_data_children_total #Let's display the results obtained

debt,0,1,total,children_debt_ratio
children_total,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
children,6685,678,7363,9.208203
nochildren,13027,1063,14090,7.544358


#### Conclusion

The number of people without children in the data significantly exceeds the number of people with children, so it can be assumed that the results for a sample of childless borrowers are more accurate. However, it is clear from the available data that people without children are more disciplined in repaying loans. Despite the fact that the difference is less than 2%, it can be recommended to consider this factor if there are doubts about the good faith of the borrower.

- Is there a relation between marital status and repaying a loan on time?

In [160]:
pivot_data_family_status #Let's display the results obtained earlier.

debt,0,1,total,debt_ratio
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
civil partnership,3762,388,4150,9.349398
divorced,1110,85,1195,7.112971
married,11408,931,12339,7.545182
unmarried,2536,274,2810,9.75089
widow / widower,896,63,959,6.569343


#### Conclusion

The difference between those who officially registered their marriage and those who did not do so is palpable. Perhaps this is due to a higher degree of responsibility in general. Since the lowest percentage of debtors is among widowers and divorced people, I would not conclude that a spouse can help with the repayments. One way or another, I would advise taking this criterion into account.

- Is there a relation between income level and repaying a loan on time?

In [161]:
data_total_income.groupby('debt')['total_income'].mean()

debt
0    26506.856331
1    25784.954049
Name: total_income, dtype: float64

In [162]:
data_total_income.groupby('debt')['total_income'].median()

debt
0    22815
1    22815
Name: total_income, dtype: int64

#### Conclusion

I was unable to identify the relationship between income level and loan repayment. It is not worth taking into account the level of income when you need to issue a loan. But I suppose it's worth tying the income level to the loan amount


- How do different loan purposes affect on-time repayment of the loan?

In [163]:
pivot_data_purpose.sort_values(by='purpose_ratio') #Earlier results

debt,0,1,total,purpose_ratio
grouped_purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
estate,10029,782,10811,7.233373
wedding,2137,186,2323,8.006888
education,3643,370,4013,9.220035
car,3903,403,4306,9.359034


#### Conclusion

The difference in the share of debtors depending on the goals is noticeable. This factor should be taken into account in scoring. Buying a car can result in its loss (accident, theft), which can affect the client's desire to repay the loan - therefore, this category logically falls into the most risky. Education, moreover, in adulthood, may not be brought to its logical conclusion. And it is more difficult to repay a loan for purposes that have remained unfulfilled. Loans for the repair or purchase of real estate are the least risky. Perhaps this is a less impulsive step, and the result of investments is before our eyes every day, which is an additional incentive for timely loan repayment.

<a id="oc"></a>

### Overall conclusion

We managed to identify the dependencies between certain factors and the likelihood of loan repayment on time.

- So people without children are more disciplined in repayment of the loan. Despite the fact that the difference is less than 2%, it can be recommended to consider this factor if there are doubts about the good faith of the borrower.
- The difference between those who officially registered their marriage and those who did not do so is palpable. Perhaps this is due to a higher degree of responsibility in general. Since the lowest percentage of debtors is among widowers and divorced people, I would not conclude that a spouse can help with loan repayment. One way or another, I would advise taking this criterion into account.
- The purpose of the loan also definitely affects the repayment of the loan on time. So buying a car and education are more at risk than organizing a wedding and buying real estate.
- In addition, the age and educational level of the borrower also affect the loan repayment on time. The most at risk are young people under 30, the most conscientious payers are people with higher education.
- At the same time, the relationship between income level and loan repayment was not revealed. This factor can be ignored. But I would recommend linking the income level to the loan amount.