## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
from nltk.stem import SnowballStemmer

In [2]:
data_loan = pd.read_csv('credit_scoring_eng.csv')
data_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [3]:
data_loan['children'].value_counts() # determine how many missing values there are

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [4]:
data_loan['days_employed'].value_counts().head(30)

-986.927316       1
-7026.359174      1
-4236.274243      1
-6620.396473      1
-1238.560080      1
-3047.519891      1
-7373.150635      1
-1048.380461      1
-4906.125062      1
-1893.222792      1
-849.764227       1
-1741.489608      1
-5135.928528      1
-1453.358707      1
-4977.646061      1
 396078.542064    1
-1399.361282      1
-1645.463049      1
 386155.404320    1
-2793.736218      1
-1171.322169      1
-158.140270       1
-978.830019       1
-941.943913       1
-5208.815041      1
-951.868467       1
-501.078443       1
-725.971741       1
-2308.532389      1
-666.570650       1
Name: days_employed, dtype: int64

In [5]:
data_loan['dob_years'].value_counts()

35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

In [6]:
data_loan['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

In [7]:
data_loan['education_id'].value_counts()

1    15233
0     5260
2      744
3      282
4        6
Name: education_id, dtype: int64

In [8]:
data_loan['family_status'].value_counts()

married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

In [9]:
data_loan['family_status_id'].value_counts()

0    12380
1     4177
4     2813
3     1195
2      960
Name: family_status_id, dtype: int64

In [10]:
data_loan['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [11]:
data_loan.loc[data_loan['gender'] == 'XNA'] # row containing missing gender

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,-2358.600502,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


In [12]:
data_loan['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [13]:
data_loan['debt'].value_counts()

0    19784
1     1741
Name: debt, dtype: int64

In [14]:
data_loan['total_income'].value_counts()

17312.717    2
31791.384    2
42413.096    2
54857.666    1
26935.722    1
            ..
48796.341    1
34774.610    1
15710.698    1
19232.334    1
9591.824     1
Name: total_income, Length: 19348, dtype: int64

In [15]:
data_loan['purpose'].value_counts()

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
housing                                     647
purchase of the house                       647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

In [16]:
data_loan[data_loan['children'].isnull()] #take a look at the rows to check for missing values in a column

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [17]:
data_loan[data_loan['family_status'].isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [18]:
data_loan[data_loan['purpose'].isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [19]:
data_loan[data_loan['total_income'].isnull()].count() # Check how significant data absence is

children            2174
days_employed          0
dob_years           2174
education           2174
education_id        2174
family_status       2174
family_status_id    2174
gender              2174
income_type         2174
debt                2174
total_income           0
purpose             2174
dtype: int64

### Conclusion

Column 'children' contains 47 negative values and 76 - not so realistic values, namely, 20 children. Int64 is not proper format for this column, int8 would be pretty enough as the number of children a family can have is considerably less than even int8 can address.
Column 'days_employed' seems totally corrupted (probably there was wrong data source for that particular column). Besides predominant negative values that do not make sense it contains too big positive values that even exceed human life span. About 10% of data are missing. 
Column ‘dob_years’ contains 101 zero values. Proper format would be int8.
Column ‘education’ contains duplicate data with case sensitivity. 
Column ‘education_id’ format to be changed to int8.
Column ‘family_status_id’ format to be changed to int8.
Column ‘gender’ contains meaningless ‘XNA’ value.
Column ‘total_income’ contains about 10% of missing data.
Column ‘purpose’ contains a lot of duplicates by meaning, to be stemmed and categorized.

### Step 2. Data preprocessing

### Processing missing values

In [20]:
# looks like totally meaningful row
data_loan.loc[data_loan['gender'] == 'XNA', 'gender'] = 'F' 

In [21]:
dob_mean = int(data_loan['dob_years'].sum()/(21525-101)) # mean of age excluding '0' values
data_loan.loc[data_loan['dob_years'] == 0, 'dob_years'] = dob_mean # filling '0' with mean

In [22]:
#print(data_loan.groupby('income_type')['total_income'].mean())
print(data_loan.groupby('income_type')['total_income'].median())
print(data_loan.loc[data_loan.groupby('income_type')['total_income'].idxmax()]) 
# - there are outliers, that's why we should use median

income_type
business                       27577.2720
civil servant                  24071.6695
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18962.3180
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64
       children  days_employed  dob_years            education  education_id  \
12412         0   -1477.438114         44    bachelor's degree             0   
3097          1   -7814.429301         49    bachelor's degree             0   
9169          1   -5248.554336         35  secondary education             1   
18697         0    -520.848083         27    bachelor's degree             0   
20845         2   -3296.759962         39  SECONDARY EDUCATION             1   
15660         0  342783.405779         64    bachelor's degree             0   
9410          0    -578.751554         22    bachelor's d

In [23]:
data_loan['median_total_income'] = data_loan.groupby(
    ['income_type','gender','dob_years'])['total_income'].transform('median') 
# - adding a new column: grouped by income type, gender and age - total_income medians
data_loan.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,median_total_income
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,21437.923
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,20064.0295
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house,28264.478
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,26266.115
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,17872.245


In [24]:
#print(data_loan[data_loan['median_total_income'].isnull()])
data_loan['median_total_income'] = data_loan['median_total_income'].fillna(value=0) # filling NaN cells with value 0
mean_total_income = data_loan['total_income'].mean()
#print(mean_total_income)
data_loan.loc[data_loan['median_total_income'] == 0, 'median_total_income'] = mean_total_income 
# - filling with simple mean, 
# as this 5 rows are outliers in their groups 

data_loan.sort_values(by = ['median_total_income']).tail(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,median_total_income
10931,0,-3046.540954,60,secondary education,1,widow / widower,2,M,civil servant,0,42925.538,buy commercial real estate,42925.538
5842,2,397548.767244,34,secondary education,1,married,0,F,retiree,0,47686.626,transactions with commercial real estate,47686.626
2207,2,-5686.183889,45,secondary education,1,married,0,M,civil servant,0,75587.609,to become educated,47961.5505
17127,0,-1925.75648,45,bachelor's degree,0,civil partnership,1,M,civil servant,0,16814.868,to have a wedding,47961.5505
13365,0,-5109.799174,45,bachelor's degree,0,civil partnership,1,M,civil servant,0,39801.327,wedding ceremony,47961.5505
19253,0,-2933.126391,45,secondary education,1,married,0,M,civil servant,0,56121.774,university education,47961.5505
8514,0,-5668.86228,69,secondary education,1,divorced,3,F,civil servant,0,50046.414,purchase of a car,50046.414
7682,0,-6025.506521,70,bachelor's degree,0,widow / widower,2,F,civil servant,0,57508.032,real estate transactions,57508.032
18697,0,-520.848083,27,bachelor's degree,0,civil partnership,1,F,entrepreneur,0,79866.103,having a wedding,79866.103
11563,0,-1025.402943,64,bachelor's degree,0,married,0,M,civil servant,0,113024.236,profile education,113024.236


In [25]:
data_loan['total_income'] = data_loan['total_income'].fillna(value=0) # filled NaN cells with value 0
data_loan.loc[data_loan['total_income'] == 0, 'total_income'] = data_loan['median_total_income'] 
# - filled missing total_income with respective median
#print(data_loan.head(50))

data_loan.drop('median_total_income', axis='columns', inplace=True)
#print(data_loan.head(50))
data_loan.sort_values(by = ['total_income']).tail(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
11071,1,-1851.200013,36,bachelor's degree,0,civil partnership,1,F,employee,0,205804.96,buy commercial real estate
15268,1,-10207.448165,64,bachelor's degree,0,divorced,3,M,business,0,216039.297,housing
18353,1,-3173.282035,41,bachelor's degree,0,unmarried,4,F,business,0,228469.514,car
18368,1,-333.935516,41,BACHELOR'S DEGREE,0,civil partnership,1,M,business,0,248184.463,wedding ceremony
17503,0,-2285.476482,43,secondary education,1,married,0,M,business,0,255618.158,real estate transactions
17178,0,-5734.127087,42,bachelor's degree,0,civil partnership,1,M,business,0,273809.483,to have a wedding
20809,0,-4719.273476,61,secondary education,1,unmarried,4,F,employee,0,274402.943,purchase of the house for my family
9169,1,-5248.554336,35,secondary education,1,civil partnership,1,M,employee,0,276204.162,supplementary education
19606,1,-2577.664662,39,bachelor's degree,0,married,0,M,business,1,352136.354,building a property
12412,0,-1477.438114,44,bachelor's degree,0,married,0,M,business,0,362496.645,housing renovation


In [26]:
data_loan.loc[data_loan['children'] == 20, 'children'] = 2 # replacing a nonrealistic value, probably typo, with 2
data_loan.loc[data_loan['children'] == -1, 'children'] = 1 # replacing a nonrealistic value, probably typo, with 1
#print(data_loan['children'].value_counts())

In [27]:
data_loan.drop('days_employed', axis='columns', inplace=True) 
# - colimn is totally corrupted, so useless; and not important for our task
#print(data_loan)
data_loan.sort_values(by = ['total_income']).tail(10)

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
11071,1,36,bachelor's degree,0,civil partnership,1,F,employee,0,205804.96,buy commercial real estate
15268,1,64,bachelor's degree,0,divorced,3,M,business,0,216039.297,housing
18353,1,41,bachelor's degree,0,unmarried,4,F,business,0,228469.514,car
18368,1,41,BACHELOR'S DEGREE,0,civil partnership,1,M,business,0,248184.463,wedding ceremony
17503,0,43,secondary education,1,married,0,M,business,0,255618.158,real estate transactions
17178,0,42,bachelor's degree,0,civil partnership,1,M,business,0,273809.483,to have a wedding
20809,0,61,secondary education,1,unmarried,4,F,employee,0,274402.943,purchase of the house for my family
9169,1,35,secondary education,1,civil partnership,1,M,employee,0,276204.162,supplementary education
19606,1,39,bachelor's degree,0,married,0,M,business,1,352136.354,building a property
12412,0,44,bachelor's degree,0,married,0,M,business,0,362496.645,housing renovation


In [28]:
print(data_loan.info())
data_loan.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   dob_years         21525 non-null  int64  
 2   education         21525 non-null  object 
 3   education_id      21525 non-null  int64  
 4   family_status     21525 non-null  object 
 5   family_status_id  21525 non-null  int64  
 6   gender            21525 non-null  object 
 7   income_type       21525 non-null  object 
 8   debt              21525 non-null  int64  
 9   total_income      21525 non-null  float64
 10  purpose           21525 non-null  object 
dtypes: float64(1), int64(5), object(5)
memory usage: 1.8+ MB
None


(21525, 11)

### Conclusion

We replaced one missing value 'XNA' in column 'gender' with 'F', as both 'M' and 'F' represented by big numbers, so adding 1 to any of them will not change the distribution. Beside this, investigation of debt-gender relation is beyond the scope of this project. There is no reason to drop this row; it looks reliable all through other columns.

Column ‘dob_years’ – filling ‘0’ values with mean of non-zero values.

In column ‘total_income’ we filled 2174 missing values. For this we generated a new column of medians, calculated by grouping by columns ‘gender’, ‘income_type’ and ‘dob_years’. This grouping by three columns gives us more precise approach, as we noticed that medians are vary significantly by gender, income source and age of customers. We decided to use medians, but not means, as we found outliers in each group (e.g.362 496. 645, while mean is about 20K+).

Some of ‘income_type’ groups turned out to be unrepresentative – student (only 1 row), entrepreneur (2), unemployed (2), paternity/maternity leave (1). For them we filled ‘total_income’ column with simply mean of all total incomes. They are only 5 rows, unable to affect on general picture, and there is no reason to drop.

In column ‘children’ we replaced non-realistic values ‘20’ and ‘-1’ with 2 and 1 respectively, as it could be considered as typo errors.

We decided to drop ‘days_employed’ column, as it’s totally corrupted and not important for our investigation.


In [29]:
data_loan['total_income'].describe()

count     21525.000000
mean      26440.851617
std       15716.529296
min        3306.762000
25%       17117.387000
50%       23061.267000
75%       31535.534000
max      362496.645000
Name: total_income, dtype: float64

### Data type replacement

In [30]:
data_loan['children'] = data_loan['children'].astype('int8')
data_loan['dob_years'] = data_loan['dob_years'].astype('int8')
data_loan['education_id'] = data_loan['education_id'].astype('int8')
data_loan['family_status_id'] = data_loan['family_status_id'].astype('int8')
data_loan['debt'] = data_loan['debt'].astype('int8')
data_loan['total_income'] = data_loan['total_income'].astype('float32')

print (data_loan['education'].apply(type))
print (data_loan['family_status'].apply(type))
print (data_loan['gender'].apply(type))
print (data_loan['income_type'].apply(type))
print (data_loan['purpose'].apply(type))
data_loan['education'] = data_loan['education'].astype('str')
data_loan['family_status'] = data_loan['family_status'].astype('str')
data_loan['gender'] = data_loan['gender'].astype('str')
data_loan['income_type'] = data_loan['income_type'].astype('str')
data_loan['purpose'] = data_loan['purpose'].astype('str')
print (data_loan['education'].apply(type))
print (data_loan['family_status'].apply(type))
print (data_loan['gender'].apply(type))
print (data_loan['income_type'].apply(type))
print (data_loan['purpose'].apply(type))

0        <class 'str'>
1        <class 'str'>
2        <class 'str'>
3        <class 'str'>
4        <class 'str'>
             ...      
21520    <class 'str'>
21521    <class 'str'>
21522    <class 'str'>
21523    <class 'str'>
21524    <class 'str'>
Name: education, Length: 21525, dtype: object
0        <class 'str'>
1        <class 'str'>
2        <class 'str'>
3        <class 'str'>
4        <class 'str'>
             ...      
21520    <class 'str'>
21521    <class 'str'>
21522    <class 'str'>
21523    <class 'str'>
21524    <class 'str'>
Name: family_status, Length: 21525, dtype: object
0        <class 'str'>
1        <class 'str'>
2        <class 'str'>
3        <class 'str'>
4        <class 'str'>
             ...      
21520    <class 'str'>
21521    <class 'str'>
21522    <class 'str'>
21523    <class 'str'>
21524    <class 'str'>
Name: gender, Length: 21525, dtype: object
0        <class 'str'>
1        <class 'str'>
2        <class 'str'>
3        <class 'str'>
4        <

 ### Conclusion

5 columns were converted from  int64 to int8 to save space. Int8 would be more than enough for our data. One column was converted from float64 to float32. 5 ‘object’ columns were converted to ‘str’. However,it appears, they were ‘str’ from the very beginning.

### Processing duplicates

In [31]:
print(data_loan.duplicated().sum())
data_loan['education'] = data_loan['education'].str.lower() # all - to lower case (removing duplicates by case)
print(data_loan['education'].value_counts())
print(data_loan.duplicated().sum())
duplicates_income = data_loan[data_loan.duplicated(keep=False)]
duplicates_income.sort_values(by=['total_income']).head(60)

56
secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64
74


Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
18380,0,67,secondary education,1,civil partnership,1,F,employee,0,16690.019531,to have a wedding
7048,0,67,secondary education,1,civil partnership,1,F,employee,0,16690.019531,to have a wedding
5865,0,66,secondary education,1,widow / widower,2,F,retiree,0,16968.695312,transactions with my real estate
9528,0,66,secondary education,1,widow / widower,2,F,retiree,0,16968.695312,transactions with my real estate
19321,0,23,secondary education,1,unmarried,4,F,employee,0,17189.03125,second-hand car purchase
8853,1,23,secondary education,1,civil partnership,1,F,employee,0,17189.03125,to have a wedding
15892,0,23,secondary education,1,unmarried,4,F,employee,0,17189.03125,second-hand car purchase
20297,1,23,secondary education,1,civil partnership,1,F,employee,0,17189.03125,to have a wedding
1191,0,61,secondary education,1,married,0,F,retiree,0,17746.972656,real estate transactions
19688,0,61,secondary education,1,married,0,F,retiree,0,17746.972656,real estate transactions


### Conclusion

Duplicates with case sensitivity (result of caps lock - on) in column ‘education’ were processed with lower() method. In real life - they are not duplicates, they are different borrowers to the bank. As we do not have any customer ID, or Name and Surname or/and birthday in dataset provided, we cannot stipulate they are the same people. There are millions of people of same age, gender, type of income, education level and marital status who want a loan to buy a property. As for 'total_income' column, we remember, that we filled it with many same values by our discretion. So the conclusion is: we must keep all this rows until personal ID data will be available and will confirm (or ruin) the idea of duplicates.

In [32]:
data_loan.duplicated().sum()
#print([data_loan.duplicated(keep=False)].head(21))
duplicateDFRow = data_loan[data_loan.duplicated()]
data_loan.sort_values(by = ['total_income'])

Unnamed: 0,children,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
14585,0,57,secondary education,1,married,0,F,retiree,1,3306.761963,property
13006,0,37,secondary education,1,civil partnership,1,M,retiree,0,3392.844971,going to university
16174,1,52,secondary education,1,married,0,M,employee,0,3418.823975,car purchase
1598,0,68,secondary education,1,civil partnership,1,M,retiree,0,3471.216064,having a wedding
14276,0,61,secondary education,1,married,0,F,retiree,0,3503.298096,property
...,...,...,...,...,...,...,...,...,...,...,...
17178,0,42,bachelor's degree,0,civil partnership,1,M,business,0,273809.468750,to have a wedding
20809,0,61,secondary education,1,unmarried,4,F,employee,0,274402.937500,purchase of the house for my family
9169,1,35,secondary education,1,civil partnership,1,M,employee,0,276204.156250,supplementary education
19606,1,39,bachelor's degree,0,married,0,M,business,1,352136.343750,building a property


### Categorizing Data

In [33]:
english_stemmer = SnowballStemmer('english')

words = ["estate", "property", "house", "wedding", "car", "university", "education", "educated"] 
# comprehensive list of words for stemmimg 

for word in words:
    print('Source word - {}, after stemming - {}'.format(word, english_stemmer.stem(word)))

Source word - estate, after stemming - estat
Source word - property, after stemming - properti
Source word - house, after stemming - hous
Source word - wedding, after stemming - wed
Source word - car, after stemming - car
Source word - university, after stemming - univers
Source word - education, after stemming - educ
Source word - educated, after stemming - educ


In [34]:
purpose = data_loan['purpose']
x=-1

for query in purpose: # categorizing loan purpose 
    x+=1 # setting index
    for word in query.split(' '):
        stemmed_word = english_stemmer.stem(word)
        if stemmed_word == 'estat' or stemmed_word=='properti' or stemmed_word=='hous':
            data_loan.at[x,'purpose']='real estate' # setting new value in data_loan cell
        elif stemmed_word == 'educ' or stemmed_word=='univers':
            data_loan.at[x,'purpose']= 'education'  # setting new value in data_loan cell
        elif stemmed_word == 'car':
            data_loan.at[x,'purpose']= 'car' # setting new value in data_loan cell
        elif stemmed_word == 'wed':
            data_loan.at[x,'purpose']= 'wedding' # setting new value in data_loan cell
        else:
            a=0 # doing nothing with data_loan (i'm not sure what to be placed here to look proper, except for "print")
print(data_loan['purpose'].value_counts())

real estate    10840
car             4315
education       4022
wedding         2348
Name: purpose, dtype: int64


In [35]:
total_income = data_loan['total_income']
data_loan['total_income_levels'] = ''
x=-1
y=0

for quer in total_income: # categorizing income level
    x+=1
    if quer < 10000.0:
        data_loan.at[x,'total_income_levels']= '<10K'
    elif 10000.0 <= quer < 15000.0:
        data_loan.at[x,'total_income_levels']= '10-15K'
    elif 15000.0 <= quer < 20000.0:
        data_loan.at[x,'total_income_levels']= '15-20K'
    elif 20000.0 <= quer < 25000.0:
        data_loan.at[x,'total_income_levels']= '20-25K'
    elif 25000.0 <= quer < 30000.0:
        data_loan.at[x,'total_income_levels']= '25-30K'
    elif 30000.0 <= quer < 35000.0:
        data_loan.at[x,'total_income_levels']= '30-35K'
    elif 35000.0 <= quer < 40000.0:
        data_loan.at[x,'total_income_levels']= '35-40K'
    elif 40000.0 <= quer < 45000.0:
        data_loan.at[x,'total_income_levels']= '40-45K'
    elif 45000.0 <= quer < 50000.0:
        data_loan.at[x,'total_income_levels']= '45-50K'
    elif 50000.0 <= quer < 60000.0:
        data_loan.at[x,'total_income_levels']= '50-60K'
    else:
        data_loan.at[x,'total_income_levels']= '>60K'
        
data_loan['total_income_levels'].value_counts()

20-25K    4389
15-20K    4054
25-30K    3237
10-15K    2819
30-35K    2004
35-40K    1283
40-45K     954
<10K       926
>60K       672
50-60K     648
45-50K     539
Name: total_income_levels, dtype: int64

### Conclusion

We categorized column ‘purpose’ .  First, we made a comprehensive list of words for stemming, then assign every value of column ‘purpose’ to one of 4 categories: ‘car’, ‘education’, ‘real estate’, ‘wedding’.

For column ‘total_income’ we made 11 categories corresponding to different levels of income. However, it could be less detailed, as further investigation’s result shows.

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [37]:
data_no_children = data_loan.loc[data_loan['children'] == 0]['children'].count() # no kids
data_children = data_loan.loc[data_loan['children'] != 0]['children'].count() # with kids

no_children_debt = data_loan.loc[data_loan['children'] == 0]['debt'].sum() # no kids, with debt
children_debt = data_loan.loc[data_loan['children'] != 0]['debt'].sum() # with kids, with debt

print('Debts to all loans:')
print('No kids - {:.1%}'.format(no_children_debt/data_no_children))
print('With kids - {:.1%}'.format(children_debt/data_children))


data_loan_children_income_type_pivot = data_loan.pivot_table(index='income_type', columns='children', values='debt', aggfunc='sum', fill_value=0)
data_loan_children_income_type_pivot

Debts to all loans:
No kids - 7.5%
With kids - 9.2%


children,0,1,2,3,4,5
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
business,226,106,39,5,0,0
civil servant,59,19,6,2,0,0
employee,580,305,153,19,4,0
entrepreneur,0,0,0,0,0,0
paternity / maternity leave,0,0,1,0,0,0
retiree,198,14,3,1,0,0
student,0,0,0,0,0,0
unemployed,0,1,0,0,0,0


### Conclusion

Obtained results illustrate relation between having kids and repaying a loan on time – percentage of debt is higher in customers with kids.

In [38]:
data_loan['debt'].mean()*100

8.088269454123113

In [39]:
data_loan.groupby('children').agg({'debt': ['count', 'sum', 'mean']})

Unnamed: 0_level_0,debt,debt,debt
Unnamed: 0_level_1,count,sum,mean
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,14149,1063.0,0.075129
1,4865,445.0,0.09147
2,2131,202.0,0.094791
3,330,27.0,0.081818
4,41,4.0,0.097561
5,9,0.0,0.0


In [40]:
data_loan.query("children != 0").agg({'debt': ['count', 'sum', 'mean']})

Unnamed: 0,debt
count,7376.0
sum,678.0
mean,0.09192


In [41]:
data_loan.query("children >= 3").agg({'debt': ['count', 'sum', 'mean']})

Unnamed: 0,debt
count,380.0
sum,31.0
mean,0.081579


- Is there a relation between marital status and repaying a loan on time?

In [42]:
data_family_status = data_loan.groupby(['family_status'])['debt'].sum()

data_family_status_total = data_loan.groupby(['family_status'])['family_status'].count()
family_status_debt_percent = round(data_family_status/data_family_status_total*100,1)

print('Debts to all loans by (%)')
print(family_status_debt_percent)

Debts to all loans by (%)
family_status
civil partnership    9.3
divorced             7.1
married              7.5
unmarried            9.7
widow / widower      6.6
dtype: float64


### Conclusion

There is a relation between marital status and repaying a loan on time. The less reliable category is unmarried - 9.7% of debt, slightly better is civil partnership - 9.3%. Married looks much better - 7.5%, and the most reliable category turned out widow/widower - 6.6%.

- Is there a relation between income level and repaying a loan on time?

In [43]:
# how many customers have this level of total income
data_income_levels_1 = data_loan.loc[data_loan['total_income_levels'] == '<10K']['total_income'].count() 
data_income_levels_2 = data_loan.loc[data_loan['total_income_levels'] == '10-15K']['total_income'].count() 
data_income_levels_3 = data_loan.loc[data_loan['total_income_levels'] == '15-20K']['total_income'].count()
data_income_levels_4 = data_loan.loc[data_loan['total_income_levels'] == '20-25K']['total_income'].count()
data_income_levels_5 = data_loan.loc[data_loan['total_income_levels'] == '25-30K']['total_income'].count()
data_income_levels_6 = data_loan.loc[data_loan['total_income_levels'] == '30-35K']['total_income'].count()
data_income_levels_7 = data_loan.loc[data_loan['total_income_levels'] == '35-40K']['total_income'].count()
data_income_levels_8 = data_loan.loc[data_loan['total_income_levels'] == '40-45K']['total_income'].count()
data_income_levels_9 = data_loan.loc[data_loan['total_income_levels'] == '45-50K']['total_income'].count()
data_income_levels_10 = data_loan.loc[data_loan['total_income_levels'] == '50-60K']['total_income'].count()
data_income_levels_11 = data_loan.loc[data_loan['total_income_levels'] == '>60K']['total_income'].count()


# how many customers have this level of total income with debt
income_levels_debt_1 = data_loan.loc[data_loan['total_income_levels'] == '<10K']['debt'].sum()
income_levels_debt_2 = data_loan.loc[data_loan['total_income_levels'] == '10-15K']['debt'].sum()
income_levels_debt_3 = data_loan.loc[data_loan['total_income_levels'] == '15-20K']['debt'].sum()
income_levels_debt_4 = data_loan.loc[data_loan['total_income_levels'] == '20-25K']['debt'].sum()
income_levels_debt_5 = data_loan.loc[data_loan['total_income_levels'] == '25-30K']['debt'].sum()
income_levels_debt_6 = data_loan.loc[data_loan['total_income_levels'] == '30-35K']['debt'].sum()
income_levels_debt_7 = data_loan.loc[data_loan['total_income_levels'] == '35-40K']['debt'].sum()
income_levels_debt_8 = data_loan.loc[data_loan['total_income_levels'] == '40-45K']['debt'].sum()
income_levels_debt_9 = data_loan.loc[data_loan['total_income_levels'] == '45-50K']['debt'].sum()
income_levels_debt_10 = data_loan.loc[data_loan['total_income_levels'] == '50-60K']['debt'].sum()
income_levels_debt_11 = data_loan.loc[data_loan['total_income_levels'] == '>60K']['debt'].sum()

print('Relation between income level and repaying a loan on time')
print('Debts % to all loans:')

print('<10K - {:.1%}'.format(income_levels_debt_1/data_income_levels_1))
print('10-15K - {:.1%}'.format(income_levels_debt_2/data_income_levels_2))
print('15-20K - {:.1%}'.format(income_levels_debt_3/data_income_levels_3))
print('20-25K - {:.1%}'.format(income_levels_debt_4/data_income_levels_4))
print('25-30K - {:.1%}'.format(income_levels_debt_5/data_income_levels_5))
print('30-35K - {:.1%}'.format(income_levels_debt_6/data_income_levels_6))
print('35-40K - {:.1%}'.format(income_levels_debt_7/data_income_levels_7))
print('40-45K - {:.1%}'.format(income_levels_debt_8/data_income_levels_8))
print('45-50K - {:.1%}'.format(income_levels_debt_9/data_income_levels_9))
print('50-60K - {:.1%}'.format(income_levels_debt_10/data_income_levels_10))
print('>60K - {:.1%}'.format(income_levels_debt_11/data_income_levels_11))

Relation between income level and repaying a loan on time
Debts % to all loans:
<10K - 6.3%
10-15K - 8.5%
15-20K - 8.4%
20-25K - 8.3%
25-30K - 9.0%
30-35K - 7.9%
35-40K - 7.6%
40-45K - 6.5%
45-50K - 7.4%
50-60K - 8.3%
>60K - 5.7%


In [44]:
data_loan_income_level_pivot = data_loan.pivot_table(
    index='income_type', columns='total_income_levels', values='debt', aggfunc='sum', fill_value=0)
data_loan_income_level_pivot

total_income_levels,10-15K,15-20K,20-25K,25-30K,30-35K,35-40K,40-45K,45-50K,50-60K,<10K,>60K
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
business,36,58,62,76,49,25,18,14,18,3,17
civil servant,18,15,14,11,12,5,6,1,1,2,1
employee,150,214,241,181,85,55,30,23,32,32,18
entrepreneur,0,0,0,0,0,0,0,0,0,0,0
paternity / maternity leave,0,0,0,0,0,0,0,0,0,1,0
retiree,36,53,46,23,12,12,8,2,3,19,2
student,0,0,0,0,0,0,0,0,0,0,0
unemployed,0,0,0,0,0,0,0,0,0,1,0


### Conclusion

10K to 25K of total income gives us percentage of debts 8.5-8.3%. BTW, it’s also covers about half of customers. 25K to 30K shows peak in debts – 9.0%.  Minimum of debts are on categories - under 10K and above 60K.

- How do different loan purposes affect on-time repayment of the loan?

In [45]:
data_car = data_loan.loc[data_loan['purpose'] == 'car']['purpose'].count() # how many customers want car
data_education = data_loan.loc[data_loan['purpose'] == 'education']['purpose'].count() # how many customers want education
data_realestate = data_loan.loc[data_loan['purpose'] == 'real estate']['purpose'].count() # how many customers want real estate
data_wedding = data_loan.loc[data_loan['purpose'] == 'wedding']['purpose'].count() # how many customers want wedding

car_debt = data_loan.loc[data_loan['purpose'] == 'car']['debt'].sum() # car, with debt
education_debt = data_loan.loc[data_loan['purpose'] != 'education']['debt'].sum() # education, with debt
realestate_debt = data_loan.loc[data_loan['purpose'] == 'real estate']['debt'].sum() # real estate, with debt
wedding_debt = data_loan.loc[data_loan['purpose'] == 'wedding']['debt'].sum() # wedding, with debt


print('Debts % to all loans:')
print('For purpose "car" - {:.1%}'.format(car_debt/data_car))
print('For purpose "education" - {:.1%}'.format(education_debt/data_education))
print('For purpose "real estate" - {:.1%}'.format(realestate_debt/data_realestate))
print('For purpose "wedding" - {:.1%}'.format(wedding_debt/data_education))


Debts % to all loans:
For purpose "car" - 9.3%
For purpose "education" - 34.1%
For purpose "real estate" - 7.2%
For purpose "wedding" - 4.6%


### Conclusion

All loan purposes found in our data were categorized into only 4 categories – namely, “car”, “education”, “real estate” and “wedding”. Purpose “education” shows a record percentage of debts – 34.1%, followed by “car”. “Wedding” looks safest.

## Step 4. General conclusion

Our project is to prepare a report for a bank’s loan division. We need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. There is a dataset provided by bank, which includes customers’ worthiness, family status, age, gender, education level, type of income, total income, number of children and loan purpose.  

While completing the project, we open the data file /datasets/credit_scoring_eng.csv and have a look at the general information. Than we start to pre-process the data: we Identify and fill in missing values in columns “gender”, “total_income”, “children”. It was quite significant number of missing values in column “total_income” – about 10% (probably the form customer was required to fill up was different and missing field about the total income). We replace them with medians, calculated with grouping by income type, gender and age, as they notably vary from one group to another. We drop one whole column “dob_years”, because it does not make sense, probably due to wrong source or wrong formatting. We also convert data type int64, which is redundant, to int8, and float64 to float32 to reduce memory usage. We check dataset for duplicates; there are some case sensitivity duplicates. We convert everything to lower case. It’s not duplicates anymore from the bank’s point of view (see above explanations), so we do not delete them here. We categorize data on total income into 11 categories. Investigation reveals, however, it was over detailed, half (5-6) would already be enough. We categorized data in “purpose” column. Investigation shows that all entries in “purpose” column can be classified into 4 categories: “car”, “education”, “real estate” and “wedding”.

Investigation of dataset provided shows that number of children, marital status, income level and purpose of loan can affect on the ability of a potential borrower to repay their loan.

Obtained results illustrate connection between having kids and repaying a loan on time – percentage of debt is higher in customers with kids – 9.2% against 7.5%

Connection between marital status and repaying a loan on time was also found. The less reliable category is unmarried - 9.7% of debt, slightly better is civil partnership - 9.3%. Married looks much better - 7.5%, and the most reliable category turned out widow/widower - 6.6%. Divorced got 7.1%.

There is rather little dependence of financial reliability on total income. 10K to 25K of total income gives us percentage of debts 8.5-8.3%. BTW, it’s also covers about half of customers. 25K to 30K shows peak in debts – 9.0%. From 30K to 50K it is 6.5-7.8%. 50-60K gives 8.3%. Minimum of debts are in categories - under 10K and above 60K – 6.3% and 5.6%, respectively.

The most impressive is connection by the criterion of the purpose of loan. Thus, purpose “education” shows a record percentage of debts – 34.1%, followed by “car” – 9.3% and “real estate” – 7.2%. “Wedding” seems to be the safest.

So, childfree widow/widower (or divorced) with income above 60K, who wants a loan for wedding, makes the highest overall credit score.
