# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

## Open the data file and have a look at the general information. 

In [1]:
import pandas as pd

In [2]:
pip install -U sidetable

Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: sidetable in /home/jovyan/.local/lib/python3.7/site-packages (0.9.0)
Note: you may need to restart the kernel to use updated packages.


In [3]:
import sidetable

Let's read the Data file and give it a short name: data.

In [4]:
try:
    data = pd.read_csv(r'C:\Users\Ron\Documents\credit_scoring_eng.csv')
except:
    data = pd.read_csv('/datasets/credit_scoring_eng.csv')

Let's see how the data look by looking at the first five rows.

In [5]:
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


### Conclusion

We read the data, explore it to get familiar with it and look if we find something at first glance.

And at first glance, I already see some weird stuff like "-" in the "days_employed" column, duplicated with case sensitivity in the education column.

## Data preprocessing

### Processing missing values

Ok, let's chack and see the info of the data to see if I have duplicate or missing values and if I have the correct data type to each column.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


After reading the info of the data, we can see that we have some missing value in the days_employed column and the total_income.

Also, maybe I could change the data type of the total_income to integer from float just that it will be easier to read the data and make faster calculations.

but let double check if I do have missing values in the columns mentioned above be calling isnull() method and the sum()

In [7]:
for i in data:
    if data[i].isnull().sum()>0:
        print(i)

days_employed
total_income


Let's look at the percentage of missing values in columns:

In [8]:
data.isna().sum()*100/len(data)

pd.DataFrame(round((data.isna().mean()*100),2)).style.background_gradient('coolwarm')

data.isna().mean() * 100

children             0.000000
days_employed       10.099884
dob_years            0.000000
education            0.000000
education_id         0.000000
family_status        0.000000
family_status_id     0.000000
gender               0.000000
income_type          0.000000
debt                 0.000000
total_income        10.099884
purpose              0.000000
dtype: float64

In [9]:
data.stb.missing()

Unnamed: 0,missing,total,percent
days_employed,2174,21525,10.099884
total_income,2174,21525,10.099884
children,0,21525,0.0
dob_years,0,21525,0.0
education,0,21525,0.0
education_id,0,21525,0.0
family_status,0,21525,0.0
family_status_id,0,21525,0.0
gender,0,21525,0.0
income_type,0,21525,0.0


Yup. I was right. the days_employed and the total_income column has missing values.
Let's review each column separately.

In [10]:
data[data.days_employed.isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [11]:
data[data.total_income.isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


Ok, it looks like I got an array of missing values, so it seems that in every row on the days_employed column, we have a missing valeu on the total_income column.

But it is normal and logical that those people who do not report the number of days they work will not have a value in the total income column.

And also, we cannot restore the data in this column because this data is filled manually by the employee or the company system.
Let's look at the rows that don't have missing values in the investigated columns.

In [12]:
data[data['days_employed'].notna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


In [13]:
data[data['total_income'].notna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


After looking at the days_employed column, we can see values with a ''-'' sign, which doesn't make sense. Let's dig deeper.

In [14]:
data['days_employed'].describe()

count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64

We can see that the Data in the days_employed is 75% negative, and it doesn't make any sense. However, Our project question wasn't related to this data, so that we can leave this at it is and make sure we ask the engineer about this.
or we can also use the abs() method. 

For now, let's Fill the missing values in the total income column.

We can elegantly do that by filling it with the median total income by the income type and education id.

In [15]:
data['total_income']=data['total_income'].abs()

In [16]:
data['total_income'].describe()

count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64

In [17]:
data['total_income']= data['total_income'].fillna(data.groupby(['income_type', 'education_id'])['total_income'].transform('median'))

In [18]:
data[data.total_income.isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


and now there are no missing values in the total income.

For now, let's explore all the columns in the data table.

In [20]:
data['children'].describe()

count    21525.000000
mean         0.538908
std          1.381587
min         -1.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         20.000000
Name: children, dtype: float64

We can see wired values in the children column like "-1" and "20," which is impossible to have negative children and very much suspicion to have 20 children.

Because this is not persistent in the majority of the column data, we will consider this a mistype.

Let's fix this.

In [21]:
data['children']=data['children'].abs()

In [22]:
data['children'].describe()

count    21525.000000
mean         0.543275
std          1.379876
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         20.000000
Name: children, dtype: float64

In [23]:
data["children"]=data["children"].replace(20, 2, regex=True)

In [24]:
data['children'].describe(include = all)

count    21525.000000
mean         0.479721
std          0.755528
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          5.000000
Name: children, dtype: float64

In [25]:
data['children'].unique()

array([1, 0, 3, 2, 4, 5])

Now we can see that there are four categories to the education column, and it matches the array of the four ids in the education id column.

Let's move on to the dob_years column.

In [26]:
data['dob_years'].unique()

array([42, 36, 33, 32, 53, 27, 43, 50, 35, 41, 40, 65, 54, 56, 26, 48, 24,
       21, 57, 67, 28, 63, 62, 47, 34, 68, 25, 31, 30, 20, 49, 37, 45, 61,
       64, 44, 52, 46, 23, 38, 39, 51,  0, 59, 29, 60, 55, 58, 71, 22, 73,
       66, 69, 19, 72, 70, 74, 75])

In [27]:
data['dob_years'].describe()

count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

Look like we found some wrong value in the dob_yeaes like age 0, which is impossible.

Let's change it to the average age in our column, which is 43.

In [28]:
data["dob_years"]=data["dob_years"].replace(0, 43, regex=True)

In [29]:
data['dob_years'].describe()

count    21525.000000
mean        43.495145
std         12.218213
min         19.000000
25%         34.000000
50%         43.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [30]:
data['education_id'].describe()

count    21525.000000
mean         0.817236
std          0.548138
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          4.000000
Name: education_id, dtype: float64

Now it looks better. Let's move on to the family status column.

In [31]:
data['family_status'].unique()

array(['married', 'civil partnership', 'widow / widower', 'divorced',
       'unmarried'], dtype=object)

In [32]:
data['family_status'].describe()

count       21525
unique          5
top       married
freq        12380
Name: family_status, dtype: object

In [33]:
data['family_status_id'].unique()

array([0, 1, 2, 3, 4])

In [34]:
data['gender'].describe()

count     21525
unique        3
top           F
freq      14236
Name: gender, dtype: object

In [35]:
data['gender'].unique()

array(['F', 'M', 'XNA'], dtype=object)

In [36]:
data.stb.freq(['gender'])

Unnamed: 0,gender,count,percent,cumulative_count,cumulative_percent
0,F,14236,66.13705,14236,66.13705
1,M,7288,33.858304,21524,99.995354
2,XNA,1,0.004646,21525,100.0


I didn't understand what 'XNA' stands for.
Can you please be more specific about what kind of error this is?

In [37]:
data['income_type'].describe()

count        21525
unique           8
top       employee
freq         11119
Name: income_type, dtype: object

In [38]:
data['income_type'].unique()

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

In [39]:
data['debt'].describe()

count    21525.000000
mean         0.080883
std          0.272661
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: debt, dtype: float64

In [40]:
data['debt'].unique()

array([0, 1])

In [41]:
data[data.debt.isnull()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose


In [42]:
data['purpose'].describe()

count                21525
unique                  38
top       wedding ceremony
freq                   797
Name: purpose, dtype: object

In [43]:
data['purpose'].unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

let us see if we have 0's on the table

In [44]:
for i in data:
    print(i, len(data[data[i]==0]))

children 14149
days_employed 0
dob_years 0
education 0
education_id 5260
family_status 0
family_status_id 12380
gender 0
income_type 0
debt 19784
total_income 0
purpose 0


looking good. We do have some 0's on some columns, but it makes sense in thous columns

### Conclusion

So far, we have explored different columns in the data and find that we have missing values in the days_imployed and the total income column.

We then filled the missing valeu of the total income with the median of the income type and education id.

Then we concluded that there is not much we can do about the days_iployed column.

We replaced wired values from the children column (20 children) and concluded that this was a typo and it was meant to be 2. in the age (0), we replaced it with a mean age of 43.

Let's see if we need to change some data types in the table.

### Data type replacement

In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [46]:
import numpy as np

In [47]:
data['days_employed'] = data['days_employed'].fillna(0).astype(np.int64, errors='ignore')

In [48]:
data['total_income'] = data['total_income'].fillna(0).astype(np.int64, errors='ignore')

Now its look will be better!

In [49]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21525 non-null  int64 
 1   days_employed     21525 non-null  int64 
 2   dob_years         21525 non-null  int64 
 3   education         21525 non-null  object
 4   education_id      21525 non-null  int64 
 5   family_status     21525 non-null  object
 6   family_status_id  21525 non-null  int64 
 7   gender            21525 non-null  object
 8   income_type       21525 non-null  object
 9   debt              21525 non-null  int64 
 10  total_income      21525 non-null  int64 
 11  purpose           21525 non-null  object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB


I've got some unknown error while trying to do this.

### Conclusion

We changed the total_income and the days_employed columns from float to int, making it easier to make calculations on this column.

### Processing duplicates

In [50]:
data.duplicated().sum()

54

In [51]:
data['education'].duplicated().sum()

21510

In [52]:
data['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

Look like we have some duplicate values with case sensitivity in the education column.

let's remove them with the lower() method.

In [53]:
data['education'] = data['education'].str.lower()

In [54]:
data.drop_duplicates().reset_index(drop=True)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house
1,1,-4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase
2,0,-5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house
3,3,-4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21449,1,-4529,43,secondary education,1,civil partnership,1,F,business,0,35966,housing transactions
21450,0,343937,67,secondary education,1,married,0,F,retiree,0,24959,purchase of a car
21451,1,-2113,38,secondary education,1,civil partnership,1,M,employee,1,14347,property
21452,3,-3112,38,secondary education,1,married,0,M,employee,1,39054,buying my own car


In [57]:
data.duplicated().sum()

71

In [58]:
data['education'].duplicated().sum()

21520

In [56]:
data.stb.freq(['education'])

Unnamed: 0,education,count,percent,cumulative_count,cumulative_percent
0,secondary education,15233,70.768873,15233,70.768873
1,bachelor's degree,5260,24.436702,20493,95.205575
2,some college,744,3.456446,21237,98.662021
3,primary education,282,1.310105,21519,99.972125
4,graduate degree,6,0.027875,21525,100.0


In [59]:
data['education'].duplicated().sum()

21520

In [60]:
data['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

In [61]:
data['education_id'].unique()

array([0, 1, 2, 3, 4])

In [52]:
data['education_id'].describe()

count    21525.000000
mean         0.817236
std          0.548138
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          4.000000
Name: education_id, dtype: float64

### Conclusion

we noticed that there is duplicated values with case sensitivity early on, and we eliminated them with the power of lower()

Than we chacked that the number of categories in education column is mach the number of ids in the education id column.

Let us see how we will decide who is poor, the general middle class, and the wealthy using the describe() method.

### Categorizing Data

In [53]:
data['total_income'].describe()

count     21525.000000
mean      26456.597412
std       15703.769010
min        3306.762000
25%       17235.090000
50%       22959.405000
75%       31703.887000
max      362496.645000
Name: total_income, dtype: float64

With the described method, income is already visually categorized into groups of min, 25%, 50%, 75%, and max.

Let's say that below 25% is considered poor.

Between 25% and 75% is considered middle class.

And above 75% is considered wealthy.

Let's define a function that will help us do so!

In [54]:
def income_group(row):
    income_group = row['total_income']
    if income_group < 17235:
        return 'poor'
    elif income_group >= 17235 and income_group <= 31703:
        return 'middle_class'    
    elif income_group > 31703:
        return 'wealthy'
    else:
        return 'other'
    
data['income_groups'] = data.apply(income_group, axis=1)

In [55]:
data['income_groups'].value_counts()

middle_class    10762
wealthy          5382
poor             5381
Name: income_groups, dtype: int64

In [56]:
data['income_groups'].describe()

count            21525
unique               3
top       middle_class
freq             10762
Name: income_groups, dtype: object

In [57]:
data.income_groups.unique()

array(['wealthy', 'middle_class', 'poor'], dtype=object)

In [58]:
data.purpose.unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

In [59]:
from pymystem3 import Mystem
from collections import Counter

In [60]:
m=Mystem()

In [61]:
m.lemmatize("to become educated")

['to', ' ', 'become', ' ', 'educated', '\n']

In [62]:
study_cat = ['education', 'educated', 'university']

house_cat = ['house', 'housing', 'residential', 'estate', 'construction', 'property']

car_cat = ['car', 'cars', 'second-hand']

wedding_cat = ['wedding']

In [63]:
def lemmatization_func(line):
    lemmatized=m.lemmatize(line)
    return lemmatized

In [64]:
example=data.loc[0]['purpose']
example

'purchase of the house'

In [65]:
lemmatization_func(example)

['purchase', ' ', 'of', ' ', 'the', ' ', 'house', '\n']

In [66]:
any(word in lemmatization_func(example) for word in study_cat)

False

In [67]:
def lemmatization_func(line):
    lemmatized=m.lemmatize(line)
    if any(word in lemmatized for word in study_cat):
        return 'study'
    elif any(word in lemmatized for word in house_cat):
        return 'house'
    elif any(word in lemmatized for word in car_cat):
        return 'car'
    elif any(word in lemmatized for word in wedding_cat):
        return 'wedding'
    else:
        return 'other'
    return lemmatized

In [68]:
data['clean_purpose']= data['purpose'].apply(lemmatization_func)

In [69]:
data['clean_purpose'].value_counts()

house      10840
car         4315
study       4022
wedding     2348
Name: clean_purpose, dtype: int64

### Conclusion

I decided to categorize the total income column into three general categories.

poor people with an income below 17,235

the middle class between 17,235 and 31,703

And the wealthy with an income above 31,703.

In the purpose column, we noticed repeated reasons people wanted to take a loan, and we synthesized it into four columns based on that.

## Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [70]:
data

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,income_groups,clean_purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,wealthy,house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,middle_class,car
2,0,-5623.422610,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,middle_class,house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,wealthy,study
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,middle_class,wedding
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions,wealthy,house
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car,middle_class,car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property,poor,house
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car,wealthy,car


In [71]:
children_status = data.pivot_table(index='children', values='debt', aggfunc='count',
                                 margins=True).reset_index()
children_status

Unnamed: 0,children,debt
0,0,14149
1,1,4865
2,2,2131
3,3,330
4,4,41
5,5,9
6,All,21525


In [72]:
data.groupby(['children'])['debt'].mean().reset_index()

Unnamed: 0,children,debt
0,0,0.075129
1,1,0.09147
2,2,0.094791
3,3,0.081818
4,4,0.097561
5,5,0.0


In [73]:
data.pivot_table(index='children', values='debt', aggfunc=['mean', 'sum', 'count'])

Unnamed: 0_level_0,mean,sum,count
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,0.075129,1063,14149
1,0.09147,445,4865
2,0.094791,202,2131
3,0.081818,27,330
4,0.097561,4,41
5,0.0,0,9


### Conclusion

more people without kids take loans, but if we look at the percentage of people with kids, we can see that they are much more likely not to repay the loan.

- Is there a relation between marital status and repaying a loan on time?

In [74]:
family_status = data.pivot_table(index='family_status', columns = 'debt', values='family_status_id', aggfunc='count',
                                 margins=True).reset_index()
family_status

debt,family_status,0,1,All
0,civil partnership,3789,388,4177
1,divorced,1110,85,1195
2,married,11449,931,12380
3,unmarried,2539,274,2813
4,widow / widower,897,63,960
5,All,19784,1741,21525


In [75]:
data.groupby(['family_status'])['debt'].mean().reset_index()

Unnamed: 0,family_status,debt
0,civil partnership,0.09289
1,divorced,0.07113
2,married,0.075202
3,unmarried,0.097405
4,widow / widower,0.065625


### Conclusion

we find that people who didn't get married or just in a civil partnership
have a higher percentage of not repaying the loan.

Is there a relation between income level and repaying a loan on time?

In [76]:
income_level = data.pivot_table(index='income_groups', values='debt', aggfunc='count',
                                 margins=True).reset_index()
income_level

Unnamed: 0,income_groups,debt
0,middle_class,10762
1,poor,5381
2,wealthy,5382
3,All,21525


In [77]:
data.groupby(['income_groups'])['debt'].mean().reset_index()

Unnamed: 0,income_groups,debt
0,middle_class,0.087251
1,poor,0.079353
2,wealthy,0.069677


### Conclusion

most of the people from the middle class take a loan and have a hard time repaying the loan.

Although poor people make less than the middle class, they have a higher chance to repay the loan on time.

The wealthy people have the highest chance to pay the loan on time, but they tend to get away from such obligations.

- How do different loan purposes affect on-time repayment of the loan?

In [78]:
loan_purposes = data.pivot_table(index='clean_purpose', values='debt', aggfunc='count',
                                 margins=True).reset_index()
loan_purposes

Unnamed: 0,clean_purpose,debt
0,car,4315
1,house,10840
2,study,4022
3,wedding,2348
4,All,21525


In [79]:
data.groupby(['clean_purpose'])['debt'].mean().reset_index()

Unnamed: 0,clean_purpose,debt
0,car,0.093395
1,house,0.07214
2,study,0.091994
3,wedding,0.079216


### Conclusion

more than 50% of people who take a loan have the purpose of using the money for a house-related activity.

But as far as concerning repaying the loan on time, the people who took a loan for study or purchased a car have a lower chance to repay the loan on time.

## General conclusion

we started by cleaning the data from missing values, duplicated values.
Than we categories the required column to prepare the data to answer the four questions that was given to us.

We answered the question and concluded that people without kids should have a higher credit score because they are more likely to pay the loan on time.

People who get married on their life need to get an upgrade on their credit score because they are more likely to pay a loan on time.

People who took a loan for a wedding or buying a house should have a higher credit score than people who took a loan to buy a car or pay for their education.

And finally, poor people and wealthy people should have a higher credit score than people in the middle class because they have a higher percentage to pay the loan back on time.