# Analysis of Borrowers’ Risk of Defaulting

The task is to prepare a report for a bank’s loan division. We need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

The report will be considered when building a credit scoring of a potential customer. A credit scoring is used to evaluate the ability of a potential borrower to repay their loan.

# Table of Contents

- 1) [Open the data and study the general information](#1)
<br> <br>
- 2) [Data preprocessing](#2)
<br> <br>
    - 2.1) [Processing missing values](#2.1)
    - 2.2) [Data type replacement](#2.2)
    - 2.3) [Processing duplicates](#2.3)
    - 2.4) [Categorizing data](#2.4)
<br> <br>
- 3) [Answer relations between variables](#3)
<br> <br>
    - 3.1) [Relation between having kids and repaying a loan on time](#3.1)
    - 3.2) [Relation between marital status and repaying a loan on time](#3.2)
    - 3.3) [Relation between income level and repaying a loan on time](#3.3)
    - 3.4) [Loan purposes affect on-time repayment of the loan](#3.4)
<br> <br>
- 4) [General conclusion](#4)

<a id="1"></a>

## Open the data and study the general information

In [1]:
# import libraries
import pandas as pd
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer() 
english_stemmer = SnowballStemmer('english')

In [2]:
# read csv file
df = pd.read_csv('C:/Users/Herbert/Documents/Practicum100/datasets/data_preprocessing_credit_scoring_eng.csv')

In [3]:
df.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


### Conclusion

This table gives some information about the customers, it has 21525 rows and 12 columns. <br>
The only 2 columns that are given in the float64 type are also the only two columns that
have missing values; they both have 19351 entries (these columns are 'days_employed' and 'total_income')
and they both are quantitative columns. <br>
If one takes a look at the first 5 rows one notices that there are senseless entries for the column 'days_employed':
On the one hand there are negative values for some costumers,
on the other hand the costumer in the fifth row has been working longer than she is old.
The same education degree in the column 'education' has been written with different uppercase and lowercase letters.

<a id="2"></a>

## Data preprocessing

<a id="2.1"></a>

### Processing missing values

In [5]:
# numbers for calculating the quote of missing values
number = df.isnull().sum()
size = len(df)
quote = number / size

# creating little lists regarding different values for days_employed
df_neg = df[df['days_employed'] < 0]
df_pos = df[df['days_employed'] > 0]
df_nan = df[df['days_employed'].isnull()]
df_days_employed = df.loc[:, 'days_employed']

print("The quote of missing values for the columns 'days_employed' and 'total_income' is {:.2%}."
      .format(quote['days_employed']))

The quote of missing values for the columns 'days_employed' and 'total_income' is 10.10%.


In [6]:
print("Take a look at the maxima and minima of the positive and negative values for 'days_employed':")
print()
print('The smallest negative value is {:.2f},'.format(df_days_employed.loc[df['days_employed'] < 0].min()))
print('The biggest negative value is {:.2f},'.format(df_days_employed.loc[df['days_employed'] < 0].max()))
print('The smallest positive value is {:.2f},'.format(df_days_employed.loc[df['days_employed'] > 0].min()))
print('The biggest positive value is {:.2f}.'.format(df_days_employed.loc[df['days_employed'] > 0].max()))
print()
print("The smallest negative value for days employed corresponds to {:.0f} years."
      .format(df_days_employed.loc[df['days_employed'] < 0].min()/365))
print("The smallest positive value for days employed corresponds to {:.0f} years,"
      .format(df_days_employed.loc[df['days_employed'] > 0].min()/365))

Take a look at the maxima and minima of the positive and negative values for 'days_employed':

The smallest negative value is -18388.95,
The biggest negative value is -24.14,
The smallest positive value is 328728.72,
The biggest positive value is 401755.40.

The smallest negative value for days employed corresponds to -50 years.
The smallest positive value for days employed corresponds to 901 years,


You can see that values like 900 years are absolutely unrealistic while values like 24 days as a minimum and 50 years as a maximum are much more realistic.

In [7]:
print("There are {} entries with zero days employed.".format(len(df[df['days_employed'] == 0])))

There are 0 entries with zero days employed.


In [8]:
# I differentiate between positive, negative and missing values in the column 'days_employed' for each type of income:

print('Number of costumers with negative values grouped by income types:')
print(df_neg['income_type'].value_counts())
print()
print('Number of costumers with positive values grouped by income types:')
print(df_pos['income_type'].value_counts())
print()
print('Number of costumers with missing values grouped by income types:')
print(df_nan['income_type'].value_counts())

Number of costumers with negative values grouped by income types:
employee                       10014
business                        4577
civil servant                   1312
paternity / maternity leave        1
entrepreneur                       1
student                            1
Name: income_type, dtype: int64

Number of costumers with positive values grouped by income types:
retiree       3443
unemployed       2
Name: income_type, dtype: int64

Number of costumers with missing values grouped by income types:
employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64


In [9]:
# For the retirees we only have positive and missing values in the column 'days_employed'.
# Since the smallest negative value in the column 'days_employed' corresponds to -50 years and the smallest positive value
# corresponds to 901 years, I set the value in the column 'days_employed' for the retirees to 51 years;
# that corresponds to 18615 days days employed.

df.loc[df['income_type'] == 'retiree', 'days_employed'] = - 18615.00

# Since the unemployed costumers only have positive values in this column which are unrealistic high and
# since they aren't employed, I set their value for this column to zero.

df.loc[df['income_type'] == 'unemployed', 'days_employed'] = - 0.00

In [10]:
# now all values for days employed are whether missing or negative, so transform them to positive:
df['days_employed'] = -df['days_employed']

In [11]:
print("Grouped by the type of income, I calculate the means and the medians of the days employed:")
print()
print('medians:')
print(df.groupby('income_type')['days_employed'].median())
print()
print('means:')
print(df.groupby('income_type')['days_employed'].mean())

Grouped by the type of income, I calculate the means and the medians of the days employed:

medians:
income_type
business                        1547.382223
civil servant                   2689.368353
employee                        1574.202821
entrepreneur                     520.848083
paternity / maternity leave     3296.759962
retiree                        18615.000000
student                          578.751554
unemployed                         0.000000
Name: days_employed, dtype: float64

means:
income_type
business                        2111.524398
civil servant                   3399.896902
employee                        2326.499216
entrepreneur                     520.848083
paternity / maternity leave     3296.759962
retiree                        18615.000000
student                          578.751554
unemployed                         0.000000
Name: days_employed, dtype: float64


The means are significantly higher or equal to the corresponding medians.

For the remaining types of income I calculate the median of the days employed for every type and replace the missing values in the column 'days_employed' corresponding to the type of income (without minus sign) by using the fillna() method.

In [12]:
# create list with all income types besides 'retiree' and 'unemployed'
type_list = ['employee', 'civil servant', 'business', 'student', 'entrepreneur', 'paternity / maternity leave']

# create a function that fills the missing days employed with the median value regarding the income type:
def fill_missing_days(income_type):
    df.loc[df['income_type'] == income_type, 'days_employed'] = \
            df['days_employed'].fillna(value = df.loc[df['income_type'] == income_type, 'days_employed'].median())

# apply that function on all types of the type list
for income_type in type_list:
    fill_missing_days(income_type)

In [13]:
print("Take a look at the medians and the means of the total income grouped by the type of income:")
print()
print("medians:")
print(df.groupby('income_type')['total_income'].median())
print()
print("means:")
print(df.groupby('income_type')['total_income'].mean())

Take a look at the medians and the means of the total income grouped by the type of income:

medians:
income_type
business                       27577.2720
civil servant                  24071.6695
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18962.3180
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64

means:
income_type
business                       32386.793835
civil servant                  27343.729582
employee                       25820.841683
entrepreneur                   79866.103000
paternity / maternity leave     8612.661000
retiree                        21940.394503
student                        15712.260000
unemployed                     21014.360500
Name: total_income, dtype: float64


In [14]:
# Here also the means are significantly higher or equal to the corresponding medians,
# so I replace the missing values in the column 'total_income' with the corresponding median using a function depending
# on the type of income
def fill_missing_incomes(income_type):
    df.loc[df['income_type'] == income_type, 'total_income'] = \
            df['total_income'].fillna(value = df.loc[df['income_type'] == income_type, 'total_income'].median())

# apply function on all income types
for income_type in df['income_type'].unique():
    fill_missing_incomes(income_type)
    
print("There are {:.0f} missing values left in the whole table.".format(df.isnull().sum().sum()))

There are 0 missing values left in the whole table.


#### Conclusion

I identified missing values in the columns 'total_income' and 'days_employed'.
Only the negative values in the column 'days_employed' had realistic values,
so I didn't use the positive values in this column to calculate values for the missing values. <br>
I thought that things like the days employed and the total income are depending strongly on the type of income.
That's the reason I compared the means and the medians for both columns grouped by the type of income.
Since the means are significantly higher or equal to the corresponding medians, I replaced the
missing values with the corresponding medians by using the fillna() method. <br>
That's not the method I used for unemployed and retired costumers in the column 'days_employed', as they had only missing or unrealistic high positive values. In these cases, I set the days employed for retirees a bit bigger than the biggest value for days employed in the table (51 years); for unemployed people it makes more sense to set zero days for days employed than to have values like 900 years. <br>
There are manys types of income for which we have missing values, so it's possible that these costumers just
haven't given information to the question of days employed.
Since the only costumers having unrealistic high values for days employed
are unemployed or retired, possibly the system processing the noted data can't handle the value '0' for days employed.

<a id="2.2"></a>

### Data type replacement

In [15]:
# value for smallest total income
smallest_income = df['total_income'].min()

print("The smallest value for the total income in the list is {} units.".format(smallest_income))

# use the astype() method to transform the both columns into int64 type:
try:
    df['total_income'] = df['total_income'].astype('int')
    df['days_employed'] = df['days_employed'].astype('int')
except:
    print("Couldn't transform into int64 type.")
    
df.info()

The smallest value for the total income in the list is 3306.762 units.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null int32
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null int32
purpose             21525 non-null object
dtypes: int32(2), int64(5), object(5)
memory usage: 1.8+ MB


#### Conclusion

I used the astype() method to change the columns that were in the float64 type into the int64 type because this is an easy method doing this type change to a table. It doesn't make sense to note a half day of employment in this table. As the smallest value for the total income is 3307 units, the deficit by rounding the income to an integer is negligible.

<a id="2.3"></a>

### Processing duplicates

In [16]:
print("There are {} duplicates in the DataFrame.".format(df.duplicated().sum()))

# Delete them
df = df.drop_duplicates().reset_index(drop = True)

There are 54 duplicates in the DataFrame.


In [17]:
print("I group by the elements in every object typed column:")
print()
print(df['gender'].value_counts())
print()
print(df['education'].value_counts())
print()
print(df['family_status'].value_counts())
print()
print(df['income_type'].value_counts())
print()
print(df['purpose'].value_counts())

I group by the elements in every object typed column:

F      14189
M       7281
XNA        1
Name: gender, dtype: int64

secondary education    13705
bachelor's degree       4710
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        273
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

married              12344
civil partnership     4163
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64

employee                       11091
business                        5080
retiree                         3837
civil servant                   1457
entrepreneur                       2
unemployed                         2
student             

The only columns with ambiguous entries are 'education' and 'purpose'. <br>
In the table 'education' you can see that we have the following categories: <br>
bachelor's degree, graduate degree, primary education, secondary education, some college

In [18]:
print("I transform all letters in the column 'education' to lowercase letters.")
df['education'] = df['education'].str.lower()
print("Let's check the column 'education':")
print()
print(df['education'].value_counts())

I transform all letters in the column 'education' to lowercase letters.
Let's check the column 'education':

secondary education    15188
bachelor's degree       5251
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64


In [19]:
# safe the data
df_2 = df

In [20]:
print("I now define categories for purposes to find more duplicates:")
for i in range(len(df_2)):
    words = nltk.word_tokenize(df_2.loc[i, 'purpose'])
    lemmas = [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
    if 'education' in lemmas or 'university' in lemmas or 'educated' in lemmas:
        df_2.loc[i, 'purpose'] = 'education'
    elif 'wedding' in lemmas:
        df_2.loc[i, 'purpose'] = 'wedding'
    elif 'second-hand' in lemmas:
        df_2.loc[i, 'purpose'] = 'second-hand car'
    elif 'car' in lemmas and not 'second-hand' in lemmas:
        df_2.loc[i, 'purpose'] = 'new car'
    elif 'renovation' in lemmas:
        df_2.loc[i, 'purpose'] = 'housing renovation'
    elif 'commercial' in lemmas:
        df_2.loc[i, 'purpose'] = 'commercial real estate'
    else:
        df_2.loc[i, 'purpose'] = 'property'
print()
print(df_2['purpose'].value_counts())

I now define categories for purposes to find more duplicates:

property                  8895
education                 4014
new car                   3344
wedding                   2335
commercial real estate    1312
second-hand car            964
housing renovation         607
Name: purpose, dtype: int64


In [21]:
dupl_sum = df_2.duplicated().sum()
print("There are still {} duplicates in the data frame that I delete.".format(dupl_sum))
df_2 = df_2.drop_duplicates().reset_index(drop = True)

There are still 282 duplicates in the data frame that I delete.


In [22]:
print("I now define categories for purposes more roughly to find more duplicates:")
for i in range(len(df)):
    words = nltk.word_tokenize(df.loc[i, 'purpose'])
    lemmas = [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
    if 'education' in lemmas or 'university' in lemmas or 'educated' in lemmas:
        df.loc[i, 'purpose'] = 'education'
    if 'wedding' in lemmas:
        df.loc[i, 'purpose'] = 'wedding'
    if 'car' in lemmas:
        df.loc[i, 'purpose'] = 'car'
    if 'housing' in lemmas or 'house' in lemmas or 'estate' in lemmas or 'property' in lemmas:
        df.loc[i, 'purpose'] = 'property'
print()
print(df['purpose'].value_counts())

I now define categories for purposes more roughly to find more duplicates:

property     10814
car           4308
education     4014
wedding       2335
Name: purpose, dtype: int64


In [23]:
print("There are still {} duplicates in the data frame that I delete."
      .format(df.duplicated().sum()-dupl_sum))
df = df.drop_duplicates().reset_index(drop = True)

There are still 69 duplicates in the data frame that I delete.


In [24]:
print("Let's look at the tables again:")
print()
print(df_2['purpose'].value_counts())
print()
print(df['purpose'].value_counts())
print()
print("It's important to note that the table with more differentiated purposes includes 69 duplicates.")

Let's look at the tables again:

property                  8716
education                 3964
new car                   3322
wedding                   2306
commercial real estate    1311
second-hand car            963
housing renovation         607
Name: purpose, dtype: int64

property     10578
car           4272
education     3964
wedding       2306
Name: purpose, dtype: int64

It's important to note that the table with more differentiated purposes includes 69 duplicates.


#### Conclusion

There are many ambiguous strings in the columns 'education' and 'purpose', so I counted the number of entries for each string. I saw that I could summarize the purposes by hand. With the word_tokenize() method I created a list for each string in the column 'purpose'. Using the lemmatize() method to nouns I could avoid mistakes like separating 'cars' and 'car'. By creating many if-cases I could summarize the purposes. I also observed that in the column 'education' the only differences of ambiguous strings were uppercase and lowercase letters. I've applied the str.lower() method to be sure every letter is lowercase. <br> 
After having reduced the number of strings in these two columns, I could find many more duplicates. A possible reason for these duplicates is that some costumers had numerous discussions about the loans in the bank and those customers had to write their reasons for the loan into the data processing system multiple times; the customers or the bank employees didn't pay attention to uppercase and lowercase letters and to the exactness of the purpose while writing down the reasons.

<a id="2.4"></a>

### Categorizing data

In [25]:
print("Let's see the values for the highest and the lowest total incomes:")
print()
print(df['total_income'].sort_values().head(3))
print()
print(df['total_income'].sort_values().tail(3))

Let's see the values for the highest and the lowest total incomes:

14386    3306
12843    3392
15935    3418
Name: total_income, dtype: int32

9082     276204
19265    352136
12262    362496
Name: total_income, dtype: int32


In [26]:
print("The table has {} rows. I sort the table by total income".format(len(df)))
df_sorted = df['total_income'].sort_values().reset_index(drop = True)
print("and divide the number of clients in three groups of equal size. In the sorted table, we have:")
print("Total income of customer 7040: {}".format(df_sorted.loc[7039]))
print("Total income of customer 14080: {}".format(df_sorted.loc[14079]))

The table has 21120 rows. I sort the table by total income
and divide the number of clients in three groups of equal size. In the sorted table, we have:
Total income of customer 7040: 19055
Total income of customer 14080: 27577


In [27]:
# I define a function that attaches an income level to each costumer. With the function,
# I create a new column 'income_third' with the possible entries 'lower third', 'middle third' and 'upper third'.
# For example, middle third means that the costumer's total income is higher than the income of 33% other costumers,
# but also lower than 33% of the other costumers's incomes.

# define a function that attaches a total income to a third group of income
def income_third_func(row):
    income = row['total_income']
    if income <= df_sorted.loc[7039]:
        return 'lower third'
    if income > df_sorted.loc[14079]:
        return 'upper third'
    return 'middle third'

df['income_third'] = df.apply(income_third_func, axis = 1)

In [28]:
# I define another column: 'income_level'. There I define a new function that tells us, if the
# total income is at most 9999 ('4-digit'), at most 99999 ('5-digit') or at least 100,000 ('6-digit').

# define a function that attaches a total income to its number of digits
def income_level_func(row):
    income = row['total_income']
    if income <= 9999:
        return '4-digit'
    if income > 99999:
        return '6-digit'
    return '5-digit'

df['income_level'] = df.apply(income_level_func, axis = 1)

In [29]:
# Let's take a look at the extended table:
df.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,income_third,income_level
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,property,upper third,5-digit
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car,lower third,5-digit
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341,property,middle third,5-digit
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,education,upper third,5-digit
4,0,18615,53,secondary education,1,civil partnership,1,F,retiree,0,25378,wedding,middle third,5-digit
5,0,926,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,property,upper third,5-digit
6,0,2879,43,bachelor's degree,0,married,0,F,business,0,38484,property,upper third,5-digit
7,0,152,50,secondary education,1,married,0,M,employee,0,21731,education,middle third,5-digit
8,2,6929,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337,wedding,lower third,5-digit
9,0,2188,41,secondary education,1,married,0,M,employee,0,23108,property,middle third,5-digit


#### Conclusion

I decided not to categorize the values in the column 'days_employed' as these values have been revised strongly. <br>
I want to know how the values of the other quantitative column 'total_income' are distributed. I divided the costumers in three groups of equal size sorted by the total income. One can assert that the costumer with the biggest income has an over 100 times bigger income than the one with the smallest income. Interestingly, the income of the middle third has a small range from 19055 to 27577. <br> In view of the fact, that the third biggest income is roughly 100,000 units smaller than the biggest income, we can assume huge outliers in the distribution of the total incomes.

<a id="3"></a>

## Analyse relations between variables

<a id="3.1"></a>

### Relation between having kids and repaying a loan on time

Is there a relation between having kids and repaying a loan on time?

In [30]:
# create a pivot table grouped by number of children including the percentage fractions of debtors
kids_debt_table = df.pivot_table(index = 'children', columns = 'debt', values = 'dob_years', aggfunc = 'count')
# The NaN value at the following cell corresponds to zero
kids_debt_table.loc[5, 1] = 0
kids_debt_table['quote_debt_in_percent'] = kids_debt_table[1] / ( kids_debt_table[0] + kids_debt_table[1] ) * 100

In [31]:
# The following table counts the status of the costumer's debts and the quote of the customers
# having debts in dependency of the number of the customer's children:

kids_debt_table

debt,0,1,quote_debt_in_percent
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,46.0,1.0,2.12766
0,12768.0,1061.0,7.672283
1,4307.0,444.0,9.345401
2,1845.0,194.0,9.514468
3,302.0,27.0,8.206687
4,36.0,4.0,10.0
5,9.0,0.0,0.0
20,68.0,8.0,10.526316


In the table's column 'quote_debt_in_percent', there are two outliers: 0% at 5 children and 2% at -1 children. As the amount of costumers having 5 children is very small, this deviation is normal (if we only add one costumer having 5 children to the group with debts, the quote would straightly increase to 10%). The outlier at -1 children must be mistrusted since a bank won't count less than 0 children for a customer. As I already had problems with the minus sign in the column 'days_employed', it's possible that there's once again a problem with the system processing the data put in by the costumer or the bank employee. The more children we assume for a customer the smaller the number of customers having the assumed number of children; we only have 9 customers with 5 children and 0 customers with 6, 7, 8, ... children. Due to this observation I neglect the row with 20 children and 76 customers.

In [32]:
# drop both rows from the table
kids_debt_table = kids_debt_table.loc[0:5]

kids_debt_table

debt,0,1,quote_debt_in_percent
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,12768.0,1061.0,7.672283
1,4307.0,444.0,9.345401
2,1845.0,194.0,9.514468
3,302.0,27.0,8.206687
4,36.0,4.0,10.0
5,9.0,0.0,0.0


In [33]:
# number of customers with children
df_kids = df[(df['children'] > 0) & (df['children'] < 6)]
number_kids = len(df_kids)
# number of customers with children and debts
number_kids_debt = len(df_kids[df_kids['debt'] == 1])
# quote of customers with children having debts
quote_kids = number_kids_debt / number_kids
# number of customers without children
df_no_kids = df[df['children'] == 0]
number_no_kids = len(df_no_kids)
# number of customers without children and debts
number_no_kids_debt = len(df_no_kids[df_no_kids['debt'] == 1])
# quote of customers without children having debts
quote_no_kids = number_no_kids_debt / number_no_kids

In [34]:
# The following calculations exclude the rows with -1 and 20 children:
print("{:.2%} of all costumers that have no children are having debts.".format(quote_no_kids))
print("{:.2%} of all costumers that have children are having debts.".format(quote_kids))
print("Statistically, this increase of the ratio from the parents having debts compared to")
print("the childless costumers having debts corresponds to an increase of {:.2%}.".format(quote_kids / quote_no_kids - 1))

7.67% of all costumers that have no children are having debts.
9.33% of all costumers that have children are having debts.
Statistically, this increase of the ratio from the parents having debts compared to
the childless costumers having debts corresponds to an increase of 21.65%.


#### Conclusion

Looking at the quote of customers having debts in the rows with 0 to 4 children, I observe a slight increase by increasing the number of children. In total, we have an increase of 21.65% when comparing customers with and without children having debts. In view of a huge number of over 20,000 customers I find that having children increases the probability having debts by approximately 22% with a slightly upward trend increasing the number of children.

<a id="3.2"></a>

### Relation between marital status and repaying a loan on time

Is there a relation between marital status and repaying a loan on time?

In [35]:
# create a pivot table grouped by family status including the percentage fractions of debtors
family_debt_table = df.pivot_table(index = 'family_status', columns = 'debt', 
                      values = 'dob_years', aggfunc = 'count')
family_debt_table['quote_debt_in_percent'] = family_debt_table[1] / ( family_debt_table[0] + family_debt_table[1] ) * 100

In [36]:
# number of customers with certain family status
widowed = len(df[df['family_status'] == 'widow / widower'])
divorced = len(df[df['family_status'] == 'divorced'])
married = len(df[df['family_status'] == 'married'])
civil = len(df[df['family_status'] == 'civil partnership'])
unmarried = len(df[df['family_status'] == 'unmarried'])

In [37]:
# number of customers with certain family status having debts
widowed_debt = len(df[(df['family_status'] == 'widow / widower') & (df['debt'] == 1)])
divorced_debt = len(df[(df['family_status'] == 'divorced') & (df['debt'] == 1)])
married_debt = len(df[(df['family_status'] == 'married') & (df['debt'] == 1)])
civil_debt = len(df[(df['family_status'] == 'civil partnership') & (df['debt'] == 1)])
unmarried_debt = len(df[(df['family_status'] == 'unmarried') & (df['debt'] == 1)])

In [38]:
# calculating the quote of debtors in the certain groups
group_1_quote = (widowed_debt+divorced_debt+married_debt)/(widowed+divorced+married)
group_2_quote = (civil_debt+unmarried_debt)/(civil+unmarried)

In [39]:
family_debt_table.sort_values('quote_debt_in_percent')

debt,0,1,quote_debt_in_percent
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
widow / widower,880,63,6.680806
divorced,1108,85,7.124895
married,11147,929,7.692945
civil partnership,3736,388,9.408341
unmarried,2510,274,9.841954


In [40]:
print("{:.2%} of the widowed costumers are having debts."
      .format(family_debt_table.loc['widow / widower', 'quote_debt_in_percent']/100))
print("{:.2%} of the divorced costumers are having debts."
      .format(family_debt_table.loc['divorced', 'quote_debt_in_percent']/100))
print("{:.2%} of the married costumers are having debts."
      .format(family_debt_table.loc['married', 'quote_debt_in_percent']/100))
print("{:.2%} of the costumers in a civil partnership are having debts."
      .format(family_debt_table.loc['civil partnership', 'quote_debt_in_percent']/100))
print("{:.2%} of the unmarried costumers are having debts."
      .format(family_debt_table.loc['unmarried', 'quote_debt_in_percent']/100))

6.68% of the widowed costumers are having debts.
7.12% of the divorced costumers are having debts.
7.69% of the married costumers are having debts.
9.41% of the costumers in a civil partnership are having debts.
9.84% of the unmarried costumers are having debts.


In [41]:
print("Let's divide these five family status into two groups:")
print("In the group including widowed, divorced and married customers we overall have a quote in debt of {:.2%}."
     .format(group_1_quote))
print("In the group including customers in a civil partnership and being unmarried we overall have a quote in debt of {:.2%}."
     .format(group_2_quote))
print("The higher percentage corresponds to an increase of {:.2%} compared to the lower percentage."
     .format(group_2_quote / group_1_quote - 1))

Let's divide these five family status into two groups:
In the group including widowed, divorced and married customers we overall have a quote in debt of 7.58%.
In the group including customers in a civil partnership and being unmarried we overall have a quote in debt of 9.58%.
The higher percentage corresponds to an increase of 26.46% compared to the lower percentage.


#### Conclusion

Since there's a bigger distance (approx. 2%) between the marital status 'married' and 'civil partnership' I divided all customers in two groups at this point. <br> In view of the huge amount of customers in both groups I've defined (group 1: widowed, divorced and married; group 2: civil partnership and unmarried) and of the increase of 26.46% between the two groups I find a higher probability having debts for costumers being in group 2 than in group 1. <br>
When defining these two groups, one can observe different probabilites repaying the loan on-time.

<a id="3.3"></a>

### Relation between income level and repaying a loan on time

Is there a relation between income level and repaying a loan on time?

In [42]:
# number of costumers being in a certain third of total income
lower_third = len(df[df['income_third'] == 'lower third'])
middle_third = len(df[df['income_third'] == 'middle third'])
upper_third = len(df[df['income_third'] == 'upper third'])

In [43]:
# number of costumers being in a certain third of total income having debts
lower_third_debt = len(df[(df['income_third'] == 'lower third') & (df['debt'] == 1)])
middle_third_debt = len(df[(df['income_third'] == 'middle third') & (df['debt'] == 1)])
upper_third_debt = len(df[(df['income_third'] == 'upper third') & (df['debt'] == 1)])

In [44]:
# number of costumers being in a certain income level (given as number of digits)
digit4 = len(df[df['income_level'] == '4-digit'])
digit5 = len(df[df['income_level'] == '5-digit'])
digit6 = len(df[df['income_level'] == '6-digit'])

In [45]:
# number of costumers being in a certain income level having debts
digit4_debt = len(df[(df['income_level'] == '4-digit') & (df['debt'] == 1)])
digit5_debt = len(df[(df['income_level'] == '5-digit') & (df['debt'] == 1)])
digit6_debt = len(df[(df['income_level'] == '6-digit') & (df['debt'] == 1)])

In [46]:
# multiple conditions for the income in order to be able to calculate more quotes
lower_third_digit5 = len(df[(df['income_level'] == '5-digit') & (df['income_third'] == 'lower third')])
middle_third_digit5 = len(df[(df['income_level'] == '5-digit') & (df['income_third'] == 'middle third')])
upper_third_digit5 = len(df[(df['income_level'] == '5-digit') & (df['income_third'] == 'upper third')])
lower_third_digit5_debt = len(df[(df['income_level'] == '5-digit') & (df['income_third'] == 'lower third') & (df['debt'] == 1)])
middle_third_digit5_debt = len(df[(df['income_level'] == '5-digit') & (df['income_third'] == 'middle third')&(df['debt'] == 1)])
upper_third_digit5_debt = len(df[(df['income_level'] == '5-digit') & (df['income_third'] == 'upper third') & (df['debt'] == 1)])

In [47]:
# all the previous defined numbers will be showed in appropriate pivot tables including the quotes of debtors
income_debt_table = df.pivot_table(index = ['income_level', 'income_third'], columns = 'debt', 
                      values = 'dob_years', aggfunc = 'count')
income_debt_table['quote_debt_in_percent'] = income_debt_table[1] / ( income_debt_table[0] + income_debt_table[1] )*100

income_debt_table_2 = df.pivot_table(index = 'income_third', columns = 'debt', 
                      values = 'dob_years', aggfunc = 'count')
income_debt_table_2['quote_debt_in_percent'] = income_debt_table_2[1] / ( income_debt_table_2[0] + income_debt_table_2[1] )*100

income_debt_table_3 = df.pivot_table(index = 'income_level', columns = 'debt', 
                      values = 'dob_years', aggfunc = 'count')
income_debt_table_3['quote_debt_in_percent'] = income_debt_table_3[1] / ( income_debt_table_3[0] + income_debt_table_3[1] )*100

In the categorisation of the data I defined two columns regarding the total income. I use these columns to create pivot tables getting an overview on the debts.

In [48]:
income_debt_table.sort_values('quote_debt_in_percent')

Unnamed: 0_level_0,debt,0,1,quote_debt_in_percent
income_level,income_third,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6-digit,upper third,93,6,6.060606
4-digit,lower third,868,58,6.263499
5-digit,upper third,6396,522,7.545533
5-digit,lower third,5591,523,8.554138
5-digit,middle third,6433,630,8.919722


In [49]:
income_debt_table_2.sort_values('quote_debt_in_percent')

debt,0,1,quote_debt_in_percent
income_third,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
upper third,6489,528,7.524583
lower third,6459,581,8.252841
middle third,6433,630,8.919722


In [50]:
income_debt_table_3.sort_values('quote_debt_in_percent')

debt,0,1,quote_debt_in_percent
income_level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6-digit,93,6,6.060606
4-digit,868,58,6.263499
5-digit,18420,1675,8.335407


In [51]:
print("{:.2%} of the customers in the group of lower third income are having debts.".format(lower_third_debt/lower_third))
print("{:.2%} of the customers in the group of middle third income are having debts.".format(middle_third_debt/middle_third))
print("{:.2%} of the customers in the group of upper third income are having debts.".format(upper_third_debt/upper_third))
print()
print("{:.2%} of the customers with an 4-digit income are having debts.".format(digit4_debt/digit4))
print("{:.2%} of the customers with an 5-digit income are having debts.".format(digit5_debt/digit5))
print("{:.2%} of the customers with an 6-digit income are having debts.".format(digit6_debt/digit6))

8.25% of the customers in the group of lower third income are having debts.
8.92% of the customers in the group of middle third income are having debts.
7.52% of the customers in the group of upper third income are having debts.

6.26% of the customers with an 4-digit income are having debts.
8.34% of the customers with an 5-digit income are having debts.
6.06% of the customers with an 6-digit income are having debts.


In [52]:
print("Regarding the number of digits, the increasing percentage between 6 digits and 5 digits is {:.2%}."
     .format((digit5_debt/digit5)/(digit6_debt/digit6) - 1))
print()
print("{:.2%} of all customers having a 5-digit income and being in the lower third group are having debts."
     .format(lower_third_digit5_debt/lower_third_digit5))
print("{:.2%} of all customers having a 5-digit income and being in the middle third group are having debts."
     .format(middle_third_digit5_debt/middle_third_digit5))
print("{:.2%} of all customers having a 5-digit income and being in the upper third group are having debts."
     .format(upper_third_digit5_debt/upper_third_digit5))

Regarding the number of digits, the increasing percentage between 6 digits and 5 digits is 37.53%.

8.55% of all customers having a 5-digit income and being in the lower third group are having debts.
8.92% of all customers having a 5-digit income and being in the middle third group are having debts.
7.55% of all customers having a 5-digit income and being in the upper third group are having debts.


#### Conclusion

Overall the differences of the fractions of customers having debts in my defined groups are small. The biggest difference I could find refers to the number of digits in the value of total income: <br> With approximately 6%, the fraction of customers getting an 4- or 6-digits income being in debt is smaller compared to the group with an 5-digit income being in debt (8.34%); this difference corresponds to an increasing percentage of up to 37.53%. <br> In consideration of the number of digits in the value for the total income, we can observe that the income level does have an effect to repaying loans on-time.

<a id="3.4"></a>

### Loan purposes affect on-time repayment of the loan

 How do different loan purposes affect on-time repayment of the loan?

In [53]:
# create two pivot tables from df and df_2 since df_2 includes more detailed purposes
purpose_debt_table_1 = df.pivot_table(index = 'purpose', columns = 'debt', 
                      values = 'dob_years', aggfunc = 'count')
purpose_debt_table_1['quote_debt_in_percent'] = purpose_debt_table_1[1] / ( 
    purpose_debt_table_1[0] + purpose_debt_table_1[1] ) * 100
purpose_debt_table_2 = df_2.pivot_table(index = 'purpose', columns = 'debt', 
                      values = 'dob_years', aggfunc = 'count')
purpose_debt_table_2['quote_debt_in_percent'] = purpose_debt_table_2[1] / ( 
    purpose_debt_table_2[0] + purpose_debt_table_2[1] ) * 10

In [54]:
# define the total number of customers having purposes regarding cars for each table
total_car_table_1 = purpose_debt_table_1.loc['car', 0]+purpose_debt_table_1.loc['car', 1]
total_car_table_2 = purpose_debt_table_2.loc['new car', 0]+purpose_debt_table_2.loc['new car', 1] \
                    +purpose_debt_table_2.loc['second-hand car', 0]+purpose_debt_table_2.loc['second-hand car', 1]

In [55]:
# count all costumers having debts regarding to property purposes in purpose_debt_table_2
debt_property_table_2 = purpose_debt_table_2.loc['housing renovation', 1] \
                        +purpose_debt_table_2.loc['property', 1] \
                        +purpose_debt_table_2.loc['commercial real estate', 1]

In [56]:
# calculate a corrected value for the quote of costumers having debts that want to renovate a house
new_quote_renovation = purpose_debt_table_2.loc['housing renovation', 1]/(purpose_debt_table_2.loc['housing renovation', 0]-56)


In [57]:
purpose_debt_table_1.sort_values('quote_debt_in_percent')

debt,0,1,quote_debt_in_percent
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
property,9797,781,7.383248
wedding,2120,186,8.065915
education,3594,370,9.334006
car,3870,402,9.410112


In [58]:
purpose_debt_table_2.sort_values('quote_debt_in_percent')

debt,0,1,quote_debt_in_percent
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
housing renovation,572,35,0.576606
property,8069,647,0.742313
commercial real estate,1212,99,0.755149
wedding,2120,186,0.806592
second-hand car,876,87,0.903427
education,3594,370,0.933401
new car,3007,315,0.948224


In the second table, you have a more detailed view over the purposes but one also must consider 69 duplicates in this table.

In [59]:
print("The first table has {} customers with the purpose 'car'.".format(total_car_table_1))
print("When combining the number of customers with the purposes 'new car' and")
print("'second-hand car' in the second table, there are {} customers.".format(total_car_table_2))
print("The difference of customers in these two values is {}.".format(abs(total_car_table_1-total_car_table_2)))
print("This means that the other 56 duplicates left are concerning the properties.")

The first table has 4272 customers with the purpose 'car'.
When combining the number of customers with the purposes 'new car' and
'second-hand car' in the second table, there are 4285 customers.
The difference of customers in these two values is 13.
This means that the other 56 duplicates left are concerning the properties.


In [60]:
print("As I want to consider these 56 duplicates, let's assume that they are concerning to housing renovations.")
print("The number of customers having debts regarding the property purposes is the same value for both tables: {} customers."
      .format(debt_property_table_2))
print("I can assume that 56 duplicates are according to the housing renovation customers having no debts.")

As I want to consider these 56 duplicates, let's assume that they are concerning to housing renovations.
The number of customers having debts regarding the property purposes is the same value for both tables: 781 customers.
I can assume that 56 duplicates are according to the housing renovation customers having no debts.


In [61]:
print("Consequently, the quote of house renovating customers having debts increases to {:.2%}.".format(new_quote_renovation))
print("Housing renovation is still the purpose with the smallest quote of customers having debts.")
print()
print("The highest of these noted quotes corresponds to an increase of {:.2%} compared to this smallest quote."
     .format(0.01 * purpose_debt_table_1.loc['car', 'quote_debt_in_percent']/new_quote_renovation - 1))

Consequently, the quote of house renovating customers having debts increases to 6.78%.
Housing renovation is still the purpose with the smallest quote of customers having debts.

The highest of these noted quotes corresponds to an increase of 38.73% compared to this smallest quote.


#### Conclusion

In the first table you can see that costumers buying cars with the loan have the highest quote of costumers with debts (9.41%). This quote hardly changes for education purposes (9.33%), but is different from those who want to pay a wedding (8.07%). Customers asking for a property show the smallest fraction of customers having debts (7.38%). Including the duplicates, I find out that the house renovating purposes make the smallest quote (6.78%). <br> The highest of these noted quotes corresponds to an increase of 38.73% compared to this smallest quote, so one can inference that the purpose does have an effect on repaying loans on-time.

<a id="4"></a>

## General conclusion

By finding appropriate groups for different properties I could find connections to repaying loans on-time for every question. <br> Calculated with the given table, the biggest differences for the probability repaying the loan on-time are

- 21.65% concerning the number of children,
- 26.46% concerning the family status,
- 37.53% concerning the level of income and
- 38.73% concerning the purposes. 

All four percentages are in the same order, so one can assert that every of these four customer's properties has a small but significant effect on repaying the loan on-time.
