## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("/datasets/credit_scoring_eng.csv")
data.head()

    

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


### Conclusion

The column "days_employed" seems to represent technical errors.
Index number 4 would have been working for over 900 years. Also, an integer type would be more appropriate than a float.
The column represents a work start date, which is represented in a "Unix Time Stamp" ( Index 4 = 340266. 072047 = 01/04/1970 @ 10:31pm (UTC)) 
The column "total_income" shows 3 decimal places, strange representation of a currency.
The column "days_employd" and "total_income" have NaN values.
The column "dob_years" shows customers with the age 0 1 and 2 years, "age" would fit better for the columnname.
The column "children" shows -1 and 20. I suspect 20 is a typing error, since the 0 is directly under the 2 on the number pad.

### Step 2. Data preprocessing

### Processing missing values

In [2]:

data = pd.read_csv("/datasets/credit_scoring_eng.csv")
days_employed_median = data['days_employed'].median()
total_income_mean = data['total_income'].mean()
print(data.isnull().sum() * 100 / len(data))# check how much values are NaN's in % 
data["days_employed"] = data["days_employed"].fillna(value=days_employed_median)
data["total_income"] = data["total_income"].fillna(value=total_income_mean)
# fill NaN from days_employed and total_income with median and mean


children             0.000000
days_employed       10.099884
dob_years            0.000000
education            0.000000
education_id         0.000000
family_status        0.000000
family_status_id     0.000000
gender               0.000000
income_type          0.000000
debt                 0.000000
total_income        10.099884
purpose              0.000000
dtype: float64


### Conclusion

The NaN values in the "total_income" and "days_employed" columns were padded with the mean for "days_employed" and the median for "total_income". The column "days_employed" has been left unchanged. The NaN values in "total_income" are random, a 0 can be excluded, because it makes no sense to ask for a loan without income.

### Data type replacement

In [3]:
import pandas as pd
data = pd.read_csv("/datasets/credit_scoring_eng.csv")
days_employed_median = data['days_employed'].median()
total_income_mean = data['total_income'].mean()
data["days_employed"] = data["days_employed"].fillna(value=days_employed_median)
data["total_income"] = data["total_income"].fillna(value=total_income_mean)
# i need to use this again or get errors :/ 
data.days_employed = data.days_employed.astype(int)
#change the days_employed to integer type
data.rename(columns={'dob_years': 'age'}, inplace=True)
# rename the dob_years column with an easier understanding "age"
data.loc[data['children'] == 20] = 2
data.drop(data[data.children == -1].index, inplace=True)
# replace 20 children with 2 and drop -1 children
data.drop(data[data.age < 4].index, inplace=True)
# drop the ages 2 1 and 0 sum of them is 223, give them a value would impact the analysis


### Conclusion

First, I converted the float type to integer to match the task. After that I renamed the column “dob_years” to “age” for easier understanding. I have found 76 customers with 20 children and 47 with -1 children. The 20 turned to 2 and the deleted the -1 children. 223 clients had an age of 0, 1 or 2 years, this might affect the analysis, why I droped this data too. A single client has gender XNA.

### Processing duplicates

In [4]:
data['education'] = data['education'].str.lower()
#correct the education with the .lower() function to group them up

In [5]:
from nltk.stem import SnowballStemmer
english_stemmer = SnowballStemmer('english')
#create the stem
def stem(data): # create function
    for word in data['purpose'].split(" "):# for every word in the string in the column "purpose" splitet bei space
        stemmed_word = english_stemmer.stem(word)# use the word stem
        if stemmed_word == 'car': # if the condition: the string includes the word "car" group them 
            data['purpose_stem'] = 'car'
            return data #return the row
        elif stemmed_word == 'educ':
            data['purpose_stem'] = 'education'
            return data
        elif stemmed_word == 'estat':
            data['purpose_stem'] = 'estate'
            return data
        elif word == 'university': # there is onle 1 word and i grouped it with education
            data['purpose_stem'] = 'education'
            return data
        elif stemmed_word == 'properti':
            data['purpose_stem'] = 'estate'
            return data
        elif stemmed_word == 'wed':
            data['purpose_stem'] = 'wedding'
            return data
        elif stemmed_word == 'hous':
            data['purpose_stem'] = 'estate'
            return data
        
data = data.apply(stem, axis=1) # fill data in the function and run it

print(data["purpose_stem"].value_counts())
#clean 

#new column with purpose_stem for better overview and grouping different names with same mean.
data.head(5)

estate       10733
car           4267
education     3979
wedding       2323
Name: purpose_stem, dtype: int64


Unnamed: 0,children,days_employed,age,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_stem
0,1,-8437,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,estate
1,1,-4024,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,car
2,0,-5623,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,estate
3,3,-4124,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,wedding


### Conclusion

For a better overview I have adjusted the "education" with the lower case letters uniformly. I added the "purpose" column with the "purpose_stem" to better summarize them in a consistent way.

### Categorizing Data

In [6]:
def getlevel(total_income): # create function to categorize the income level bye the mean
        if total_income <= total_income_mean * 0.25: 
            return "low income"
        elif total_income <= total_income_mean * 0.50:
            return "medium income"
        elif total_income <= total_income_mean * 0.75:
            return "medium to high income"
        else:
            return "very high income"

data['income_level'] = data['total_income'].apply(getlevel) # fill the data in the function and create a new column in data
data['income_level'].value_counts()



very high income         13955
medium to high income     4711
medium income             2486
low income                 150
Name: income_level, dtype: int64

### Conclusion

The income_level has a heavy head and a light tail. This could be an indication that the bank segments its customers and only cares for "good" customers.The currency and country is unknown, so I have to access the averages. Comparing it to google data would perhaps skew the analysis. 

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [7]:


print(data.groupby('children').agg({'debt':['sum', 'count', 'mean']}))
pivoting_children = data.pivot_table(index=['children'], values='debt', aggfunc='mean')

pivoting_children_with_children = pivoting_children[1:5].mean() # get the average for the customer with children

pivoting_children_without_children = pivoting_children[0:1] # put the customer without children in a variable for calculating



result1 = 1 - pivoting_children_without_children["debt"] / pivoting_children_with_children["debt"]



print("The risk of default debt for customers with children and customers without children, compared with the average is:{: >.1%}".format(result1[0]))



          debt                 
           sum  count      mean
children                       
0         1058  14080  0.075142
1          441   4802  0.091837
2          194   2042  0.095005
3           27    328  0.082317
4            4     41  0.097561
5            0      9  0.000000
The risk of default debt for customers with children and customers without children, compared with the average is:18.0%


### Conclusion

I drop the values for 5 children because they are too small to evaluate and 0 would affect the result too much. Customers with children actually have an 18% higher risk than customers without children. Children may be a negative cash flow, which has an impact on the default of payments.

- Is there a relation between marital status and repaying a loan on time?

In [8]:
grouped_family_data = data.groupby('family_status_id').agg({'debt':['sum', 'count', 'mean']}) # group data by famaly id, because its not a string
grouped_family_data.insert(3, "Status", ["married", "civil_partnership", "widow_widower", "divorced", "unmarried"], True) #add a column for better overview
pivoting_family = data.pivot_table(index=['family_status_id'], values='debt', aggfunc='mean')
family_mean = pivoting_family.mean() # get the family status mean


married_dif = 1 - pivoting_family.loc[0] / family_mean # calculate for married status the ratio 
civil_partnership_dif = (1 - pivoting_family.loc[1]  / family_mean) * -1 # calculate for civil partnerships ratio and change negative to positive number
widow_widower_dif = 1 - pivoting_family.loc[2] / family_mean # for better overview
divorced_dif = 1 - pivoting_family.loc[3] / family_mean  
unmarried_dif = (1 - pivoting_family.loc[4]  / family_mean) * -1



print(grouped_family_data)

print("The risk of default debt for familystatus married is {: >.1%} lower compare to the average.".format(married_dif[0]))
print("The risk of default debt for familystatus civil partnership is {: >.1%} higher compare to the average.".format(civil_partnership_dif[0]))
print("The risk of default debt for familystatus widow/widower is {: >.1%} lower compare to the average.".format(widow_widower_dif[0]))
print("The risk of default debt for familystatus divorced is {: >.1%} lower compare to the average.".format(divorced_dif[0]))
print("The risk of default debt for familystatus unmarried is {: >.1%} higher compare to the average.".format(unmarried_dif[0]))

                 debt                              Status
                  sum  count      mean                   
family_status_id                                         
0                 923  12254  0.075322            married
1                 383   4139  0.092534  civil_partnership
2                  62    947  0.065470      widow_widower
3                  84   1179  0.071247           divorced
4                 272   2783  0.097736          unmarried
The risk of default debt for familystatus married is 6.4% lower compare to the average.
The risk of default debt for familystatus civil partnership is 15.0% higher compare to the average.
The risk of default debt for familystatus widow/widower is 18.6% lower compare to the average.
The risk of default debt for familystatus divorced is 11.5% lower compare to the average.
The risk of default debt for familystatus unmarried is 21.5% higher compare to the average.


### Conclusion

There is a relation between the debt ratio and the marital status and repaying the loan on time. 
Divorced, married and widows/widowers seem to be less default than unmarried and in a partnership. I suspect that married people may be both responsible for repaying the loan or at least make an effort. This allows a spouse to support the partner financially. In the case of divorced people, the financial situation also seems to have been clarified in good time, and in some cases they are even regulated by law. In the case of widows/widowers, external help may be possible through a widow’s pension, which influences credit agreements. Unmarried people usually do not have a partner to support them, which is why the risk of failure is greatly increased. Even in partnerships, financial relationships seem to be unclear, which is why it can lead to failures due to the lack of structure and law.

- Is there a relation between income level and repaying a loan on time?

In [9]:
print(data.groupby('income_level').agg({'debt':['sum', 'count', 'mean']}))
pivoting_income_level = data.pivot_table(index=['income_level'], values='debt', aggfunc='mean')
income_mean = pivoting_income_level.mean() # get the average

low_income = 1 - pivoting_income_level.loc["low income"] / income_mean # compare the average
medium_income = 1 - pivoting_income_level.loc["medium income"] / income_mean  # compare the average
medium_to_high_income = (1 - pivoting_income_level.loc["medium to high income"] / income_mean ) *-1 # compare the average change to positive
very_high_income = (1 - pivoting_income_level.loc["very high income"] / income_mean ) * -1 # compare the average change to positive


print("The risk of default debt for the income level low is {: >.1%} lower compare to the average.".format(low_income["debt"])) 
print("The risk of default debt for the income level low is {: >.1%} lower compare to the average.".format(medium_income["debt"]))
print("The risk of default debt for the income level low is {: >.1%} higher compare to the average.".format(medium_to_high_income["debt"]))
print("The risk of default debt for the income level low is {: >.1%} higher compare to the average.".format(very_high_income["debt"]))


                       debt                 
                        sum  count      mean
income_level                                
low income               11    150  0.073333
medium income           194   2486  0.078037
medium to high income   401   4711  0.085120
very high income       1118  13955  0.080115
The risk of default debt for the income level low is 7.4% lower compare to the average.
The risk of default debt for the income level low is 1.4% lower compare to the average.
The risk of default debt for the income level low is 7.5% higher compare to the average.
The risk of default debt for the income level low is 1.2% higher compare to the average.


### Conclusion

It seems to have a minimal impact on how high the income_level is. Compared to the average, the group with the lowest income performs best. Unfortunately, the data are very small to provide a reliable conclusion. I would assume that income_level has little impact on credit scores and other factors are more crucial.

- How do different loan purposes affect on-time repayment of the loan?

In [10]:
grouped_purpose_data = data.groupby('purpose_stem').agg({'debt':['sum', 'count', 'mean']})

pivoting_purpose_stem = data.pivot_table(index=['purpose_stem'], values='debt', aggfunc='mean')
purpose_stem_mean = pivoting_purpose_stem.mean()

car_stem = (1 - pivoting_purpose_stem.loc["car"] / purpose_stem_mean) * -1
education_stem = (1 - pivoting_purpose_stem.loc["education"]  / purpose_stem_mean) * -1
estate_stem = 1 - pivoting_purpose_stem.loc["estate"] / purpose_stem_mean 
#house_stem = 1 - pivoting_purpose_stem.loc["house"] / purpose_stem_mean  
#property_stem = 1 - pivoting_purpose_stem.loc["property"]  / purpose_stem_mean
wedding_stem = 1 - pivoting_purpose_stem.loc["wedding"]  / purpose_stem_mean

print(grouped_purpose_data)
print('The risk of default debt for purpose stem "car" is {: >.1%} higher compare to the average.'.format(car_stem[0]))
print('The risk of default debt for purpose stem "education" is {: >.1%} higher compare to the average.'.format(education_stem[0]))
print('The risk of default debt for purpose stem "estate" is {: >.1%} lower compare to the average.'.format(estate_stem[0]))
#print('The risk of default debt for purpose stem "house" is {: >.1%} lower compare to the average.'.format(house_stem[0]))
#print('The risk of default debt for purpose stem "property" is {: >.1%} lower compare to the average.'.format(property_stem[0]))
print('The risk of default debt for purpose stem "wedding" is {: >.1%} lower compare to the average.'.format(wedding_stem[0]))

             debt                 
              sum  count      mean
purpose_stem                      
car           397   4267  0.093040
education     369   3979  0.092737
estate        777  10733  0.072394
wedding       181   2323  0.077916
The risk of default debt for purpose stem "car" is 10.7% higher compare to the average.
The risk of default debt for purpose stem "education" is 10.4% higher compare to the average.
The risk of default debt for purpose stem "estate" is 13.8% lower compare to the average.
The risk of default debt for purpose stem "wedding" is 7.3% lower compare to the average.


### Conclusion

It looks like specifying "car and "education" bring a slightly higher risk. Perhaps the transition between an education and paying back the loan is a bit of a hurdle. Nevertheless, the differences between the purposes are very small. Purpose selection does not appear to be an appropriate criterion for determining creditworthiness.

### Step 4. General conclusion

I cannot see any reliable correlation between the given criteria. Possibly the amount of data is too small or other criteria have a much greater impact on a customer’s creditworthiness. The heavy head of the “income_level” suggests a segmentation of customers, since income does not fit into a normal distribution (Gaussian curve). The fluctuations in all tasks are very small and could be randomly.