# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit score** of a potential customer. A **credit score** is used to evaluate the ability of a potential borrower to repay their loan.

## Open the data file and have a look at the general information. 

In [1]:
#The overall goal is to find out if marital status and number of children will have an impact on the customers loan.
#to reach this goal, i'll have to fix find any errors in the given data, and fix them while presenting them nicely.

#import pandas, numpy, and nltk
import pandas as pd
import numpy as np
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer() 

#used for stemming in replacing duplicate purposes
english_stemmer = SnowballStemmer('english')

#read all input data
try:
    df = pd.read_csv('credit_scoring_eng.csv')
except:
    df = pd.read_csv('/datasets/credit_scoring_eng.csv')
    
df.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


### Conclusion

There are negative numbers unders days_employed, may be a typo. Will need to fix that later

## Data preprocessing

### Processing missing values

In [2]:
#was used to identify how many missing values existed which showed 2174 were missing
#df.isna().sum()

#the columns where numbers were missing
missing_num = ['days_employed', 'total_income']

#this shows the means and medians of the missing numbers so we could potentially use them for filling in.
#this result shows that there are too many negative values, and will need to be fixed. Days employed doesn't make sense as negative. 
#print('These are the means:')
#print(df.groupby('family_status')[missing_num].mean())
#print()
#print('These are the medians:')
#print(df.groupby('family_status')[missing_num].median())

#making the negative numbers in 'days_employed' positive
df['days_employed'] = df['days_employed'].abs()

#this shows that the negative numbers are gone for days_employed are gone
#but shows that there are also people that have marked down their kids in the negatives and also as 20.
#assuming that these are all typos, and that they actually mean 2 kids instead of 20, i'll have to fix this
#print('These are the means:')
#print(df.groupby('children')[missing_num].mean())
#print()
#print('These are the medians:')
#print(df.groupby('children')[missing_num].median())

#converting 20 to 2, and converting -1 to 1
df.loc[df['children'] == 20, 'children'] = 2
df.loc[df['children'] == -1, 'children'] = 1

#makes everything lowercase in education for easier organization later
df['education'] = df['education'].str.lower()

#there is a third gender type, xna
#print(df['gender'].value_counts())

#deleting the entire row that consists of XNA, because there is only one row that has xna
df = df.loc[df['gender'] != 'XNA']

#checking to see if xna is still there
#print(df['gender'].value_counts())

#print anything above 365*50 to see how outrageous the values can be and how many we may need to deal with
#print(df['days_employed'].loc[df['days_employed']>365*50])

#replacing any number greater then 365*50 with 365*50 so we don't get any outrageous numbers
df['days_employed'].loc[df['days_employed']>365*50] = 365*50

#checking to see if the numbers replaced and to see if any errors occurred 
#print(df['days_employed'].loc[df['days_employed']>=365*50])



#After looking at the code, I realized that using median will be good for everything EXCEPT for widows. For some reason,
#their numbers are just too outragous and are way too high, higher then their actual age. But because they are so random, i'm
#not sure how to entirely cover this subject. Age comes in play, and it's hard to judge by days employed and to omit 
#what we don't want. Will probably need to fix in next itation, just not sure how to do it
#print('These are the means:')
#print(df.groupby('family_status')[missing_num].mean())
#print()
#print('These are the medians:')
#print(df.groupby('family_status')[missing_num].median())

#filling na by using groupby and calculating everything as median because the means are too high, meaning there
#are outliers in the data. Widow is still definitely broken, will probably need to fix in next iteration, just don't know
#how to fix it currently
df['days_employed'] = df['days_employed'].fillna(df.groupby('income_type')['days_employed'].transform('median'))
df['total_income'] = df['total_income'].fillna(df.groupby('income_type')['total_income'].transform('median'))

#double checking to make sure there are no more missing values
df.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

### Conclusion

Only days_employed and total_income are missing. There are also a lot of negative numbers here, so fixing that was vital to fixing the entire thing. I then calculated everything and filled na with mean of the column. Turns out, days employed and total income are float values, so I will have to change the type in the next batch of code. 

Another anomalie that i've noticed is that some of the values in days_employed is too high to be possible. Definitely human error, and I was able to fix it using .loc to replace any number greater than 365*50 with 365*50

There was also another issue with the gender column, and that there was an extra variable other than female and male. Luckily, there was only one row that was weird, so just removing it wouldn't harm the overall data.

Possible causes for the missing values could be human error, but I doubt it because there are just so many negative numbers listed. On top of that, I don't think it's a coincidence that both days_employed and total_income are both missing in each row. That leads me to believe that there may be an error in the system for people that are inputting data, or corruption of file.

Blanks are filled with fillna and replaced with the average of each column

### Data type replacement

In [3]:
#checking what needs to be changed. Upon first glance, days_employed and total_income are both float64, when they should be 
#changed to int
#df.info()

#turning float values to int using astype
df['days_employed'] = df['days_employed'].astype('int')
df['total_income'] = df['total_income'].astype('int')

#checking to see if changes worked
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21524 entries, 0 to 21524
Data columns (total 12 columns):
children            21524 non-null int64
days_employed       21524 non-null int64
dob_years           21524 non-null int64
education           21524 non-null object
education_id        21524 non-null int64
family_status       21524 non-null object
family_status_id    21524 non-null int64
gender              21524 non-null object
income_type         21524 non-null object
debt                21524 non-null int64
total_income        21524 non-null int64
purpose             21524 non-null object
dtypes: int64(7), object(5)
memory usage: 2.1+ MB


### Conclusion

I used astype to change float to int because it seemed to be the easiest way of doing so. I'm not confident if the data type shouldv'e been int16 or int 64, but i'll stick with default int for now.

### Processing duplicates

In [4]:
#finding how many duplicates we have. At this point, we have 71 duplicates
#df.duplicated().sum()

#deleting duplicates
df.drop_duplicates(inplace=True)

#double checking if all duplicates are deleted
#df.duplicated().sum()

#finding duplicate purposes
#df['purpose'].value_counts()

#lemmatizing words to make sure we have all the tokens for the if statements later
#text = 'educated'
#words = nltk.word_tokenize(text)
#words = [wordnet_lemma.lemmatize(w, pos= 'v') for w in words]
#words

#lemmatize function with if statements to replace purpose and make everything uniform.
def convert(purpose):
    words = nltk.word_tokenize(purpose)
    words = [wordnet_lemma.lemmatize(w, pos= 'v') for w in words]
    if 'estate' in words:
        return 'estate'
    elif 'education' in words or 'educated' in words or 'university' in words or 'educate' in words:
        return 'education'
    elif 'car' in words or 'cars' in words:
        return 'purchasing car'
    elif 'house' in words:
        return 'housing'
    elif 'wed' in words:
        return 'wedding'
    elif 'property' in words:
        return 'property'
    else: 
        return 'unknown purpose'

df['purpose'] = df['purpose'].apply(convert)
  
#confirmation that everything is replaced
df['purpose'].value_counts()

estate            4463
purchasing car    4306
education         4013
housing           3809
property          2538
wedding           2324
Name: purpose, dtype: int64

### Conclusion

I used df.duplicated().sum() to help locate if there were any duplicates in the table, and df.drop_duplicates(inplace=True) to remove them.

Not sure how duplicates would end up in a table like this, likely someone just forgot that they already entered in data, which could lead to duplicates.

I also used stemming to replace duplicates in the purpose column to make things neater and easier to see. 

### Categorizing Data

In [5]:
#creating a pivot table for organization and categorizing the important data
pivot_table_credit = df.pivot_table(index='family_status', columns = 'children', values = 
                                    ['debt', 'total_income'], aggfunc = 'sum', margins=True)


pivot_table_purpose = df.pivot_table(index='purpose', columns = 'children', values = 
                                    'debt', aggfunc = 'sum', margins=True)

pivot_table_all = df.pivot_table(index=['purpose', 'family_status'], columns = 'children', values =
                                       ['debt', 'total_income'], aggfunc = 'count', margins=True)

pivot_table_income = df.pivot_table(index='purpose', columns = 'debt', values = 
                                    'total_income', aggfunc = 'sum', margins=True)

pivot_table_purpose

children,0,1,2,3,4,5,All
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
education,229.0,90.0,47.0,4.0,0.0,0.0,370
estate,206.0,90.0,34.0,5.0,1.0,0.0,336
housing,161.0,53.0,34.0,6.0,2.0,0.0,256
property,109.0,57.0,22.0,2.0,0.0,,190
purchasing car,243.0,104.0,50.0,5.0,1.0,0.0,403
wedding,115.0,51.0,15.0,5.0,0.0,0.0,186
All,1063.0,445.0,202.0,27.0,4.0,0.0,1741


### Conclusion

I used a pivot table to categorize the data by children, purpose, family status, and debt so the values are easier to see, and so that easier judgements can be made 

I made 3 pivot tables depending on what is needed. Regarding general information, pivot_table_credit will show what's important
Regarding the purpose of the customers, pivot_table_purpose will be of use. 
And to show all data, pivot_table_all will show everything combined, though a bit messy, can still work.

# Is there a relation between having kids and repaying a loan on time?

There does seem to be a relation between having kids and repaying loan. The less kids you have, the more likely they won't pay
on time. 

Looking at the toal amount of debt amounted based on children, 1062 debt is totaled compared to 445 debt with 
customers with only 1 kid. Throughout the table, the more kids you have, the less debt is amounted.

# Is there a relation between marital status and repaying a loan on time?

According to the table, married customers with 0 kids have a total of 515 debt amounted, compared to others, such as 
divorced who have 55 debt amounted with 0 kids. 

There are inconsistencies because the totals are different, and there are more married couples then not. In fact, unmarried couples have 210 debt amounted with 0 kids, but there are only 274 unmarried couples in total. That means that 76% of unmarried couples with no kids will most likely not repay the loan on time

# Is there a relation between income level and repaying a loan on time?

Yes, the more money the customer makes, the more likely they will repay their loan. According to the table, 92% of customers that repay their loan make more money in comparison to those that don't repay in time

# How do different loan purposes affect on-time repayment of the loan?

If the loan purpose is related to the purchase of a car, getting educated, and/or to buy real estate, there will be a high chance that the customer won't repay the loan on time. If the purpose is related to the purchase of a car, there will be a 23% chance they won't repay the loan on time

## General conclusion

The customer's martial status, and number of children do indeed have an impact on whether they will default on a loan. There are many variables that can show whether or not a customer can default to a loan just by judging by income, their purpose, the amount of kids, and what their marital status is. 


Throughout this entire project, I found anomalies in the data such as negative days employed, multiple genders, too high amount of days employed, negative children, and missing values. I fixed the issues and filled in the missing data. I also organized the purpose column so it's easier to see and will be easier for us to make a better conclusion on the data. I found that there was a general correlation between marriage and number of children on whether or not they will default on a loan. Mainly, the more kids you have and if you are married, the less debt you'll most likely amount. 