## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
data = pd.read_csv('/datasets/credit_scoring_eng.csv')

I opened the file

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [3]:
data.shape

(21525, 12)

Checking how many rows and columns there are in the data

Checked the information for the data

In [4]:
data.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


Checked ten rows of the data to see what I am dealing with. Noticed issues of uppercase lowercase as well as purpose having different wording in the data. I will only really need children and marital status for the questions.

In [5]:
data.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose'],
      dtype='object')

Checked which columns I have

In [6]:
data['dob_years'].value_counts()

35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

Checked data of dob_years. Found 101 people had 0 as an answer. That is not a significant number in the amount that we have of data. But also I would say its the 101 people who did not want to tell their age.

In [7]:
data['debt'].value_counts()

0    19784
1     1741
Name: debt, dtype: int64

Checked how many people defaulted on debt. Only 1741 have out of 19784. Good amount for customer credit scoring

In [8]:
data['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
unemployed                         2
entrepreneur                       2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

Checked data on how many people in data have a job. Over 90% have an income source

In [9]:
data.dtypes

children              int64
days_employed       float64
dob_years             int64
education            object
education_id          int64
family_status        object
family_status_id      int64
gender               object
income_type          object
debt                  int64
total_income        float64
purpose              object
dtype: object

Checked the data types and days_employed and total_income are floating data types and I will change them to integer types. 

In [10]:
data.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


Checked data more clearly and most people have an education of some sort. I found negative numbers in children and days-employed and I will change that. Also there seems to be families with 20 children. I see that as a typo and will change it to 2 children

In [11]:
data.isnull().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

I found there to be missing values in the column days_employed and total_income

### Conclusion

I examined the data and noticed missing values in 2174 rows of the columns days_employed and total_income. Also the children column and days_employed column have negative numbers. There are families with 20 children but this does not make sense and I will count this as a typo. In the education column there are issues of lowercase and uppercase. I will change everything to lowercase. Also discovered that 101 people have 0 in age. This is probably because they declined to write there age for the analysis.

### Step 2. Data preprocessing

In [12]:
data['children'] = data['children'].abs()

Changed the Children column to positive numbers

In [13]:
data['children'] = data['children'].replace(20,2) 

Changed the 20 in 'Children' column to 2

In [14]:
data['children'].describe()

count    21525.000000
mean         0.479721
std          0.755528
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          5.000000
Name: children, dtype: float64

checking if the code worked for the Children column and it looks like it did. Instead of 20 max I have 5 max and min is not negative 1 anymore

In [15]:
data['days_employed'] = data['days_employed'].abs()

changed the Days_employed column to positive numbers

In [16]:
data['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

check to see if days_employed is positive

In [17]:
data['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

looking into education column to see the issue of lowercase uppercase

In [18]:
data['education'] = data['education'].str.lower()

Made all the education column lowercase

In [19]:
data['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

checked to see if lowercase worked.

In [20]:
data['purpose'] = data['purpose'].str.lower()

Even though did not see any issue of lowercase in purpose it did not hurt to do the same for purpose in case.

In [21]:
data['purpose'].value_counts()

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
housing                                     647
purchase of the house                       647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
building a property                         620
purchase of my own house                    620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

Checked to see all the different categories in purpose. Will need to categorize this better in order to analyze the data better

### Processing missing values

In [22]:
median_employed = data['days_employed'].median()
data['days_employed'].fillna(median_employed, inplace=True)

Filled in the missing data for days employed with the median of days_employed so there will be some value there. It may mess up the data later but I do not believe it will be significant

In [23]:
data['days_employed'].describe()

count     21525.000000
mean      60378.032733
std      133257.558514
min          24.141633
25%        1025.608174
50%        2194.220567
75%        4779.587738
max      401755.400475
Name: days_employed, dtype: float64

This column will not be significant in calculating the conclusion because we need to look at marital status and number of children more closely. It also is I believe MCAR based on the fact that I can still figure out analysis from other columns. I decided to use this method because it was simplest for me and also because I do not believe it will be significant in my investigation.

In [24]:
median_income = data['total_income'].median()
data['total_income'].fillna(median_employed, inplace=True)

Did the same with total income in which I did with days employed. 

In [25]:
data['total_income'].describe()

count     21525.000000
mean      24303.668792
std       17290.003687
min        2194.220567
25%       14178.053000
50%       21682.354000
75%       31286.979000
max      362496.645000
Name: total_income, dtype: float64

I don't think this will make a significant change in the data and it was the easiest to use the median here.

In [26]:
data.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

Checked to see if there were still any missing values but there are none

### Conclusion

I replaced all the missing values in 'days_employed' and 'total_income' with median value. This changed the data a little bit but not significantly enough for what we are investigating.

### Data type replacement

In [27]:
data['days_employed'] = data['days_employed'].astype('int')

In [28]:
data['total_income'] = data['total_income'].astype('int')

changed days employed and total income to integer type to better work with the data and have all of the number columns be in integer. Also to change them with astype is better because it won't change the other way.

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null int64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null int64
purpose             21525 non-null object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB


### Conclusion

I was able to change the days_employed column and total_income column to integer instead of float and should be able to work better with the data now.

### Processing duplicates

In [30]:
data.duplicated().sum()

71

Checking how many duplicated data I have

In [31]:
data.loc[data.duplicated(), :]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,2194,41,secondary education,1,married,0,F,employee,0,2194,purchase of the house for my family
3290,0,2194,58,secondary education,1,civil partnership,1,F,retiree,0,2194,to have a wedding
4182,1,2194,34,bachelor's degree,0,civil partnership,1,F,employee,0,2194,wedding ceremony
4851,0,2194,60,secondary education,1,civil partnership,1,F,retiree,0,2194,wedding ceremony
5557,0,2194,58,secondary education,1,civil partnership,1,F,retiree,0,2194,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
20702,0,2194,64,secondary education,1,married,0,F,retiree,0,2194,supplementary education
21032,0,2194,60,secondary education,1,married,0,F,retiree,0,2194,to become educated
21132,0,2194,47,secondary education,1,married,0,F,employee,0,2194,housing renovation
21281,1,2194,30,bachelor's degree,0,married,0,F,employee,0,2194,buy commercial real estate


There are duplications in days employed and total income column

I dropped the duplicates from the data

In [32]:
data = data.drop_duplicates(keep = 'first')

In [33]:
data.duplicated().sum()

0

Checked to see if any duplicates remained

### Conclusion

Checking the 71 duplications and in Days_employed and total_income columns they are the same numbers. This happened because I added the median to fill the missing data. This won't affect the data that I am looking at and 71 duplicates is only .32% of the data so should not affect my conclusion. I dropped the duplicated data from the dataframe and have no more duplicates. Days_employed is not a necessary column for my analysis because I am not assessing how long the person has worked but I am looking at their total_income. This second column may affect the analysis.

### Categorizing Data

In [35]:
import nltk
from nltk.stem import WordNetLemmatizer
 
def lemmatizing(text):
    wordnet_lemma = WordNetLemmatizer()
    words = nltk.word_tokenize(text)
    return [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
 
data['lemmas'] = data['purpose'].apply(lemmatizing)

Importing the lemmatizer to be able to split the purpose column into different categories

In [36]:
data['lemmas']

0        [purchase, of, the, house]
1                   [car, purchase]
2        [purchase, of, the, house]
3        [supplementary, education]
4            [to, have, a, wedding]
                    ...            
21520        [housing, transaction]
21521        [purchase, of, a, car]
21522                    [property]
21523        [buying, my, own, car]
21524             [to, buy, a, car]
Name: lemmas, Length: 21454, dtype: object

checking the lemmatization worked

In [37]:
data['lemmas'].value_counts()

[car]                                            972
[wedding, ceremony]                              791
[having, a, wedding]                             768
[to, have, a, wedding]                           765
[real, estate, transaction]                      675
[buy, commercial, real, estate]                  661
[housing, transaction]                           652
[buying, property, for, renting, out]            651
[transaction, with, commercial, real, estate]    650
[purchase, of, the, house]                       646
[housing]                                        646
[purchase, of, the, house, for, my, family]      638
[construction, of, own, property]                635
[property]                                       633
[transaction, with, my, real, estate]            627
[building, a, real, estate]                      624
[buy, real, estate]                              621
[purchase, of, my, own, house]                   620
[building, a, property]                       

Checking the different categories I would like to make based on the lemmatization. I will make a category of wedding, car, real estate, personal property, and education

In [38]:
from nltk.stem import SnowballStemmer
english_stemmer = SnowballStemmer('english')

importing stemmer to see if any words need to be added to my function

In [39]:
stemmed_list_0=[english_stemmer.stem(word)for word in data['purpose'][0].split(" ")]
stemmed_list_0

['purchas', 'of', 'the', 'hous']

In [40]:
stemmed_list_1=[english_stemmer.stem(word)for word in data['purpose'][1].split(" ")]
stemmed_list_1

['car', 'purchas']

In [41]:
stemmed_list_3=[english_stemmer.stem(word)for word in data['purpose'][3].split(" ")]
stemmed_list_3

['supplementari', 'educ']

In [42]:
stemmed_list_4=[english_stemmer.stem(word)for word in data['purpose'][4].split(" ")]
stemmed_list_4

['to', 'have', 'a', 'wed']

In [43]:
def purpose_category(element):
    if 'estate' in element or 'housing' in element or 'house' in element or 'construction' in element or 'property' in element or'renting' in element:
        element = 'property'
    elif 'university' in element or 'education' in element or 'educated' in element:
        element = 'education'
    elif 'wedding' in element:
        element = 'wedding'
    elif 'car' in element:
        element = 'car'
    return element
   
data['purpose_category'] = data['purpose'].apply(purpose_category)

Created purpose category column and applied it to the data. I took four separate categories to make the data clearer and get rid of too many variables

In [44]:
data['purpose_category'].value_counts()

property     10811
car           4306
education     4013
wedding       2324
Name: purpose_category, dtype: int64

now I have four categories and much easier to work with the data

In [45]:
data['total_income'].describe()

count     21454.000000
mean      24376.367950
std       17271.632609
min        2194.000000
25%       14254.000000
50%       21724.500000
75%       31330.000000
max      362496.000000
Name: total_income, dtype: float64

rechecking total income to pull numbers for my next categorization which would be of income

In [46]:
def income_bracket(income):
    if income < 14178:
        return 'low income'
    elif income >= 14178 and income <= 21682:
        return 'mid income'
    elif income > 21682:
        return 'high income'
    return income
data['income_bracket'] = data['total_income'].apply(income_bracket)    

Making a new column income_bracket to categorize income into low income bracket, mid income bracket, and high income bracket and applied to to the table

In [47]:
data['income_bracket'].value_counts()

high income    10762
mid income      5382
low income      5310
Name: income_bracket, dtype: int64

Found that many people have a high income and then there is a split between mid income and low income.

In [48]:
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,lemmas,purpose_category,income_bracket
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,"[purchase, of, the, house]",property,high income
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase,"[car, purchase]",car,mid income
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,"[purchase, of, the, house]",property,high income
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,"[supplementary, education]",education,high income
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,"[to, have, a, wedding]",wedding,high income


Checked my table to see everything was there including the new columns

In [49]:
credit_scoring = data.drop(['days_employed', 'dob_years', 'education', 'education_id', 'family_status_id', 'gender', 'income_type', 'purpose', 'lemmas'], axis=1)

dropped the columns I felt was not needed to contribute to the questions that I would be analyzing

In [50]:
credit_scoring

Unnamed: 0,children,family_status,debt,total_income,purpose_category,income_bracket
0,1,married,0,40620,property,high income
1,1,married,0,17932,car,mid income
2,0,married,0,23341,property,high income
3,3,married,0,42820,education,high income
4,0,civil partnership,0,25378,wedding,high income
...,...,...,...,...,...,...
21520,1,civil partnership,0,35966,property,high income
21521,0,married,0,24959,car,high income
21522,1,civil partnership,1,14347,property,mid income
21523,3,married,1,39054,car,high income


Printed out the new table

### Conclusion

I was able to categorize the purpose column which showed that most loans were taken for housing projects and the least were for weddings. Also categorized the total_income and found that most of the families had a middle income. This will help me determine better what I am analyzing when it comes to income because I will be able to see which family income types default on a loan as well as where the loan is being used for in a much more concise way. I also created data_credit_scoring in order to clean the data for the questions I will be asking. 

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [51]:
credit_scoring_pivot_children = credit_scoring.pivot_table(index = ["children"], columns= ["debt"], aggfunc = "count")
credit_scoring_pivot_children

Unnamed: 0_level_0,family_status,family_status,income_bracket,income_bracket,purpose_category,purpose_category,total_income,total_income
debt,0,1,0,1,0,1,0,1
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,13028.0,1063.0,13028.0,1063.0,13028.0,1063.0,13028.0,1063.0
1,4410.0,445.0,4410.0,445.0,4410.0,445.0,4410.0,445.0
2,1926.0,202.0,1926.0,202.0,1926.0,202.0,1926.0,202.0
3,303.0,27.0,303.0,27.0,303.0,27.0,303.0,27.0
4,37.0,4.0,37.0,4.0,37.0,4.0,37.0,4.0
5,9.0,,9.0,,9.0,,9.0,


### Conclusion

The numbers show that the more kids the family has the less likely they will default on their debt. I do have NaN for families with 5 children but I can tell already from the rest of the data in the area of 1 to 3 children that the numbers of default on debt go lower and lower. The average of people who end up defaulting on the loan is around the same.

In [52]:
# credit_scoring.groupby('children')['debt'].count()

children
0    14091
1     4855
2     2128
3      330
4       41
5        9
Name: debt, dtype: int64

In [53]:
credit_scoring.groupby('children')['debt'].mean()

children
0    0.075438
1    0.091658
2    0.094925
3    0.081818
4    0.097561
5    0.000000
Name: debt, dtype: float64

- Is there a relation between marital status and repaying a loan on time?

In [54]:
credit_scoring_pivot_family = credit_scoring.pivot_table(index = ["family_status"], columns = ["debt"], aggfunc = "count")
credit_scoring_pivot_family

Unnamed: 0_level_0,children,children,income_bracket,income_bracket,purpose_category,purpose_category,total_income,total_income
debt,0,1,0,1,0,1,0,1
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
civil partnership,3763,388,3763,388,3763,388,3763,388
divorced,1110,85,1110,85,1110,85,1110,85
married,11408,931,11408,931,11408,931,11408,931
unmarried,2536,274,2536,274,2536,274,2536,274
widow / widower,896,63,896,63,896,63,896,63


In [55]:
credit_scoring.groupby('family_status')['debt'].mean()

family_status
civil partnership    0.093471
divorced             0.071130
married              0.075452
unmarried            0.097509
widow / widower      0.065693
Name: debt, dtype: float64

### Conclusion

According to the table most of the people are married and they have the higher number of defaults on loans. The average is around the same for all types of family status. Most people are married and have the most defaults on loans but it is not so different from the rate of people defaulting if not married.

- Is there a relation between income level and repaying a loan on time?

In [56]:
credit_scoring_pivot_income = credit_scoring.pivot_table(index = ["income_bracket"], columns = ["debt"], aggfunc = "count")
credit_scoring_pivot_income

Unnamed: 0_level_0,children,children,family_status,family_status,purpose_category,purpose_category,total_income,total_income
debt,0,1,0,1,0,1,0,1
income_bracket,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
high income,9890,872,9890,872,9890,872,9890,872
low income,4893,417,4893,417,4893,417,4893,417
mid income,4930,452,4930,452,4930,452,4930,452


In [57]:
credit_scoring.groupby('income_bracket')['debt'].mean()

income_bracket
high income    0.081026
low income     0.078531
mid income     0.083984
Name: debt, dtype: float64

### Conclusion

There is no relation. The numbers show that there will almost always be the same number of people defaulting on their loan no matter the bracket of income they are in.

- How do different loan purposes affect on-time repayment of the loan?

In [58]:
credit_scoring_pivot_purpose = credit_scoring.pivot_table(index = ["purpose_category"], columns = ["debt"], aggfunc = "count")
credit_scoring_pivot_purpose

Unnamed: 0_level_0,children,children,family_status,family_status,income_bracket,income_bracket,total_income,total_income
debt,0,1,0,1,0,1,0,1
purpose_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
car,3903,403,3903,403,3903,403,3903,403
education,3643,370,3643,370,3643,370,3643,370
property,10029,782,10029,782,10029,782,10029,782
wedding,2138,186,2138,186,2138,186,2138,186


In [62]:
# credit_scoring.groupby('purpose_category')['debt'].count().sort_values(ascending=False)

purpose_category
property     10811
car           4306
education     4013
wedding       2324
Name: debt, dtype: int64

In [63]:
credit_scoring.groupby('purpose_category')['debt'].mean().sort_values(ascending=False)

purpose_category
car          0.093590
education    0.092200
wedding      0.080034
property     0.072334
Name: debt, dtype: float64

### Conclusion

Again it seems to be that the numbers are higher for not paying back debt but the numbers of people with that purpose are higher as well. I do not see a correlation between different loan purposes and on-time repayments. 

### Step 4. General conclusion

After going through the data the credit scoring of individuals is not adversely affected by having more children. On the contrary there are less loans given the more children you have. And the purpose for the loan or your marital status does not necessarily affect your payment back for the loan either. The numbers of defaulting on loans are similar with each category. I would think that checking the age differences and life experience will be more relevant to see if their is a credit worthiness for the individual. Some of the data needed to be manipulated to fill missing values and this may have affected the total_income of individuals but I deemed it not significant enough.

### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [x]  file examined;
- [x]  missing values defined;
- [x]  missing values are filled;
- [x]  an explanation of which missing value types were detected;
- [ ]  explanation for the possible causes of missing values;
- [x]  an explanation of how the blanks are filled;
- [x]  replaced the real data type with an integer;
- [x]  an explanation of which method is used to change the data type and why;
- [x]  duplicates deleted;
- [x]  an explanation of which method is used to find and remove duplicates;
- [x]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [x]  an explanation of the principle of data categorization;
- [x]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [x]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [x]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [x]  conclusions are present on each stage;
- [x]  a general conclusion is made.