# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

## Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
data = pd.read_csv('/datasets/credit_scoring_eng.csv')
from pymystem3 import Mystem
from collections import Counter
m = Mystem()
data.columns.str.lower()
data.info()
data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


### Conclusion

Lots of information, should be sufficient. Even Though some parts of it are irrelevant.

## Data preprocessing

### Processing missing values

In [2]:
data = data.drop(columns = ['days_employed', 'dob_years', 'education', 'education_id', 'gender'])

In [3]:
data = data.dropna().reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 7 columns):
children            19351 non-null int64
family_status       19351 non-null object
family_status_id    19351 non-null int64
income_type         19351 non-null object
debt                19351 non-null int64
total_income        19351 non-null float64
purpose             19351 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 1.0+ MB


### Conclusion

Some information is troublesome probably due to human error. Most of the values can't be restored. 'family_status' and 'familiy_status_id' can fill one another.

### Data type replacement

In [4]:
data['children'] = data['children'].abs()
grouped_data = data.groupby('income_type').median()
data['total_income'].fillna('grouped_data', inplace = True)

Unnamed: 0,children,family_status,family_status_id,income_type,debt,total_income,purpose
0,1,married,0,employee,0,40620.102,purchase of the house
1,1,married,0,employee,0,17932.802,car purchase
2,0,married,0,employee,0,23341.752,purchase of the house
3,3,married,0,employee,0,42820.568,supplementary education
4,0,civil partnership,1,retiree,0,25378.572,to have a wedding


### Conclusion

The data is now in order and complete. Eliminating negative children.

### Processing duplicates

In [5]:
data.drop_duplicates().reset_index()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 7 columns):
children            19351 non-null int64
family_status       19351 non-null object
family_status_id    19351 non-null int64
income_type         19351 non-null object
debt                19351 non-null int64
total_income        19351 non-null float64
purpose             19351 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 1.0+ MB


### Conclusion

There are no duplicates as I can see but the check has been done.


### Categorizing Data

In [6]:
def income_lvl(x):
    if x<= 16488.504500:
        return 'low'
    elif x>= 16488.504500 and x<= 23202.870000:
        return 'sufficient'
    elif x>= 23202.870001 and x<= 32549.611000:
        return 'average'
    elif x>= 32549.611001:
        return 'wealthy'
    else:
        return 'other'

In [7]:
data['clean_income'] = data['total_income'].apply(income_lvl)

In [8]:
housing_category = ['purchase of the house', 'housing transactions',
                    'purchase of the house for my family', 'buy real estate', 'buy commercial real estate',
                    'buy residential real estate', 'construction of own property', 'property', 'building a property',
                    'transactions with commercial real estate', 'housing', 'transactions with my real estate', 'purchase of my own house',
                    'real estate transactions', 'buying property for renting out', 'building a real estate', 'housing renovation', 'house', 'real estate', 'realestate', 'commercial',
                    'residential', 'transactions', 'building', 'real', 
                    'estate']

car_category = ['car purchase', 'buying a second-hand car', 'buying my own car', 'cars', 'second-hand car purchase', 'car', 'to own a car', 'purchase of a car', 'to buy a car']

wedding_category = ['to have a wedding', 'having a wedding', 'wedding ceremony', 'wedding', 'ceremony']

education_category = ['supplementary education', 'education', 'to become educated', 'getting an education', 'to get a supplementary education', 'getting higher education', 'profile education', 'university education', 'going to university', 'university', 'educated']

In [9]:
def lemmatazation_func(line):
    lemmatized = m.lemmatize(line)
    if any (word in lemmatized for word in housing_category):
        return 'housing'
    elif any (word in lemmatized for word in car_category): 
        return 'cars'
    elif any (word in lemmatized for word in wedding_category):
        return 'weddings'
    elif any (word in lemmatized for word in education_category):
        return 'education'
    else:
        return 'other'
    

In [10]:
data['clean_purpose'] = data['purpose'].apply(lemmatazation_func)

### Conclusion

Four groups for two columns were created in order to arrange the data in a visually pleasant way and in order to assess the situation.

## Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [11]:
kids_loan_relation = pd.pivot_table(data, index = 'children', columns = 'debt', values = 'family_status_id', aggfunc = 'count', margins = True).reset_index()
kids_loan_relation

debt,children,0,1,All
0,0,11758.0,952.0,12710
1,1,3978.0,409.0,4387
2,2,1674.0,177.0,1851
3,3,272.0,22.0,294
4,4,31.0,3.0,34
5,5,8.0,,8
6,20,59.0,8.0,67
7,All,17780.0,1571.0,19351


In [12]:
kids_loan_relation.fillna(0, inplace = True)

In [13]:
kids_loan_relation['relation'] = (kids_loan_relation[1] / kids_loan_relation['All']) * 100
kids_loan_relation

debt,children,0,1,All,relation
0,0,11758.0,952.0,12710,7.490165
1,1,3978.0,409.0,4387,9.323
2,2,1674.0,177.0,1851,9.562399
3,3,272.0,22.0,294,7.482993
4,4,31.0,3.0,34,8.823529
5,5,8.0,0.0,8,0.0
6,20,59.0,8.0,67,11.940299
7,All,17780.0,1571.0,19351,8.118443


### Conclusion

The relation between number of children and the repaying a loan is not direct from 0 - 4 children, with 5 children there is a unique result of having paid the loan but with 20 children (is legitimate in Israel) there is a spike in the number of people who didn't repay their loan.

- Is there a relation between marital status and repaying a loan on time?

In [14]:
marital_loan_relation = pd.pivot_table(data, index = 'family_status', columns = 'debt', values = 'family_status_id', aggfunc = 'count', margins = True).reset_index()
marital_loan_relation

debt,family_status,0,1,All
0,civil partnership,3396,339,3735
1,divorced,1007,76,1083
2,married,10297,846,11143
3,unmarried,2271,254,2525
4,widow / widower,809,56,865
5,All,17780,1571,19351


In [15]:
marital_loan_relation['relation'] = (marital_loan_relation[1] / marital_loan_relation['All']) * 100
marital_loan_relation

debt,family_status,0,1,All,relation
0,civil partnership,3396,339,3735,9.076305
1,divorced,1007,76,1083,7.017544
2,married,10297,846,11143,7.59221
3,unmarried,2271,254,2525,10.059406
4,widow / widower,809,56,865,6.473988
5,All,17780,1571,19351,8.118443


### Conclusion

Widows are inclined to return the loan more than other marital status, unmarried people are the worst. civil partnership is quite close to unmarried people.

- Is there a relation between income level and repaying a loan on time?

In [16]:
income_loan_relation = pd.pivot_table(data, index = 'clean_income', columns = 'debt', values = 'family_status_id', aggfunc = 'count', margins = True).reset_index()
income_loan_relation

debt,clean_income,0,1,All
0,average,4411,426,4837
1,low,4455,383,4838
2,sufficient,4417,421,4838
3,wealthy,4497,341,4838
4,All,17780,1571,19351


In [17]:
income_loan_relation['relation'] = (income_loan_relation[1] / income_loan_relation['All']) * 100
income_loan_relation

debt,clean_income,0,1,All,relation
0,average,4411,426,4837,8.807112
1,low,4455,383,4838,7.916494
2,sufficient,4417,421,4838,8.701943
3,wealthy,4497,341,4838,7.048367
4,All,17780,1571,19351,8.118443


### Conclusion

There isn't a correlation between income level and the inclination to repay the loan. But the lowest and wealthiest are better at paying loans.

- How do different loan purposes affect on-time repayment of the loan?

In [18]:
purpose_loan_relation = pd.pivot_table(data, index = 'clean_purpose', columns = 'debt', values = 'family_status_id', aggfunc = 'count', margins = True).reset_index()
purpose_loan_relation

debt,clean_purpose,0,1,All
0,cars,3530,367,3897
1,education,3266,331,3597
2,housing,9043,715,9758
3,weddings,1941,158,2099
4,All,17780,1571,19351


In [19]:
purpose_loan_relation['relation'] = (purpose_loan_relation[1] / purpose_loan_relation['All']) * 100
purpose_loan_relation

debt,clean_purpose,0,1,All,relation
0,cars,3530,367,3897,9.417501
1,education,3266,331,3597,9.202113
2,housing,9043,715,9758,7.327321
3,weddings,1941,158,2099,7.527394
4,All,17780,1571,19351,8.118443


### Conclusion

Car buyers and education lovers are risky but housing and wedding personal are safe bet for loaners.

## General conclusion

As it seems, there are many direct correlations between repaying a loan and peoples 'condition' in several ways. 0, 3 and 5 are the magic numbers as number of children to loan payments positive correlation, unmarried and civil partnership are the worst at replying loans, the wealthy and the poor pay loans better than the rest and loans concerning car and education are less likely to repay on time. Overall wealthy widow people with 5 children that take a loan for a house are perfect for business.