## Analyzing borrowers’ risk of defaulting

My project is to prepare a report for a bank’s loan division. I will need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan.

This report will be considered when building a **credit scoring** of a potential customer. 

### Table of Contents
* <a href="#Step 1">Opening the Data</a><br>
    * <a href="#df_tail">data tail</a><br>
    * <a href="#df_shape">Data shape</a><br>
    * <a href="#df_columns">Data columns</a><br>
    * <a href="#df_describe">Data describe</a><br>
        * <a href="#step1_conclusion">Conclusion</a><br>
* <a href="#Step 2">Data preprocessing</a><br>
    * <a href="#data_typ_rep">Data type replacement</a><br>
    * <a href="#real2int">Real number data type to Integer type</a><br>
    * <a href="#processdup">Processing Duplicates</a><br>
        * <a href="#dupconclusion">Conclusion</a><br>
    * <a href="#Categ_data">Categorizing the Data</a><br>
        * <a href="#Categ_conclusion">Conclusion</a><br>
* <a href="#Step 3">Answer these Questions</a><br>
* <a href="#Step 4">General Conclusion</a><br>

<p><a name="Step 1"></a></p>

 ### Step 1. Opening the Data

In [24]:
import pandas as pd
import numpy as np

df = pd.read_csv(r'C:\Users\User\Documents\YandexDataA\sprint 2/credit_scoring_eng (2).csv')

print(df.head())


   children  days_employed  dob_years            education  education_id  \
0         1   -8437.673028         42    bachelor's degree             0   
1         1   -4024.803754         36  secondary education             1   
2         0   -5623.422610         33  Secondary Education             1   
3         3   -4124.747207         32  secondary education             1   
4         0  340266.072047         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1            married                 0      F    employee     0     17932.802   
2            married                 0      M    employee     0     23341.752   
3            married                 0      M    employee     0     42820.568   
4  civil partnership                 1      F     retiree     0     25378.572   

                   purpose  
0    purchase of the house 

<p><a name="df_tail"></a></p>

print(df.tail())

<p><a name="df_shape"></a></p>

<p><a name="df_columns"></a></p>

In [3]:
df.columns

#Columns of the dataset

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose'],
      dtype='object')

<p><a name="df_describe"></a></p>

In [4]:
df.describe()

## Descriptive stats
# Distribution of the data

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


<p><a name="step1_conclusion"></a></p>

### Conclusion
The dataset has 21525 rows and 12 columns.There are Nan values inplaced in columns 'day_employed' and 'total_income'.The shape function shows that there are only 21471 rows and the rest are NaN. I'm using Pandas dictionary since my task involves, data cleaning and data manipulation.

<p><a name="Step 2"></a></p>

### Step 2. Data preprocessing

<p><a name="data_typ_rep"></a></p>

### Data type replacement

In [5]:
# idetifying and filling up the missing values using unique function.
days_employed = df['days_employed'].unique()
total_income = df ['total_income'].unique()

In [6]:
print('Unique days employed:' , len(days_employed))
print('Unique total income:', len (total_income))


Unique days employed: 19352
Unique total income: 19349


This 2 columns have the same number of missing value since they are relative in scaling of data in a payroll calculator. Therefore, when the column days employed is missing, the column total income would also be missing.

In [7]:
#We will determine how may rows missing in days_employed and total_ income column
print(df[df['days_employed'].isna()].count()) 


children            2174
days_employed          0
dob_years           2174
education           2174
education_id        2174
family_status       2174
family_status_id    2174
gender              2174
income_type         2174
debt                2174
total_income           0
purpose             2174
dtype: int64


In [8]:
print(df[df['total_income'].isna()].count()) 


children            2174
days_employed          0
dob_years           2174
education           2174
education_id        2174
family_status       2174
family_status_id    2174
gender              2174
income_type         2174
debt                2174
total_income           0
purpose             2174
dtype: int64


In [9]:
#this is the average of the columns: days employed and total income that I will be using to 
#fill up the NaN values of the column.

mean_days_employed = df['days_employed'].mean()
mean_total_income = df['total_income'].mean()

In [10]:
#I will use fillna passing values from the average total of respective rows with missing values.

df['days_employed'] = df['days_employed'].fillna(value= mean_days_employed)
df['total_income'] = df['total_income'].fillna(value= mean_total_income)

In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


As you can see here in the info, we no longer have missing value in the dataset.I used mean in replacing the missing values since they are numerical variable.I use mean to fill the missing value first because it's a common practice, second because the number of outliers in the data is not that high, if the outliers is more than 1% then i will use median.

<p><a name="real2int"></a></p>

#### Real number data type to Integer type

In [12]:
#Since the columns 'day_employed' and total_income contains real number,I will change it to integer
# by calling the astype function for the particular columns.I'm using this function for its versatility to go 
# one type to another in a dataframe.

df = df.astype({"days_employed": int, "total_income": int})

In [13]:
print(df.head())

   children  days_employed  dob_years            education  education_id  \
0         1          -8437         42    bachelor's degree             0   
1         1          -4024         36  secondary education             1   
2         0          -5623         33  Secondary Education             1   
3         3          -4124         32  secondary education             1   
4         0         340266         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0         40620   
1            married                 0      F    employee     0         17932   
2            married                 0      M    employee     0         23341   
3            married                 0      M    employee     0         42820   
4  civil partnership                 1      F     retiree     0         25378   

                   purpose  
0    purchase of the house 

### Conclusion
In the column name 'days_employed' meaning ,how many days before the application, the person started current employment (time only relative to the application) in this case the negative refer to the days of being employed before the application and positive refer after the application. But the significant high positive value data points mean they are nan/unknown for days_employed. 


<p><a name="processdup"></a></p>

### Processing duplicates

In [14]:
# Now want to find out how many duplicates entries i have in the dataset, 
# so i will call the duplicate and sum function to do this.

print ('Duplicated rows in the dataset:', df.duplicated().sum())


Duplicated rows in the dataset: 54


In [15]:
duplicated = df.duplicated().sum()
total = df.index
number_of_rows = len(total)

In [16]:
duplicated_percentage = duplicated /  number_of_rows

In [17]:
print('the percentage of duplicates in the data:', "{:.2%}". format(duplicated_percentage))

the percentage of duplicates in the data: 0.25%


There are 54 duplicated row entries in the dataset which is 0.25% of our data. A duplicate credit entry means that the credit account information has been recorded more than once. Therefore, it appears to be two separate accounts.

In [18]:
# since the value_counts function dont work for me, I will have to use the drop_duplicates function for the entire dataset.

df = df.drop_duplicates(subset=None, keep='first', inplace=False)

In [19]:
print(df.duplicated().sum())


0


In [20]:
# i call the info to very the result.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 12 columns):
children            21471 non-null int64
days_employed       21471 non-null int64
dob_years           21471 non-null int64
education           21471 non-null object
education_id        21471 non-null int64
family_status       21471 non-null object
family_status_id    21471 non-null int64
gender              21471 non-null object
income_type         21471 non-null object
debt                21471 non-null int64
total_income        21471 non-null int64
purpose             21471 non-null object
dtypes: int64(7), object(5)
memory usage: 2.1+ MB


<p><a name="dupconclusion"></a></p>

### Conclusion
The info of the dataframe shows equal number of all rows from the total of 21525 prior to the drop.duplicate call, and now there are 21471 entries.I use drop.duplicate method since its the easiest way to remove the duplicate in a Dataframe, however i pass the **Subset**: for it takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.**keep:** *'first'*, for it considers first value as unique and rest of the same values as duplicate. **Inplace:** *False* so it does not remove rows with duplicates.

<p><a name="Categ_data"></a></p>

### Categorizing Data
I categoried the Data to see if demographic variable such as marital staus and number of children and total income will have direct effect on financial viariable in such as  purpose in terms of paying bank the loan.

In [21]:
# I will change some columns name and retain some names to understand easier whats the content of the column.

df.set_axis(['children','days_employed', 'age', 'education','education_id', 'marital_status', 'status_id', 'gender', 'income_type', 'debt', 'total_income', 'purpose'], axis = 'columns', inplace = True)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21471 entries, 0 to 21524
Data columns (total 12 columns):
children          21471 non-null int64
days_employed     21471 non-null int64
age               21471 non-null int64
education         21471 non-null object
education_id      21471 non-null int64
marital_status    21471 non-null object
status_id         21471 non-null int64
gender            21471 non-null object
income_type       21471 non-null object
debt              21471 non-null int64
total_income      21471 non-null int64
purpose           21471 non-null object
dtypes: int64(7), object(5)
memory usage: 2.1+ MB


In [23]:
person_info = df[['children','marital_status' , 'total_income', 'purpose']]

In [24]:
print(person_info.head(10))

   children     marital_status  total_income  \
0         1            married         40620   
1         1            married         17932   
2         0            married         23341   
3         3            married         42820   
4         0  civil partnership         25378   
5         0  civil partnership         40922   
6         0            married         38484   
7         0            married         21731   
8         2  civil partnership         15337   
9         0            married         23108   

                               purpose  
0                purchase of the house  
1                         car purchase  
2                purchase of the house  
3              supplementary education  
4                    to have a wedding  
5                purchase of the house  
6                 housing transactions  
7                            education  
8                     having a wedding  
9  purchase of the house for my family  


In [25]:
person_info_grouped = person_info.groupby('marital_status').count()

# in order to get a better view of  the datas i need for scoring i have to take out this columns and group them in a dataset
# according to marital status.

In [26]:
print(person_info_grouped)

                   children  total_income  purpose
marital_status                                    
civil partnership      4163          4163     4163
divorced               1195          1195     1195
married               12344         12344    12344
unmarried              2810          2810     2810
widow / widower         959           959      959


I specifically choose this 4 columns since i wanted to see if demographic variables like 'marital_status' and 'children' will have effect on financial variables 'total_income' and 'purpose'.

In [27]:
# grouping the 2 demosgraphic variables marital_status and total income,
# while passing the value of agg to get the count, min,mean and max of total income.

info_grouped = df.groupby('marital_status').agg({'total_income': ['count', 'min', 'median', 'max']})

In [28]:
print(info_grouped)

                  total_income                       
                         count   min   median     max
marital_status                                       
civil partnership         4163  3392  24903.0  276204
divorced                  1195  5402  25357.0  216039
married                  12344  3306  25216.0  362496
unmarried                 2810  3913  25046.5  274402
widow / widower            959  5443  21903.0  117616


In [29]:
print(df['marital_status'].value_counts(normalize=True) * 100)

married              57.491500
civil partnership    19.388943
unmarried            13.087420
divorced              5.565647
widow / widower       4.466490
Name: marital_status, dtype: float64


Our results indicates that 57.49% of the clients are married with the average income of 27016. I have to take into account at looking at this 2 data for the reason because it might have some effects on paying back capabilities.

In [30]:
purpose_grouped = df.groupby('purpose').agg({'total_income': ['count', 'min', 'median', 'max']})

In [31]:
print(purpose_grouped.head(10))

                                total_income                       
                                       count   min   median     max
purpose                                                            
building a property                      619  4664  24490.0  352136
building a real estate                   625  5217  24519.0  107520
buy commercial real estate               662  4245  23690.0  205804
buy real estate                          621  4650  26221.0  198426
buy residential real estate              606  5029  25876.5  125832
buying a second-hand car                 478  5639  23345.5  137492
buying my own car                        505  5579  25469.0  107686
buying property for renting out          652  5195  25127.0  115521
car                                      494  5452  25194.0  228469
car purchase                             461  3418  24660.0  130563


In [32]:
print(df['purpose'].value_counts(normalize=True) * 100)

wedding ceremony                            3.693354
having a wedding                            3.600205
to have a wedding                           3.581575
real estate transactions                    3.143775
buy commercial real estate                  3.083229
housing transactions                        3.036654
buying property for renting out             3.036654
transactions with commercial real estate    3.027339
purchase of the house                       3.008709
housing                                     3.008709
purchase of the house for my family         2.971450
construction of own property                2.957478
property                                    2.948163
transactions with my real estate            2.920218
building a real estate                      2.910903
buy real estate                             2.892273
purchase of my own house                    2.887616
building a property                         2.882958
housing renovation                          2.

As we can see here, the 4 major purpose of the loan are, buying a house or something related to real property, wedding expenses, buying a car and for educational purposes.

In [33]:
print(df['total_income'].value_counts(normalize=True) * 100)

26787    9.873783
19552    0.032602
23344    0.027945
17855    0.023287
15835    0.023287
           ...   
37016    0.004657
30869    0.004657
14780    0.004657
22673    0.004657
40964    0.004657
Name: total_income, Length: 15389, dtype: float64


In [34]:
def income_level(income):
    
    if income <= 15000:
        return 'low'
    if income <= 30000:
        return 'middle'
    return 'high' 

In [35]:
df['income_level'] =df['total_income'].apply(income_level)

In [36]:
print(df.head(20))

    children  days_employed  age            education  education_id  \
0          1          -8437   42    bachelor's degree             0   
1          1          -4024   36  secondary education             1   
2          0          -5623   33  Secondary Education             1   
3          3          -4124   32  secondary education             1   
4          0         340266   53  secondary education             1   
5          0           -926   27    bachelor's degree             0   
6          0          -2879   43    bachelor's degree             0   
7          0           -152   50  SECONDARY EDUCATION             1   
8          2          -6929   35    BACHELOR'S DEGREE             0   
9          0          -2188   41  secondary education             1   
10         2          -4171   36    bachelor's degree             0   
11         0           -792   40  secondary education             1   
12         0          63046   65  secondary education             1   
13    

In [37]:
pd.pivot_table(df,index=["marital_status","children", "purpose"],values=["total_income"], aggfunc=[np.sum, len])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,total_income,total_income
marital_status,children,purpose,Unnamed: 3_level_2,Unnamed: 4_level_2
civil partnership,-1,having a wedding,21102,1
civil partnership,-1,profile education,16450,1
civil partnership,-1,to buy a car,51456,1
civil partnership,-1,to have a wedding,15700,1
civil partnership,-1,to own a car,8705,1
...,...,...,...,...
widow / widower,3,university education,22735,1
widow / widower,4,housing transactions,26235,1
widow / widower,20,housing transactions,22687,2
widow / widower,20,purchase of the house,26787,1


<p><a name="Categ_conclusion"></a></p>

## Conclusion

Our results indicates that 57.49% of the clients are married with the average income of 27016. I have to take into account at looking at this 2 data for the reason because it might have some effects on paying back capabilities.As for the purpose that top  major purpose of the loan are related to real property purchases and then followed by car purchases.

<p><a name="Step 3"></a></p>

### Step 3. Answer these questions'

- Is there a relation between having kids and repaying a loan on time?

The number of kids the client have would have some effect on clients capability to pay back on time because having more kids would mean more financial stress. in our client with marital status of widow/widower shown in the pivot table, client A has 4 kids and has a total income of 26235 and the other client B that has no kids with total income of 22687 they have the same loan purpose also which means client A would most likely to face financial stress due to the fact that he has 4 kids even if he earns more than client B.

### Conclusion

- Is there a relation between marital status and repaying a loan on time?

There is a relation between marital status and repaying the loan on time.Like in our data which 57.49% of the clients are married couples.They have the fexibility to use both of their incomes to pay back in case financial stress araise.While in the case of divoced clients which comprises of around 5.5%, it could be a factor since the client might have alimony or chilcare monthly payments.These regular payments will factor into the debt-to-income ratio which is a factor in repaying the loan on time.


### Conclusion

- Is there a relation between income level and repaying a loan on time?

More often income and credit score rates and positively corelated since the higher income is related with higher probability of paying back.

### Conclusion

- How do different loan purposes affect on-time repayment of the loan?

The dataset mostly consist of personal loans.The loan purpose however affects the on-time repayment in the sense that lower loan size and shorter maturity period  decrease credit risk.

<p><a name="Step 4"></a></p>

### Step 4. General conclusion

I therefore conclude that the demographic variables such as marital status, children, and total income has positive corelation in in financial variable which is the purpose in terms of on-time payment of the loan.This is due to the financial strees that demographic variable may cause during the period of loan payment.