# Research of bank borrowers


The customer is the credit department of the bank. It is necessary to find out whether the marital status and the number of children of the client affect the fact of repaying the loan on time. Input data from the bank - statistics on the solvency of customers.
The results of the study will be taken into account when building a credit scoring model - a special system that assesses the ability of a potential borrower to return a loan to a bank.

The dataframe contains:

`children` — number of children in the family
`days_employed` - total work experience in days
`dob_years` — client's age in years
`education` — client education level
`education_id` — education level ID
`family_status` — marital status
`family_status_id` — marital status identifier
`gender` - gender of the client
`income_type` — employment type
`debt` - whether he had a debt on the return of credits
`total_income` — monthly income
`purpose` — the purpose of obtaining a loan

## 1. Data Acquaintance

**We are importing DataBase to observe data**

In [1]:
import pandas as pd

try:
    data = pd.read_csv('/datasets/data.csv')
except:
    data = pd.read_csv('https://code.s3.yandex.net/datasets/data.csv')

**We are showing first 20 rows**

In [2]:
data.head(20)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу
5,0,-926.185831,27,высшее,0,гражданский брак,1,M,компаньон,0,255763.565419,покупка жилья
6,0,-2879.202052,43,высшее,0,женат / замужем,0,F,компаньон,0,240525.97192,операции с жильем
7,0,-152.779569,50,СРЕДНЕЕ,1,женат / замужем,0,M,сотрудник,0,135823.934197,образование
8,2,-6929.865299,35,ВЫСШЕЕ,0,гражданский брак,1,F,сотрудник,0,95856.832424,на проведение свадьбы
9,0,-2188.756445,41,среднее,1,женат / замужем,0,M,сотрудник,0,144425.938277,покупка жилья для семьи


**We are showing We show the basic details of the data using the method `info()`.**

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


## 2.Data preprocessing



### 2.1 Removing gaps

**We output the number of missing values for each column**

In [4]:
data.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

**There are missing values in two columns. One of them `total_income` - stores income data. The amount of income is most affected by the type of employment, so we will fill in the gaps in this column with the median value for each type from the `income_type` column. For example, for a person with an employment type of `employee`, the gap in the `total_income` column must be filled by the median income among all records of the same type.**

In [5]:
for t in data['income_type'].unique():
    data.loc[(data['income_type'] == t) & (data['total_income'].isna()), 'total_income'] = \
    data.loc[(data['income_type'] == t), 'total_income'].median()

### 2.2 Anomaly Handling


**There are artifacts (anomalies) in the data - values that do not reflect reality and appeared due to some kind of error. such an artifact would be the negative number of days of work experience in the `days_employed` column. Let's process the values in this column: we will replace all negative values with positive ones using the `abs()` method.**


In [6]:
data['days_employed'] = data['days_employed'].abs()

**For each type of employment, we will display the median `days_employed` in days.**

In [7]:
data.groupby('income_type')['days_employed'].agg('median')

income_type
безработный        366413.652744
в декрете            3296.759962
госслужащий          2689.368353
компаньон            1547.382223
пенсионер          365213.306266
предприниматель       520.848083
сотрудник            1574.202821
студент               578.751554
Name: days_employed, dtype: float64

Two types of them (the unemployed and pensioners) will have abnormally large values. Correcting such values is difficult, so leave them as they are. Moreover, we will not need this column for research.


**We will display a list of unique values in the `children` column**

In [8]:
data['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5])

**There are two anomalous values in the `children` column. We will remove rows that contain such anomalous values from the dataframe.**

In [9]:
data = data[(data['children'] != -1) & (data['children'] != 20)]

**Once again, we will check the data in the `children` column.**

In [10]:
data['children'].unique()

array([1, 0, 3, 2, 4, 5])

### 2.3 Removing gaps (continiue)




**We will fill in the gaps in the `days_employed` column with the median values for each `income_type` employment type.**

In [11]:
for t in data['income_type'].unique():
    data.loc[(data['income_type'] == t) & (data['days_employed'].isna()), 'days_employed'] = \
    data.loc[(data['income_type'] == t), 'days_employed'].median()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try us

**We will check that there are no gaps in the data.**

In [12]:
data.isna().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

### 2.4 Changing Data Types

**We will replace the object type in the `total_income` column with an integer using the `astype()` method.**

In [13]:
data['total_income'] = data['total_income'].astype(int)

### 2.5 Duplicate Handling

**We will check the data for duplicates and remove them**

In [14]:
data.duplicated().sum()

54

In [15]:
data = data.drop_duplicates()

**We will handle implicit duplicates in the `education` column. This column has the same values, but written differently: using uppercase and lowercase letters. We will convert them to lower case.**

In [16]:
data['education'] = data['education'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### 2.6 Data categorization

**We will create a `total_income_category` column in the `data` dataframe with categories:**

- 0–30000 — `'E'`;
- 30001–50000 — `'D'`;
- 50001–200000 — `'C'`;
- 200001–1000000 — `'B'`;
- 1000001 и выше — `'A'`.


**For example, a borrower with an income of 25,000 should be assigned an `'E'' category, and a customer with an income of 235,000 would be assigned a `'B'` category.**

In [17]:
def categorize_income(income):
    try:
        if 0 <= income <= 30000:
            return 'E'
        elif 30001 <= income <= 50000:
            return 'D'
        elif 50001 <= income <= 200000:
            return 'C'
        elif 200001 <= income <= 1000000:
            return 'B'
        elif income >= 1000001:
            return 'A'
    except:
        pass

In [18]:
data['total_income_category'] = data['total_income'].apply(categorize_income)

**We will display a list of unique loan purposes from the `purpose` column**

In [19]:
data['purpose'].unique()

array(['покупка жилья', 'приобретение автомобиля',
       'дополнительное образование', 'сыграть свадьбу',
       'операции с жильем', 'образование', 'на проведение свадьбы',
       'покупка жилья для семьи', 'покупка недвижимости',
       'покупка коммерческой недвижимости', 'покупка жилой недвижимости',
       'строительство собственной недвижимости', 'недвижимость',
       'строительство недвижимости', 'на покупку подержанного автомобиля',
       'на покупку своего автомобиля',
       'операции с коммерческой недвижимостью',
       'строительство жилой недвижимости', 'жилье',
       'операции со своей недвижимостью', 'автомобили',
       'заняться образованием', 'сделка с подержанным автомобилем',
       'получение образования', 'автомобиль', 'свадьба',
       'получение дополнительного образования', 'покупка своего жилья',
       'операции с недвижимостью', 'получение высшего образования',
       'свой автомобиль', 'сделка с автомобилем',
       'профильное образование', 'высшее об

**We will create a function that, based on the data from the `purpose` column, will form a new `purpose_category` column, which will include the following categories:**

- ``car operations'`,
- ``real estate transactions'`,
- ``conducting a wedding'`,
- ``getting an education'`.

**For example, if the `purpose` column contains the substring `'car purchase'`, then the `purpose_category` column should contain the string `'car operations'`**



In [20]:
def categorize_purpose(row):
    try:
        if 'автом' in row:
            return 'car operations'
        elif 'жил' in row or 'недвиж' in row:
            return 'real estate transactions'
        elif 'свад' in row:
            return 'conducting a wedding'
        elif 'образов' in row:
            return 'getting an education'
    except:
        return 'no category'

In [21]:
data['purpose_category'] = data['purpose'].apply(categorize_purpose)

## 3.Exploratory data analysis

**3.1 Is there a relationship between the number of children and loan repayment on time?**

*Let's write two functions to create two pivot tables, which we will need further*

In [22]:
def group_data (data, column_name1, column_name2):
  data_pivot_func = pd.pivot_table(data, index=column_name1,values=column_name2, aggfunc=['sum','count','mean'])
  return data_pivot_func

In [23]:
def group_data_2 (data, column_name1, column_name2, column_name3 ):
  data_pivot_func_2 = pd.pivot_table(data, index=[column_name1,column_name2],values=column_name3, aggfunc=['sum','count','mean'])
  return data_pivot_func_2

**In order to trace the relationship between the number of children and the repayment of the loan on time, let's group the pivot tables by the 'children' column using the following aggregating functions for the 'debt' column: sum - the sum of all debts depending on the number of children, count - selections for each categories, mean - in this case, the share of debts from the total number of samples, since the 'debt' column takes the boolean value "1" (there is a debt) or "0" (there is no debt)**

In [24]:
group_children  = group_data(data,'children','debt')
group_children

Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,1063,14107,0.075353
1,444,4809,0.092327
2,194,2052,0.094542
3,27,330,0.081818
4,4,41,0.097561
5,0,9,0.0


<div style="border:solid green 2px; padding: 20px">
    
**Conclusion:**
    All debts account for about 7.5% - 10% of the total number of loans. <br/>
        As can be seen from the data, the sample of clients who have more than three children is not enough to compare it with the rest of the data.<br/> It can be said that the probability of loan repayment among clients who do not have children is higher by 2%. However, the sample for such clients is an order of magnitude larger than for clients with at least one child.<br/> The difference between the presence of one or two children in clients and the probability of default on the loan is insignificant and amounts to 0.2%. 

*Let's add gender to the grouping to split the sample and see additional trends*

In [25]:
group_children_gender  = group_data_2(data,'children','gender','debt')
group_children_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,debt
children,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,F,592,9534,0.062094
0,M,471,4572,0.103018
0,XNA,0,1,0.0
1,F,245,3086,0.079391
1,M,199,1723,0.115496
2,F,134,1256,0.106688
2,M,60,796,0.075377
3,F,17,196,0.086735
3,M,10,134,0.074627
4,F,1,28,0.035714


In [26]:
data = data.loc[data['gender'] != 'XNA']

In [27]:
data['gender'].unique()

array(['F', 'M'], dtype=object)

*Выведем группировку еще раз*

In [28]:
group_children_gender  = group_data_2(data,'children','gender','debt')
group_children_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,debt
children,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,F,592,9534,0.062094
0,M,471,4572,0.103018
1,F,245,3086,0.079391
1,M,199,1723,0.115496
2,F,134,1256,0.106688
2,M,60,796,0.075377
3,F,17,196,0.086735
3,M,10,134,0.074627
4,F,1,28,0.035714
4,M,3,13,0.230769


<div style="border:solid green 2px; padding: 20px">
    
**Conclusion:**
When grouped by gender, it is clear that the probability of debt for a loan among childless men is higher than for childless clients in general, and even higher than in other groups. At the same time, childless women have the lowest probability of loan default, it is even lower than among all childless clients. The highest rate of debt (more than 11 percent) among men with 1 child.<br/> Sampling of clients with more than three children still requires data accumulation.

**3.2 Is there a relationship between marital status and loan repayment on time?**

*To answer this question, it is also necessary to group the data according to the marital status of clients. Let's group the pivot tables by the 'family_status' column using the following aggregating functions for the 'debt' column: sum - sums of all debts depending on marital status, count - samples for each category, mean - in this case, the share of debts from the total number of samples, because the 'debt' column takes the boolean value "1" (debt) or "0" (no debt).*

In [29]:
group_family = group_data(data,'family_status','debt')
group_family 


Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Не женат / не замужем,273,2796,0.097639
в разводе,84,1189,0.070648
вдовец / вдова,63,951,0.066246
гражданский брак,385,4145,0.092883
женат / замужем,927,12266,0.075575


<div style="border:solid green 2px; padding: 20px">
    
    
**Conclusion:**
It is difficult to talk about a direct relationship between marital status and the presence of debts, all debts make up from 7% to 10% of the total number of loans.<br/> The sample is more homogeneous compared to previous indicators. It can be seen that against the general background, two groups stand out, leading in non-returns: "single / not married" (9.7%), "civil marriage" (9.2%) - the difference in which with the other groups is about 2%. <br/>The rest of the groups are approximately on the same level (about 7%).

*We can also look at the relationship of marital status and gender along with the presence of loan debt. To do this, add the 'gender' column to the grouping*

In [30]:
group_children_gender  = group_data_2(data,'family_status','gender','debt')
group_children_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,debt
family_status,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Не женат / не замужем,F,118,1723,0.068485
Не женат / не замужем,M,155,1073,0.144455
в разводе,F,61,931,0.065521
в разводе,M,23,258,0.089147
вдовец / вдова,F,52,896,0.058036
вдовец / вдова,M,11,55,0.2
гражданский брак,F,232,2843,0.081604
гражданский брак,M,153,1302,0.117512
женат / замужем,F,526,7714,0.068188
женат / замужем,M,401,4552,0.088093


<div style="border:solid green 2px; padding: 20px">
    
    
**Conclusion:**
As can be seen from this grouping, gender separation also strongly affects the presence of loan debt: in the group "Single / not married" the proportion of unmarried men who have debts on loans reaches 14%, in the group "common-law marriage" - the proportion of men with almost 12%, which is higher than in all other groups and according to the data as a whole.

**3.3. Is there a relationship between income level and loan repayment on time?**

*Below we will consider data grouping and dependence on the level of income from the column 'total_income_category' and the ability to repay a loan based on financial well-being.*

In [31]:
total_income_group = group_data(data,'total_income_category','debt')
total_income_group 


Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
total_income_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,2,25,0.08
B,354,5013,0.070616
C,1353,15938,0.084891
D,21,349,0.060172
E,2,22,0.090909


<div style="border:solid green 2px; padding: 20px">
    
**Conclusion:**
  For two categories (the richest clients "category A" and citizens with the lowest incomes "category E"), the sample is insufficient, and for the rest of the categories the situation is generally quite homogeneous and there are no groups that stand out on this basis.

*We will also try to split the data by adding gender to the grouping.*

In [32]:
total_income_group_gender = group_data_2(data,'total_income_category','gender','debt')
total_income_group_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,debt
total_income_category,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,F,0,10,0.0
A,M,2,15,0.133333
B,F,162,2695,0.060111
B,M,192,2318,0.08283
C,F,810,11078,0.073118
C,M,543,4860,0.111728
D,F,15,308,0.048701
D,M,6,41,0.146341
E,F,2,16,0.125
E,M,0,6,0.0


<div style="border:solid green 2px; padding: 20px">

**Conclusion**
The additional grouping by gender also reveals a trend that low-income men ("Category D") stand out from the rest of the clients with over 11% indebtedness.
<br/>In some groups, there is no sample required for conclusions, which requires the accumulation of data for further research.

**3.4.How do different purposes of a loan affect its repayment on time?**





*We do the grouping according to the same principle, but in accordance with the purposes for which the loan was taken, we group the data by the 'purpose_category' column*

In [33]:
purpose_group = group_data(data,'purpose_category','debt')
purpose_group 

Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
purpose_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
car operations,400,4281,0.093436
conducting a wedding,183,2324,0.078744
getting an education,369,3989,0.092504
real estate transactions,780,10753,0.072538


<div style="border:solid green 2px; padding: 20px">


**Conclusion:** 
 Most often, loan debt occurs if a loan is taken for operations related to a car and education, debts here arise in 9% of cases, against 7-8% of real estate transactions and marriage, respectively.<br/ > It should also be noted that the sample for real estate transactions is an order of magnitude higher than for other loan purposes.

*We will also try to find a dependence on gender identity, which we will add to the grouping*

In [34]:
purpose_group_gender = group_data_2(data,'purpose_category','gender','debt')
purpose_group_gender 

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,debt,debt,debt
purpose_category,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
car operations,F,232,2845,0.081547
car operations,M,168,1436,0.116992
conducting a wedding,F,107,1564,0.068414
conducting a wedding,M,76,760,0.1
getting an education,F,206,2650,0.077736
getting an education,M,163,1339,0.121733
real estate transactions,F,444,7048,0.062997
real estate transactions,M,336,3705,0.090688


<div style="border:solid green 2px; padding: 20px">


**Conclusion:** It should also be noted that when grouped by gender, the most risky group is men who receive a loan for car operations and education (12% default)

## General conclusion.

<div style="border:solid pink 2px; padding: 20px">
 1. According to the study, there are some trends that can be identified, in accordance with the grouping of data according to various criteria, including marital status and the presence of children, and on the other hand, the possibility of repaying the loan on time.<br/>
2. The absence of children has a positive effect on the ability to repay the loan on time, only 7.5% of clients in this group remained in debt, against about 10% in other groups, however, if we divide clients by gender, then among single men the default rate increases to 10 % <br/> For clients with 3 or more children, more data and additional research is needed.<br/>
3. The most vulnerable, for loan default, groups by marital status are: "single / not married" (9.7%), "civil marriage" (9.2%), the difference with other groups is about 2%. When considering the gender identity aspect of clients on this issue, men in these groups become debtors more often (14% and 12% respectively).<br/>
4. The level of income does not greatly affect loan default: from 6% in the category with low income (category D) to 8.5% in the category with an average income (category C), however, if we also consider the sample by gender, men with a low income (category D) are strongly knocked out - with loan debts of more than 11%. Samples for people with very high income (category A) and very low income (category E) are not enough, and more research is needed on them.<br/>
5. Identified a trend in debt related to the purpose of the loan. Revealed in the categories "operations related to the car" and "acquisition of education." Debts here arise in 9% of cases. There is also a connection with gender identity: men in these categories do not repay loans in 12% of cases.
6. According to the study, the portrait of the most vulnerable client, in terms of the risk associated with credit debt: a single man with an insufficiently high income, no children, and taking a loan to buy a car and get an education.
 