# Description of the project

The customer is the bank's credit department. It is necessary to figure out whether the marital status and the number of children of the client affect the fact of repayment of the loan on time. The input data from the bank is statistics on the solvency of customers.

The results of the study will be taken into account when building the **credit scoring** model, a special system that evaluates the ability of a potential borrower to repay a loan to a bank.

## General information

In [1]:
# Import packages
import pandas as pd
%pip install pymystem3
from pymystem3 import Mystem
m = Mystem()


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Read
credit_info = pd.read_csv('credit_info.csv', index_col=0) 

In [3]:
# Info
credit_info.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.1+ MB


### Summary

There are 12 columns in the file, while 2 of them consist of floating-point numbers (total work experience in days and monthly income), 5 consist of integer values (number of children, age of the client in years, education level identifier, marital status identifier, whether there was a debt on loan repayment), 5 consist of objects or lines (level of education, marital status, gender, type of employment, purpose of obtaining a loan). At the same time, the number of values in the columns is not equal, so we can conclude that there are missing values.

In [4]:
# I output the type of missing values in the total_income and days_employed columns
credit_info[credit_info['total_income'].isnull()].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,среднее,1,гражданский брак,1,M,пенсионер,0,,сыграть свадьбу
26,0,,41,среднее,1,женат / замужем,0,M,госслужащий,0,,образование
29,0,,63,среднее,1,Не женат / не замужем,4,F,пенсионер,0,,строительство жилой недвижимости
41,0,,50,среднее,1,женат / замужем,0,F,госслужащий,0,,сделка с подержанным автомобилем
55,0,,54,среднее,1,гражданский брак,1,F,пенсионер,1,,сыграть свадьбу


In [5]:
# First 5 rows
credit_info.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875.639453,покупка жилья
1,1,-4024.803754,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080.014102,приобретение автомобиля
2,0,-5623.42261,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885.952297,покупка жилья
3,3,-4124.747207,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628.550329,дополнительное образование
4,0,340266.072047,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616.07787,сыграть свадьбу


In [6]:
# Empty row
credit_info.isna().sum() 

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [7]:
# I output the type of missing values in the total_income and days_employed columns
credit_info[credit_info['total_income'].isnull()].head() 

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,среднее,1,гражданский брак,1,M,пенсионер,0,,сыграть свадьбу
26,0,,41,среднее,1,женат / замужем,0,M,госслужащий,0,,образование
29,0,,63,среднее,1,Не женат / не замужем,4,F,пенсионер,0,,строительство жилой недвижимости
41,0,,50,среднее,1,женат / замужем,0,F,госслужащий,0,,сделка с подержанным автомобилем
55,0,,54,среднее,1,гражданский брак,1,F,пенсионер,1,,сыграть свадьбу


In [8]:
# Let's output values about the number of children
credit_info['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [9]:
# Let's output values about marital status
credit_info['family_status_id'].value_counts()

0    12380
1     4177
4     2813
3     1195
2      960
Name: family_status_id, dtype: int64

In [10]:
# Let's output the values about education
credit_info['education_id'].value_counts() 

1    15233
0     5260
2      744
3      282
4        6
Name: education_id, dtype: int64

### Summary

1) Let's display the first 10 rows of the table to familiarize ourselves with the data.
2) Let's look at the number of gaps in the columns. There are missing values (NaN) in the days_employed and total_income columns. Presumably, the omissions are due to the fact that people were not yet employed, so they had no income. The proof of this is the equal number of gaps in the two columns. 
3) Please note that in the Children column there are two values (-1 and 20) that stand out from the total mass. Let's replace them with the modulus of the number -1 and 2.

In [11]:
# Let's write a function to replace the values in the total_income column with the median value for the corresponding type of employment
def median_income_level(income_type): 
    median = credit_info.loc[credit_info['income_type'] == income_type, 'total_income'].median()
    return median
# Replacing the value with the median for "Pensioners"
credit_info.loc[(credit_info['total_income'].isnull()) & (credit_info['income_type'] == 'пенсионер'), 'total_income'] = median_income_level('пенсионер')
# Replacing the value with the median for "employees"
credit_info.loc[(credit_info['total_income'].isnull()) & (credit_info['income_type'] == 'сотрудник'), 'total_income'] = median_income_level('сотрудник') 
# Replacing the value with the median for the "companions"
credit_info.loc[(credit_info['total_income'].isnull()) & (credit_info['income_type'] == 'компаньон'), 'total_income'] = median_income_level('компаньон') 
# Replacing the value with the median for "civil servants"
credit_info.loc[(credit_info['total_income'].isnull()) & (credit_info['income_type'] == 'госслужащий'), 'total_income'] = median_income_level('госслужащий') 
# Replacing the value with the median for "entrepreneurs"
credit_info.loc[(credit_info['total_income'].isnull()) & (credit_info['income_type'] == 'предприниматель'), 'total_income'] = median_income_level('предприниматель') 
credit_info['children'] = credit_info['children'].replace({-1:1, 20:2})
# Replace the missing values in the column with 0.
credit_info['days_employed'] = credit_info['days_employed'].fillna(0)
# Converting data types to integers
credit_info['total_income'] = credit_info['total_income'].astype('int') 
# Converting data types to integers
credit_info['days_employed'] = credit_info['days_employed'].astype('int') 

In [12]:
# First 10 after coverting
credit_info.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437,42,высшее,0,женат / замужем,0,F,сотрудник,0,253875,покупка жилья
1,1,-4024,36,среднее,1,женат / замужем,0,F,сотрудник,0,112080,приобретение автомобиля
2,0,-5623,33,Среднее,1,женат / замужем,0,M,сотрудник,0,145885,покупка жилья
3,3,-4124,32,среднее,1,женат / замужем,0,M,сотрудник,0,267628,дополнительное образование
4,0,340266,53,среднее,1,гражданский брак,1,F,пенсионер,0,158616,сыграть свадьбу


### Summary

Since NaN replaces the number missing in the cell and belongs to the float type, therefore mathematical operations can be performed with it. Replace the values in the total_income column with the median value for each type of employment. So that the median is calculated depending on the specific group of employees. This will allow you to fill in the missing values in the table more correctly, taking into account the type of employment.  In the days_employed column, we will replace the values with 0, since it is not needed to analyze the credit scoring model and will not affect the results. Let's change the data types in the total_income and days_employed columns to integer values, since the values of pennies in income, as well as shares in days of service, will not interest us in the general data analysis.

In [13]:
# Let's count the number of values before conversion by education level
credit_info['education'].value_counts() 
# Let's reduce the values in the Education column to lowercase
credit_info['education'] = credit_info['education'].str.lower()


In [14]:
# Find the sum of duplicates before conversion
credit_info.duplicated().sum()  

71

In [15]:
# We will display all the values with duplicates
credit_info[credit_info.duplicated()].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,0,41,среднее,1,женат / замужем,0,F,сотрудник,0,142594,покупка жилья для семьи
3290,0,0,58,среднее,1,гражданский брак,1,F,пенсионер,0,118514,сыграть свадьбу
4182,1,0,34,высшее,0,гражданский брак,1,F,сотрудник,0,142594,свадьба
4851,0,0,60,среднее,1,гражданский брак,1,F,пенсионер,0,118514,свадьба
5557,0,0,58,среднее,1,гражданский брак,1,F,пенсионер,0,118514,сыграть свадьбу


In [16]:
# Let's delete the duplicates in the table
credit_info = credit_info.drop_duplicates() 
credit_info.duplicated().sum()

0

In [17]:
# Let's count the number of values after the conversion
credit_info['education'].value_counts()

среднее                15172
высшее                  5250
неоконченное высшее      744
начальное                282
ученая степень             6
Name: education, dtype: int64

### Summary

71 duplicate values were found in the source table. We will delete these rows from the table for further correct analysis. They could appear due to incorrect filling or re-filling of rows for the table.

## Lemmatization

In [18]:
purposes = pd.DataFrame(data=credit_info['purpose'], columns=['purpose', 'lemmas'])
purposes = purposes.drop_duplicates().reset_index(drop=True)

# Function for reducing a word to its dictionary form(lemma)
def find_lemmas(text): 
    lemmas = m.lemmatize(text)
    return lemmas


# Let's add another column
purposes['lemmas'] = purposes['purpose'].apply(find_lemmas) 

# Let's write a function to assign the purpose of the loan to a certain category
def purpose_category(row): 
    lemmas = row['lemmas']
    if 'автомобиль' in lemmas:
        return 'покупка автомобиля'
    elif ('недвижимость' in lemmas) or ('жилье' in lemmas):
        return 'покупка недвижимости'
    elif 'свадьба' in lemmas:
        return 'свадьба'
    elif 'образование' in lemmas:
        return 'образование'
    else:
        return 'другая категория'
    
purposes['purpose_category'] = purposes.apply(purpose_category, axis=1)
credit_info = credit_info.merge(purposes, on='purpose', how='left')
credit_info['purpose_category'].value_counts()



покупка недвижимости    10811
покупка автомобиля       4306
образование              4013
свадьба                  2324
Name: purpose_category, dtype: int64

### Summary

Since the borrowers indicated different loan goals, it would be correct to make a lemmatization. 
Lemmatization was carried out using a library with a lemmatization function in Russian — pymystem3 for the Purpose column, since we need to highlight the purpose of the loan to improve further grouping of data.
After the transformations, 4 main categories of the purpose of the loan were identified: "Buying real estate", "Buying a car", "Education", "Wedding".

## Categorization of data

In [19]:
education_group = credit_info.pivot_table(index='education', values='education_id').to_dict()['education_id']
education_group

{'высшее': 0,
 'начальное': 3,
 'неоконченное высшее': 2,
 'среднее': 1,
 'ученая степень': 4}

In [20]:
# Let's make a summary table for the columns education and education_id
education_group = credit_info.pivot_table(index='education', values='education_id').to_dict()['education_id'] 
print(education_group)

# Let's make a summary table of the family_status and family_status_id columns
family_group = credit_info.pivot_table(index='family_status', values='family_status_id').to_dict()['family_status_id'] 
print(family_group)

def income_level(income):
    if income <= 27000.00:
        return 'Низкий'
    if 27000 < income <= 108000.00:
        return 'Средний'
    if 108000.00 < income <= 225000.00:
        return 'Выше среднего'
    return 'Высокий'

credit_info['income_level'] = credit_info['total_income'].apply(income_level)
credit_info['income_level'].value_counts()

{'высшее': 0, 'начальное': 3, 'неоконченное высшее': 2, 'среднее': 1, 'ученая степень': 4}
{'Не женат / не замужем': 4, 'в разводе': 3, 'вдовец / вдова': 2, 'гражданский брак': 1, 'женат / замужем': 0}


Выше среднего    12269
Средний           5399
Высокий           3774
Низкий              12
Name: income_level, dtype: int64

### Summary

Obviously, from our data, two dictionaries can be distinguished from the fields education and marital status. Since it is the columns education level and marital status that correspond to their columns with a numeric identifier. 
We will also categorize by income level. Let's create an income level classification based on data from the income structure of the population according to official statistics, where the "Low" category will correspond to income less than 27,000/person, the "Average" category will correspond to income from 27,000/person to 108,000/person, the "Above average" category will correspond to income from 108,000/person to 225,000/person, and the "High" category will correspond to an income of over 225,000/person. This classification more accurately reflects reality.

In [21]:
# Let's build a summary table of the relationship between having children and paying back the loan on time
children_pivot_table = credit_info.pivot_table(index=['children'], values='debt', aggfunc=['count', 'sum', 'mean'])
children_pivot_table


Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,14091,1063,0.075438
1,4855,445,0.091658
2,2128,202,0.094925
3,330,27,0.081818
4,41,4,0.097561
5,9,0,0.0


In [22]:
# Info
children_pivot_table.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   (count, debt)  6 non-null      int64  
 1   (sum, debt)    6 non-null      int64  
 2   (mean, debt)   6 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 192.0 bytes


### Summary

When analyzing the relationship between having children and repayment of the loan on time, we can see that borrowers without children have more debt on repayment of loans than borrowers with children. This is due to the fact that there are numerically more borrowers without children. People with 5 children have no loan arrears. If there are 4 children, the highest probability of delinquency on loans. It is also worth noting that people with 1 child have a high probability of about 9%.

In [23]:
# Let's build a summary table of the relationship between marital status and repayment of the loan on time
family_status_pivot_table = credit_info.pivot_table(index=['family_status'], values='debt', aggfunc=['count', 'sum', 'mean']) 
family_status_pivot_table

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Не женат / не замужем,2810,274,0.097509
в разводе,1195,85,0.07113
вдовец / вдова,959,63,0.065693
гражданский брак,4151,388,0.093471
женат / замужем,12339,931,0.075452


In [24]:
#Info
family_status_pivot_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Не женат / не замужем to женат / замужем
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   (count, debt)  5 non-null      int64  
 1   (sum, debt)    5 non-null      int64  
 2   (mean, debt)   5 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 160.0+ bytes


### Summary

It can be concluded, according to the summary table between marital status and repayment of the loan on time, that married borrowers have the largest number of debts. This is because this group is the largest quantitatively. According to the shares, the "Unmarried" group, as well as the "Civil Marriage" group, have the highest probability of incurring loan debts and is slightly more than 9%.

In [25]:
# Построим сводную таблицу зависимости между уровнем дохода и возратом кредита в срок
total_income_pivot_table = credit_info.pivot_table(index=['income_level'], values='debt', aggfunc=['count', 'sum', 'mean']) 
total_income_pivot_table

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
income_level,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Высокий,3774,267,0.070747
Выше среднего,12269,1043,0.085011
Низкий,12,1,0.083333
Средний,5399,430,0.079644


In [26]:
total_income_pivot_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Высокий to Средний
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   (count, debt)  4 non-null      int64  
 1   (sum, debt)    4 non-null      int64  
 2   (mean, debt)   4 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 128.0+ bytes


### Summary

From the summary table by income level and number of debts, we can conclude that low-income borrowers are highly likely to incur loan debts, but these statistics cannot be taken as a rule, since the number of loans is too low compared to other groups. The "Above average" group has the highest probability of loan arrears, as well as the largest number of debtors in this group.

In [27]:
# Let's build a summary table of the relationship between the objectives of the loan and the repayment of the loan on time
purpose_pivot_table = credit_info.pivot_table(index=['purpose_category'], values='debt', aggfunc=['count', 'sum', 'mean']) 
purpose_pivot_table

Unnamed: 0_level_0,count,sum,mean
Unnamed: 0_level_1,debt,debt,debt
purpose_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
образование,4013,370,0.0922
покупка автомобиля,4306,403,0.09359
покупка недвижимости,10811,782,0.072334
свадьба,2324,186,0.080034


In [28]:
purpose_pivot_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, образование to свадьба
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   (count, debt)  4 non-null      int64  
 1   (sum, debt)    4 non-null      int64  
 2   (mean, debt)   4 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 128.0+ bytes


### Summary

From the summary table on the purposes of the loan and the number of debts on these loans, it can be concluded that large purchases in the form of a "Car" or "Real Estate" account for a greater number of loan debts. This is primarily due to the fact that these are the largest groups in terms of population. The groups with the highest probability of incurring debts are: "Education" and "Car purchase" - more than 9%.

# Summary

Various data processing tools were used in the project, omissions were removed, data types were replaced, duplicates were processed, lemmatization was carried out, and data was categorized. 
The original data had 21525 values in 10 columns, as well as 2 columns with missing values.
As part of the data processing, the gaps in the total_income and days_employed columns were replaced. The missing values within each type of employment have been replaced in the total_income column. And in the days_employed column, the gaps were filled with the value '0', since this column was not involved in the further analysis process. For these columns, the data types were changed to integer values. 
There were 71 duplicates in the table, they were deleted during processing. 
Lemmatization was performed on the purpose column, and a new column, purpose_category, was added, where information about the purpose of the loan will be presented in a more grouped manner. The goals are highlighted: "Buying real estate", "Buying a car", "Education", "Wedding". 
The data was categorized according to the total_income column and groups were identified by income level: "Low", "Medium", "Above average" and "High". 
Based on the results of the entire project, conclusions can be drawn:
1) People with 5 children have no loan arrears. If there are 4 children, the highest probability of delinquency on loans. It is also worth noting that people with 1 child have a high probability of about 9%.
2) According to the shares, the group of "Unmarried" and the group of "Civil marriage" have the highest probability of incurring loan debts and is slightly more than 9%.
3) The group with the highest probability of loan arrears is "Above average", i.e. from 108,000/person to 225,000/person, as well as the largest number of debtors in this group.
4) The groups with the highest probability of incurring debts: "Education" and "Car purchase" - more than 9%.