# Project Description

### Analyzing borrowers’ risk of defaulting

My project is to prepare a report for a bank’s loan division. I have to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

### Description of the Data

- children: the number of children in the family
- days_employed: how long the customer has been working
- dob_years: the customer’s age
- education: the customer’s education level
- education_id: identifier for the customer’s education
- family_status: the customer’s marital status
- family_status_id: identifier for the customer’s marital status
- gender: the customer’s gender
- income_type: the customer’s income type
- debt: whether the customer has ever defaulted on a loan
- total_income: monthly income
- purpose: reason for taking out a loan

# Project Goal

My report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

# Project Contents

-  <a href='#the_destination1'>Open the data file and read the general information:</a>
-  <a href='#the_destination2'>Preprocess the data:</a>
-  <a href='#the_destination3'>Interpretion of repaying loan on time:</a>
-  <a href='#the_destination4'>Overall conclusion:</a>

<a id='the_destination1'></a>
# Step 1. Open the data file and read the general information. 

**Importing the libraries:**

In [1]:
import pandas as pd
from nltk.stem import SnowballStemmer

**Information for the "credit_scoring" dataset:**

In [2]:
try:
    credit_scoring = pd.read_csv('credit_scoring_eng.csv')
except:
    credit_scoring = pd.read_csv('/datasets/credit_scoring_eng.csv')
credit_scoring.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


There are 21525 rows & 12 columns . Columns name are "children", "days_employed", "dob_years", "education", "education_id", "family_status", "family_status_id", "gender", "income_type", "debt", "total_income", "purpose".

<a id='the_destination2'></a>
# Step 2. Preprocess the data:

**Determing missing values in the "credit_scoring" dataset:**

In [3]:
credit_scoring.isnull().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

- There are 2174 missing values in days_employed column & total_income column.
- There are some possible reasons for missing data:
  - People do not respond to survey.
  - The individual dies or drops out before sampling.
  - Some things are easier to measure than others.
  - Data entry errors.

**Data type replacement:**

In [4]:
credit_scoring['days_employed'] = credit_scoring['days_employed'].fillna(value=0)
credit_scoring['total_income'] = credit_scoring['total_income'].fillna(value=0)
credit_scoring['days_employed'] = credit_scoring['days_employed'].astype(int).abs()
credit_scoring['total_income'] = credit_scoring['total_income'].astype(int)
credit_scoring['children'] = credit_scoring['children'].abs()
display(credit_scoring.head())

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase
2,0,5623,33,Secondary Education,1,married,0,M,employee,0,23341,purchase of the house
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding


- I used fillna() method for filling the missing values. According to the question for changing data type, I called astype(int) because it changes data from fractional data to integer type data and abs() function it also changes the data from real number to positive number in the days_employed column.
- Missing values are available in the days_employed & total_income column and their dtype is float64.
- There are some hypothesis for why the missing values appear:
  - User forgot to fill in a field.
  - Data was lost while transferring manually from a legacy database.
  - There was a programming error.
  - Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.
- Fillin missing values with zero - is a good idea because they were not employed which means they did not have income.If I put another number rather than zero it will give us inappropriate value.

**Checking duplicates values in the "credit_scoring" dataset:**

In [5]:
credit_scoring['education'] = credit_scoring['education'].str.lower()
credit_scoring=credit_scoring.drop_duplicates().reset_index(drop = True)
credit_scoring.duplicated().sum()

0

I used drop_duplicates() method for deleting duplicates and for checking duplicates I used duplicated() method.From my assumption, there are many reasons for having duplicate values:
- Lack of data standardization.
- Changing demographic data.
- Lack of multiple matching demographic data points.
- Default values.

**Categorizing data:**

In [6]:
english_stemmer = SnowballStemmer('english')
query = credit_scoring['purpose']
def purpose_new_group(query):
        for word in query.split(' '):
            stemmed_word = english_stemmer.stem(word)
            if stemmed_word == 'educ' or stemmed_word == 'uni':
                return 'education'
            if stemmed_word == 'car':
                return 'car'
            if stemmed_word == 'hous' or stemmed_word == 'estat' or 'prop' in query:
                return 'house'
            if stemmed_word == 'wed':
                return 'wedding'
credit_scoring['purpose_new'] = credit_scoring['purpose'].apply(purpose_new_group)
display(credit_scoring.head())

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_new
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,house
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase,car
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,house
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,wedding


For categorizing data I used NLTK dictionary because it's convenient to have existing text collections to explore.It also contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

**Defining income category:**

In [7]:
def income_category(income):
    if income < 100000:
        return 1
    elif 100000 < income < 200000:
        return 2
    else:
        return 3
credit_scoring["income_category"] = credit_scoring["total_income"].apply(lambda x: income_category)
display(credit_scoring.head())

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_new,income_category
0,1,8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house,house,<function income_category at 0x000001B1DDDC2D38>
1,1,4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase,car,<function income_category at 0x000001B1DDDC2D38>
2,0,5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house,house,<function income_category at 0x000001B1DDDC2D38>
3,3,4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education,education,<function income_category at 0x000001B1DDDC2D38>
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding,wedding,<function income_category at 0x000001B1DDDC2D38>


- There are 2174 missing values in days_employed column & total_income column.
- There are some possible reasons for missing data:
  - People do not respond to survey.
  - The individual dies or drops out before sampling.
  - Some things are easier to measure than others.
  - Data entry errors.
- I used fillna() method for filling the missing values. According to the question for changing data type, I called astype(int) because it changes data from fractional data to integer type data and abs() function it also changes the data from real number to positive number in the days_employed column.
- Missing values are available in the days_employed & total_income column and their dtype is float64.
- There are some hypothesis for why the missing values appear:
  - User forgot to fill in a field.
  - Data was lost while transferring manually from a legacy database.
  - There was a programming error.
  - Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.
- Fillin missing values with zero - is a good idea because they were not employed which means they did not have income.If I put another number rather than zero it will give us inappropriate value.
- I used drop_duplicates() method for deleting duplicates and for checking duplicates I used duplicated() method.From my assumption, there are many reasons for having duplicate values:
  - Lack of data standardization.
  - Changing demographic data.
  - Lack of multiple matching demographic data points.
  - Default values.
-  For categorizing data I used NLTK dictionary because it's convenient to have existing text collections to explore.It also contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning.

<a id='the_destination3'></a>
# Step 3. Interpretion of repaying loan on time:

**Relation between having kids and repaying a loan on time:**

In [8]:
credit_scoring[["children", "debt"]][credit_scoring.debt == 0].groupby("children").count()

Unnamed: 0_level_0,debt
children,Unnamed: 1_level_1
0,13028
1,4410
2,1858
3,303
4,37
5,9
20,68


**Counting the number of children:**

In [9]:
credit_scoring['children'].value_counts()

0     14091
1      4855
2      2052
3       330
20       76
4        41
5         9
Name: children, dtype: int64

Most of the potential borrower have less children. Very few potential borrower have more than 3 children.

**Relation between children and repaying a loan on time:**

In [10]:
credit_scoring[["children", "debt"]][credit_scoring.debt == 1].groupby("children").count()

Unnamed: 0_level_0,debt
children,Unnamed: 1_level_1
0,1063
1,445
2,194
3,27
4,4
20,8


From the data we have 14107 people who don't have children. Out of these 13044 people (although they don't have children)
they repay loan on time. That means, 92.46% of the people who have no children, repay loan on time. On the the other hand 1063 people(They have children)  they can't repay on time.There are 7364 people who have at least one children . Among 7364 people, 6686 people repay loan on time and 678 people don't repay loan on time. That means, 90.79% of the people having children, repay loan on time. From my analysis, there is a little connection between having kids and repaying a loan on time because people who don't have children, are a bit more likely to repay loan on time.

**Relation between marital status and repaying a loan on time:**

In [11]:
credit_scoring[["family_status", "debt"]][credit_scoring.debt == 0].groupby(["family_status"]).count()

Unnamed: 0_level_0,debt
family_status,Unnamed: 1_level_1
civil partnership,3763
divorced,1110
married,11408
unmarried,2536
widow / widower,896


Among all of the family status, "married" family status people are unable to repay loan on time.

**Counting the number of each family status:**

In [12]:
credit_scoring['family_status'].value_counts()

married              12339
civil partnership     4151
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64

The percentage of repaying loan on time for married,civil partnership,unmarried,divorced and widow / widower are 92.45%,90.68%,90.25%,92.89% and 93.43% respectively. Among all of them, widow/widower are more flexible for repaying loan on time. 

**Relation between income level and repaying a loan on time:**

In [13]:
credit_scoring.pivot_table(index=['debt'], columns='income_category', aggfunc='size', fill_value=0)

income_category,<function income_category at 0x000001B1DDDC2D38>
debt,Unnamed: 1_level_1
0,19713
1,1741


Yes, there is a relation between total_income and repaying loan on time. In general, the people who are in income_category 2 and 3, they are more flexible for repaying loan on time rather than income_category 1. 

**Different loan purposes affect on-time repayment of the loan:**

In [14]:
credit_scoring[["debt", "purpose_new"]][credit_scoring.debt == 0].groupby(["purpose_new"]).count()

Unnamed: 0_level_0,debt
purpose_new,Unnamed: 1_level_1
car,3903
education,3190
house,10029
wedding,2138


Among all of the purposes, the house purpose is the worst condition in repaying loan on time.

**Counting the new purposes for loan:**

In [15]:
credit_scoring['purpose_new'].value_counts()

house        10811
car           4306
education     3517
wedding       2324
Name: purpose_new, dtype: int64

For house, car, education, wedding purposes number of loans are 10811, 4306, 3517, 2324 respectively.

**Calculating the number of loan repaying on time:**

In [16]:
credit_scoring[["debt", "purpose_new"]][credit_scoring.debt == 1].groupby(["purpose_new"]).count()

Unnamed: 0_level_0,debt
purpose_new,Unnamed: 1_level_1
car,403
education,327
house,782
wedding,186


The percentage of repaying loan on time for house,car,education and wedding are 92.77%,90.65%,90.71% and 92.03% respectively. Among all of them, House and wedding purposes are more flexible for repaying loan on time.

**Conversely**,The percentage of non-repaying loan on time for house,car,education and wedding are 7.23%,9.35%,9.29% and 7.97% respectively. Among all of them, House and wedding purposes are more flexible compare to car and education purposes for repaying loan on time.

-  Most of the potential borrower have less children. Very few potential borrower have more than 3 children.
-  From the data we have 14107 people who don't have children. Out of these 13044 people (although they don't have children) they repay loan on time. That means, 92.46% of the people who have no children, repay loan on time. On the the other hand 1063 people(They have children)  they can't repay on time.There are 7364 people who have at least one children . Among 7364 people, 6686 people repay loan on time and 678 people don't repay loan on time. That means, 90.79% of the people having children, repay loan on time. From my analysis, there is a little connection between having kids and repaying a loan on time because people who don't have children, are a bit more likely to repay loan on time.
- Among all of the family status, "married" family status people are unable to repay loan on time.
- The percentage of repaying loan on time for married,civil partnership,unmarried,divorced and widow / widower are 92.45%,90.68%,90.25%,92.89% and 93.43% respectively. Among all of them, widow/widower are more flexible for repaying loan on time. 
-  Yes, there is a relation between total_income and repaying loan on time. In general, the people who are in income_category 2 and 3, they are more flexible for repaying loan on time rather than income_category 1.
-  Among all of the purposes, the house purpose is the worst condition in repaying loan on time.
-  For house, car, education, wedding purposes number of loans are 10811, 4306, 3517, 2324 respectively.
-  The percentage of repaying loan on time for house,car,education and wedding are 92.77%,90.65%,90.71% and 92.03% respectively. Among all of them, House and wedding purposes are more flexible for repaying loan on time. **Conversely**,The percentage of non-repaying loan on time for house,car,education and wedding are 7.23%,9.35%,9.29% and 7.97% respectively. Among all of them, House and wedding purposes are more flexible compare to car and education purposes for repaying loan on time.

<a id='the_destination4'></a>
# Step 4. Overall conclusion.

**Final Observations**
- People with higher *total_income* are more likely to repay loan on time. 
- Though there are not much relation between repaying loan on time and other categories, but house and wedding purposes, widow / widoer are also more recommended for repaying loan on time.
- *Income_Category* 2 and 3 are more flexible for repaying loan on time

# Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [x]  missing values are filled;
- [x]  an explanation of which missing value types were detected;
- [x]  explanation for the possible causes of missing values;
- [x]  an explanation of how the blanks are filled;
- [x]  replaced the real data type with an integer;
- [x]  an explanation of which method is used to change the data type and why;
- [x]  duplicates deleted;
- [x]  an explanation of which method is used to find and remove duplicates;
- [x]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [x]  an explanation of the principle of data categorization;
- [x]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [x]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [x]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [x]  conclusions are present on each stage;
- [x]  a general conclusion is made.