In [142]:
import pandas as pd
import numpy as np

Read and preview the data.

In [143]:
credit_data = pd.read_csv('credit_scoring_eng.csv')
credit_data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


###### Look at data types and identify columns with null values in the DataFrame, for further investigation.
I have used the info() method to identify which columns have null values.

In [144]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


###### Two columns have null values: 'days_employed' and 'total_income'

The shape of the DataFrame will help me to know the size of data am working with, especially when deleting null values and duplicates.
"shape" is an attribute of the DataFrame.

In [145]:
credit_data.shape

(21525, 12)

To count the number of null values in every column, I have chained the isnull() and sum() methods.

In [146]:
credit_data.isnull().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

###### In this DataFrame, the two columns with null values both have 2174 null values.

My interpretation is that the null values in the two columns are not missing by random.
###### MNAR - Missing Not At Random -  the likelihood of missing values depends on values in the column itself.
The 'days_employed' and 'total_income' are related. Meaning there is no employment hence no annual income.

I will fill the null values with 0 in the 'days_employed' and 'total_income' columns, using the fillna() method with keyword arguments: value=0, along the column axis: axis=1, and to retain the same DataFrame: inplace=True.

I will then confirm the sum of null values in each column with method chaining of isnull() and sum().

In [147]:
credit_data.fillna(value=0, axis=1, inplace=True)
credit_data.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

I'll look at the DataFrame again with the info() method to identify data types in the columns.
The aim is to replace floats with integer data type.

In [148]:
print(credit_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None


I'll replace the columns with type 'float64' to type 'int64' using the astype() method, and np.int64 as an argument.

In [149]:
#days_employed to int
credit_data['days_employed'] = credit_data['days_employed'].astype(np.int64)

In [150]:
#total_income to int
credit_data['total_income'] = credit_data['total_income'].astype(np.int64)

To identify duplicated rows in the data, I have used the duplicated() method, and chained it with sum() to find the total.
The sum helps me to see how much of my data I would lose if I dropped the duplicates.
These duplicates could have come up during data collection where clients filled in their details more than once.

In [151]:
print(credit_data.duplicated().sum())

54


Only 54 out of 21525 rows are in duplicate, I will drop them using the drop_duplicates() method,the keyword argument 'ignore_index', so that i dont have to reset the index, and inplace=True to keep the same DataFrame.

In [152]:
credit_data.drop_duplicates(inplace=True, ignore_index=True)

In [153]:
#the shape of my DataFrame has changed because of the duplicated rows that have been removed.
credit_data.shape

(21471, 12)

In [154]:
credit_data['dob_years'].describe()

count    21471.000000
mean        43.279074
std         12.574291
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

The age column has 0 values. This could be because, the client did not give his/her age.
I will set the 0 value to the median of the age column.

In [155]:
credit_data.loc[credit_data['dob_years'] == 0, 'dob_years'] = credit_data['dob_years'].median()
credit_data['dob_years'].describe()

count    21471.000000
mean        43.476643
std         12.217612
min         19.000000
25%         33.500000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

###### Categorize the data

I will categorize the following columns: education, family_status, gender, income_type, purpose.
This is because the values in these columns are discrete and can be combined into categories.

I will use the value_counts() method to identify unique values and their counts in the education column.

In [156]:
credit_data['education'].value_counts()

secondary education    13705
bachelor's degree       4710
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        273
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

###### There are a lot of repeated values in the education column, majorly because of the difference in uppercase and lowercase letters.

I will convert all the rows in the education column to lowercase using the str.lower() method.
This changes the rows to a string then to lowercase.

unique() method is used to confirm that indeed there are only distinct values in the education column

In [157]:
credit_data['education'] = credit_data['education'].str.lower()
credit_data['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

I will use the value_counts() method to identify unique values and their counts in the family_status column.

In [158]:
credit_data['family_status'].value_counts()

married              12344
civil partnership     4163
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64

I'll create dictionaries of the columns with id, the drop them from the main DataFrame

In [159]:
education_dict = credit_data[['education', 'education_id']].drop_duplicates().reset_index(drop=True)
print(education_dict)

             education  education_id
0    bachelor's degree             0
1  secondary education             1
2         some college             2
3    primary education             3
4      graduate degree             4


In [160]:
family_dict = credit_data[['family_status', 'family_status_id']].drop_duplicates().reset_index(drop=True)
print(family_dict)

       family_status  family_status_id
0            married                 0
1  civil partnership                 1
2    widow / widower                 2
3           divorced                 3
4          unmarried                 4


I will drop the id columns from my data since i already have them as dictionaries.

In [161]:
credit_data.drop(['education_id', 'family_status_id'], axis=1, inplace=True)
credit_data.head()

Unnamed: 0,children,days_employed,dob_years,education,family_status,gender,income_type,debt,total_income,purpose
0,1,-8437,42.0,bachelor's degree,married,F,employee,0,40620,purchase of the house
1,1,-4024,36.0,secondary education,married,F,employee,0,17932,car purchase
2,0,-5623,33.0,secondary education,married,M,employee,0,23341,purchase of the house
3,3,-4124,32.0,secondary education,married,M,employee,0,42820,supplementary education
4,0,340266,53.0,secondary education,civil partnership,F,retiree,0,25378,to have a wedding


I will use the value_counts() method to identify unique values and their counts in the gender column.

In [162]:
credit_data['gender'].value_counts()

F      14189
M       7281
XNA        1
Name: gender, dtype: int64

To have two categories in the gender column, I'll replace the XNA with the top gender in the data.

Using the describe() method to get summary statistics of the gender column, and accessing top as an attribute of the column, then assigning this attribute to the row that has 'XNA'.

In [163]:
credit_data.loc[credit_data['gender'] == 'XNA', 'gender'] = credit_data.gender.describe().top

I will use the value_counts() method to identify unique values and their counts in the income_type column.

In [164]:
credit_data['income_type'].value_counts()

employee                       11091
business                        5080
retiree                         3837
civil servant                   1457
unemployed                         2
entrepreneur                       2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

In cleaning the income_type column, I'll take it that the student is unemployed, and the client on paternity leave is employed(at the time of data collection, he was on leave. But generally he is employed.)

So, I'll change that using the loc access method, to access every row in the column, that has a given condition then assign the desired value.

In [165]:
credit_data.loc[credit_data['income_type'] == 'student', 'income_type'] = 'unemployed'
credit_data.loc[credit_data['income_type'] == 'paternity / maternity leave', 'income_type'] = 'employee'

A business and entrepreneur should also fall in the same category, so I'll change the entrepreneur to business.

In [166]:
credit_data.loc[credit_data['income_type'] == 'entrepreneur', 'income_type'] = 'business'

I will use Stemming to categorize the strings in the purpose column, because Stemming finds the stem of a given word.

NLTK library is used for stemming. the stem() method is used to find the stem of each word.

I identified unique words in every row in the purpose column, just by eye-balling, and put them in a list.
I used the for loop to find the stem of each word. This stem will be used in the next step, to create if conditions.

In [167]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

#this is almost unnecessary..check again!!!
words = ['wedding', 'estate', 'house', 'property', 'car', 'education', 'university']
stem = []
for word in words:
    stem.append(stemmer.stem(word))
print(stem)

['wed', 'estat', 'hous', 'properti', 'car', 'educ', 'univers']


I have defined a function that finds the stem of every word in every row in the purpose column, and checks with a condition to return a given categorical word.

In [168]:
def purpose_cat(purpose):
    stemmed = [stemmer.stem(word) for word in purpose.split(' ')]
    if 'wed' in stemmed:
        return 'wedding'
    if 'estat' in stemmed:
        return 'real estate'
    if 'hous' in stemmed:
        return 'real estate'
    if 'properti' in stemmed:
        return 'real estate'
    if 'car' in stemmed:
        return 'car'
    if 'educ' in stemmed:
        return 'education'
    if 'univers' in stemmed:
        return 'education'
    else:
        return 'unknown'

I will call the apply() method for the purpose_cat function on the purpose column.

In [169]:
credit_data['purpose'] = credit_data['purpose'].apply(purpose_cat)

In [170]:
#check data columns
credit_data.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'family_status',
       'gender', 'income_type', 'debt', 'total_income', 'purpose'],
      dtype='object')

After cleaning and organizing these required columns, now I will use the astype() method with 'category' as an argument, to categorize the columns.

In [171]:
credit_data['education'] = credit_data.education.astype('category')
credit_data['family_status'] = credit_data.family_status.astype('category')
credit_data['gender'] = credit_data.gender.astype('category')
credit_data['income_type'] = credit_data.income_type.astype('category')
credit_data['purpose'] = credit_data.purpose.astype('category')

In [172]:
#check the data types of the columns
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21471 entries, 0 to 21470
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   children       21471 non-null  int64   
 1   days_employed  21471 non-null  int64   
 2   dob_years      21471 non-null  float64 
 3   education      21471 non-null  category
 4   family_status  21471 non-null  category
 5   gender         21471 non-null  category
 6   income_type    21471 non-null  category
 7   debt           21471 non-null  int64   
 8   total_income   21471 non-null  int64   
 9   purpose        21471 non-null  category
dtypes: category(5), float64(1), int64(4)
memory usage: 944.5 KB


In the children column, I'll use the describe() method to obtain summary statistics.

In [173]:
credit_data['children'].describe()

count    21471.000000
mean         0.539565
std          1.382978
min         -1.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         20.000000
Name: children, dtype: float64

I can see that there are -1 and 20 children!
I'll ground the -1 children to 1, because it could be an error that was made during input.
And also, the 20 children to 2, because could an error that was made during input.

In [174]:
credit_data.loc[credit_data['children'] == -1, 'children'] = 1
credit_data.loc[credit_data['children'] == 20, 'children'] = 2
credit_data['children'].value_counts()

0    14107
1     4856
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

I will create a function that will group the number of children into categories, and test it.

In [175]:
def child_group(children):
    if children == 0:
        return 'None'
    if 1 <= children <= 2:
        return 'Some'
    if 3 <= children <= 4:
        return 'Many'
    else:
        return 'Most'
print(child_group(5))

Most


I will call the apply() method for the child_group function on the children column.
Then categorize it.

In [176]:
credit_data['children'] = credit_data['children'].apply(child_group)

In [177]:
credit_data['children'] = credit_data['children'].astype('category')

In [178]:
#sample the data
credit_data.sample(20)

Unnamed: 0,children,days_employed,dob_years,education,family_status,gender,income_type,debt,total_income,purpose
2766,Some,-3285,27.0,secondary education,married,F,employee,0,41554,education
20259,Some,-1649,29.0,secondary education,married,M,employee,0,24506,education
8563,,-509,45.0,bachelor's degree,married,F,civil servant,0,28463,real estate
19203,,-3368,42.0,secondary education,married,M,business,0,18201,real estate
13964,,-3709,49.0,secondary education,married,M,employee,0,27600,car
15492,Some,-2830,54.0,secondary education,civil partnership,F,employee,0,18721,real estate
1641,,-730,37.0,bachelor's degree,unmarried,F,employee,0,24749,real estate
17344,Many,-3092,30.0,bachelor's degree,married,F,business,0,30342,real estate
16693,,341613,64.0,secondary education,civil partnership,F,retiree,0,16866,real estate
8564,,-178,26.0,bachelor's degree,unmarried,M,employee,0,26439,real estate


I will use the value_counts() method to identify unique values and their counts in the debt column.

In [179]:
credit_data['debt'].value_counts()

0    19730
1     1741
Name: debt, dtype: int64

I will create a function that returns a string to define 0 as having paid the debt, and 1 as having defaulted on paying the debt.

In [180]:
def debt_group(debt):
    if debt == 0:
        return 'paid'
    else:
        return 'defaulted'

I will then create a new column: debt_group and call the apply() method for the debt_group function on the debt column.

In [181]:
credit_data['debt_group'] = credit_data['debt'].apply(debt_group)

I will categorize the new column that I created.

In [182]:
credit_data['debt_group'] = credit_data['debt_group'].astype('category')

In [183]:
#check the age column
credit_data['dob_years'].describe()

count    21471.000000
mean        43.476643
std         12.217612
min         19.000000
25%         33.500000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [184]:
#change dob_years to int
credit_data['dob_years'] = credit_data['dob_years'].astype('int64')

In [185]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21471 entries, 0 to 21470
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   children       21471 non-null  category
 1   days_employed  21471 non-null  int64   
 2   dob_years      21471 non-null  int64   
 3   education      21471 non-null  category
 4   family_status  21471 non-null  category
 5   gender         21471 non-null  category
 6   income_type    21471 non-null  category
 7   debt           21471 non-null  int64   
 8   total_income   21471 non-null  int64   
 9   purpose        21471 non-null  category
 10  debt_group     21471 non-null  category
dtypes: category(7), int64(4)
memory usage: 819.0 KB


### Task 3

### Is there a connection between having kids and repaying a loan on time?

I will create pivot tables to concentrate data on chosen aspects.

Using the pivot_table() method with index, values, columns and aggfunc arguments.
I will rename the columns and drop the column with defaulted values, since it is of no use in my analysis.
I will then calculate the percentage of paid loans with the total number of loans variable to find a connection.

In [186]:
children = credit_data.pivot_table(index=['children'], values='debt', columns='debt_group', aggfunc=['count'])
total_data = credit_data.shape[0] #total number of loans in the data
children.columns = ['defaulted', 'paid_counts']
children.drop('defaulted', axis=1, inplace=True)
children['%paid'] = (children['paid_counts'] / total_data) * 100
print(children.sort_values(by='%paid', ascending=False))

          paid_counts      %paid
children                        
None          13044.0  60.751712
Some           6337.0  29.514228
Many            340.0   1.583531
Most              9.0   0.041917


##### There is a connection between having kids and repaying a loan on time, because the group with no children paid all their loans at a higher percentage than the group with most children.

### Is there a connection between marital status and repaying a loan on time?

In [187]:
family = credit_data.pivot_table(index='family_status', values='debt', columns='debt_group', aggfunc='count')
family.columns = ['defaulted', 'paid_counts']
family.drop('defaulted', axis=1, inplace=True)
family['total'] = credit_data.groupby('family_status')['debt_group'].count()
family['%paid'] = (family['paid_counts'] / total_data) * 100
print(family.sort_values(by='%paid', ascending=False))

                   paid_counts  total      %paid
family_status                                   
married                  11413  12344  53.155419
civil partnership         3775   4163  17.581855
unmarried                 2536   2810  11.811280
divorced                  1110   1195   5.169764
widow / widower            896    959   4.173071


#### Yes, there is. Married clients paid their loans the most, at more than average.

#### Followed largely by civil partnership group.

#### The divorced and widow/widower group paid their loans the least.

### Is there a connection between income level and repaying a loan on time?

In [188]:
income = credit_data.pivot_table(index='income_type', values='debt', columns='debt_group', aggfunc='count')
income.columns = ['defaulted', 'paid_counts']
income.drop('defaulted', axis=1, inplace=True)
income['%paid'] = (income['paid_counts'] / total_data) * 100
print(income.sort_values(by='%paid', ascending=False))

               paid_counts      %paid
income_type                          
employee             10030  46.714173
business              4706  21.917936
retiree               3621  16.864608
civil servant         1371   6.385357
unemployed               2   0.009315


##### Employed clients pay their loans on time, followed by business and retired groups.

##### Umemployed group paid their loans the least.

### How do different loan purposes affect timely loan repayment?

In [189]:
purposes = credit_data.pivot_table(index='purpose', values='debt', columns='debt_group', aggfunc='count')
purposes.columns = ['defaulted', 'paid_counts']
purposes.drop('defaulted', axis=1, inplace=True)
purposes['%paid'] = (purposes['paid_counts'] / total_data) * 100
print(purposes.sort_values(by='%paid', ascending=False))

             paid_counts      %paid
purpose                            
real estate        10032  46.723487
car                 3905  18.187322
education           3644  16.971729
wedding             2149  10.008849


#### For the Real Estate purpose, the clients paid the loans the most.

#### Car and Education purposes paid their loans at almost the same percenatge rate.

#### Wedding purposes paid their loans the least.

#### Use of the try-except function

In [190]:
#create a function for the tasks that sorts the values automatically.
#reusable code:
def sort_percentage(data):
        return data.sort_values(ascending=False)
try:
    percentages = purposes['%paid'].apply(sort_percentage)
    print(top)
except:
    print('The function has raised an error.')

The function has raised an error.


### Overall conclusion:

#### An ideal customer for credit is one who has no children, is married, employed and takes credit for real estate purposes.

#### A non-ideal customer for credit is one with more than 4 children, is a widow/widower or divorced, unemployed and takes credit for wedding purposes.

#### Most customers who took loans paid them back because there is a higher number of paid loans than defaluted loans.
