In [6]:
import pandas as pd
import numpy as np

###### Read the data.

In [7]:
credit_data = pd.read_csv('credit_scoring_eng.csv')
credit_data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


###### Look at data types and identify columns with null values in the DataFrame, for further investigation.
I have used the info() method to identify which columns have null values.

In [8]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


###### Two columns have null values: 'days_employed' and 'total_income'

The shape of the DataFrame will help me to know the size of data am working with, especially when deleting null values and duplicates.
"shape" is an attribute of the DataFrame.

In [9]:
credit_data.shape

(21525, 12)

To count the number of null values in every column, I have chained the isnull() and sum() methods.

In [10]:
credit_data.isnull().sum()
#total_income has 2174 null values.

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

###### In this DataFrame, the two columns with null values both have 2174 null values.

My interpretation is that the null values in the two columns are not missing by random.
###### MNAR - Missing Not At Random -  the likelihood of missing values depends on values in the column itself.
The 'days_employed' and 'total_income' are related. Meaning there is no employment hence no annual income.

I will fill the null values with 0 in the 'days_employed' and 'total_income' columns, using the fillna() method with keyword arguments: value=0, along the column axis: axis=1, and to retain the same DataFrame: inplace=True.

I will then confirm the sum of null values in each column with method chaining of isnull() and sum().

In [11]:
credit_data.fillna(value=0, axis=1, inplace=True)
credit_data.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

I'll look the DataFrame again with the info() method to identify data types in the columns.
The aim is to replace floats with integer data type.

In [12]:
print(credit_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     21525 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      21525 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
None


I'll replace the columns with type 'float64' to type 'int64' using the astype() method, and np.int64 as an argument.

In [13]:
#days_employed to int
credit_data['days_employed'] = credit_data['days_employed'].astype(np.int64)

In [14]:
#total_income to int
credit_data['total_income'] = credit_data['total_income'].astype(np.int64)

To identify duplicated rows in the data, I have used the duplicated() method, and chained it with sum() to find the total.
The sum helps me to see how much of my data I would lose if I dropped the duplicates.
These duplicates could have come up during data collection where clients filled in their details more than once.

In [15]:
print(credit_data.duplicated().sum())

54


Only 54 out of 21525 rows are in duplicate, I will drop them using the drop_duplicates() method,the keyword argument 'ignore_index', so that i dont have to reset the index, and inplace=True to keep the same DataFrame.

In [16]:
credit_data.drop_duplicates(inplace=True, ignore_index=True)

In [17]:
#the shape of my DataFrame has changed because of the duplicated rows that have been removed.
credit_data.shape

(21471, 12)

In [18]:
#the age column has 0 values. This could be due to during collection, the client did not give his/her age.
#I will set the 0 to the median of the age column
credit_data.loc[credit_data['dob_years'] == 0, 'dob_years'] = credit_data['dob_years'].median() 
credit_data['dob_years'].describe()

count    21471.000000
mean        43.476643
std         12.217612
min         19.000000
25%         33.500000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

###### Categorize the data

I will categorize the following columns: education, family_status, gender, income_type, purpose.
This is because the values in these columns are discrete and can be combined into categories.

I will use the value_counts() method to identify unique values and their counts in the education column.

In [19]:
credit_data['education'].value_counts()

secondary education    13705
bachelor's degree       4710
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        273
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

###### There are a lot of repeated values in the education column, majorly because of the difference in uppercase and lowercase letters.

I will convert all the rows in the education column to lowercase using the str.lower() method.
This changes the rows to a string then to lowercase.

unique() method is used to confirm that indeed there are only distinct values in the education column

In [20]:
credit_data['education'] = credit_data['education'].str.lower()
credit_data['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

I will use the value_counts() method to identify unique values and their counts in the family_status column.

In [21]:
credit_data['family_status'].value_counts()

married              12344
civil partnership     4163
unmarried             2810
divorced              1195
widow / widower        959
Name: family_status, dtype: int64

I will use the value_counts() method to identify unique values and their counts in the gender column.

In [22]:
credit_data['gender'].value_counts()

F      14189
M       7281
XNA        1
Name: gender, dtype: int64

To have two categories in the gender column, I'll replace the XNA with the top gender in the data.

Using the describe() method to get summary statistics of the gender column, and accessing top as an attribute of the column, then assigning this attribute to the row that has 'XNA'.

In [23]:
credit_data.loc[credit_data['gender'] == 'XNA', 'gender'] = credit_data.gender.describe().top

I will use the value_counts() method to identify unique values and their counts in the income_type column.

In [24]:
credit_data['income_type'].value_counts()

employee                       11091
business                        5080
retiree                         3837
civil servant                   1457
unemployed                         2
entrepreneur                       2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

In cleaning the income_type column, I'll take it that the student is unemployed, and the client on paternity leave is employed(at the time of data collection, he was on leave. But generally he is employed.)

So, I'll change that using the loc access method, to access every row in the column, that has a given condition then assign the desired value.

In [25]:
credit_data.loc[credit_data['income_type'] == 'student', 'income_type'] = 'unemployed'
credit_data.loc[credit_data['income_type'] == 'paternity / maternity leave', 'income_type'] = 'employee'

A business and entrepreneur should also fall in the same category, so I'll change the entrepreneur to business.

In [26]:
credit_data.loc[credit_data['income_type'] == 'entrepreneur', 'income_type'] = 'business'

I will use Stemming to categorize the strings in the purpose column, because Stemming finds the stem of a given word.

NLTK library is used for stemming. the stem() method is used to find the stem of each word.

I identified unique words in every row in the purpose column, just by eye-balling, and put them in a list.
I used the for loop to find the stem of each word. This stem will be used in the next step, to create if conditions.

In [27]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

#this is almost unnecessary..check again!!!
words = ['wedding', 'estate', 'house', 'property', 'car', 'education', 'university']
stem = []
for word in words:
    stem.append(stemmer.stem(word))
print(stem)

['wed', 'estat', 'hous', 'properti', 'car', 'educ', 'univers']


I have defined a function that finds the stem of every word in every row in the purpose column, and checks with a condition to return a given categorical word.

In [28]:
def purpose_cat(purpose):
    stemmed = [stemmer.stem(word) for word in purpose.split(' ')]
    if 'wed' in stemmed:
        return 'wedding'
    if 'estat' in stemmed:
        return 'real estate'
    if 'hous' in stemmed:
        return 'real estate'
    if 'properti' in stemmed:
        return 'real estate'
    if 'car' in stemmed:
        return 'car'
    if 'educ' in stemmed:
        return 'education'
    if 'univers' in stemmed:
        return 'education'
    else:
        return 'unknown'

I will create a new column: purpose_category and call the apply() method for the purpose_cat function on the purpose column.

In [29]:
credit_data['purpose_category'] = credit_data['purpose'].apply(purpose_cat)


In [30]:
#check data columns
credit_data.columns

Index(['children', 'days_employed', 'dob_years', 'education', 'education_id',
       'family_status', 'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose', 'purpose_category'],
      dtype='object')

After cleaning and organizing these required columns, now I will use the astype() method with 'category' as an argument, to categorize the columns.

In [31]:
credit_data['education'] = credit_data.education.astype('category')
credit_data['family_status'] = credit_data.family_status.astype('category')
credit_data['gender'] = credit_data.gender.astype('category')
credit_data['income_type'] = credit_data.income_type.astype('category')
credit_data['purpose_category'] = credit_data.purpose_category.astype('category')

In [32]:
#check the data types of the columns
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21471 entries, 0 to 21470
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   children          21471 non-null  int64   
 1   days_employed     21471 non-null  int64   
 2   dob_years         21471 non-null  float64 
 3   education         21471 non-null  category
 4   education_id      21471 non-null  int64   
 5   family_status     21471 non-null  category
 6   family_status_id  21471 non-null  int64   
 7   gender            21471 non-null  category
 8   income_type       21471 non-null  category
 9   debt              21471 non-null  int64   
 10  total_income      21471 non-null  int64   
 11  purpose           21471 non-null  object  
 12  purpose_category  21471 non-null  category
dtypes: category(5), float64(1), int64(6), object(1)
memory usage: 1.4+ MB


In the children column, I'll use the describe() method to obtain summary statistics.

In [33]:
credit_data['children'].describe()

count    21471.000000
mean         0.539565
std          1.382978
min         -1.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         20.000000
Name: children, dtype: float64

I can see that there are -1 and 20 children!
I'll ground the -1 children to 1, because it could be an error that was made during input.
And also, the 20 children to 2, because could an error that was made during input.

In [34]:
credit_data.loc[credit_data['children'] == -1, 'children'] = 1
credit_data.loc[credit_data['children'] == 20, 'children'] = 2
credit_data['children'].value_counts()

0    14107
1     4856
2     2128
3      330
4       41
5        9
Name: children, dtype: int64

I will create a function that will group the number of children into categories, and test it.

In [35]:
def child_group(children):
    if children == 0:
        return 'None'
    if 1 <= children <= 2:
        return 'Some'
    if 3 <= children <= 4:
        return 'Many'
    else:
        return 'Most'
print(child_group(5))

Most


I will create a new column: child_group and call the apply() method for the child_group function on the children column.

In [36]:
credit_data['child_group'] = credit_data['children'].apply(child_group)

In [37]:
#sample the data
credit_data.sample(2)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_category,child_group
5857,2,-138,30.0,secondary education,1,married,0,F,employee,1,29973,housing transactions,real estate,Some
18383,0,391094,57.0,secondary education,1,widow / widower,2,F,retiree,1,24944,getting higher education,education,


I will use the value_counts() method to identify unique values and their counts in the debt column.

In [38]:
credit_data['debt'].value_counts()

0    19730
1     1741
Name: debt, dtype: int64

I will create a function that returns a string to define 0 as having paid the debt, and 1 as having defaulted on paying the debt.

In [39]:
def debt_group(debt):
    if debt == 0:
        return 'paid'
    else:
        return 'defaulted'

I will then create a new column: debt_group and call the apply() method for the debt_group function on the debt column.

In [40]:
credit_data['debt_group'] = credit_data['debt'].apply(debt_group)

I will categorize the new columns that I created.

In [41]:
credit_data['debt_group'] = credit_data['debt_group'].astype('category')
credit_data['child_group'] = credit_data['child_group'].astype('category')

In [52]:
#clean the age column
credit_data['dob_years'].describe()

count    21471.000000
mean        43.476643
std         12.217612
min         19.000000
25%         33.500000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [43]:
#change dob_years to int
credit_data['dob_years'] = credit_data['dob_years'].astype('int64')

In [44]:
credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21471 entries, 0 to 21470
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   children          21471 non-null  int64   
 1   days_employed     21471 non-null  int64   
 2   dob_years         21471 non-null  int64   
 3   education         21471 non-null  category
 4   education_id      21471 non-null  int64   
 5   family_status     21471 non-null  category
 6   family_status_id  21471 non-null  int64   
 7   gender            21471 non-null  category
 8   income_type       21471 non-null  category
 9   debt              21471 non-null  int64   
 10  total_income      21471 non-null  int64   
 11  purpose           21471 non-null  object  
 12  purpose_category  21471 non-null  category
 13  child_group       21471 non-null  category
 14  debt_group        21471 non-null  category
dtypes: category(7), int64(7), object(1)
memory usage: 1.5+ MB


In [45]:
#create dictionaries
#which dictionaries you've selected for this data set and why.
#education, family_status
#create a dictionary, and drop the id column and then categorize in the remaining column
#chiildren and purpose, categorize directly don't create a new column

### Task 3

#### Is there a connection between having kids and repaying a loan on time?

I will create pivot tables to concentrate data on chosen aspects.

Using the pivot_table() method with index, values, columns and aggfunc arguments.
I will rename the columns and drop the column with defaulted values, since it is of no use in my analysis.
I will then calculate the percentage of paid loans with each variable to find a connection.

In [46]:
children = credit_data.pivot_table(index=['child_group'], values='debt', columns='debt_group', aggfunc=['count'])

#type(children)
children.columns = ['defaulted', 'paid']
children.drop('defaulted', axis=1, inplace=True)
#children.columns
children['total'] = credit_data.groupby('child_group')['debt_group'].count()
children['%paid'] = (children['paid'] / children['total']) * 100
print(children.sort_values(by='%paid', ascending=False))

                paid  total       %paid
child_group                            
Most             9.0      9  100.000000
None         13044.0  14107   92.464734
Many           340.0    371   91.644205
Some          6337.0   6984   90.735968


##### There is no connection between having kids and repaying a loan on time, because the group with most children paid all their loans and then the group with no children came second, yet they had more loan counts.

#### Is there a connection between marital status and repaying a loan on time?

In [47]:
#yes, there is. Married people pay their loans more than any other group
#print(credit_by_family.count())
#print(credit_by_family.value_counts())
family = credit_data.pivot_table(index='family_status', values='debt', columns='debt_group', aggfunc='count')
family.columns = ['defaulted', 'paid']
family.drop('defaulted', axis=1, inplace=True)
family['total'] = credit_data.groupby('family_status')['debt_group'].count()
family['%paid'] = (family['paid'] / family['total']) * 100
print(family.sort_values(by='%paid', ascending=False))

                    paid  total      %paid
family_status                             
widow / widower      896    959  93.430657
divorced            1110   1195  92.887029
married            11413  12344  92.457874
civil partnership   3775   4163  90.679798
unmarried           2536   2810  90.249110


#### Yes, there is. Widow/widower paid their loans the most, followed closely by divorced group.

#### The unmarried group paid their loans the least.

#### Is there a connection between income level and repaying a loan on time?

In [48]:
income = credit_data.pivot_table(index='income_type', values='debt', columns='debt_group', aggfunc='count')
income.columns = ['defaulted', 'paid']
income.drop('defaulted', axis=1, inplace=True)
income['total'] = credit_data.groupby('income_type')['debt_group'].count()
income['%paid'] = (income['paid'] / income['total']) * 100
print(income.sort_values(by='%paid', ascending=False))

                paid  total      %paid
income_type                           
retiree         3621   3837  94.370602
civil servant   1371   1457  94.097461
business        4706   5082  92.601338
employee       10030  11092  90.425532
unemployed         2      3  66.666667


##### Retirees and civil servants pay their loans on time, followed by business and employed groups.

##### Umemployed group paid their loans the least.

#### How do different loan purposes affect timely loan repayment?

In [49]:
purposes = credit_data.pivot_table(index='purpose_category', values='debt', columns='debt_group', aggfunc='count')
purposes.columns = ['defaulted', 'paid']
purposes.drop('defaulted', axis=1, inplace=True)
purposes['total'] = credit_data.groupby('purpose_category')['debt_group'].count()
purposes['%paid'] = (purposes['paid'] / purposes['total']) * 100
print(purposes.sort_values(by='%paid', ascending=False))

                   paid  total      %paid
purpose_category                         
real estate       10032  10814  92.768633
wedding            2149   2335  92.034261
education          3644   4014  90.782262
car                3905   4308  90.645311


#### For the Real Estate purpose, the clients paid the loans the most, followed by Wedding.

#### Education and Car purposes paid their loans the least.

In [56]:
#create a function for the three tasks - reusable code
def sort_percentage(data):
        return data.sort_values(ascending=False)
try:
    top = purposes['%paid'].apply(sort_percentage)
    print(top)
except:
    return 'The function ahs raised an error.'

SyntaxError: 'return' outside function (<ipython-input-56-6b0fa036c959>, line 8)