# Analyzing borrowers’ risk of defaulting

The goal of this project is to determine whether or not a client's marital status and number of children, among other factors has an impact on whether they will default on a loan. This is will be considered when building a potential customer's credit score. The avenues of thinking I will be testing:
- Does the number of children affect whether a client will default on a loan?
- Does the marital status of a client affect whether they will default on a loan?
- Does a client's income level affect whether they will default on a loan?
- Does the credit's purpose affect whether a client will default on a loan?


In [1]:
# Loading all the libraries
import pandas as pd 

# Load the data
bank_data = pd.read_csv('/datasets/credit_scoring_eng.csv')

## Task 1. Data exploration

**Description of the data**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - client's age in years
- `education` - client's education
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - marital status identifier
- `gender` - gender of the client
- `income_type` - type of employment
- `debt` - was there any debt on loan repayment
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

Next we will take a preliminary look through our data and take note of any changes that will have to be made to aid in clarity and processing.

In [2]:
# Let's see how many rows and columns our dataset has
bank_data.shape

(21525, 12)

In [3]:


print(bank_data.head(10))

   children  days_employed  dob_years            education  education_id  \
0         1   -8437.673028         42    bachelor's degree             0   
1         1   -4024.803754         36  secondary education             1   
2         0   -5623.422610         33  Secondary Education             1   
3         3   -4124.747207         32  secondary education             1   
4         0  340266.072047         53  secondary education             1   
5         0    -926.185831         27    bachelor's degree             0   
6         0   -2879.202052         43    bachelor's degree             0   
7         0    -152.779569         50  SECONDARY EDUCATION             1   
8         2   -6929.865299         35    BACHELOR'S DEGREE             0   
9         0   -2188.756445         41  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1

We can see that though some entries are conveying the same information (i.e. Secondary education), they are logged differently with some being in all capital letters and other rows having all lower case letters. 

It is also apparent that the 'days_employed' column has some negative values that might need to be investigated. 

In [4]:
bank_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Of the 12 columns, it seems like there are 2 columns with missing values: 'days_employed' and 'total_income'. Beyond that, the other 10 columns have 21,525 filled rows.



In [5]:
# Let's look at the filtered table with missing values in the the first column with missing data

bank_data.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [6]:
bank_data.isna().sum()/ bank_data.shape[0]

children            0.000000
days_employed       0.100999
dob_years           0.000000
education           0.000000
education_id        0.000000
family_status       0.000000
family_status_id    0.000000
gender              0.000000
income_type         0.000000
debt                0.000000
total_income        0.100999
purpose             0.000000
dtype: float64

It would seem like the missing values are symmetric with both 'days_employed' and 'income_type' missing 2174 entries, or about 10% of their data. My initial assumption is that those who are unemployed (for example retired) would not have an entry for 'days_employed' and because they are unemployed they would also have no total_income, but this assumption would need to be investigated further. 



In [7]:
bank_data[bank_data['income_type'] == 'unemployed']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
3133,1,337524.466835,31,secondary education,1,married,0,M,unemployed,1,9593.119,buying property for renting out
14798,0,395302.838654,45,Bachelor's Degree,0,civil partnership,1,F,unemployed,0,32435.602,housing renovation


In [8]:
# Let's apply multiple conditions for filtering data and look at the number of rows in the filtered table.
missing_data = bank_data.loc[bank_data['days_employed'].isna() == True, :].reset_index(drop=True)
print()
print(missing_data.head(10))


   children  days_employed  dob_years            education  education_id  \
0         0            NaN         65  secondary education             1   
1         0            NaN         41  secondary education             1   
2         0            NaN         63  secondary education             1   
3         0            NaN         50  secondary education             1   
4         0            NaN         54  secondary education             1   
5         0            NaN         21  secondary education             1   
6         0            NaN         52    bachelor's degree             0   
7         1            NaN         32    bachelor's degree             0   
8         2            NaN         50    bachelor's degree             0   
9         0            NaN         52  secondary education             1   

       family_status  family_status_id gender    income_type  debt  \
0  civil partnership                 1      M        retiree     0   
1            married  

**Intermediate conclusion**

The number of rows in the filtered table (missing_data) matches the number of missing values. As stated above the percentage of missing values compared to the whole dataset for the 'days_employed' and 'total_income' of the bank_data type is about 10%.

My original thought was that missing entries were all retired customers, but the first 10 rows shows that while some are indeed retired, other entries in the first 10 row include 'civil servant', 'business', and 'employee'. I would think that while my original thought is partially correct, some of the other entries may have been logged incorrectly or skipped by the client. 

It's also interesting to note that clients who are labeled as 'unemployed' still have an entry in 'days_employed' and 'total_income'.

My next steps are to find possible reasons for the missing values as my original idea proved at least partially incorrect. I also will determine whether or not there is a pattern among the missing values. This will aid me later when I will have to fill in the missing values.


In [9]:
# Let's investigate clients who do not have data on identified characteristic and the column with the missing values
missing_data.head(10)


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
1,0,,41,secondary education,1,married,0,M,civil servant,0,,education
2,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
3,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
4,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
5,0,,21,secondary education,1,unmarried,4,M,business,0,,transactions with commercial real estate
6,0,,52,bachelor's degree,0,married,0,F,retiree,0,,purchase of the house for my family
7,1,,32,bachelor's degree,0,married,0,M,civil servant,0,,transactions with commercial real estate
8,2,,50,bachelor's degree,0,married,0,F,employee,0,,housing
9,0,,52,secondary education,1,married,0,M,employee,0,,housing


In [10]:
print(missing_data['income_type'].value_counts())
print(missing_data['income_type'].value_counts()/ missing_data.shape[0])

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64
employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64


In [11]:
print(bank_data['income_type'].value_counts())

print(bank_data['income_type'].value_counts()/bank_data.shape[0])

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64
employee                       0.516562
business                       0.236237
retiree                        0.179141
civil servant                  0.067782
entrepreneur                   0.000093
unemployed                     0.000093
paternity / maternity leave    0.000046
student                        0.000046
Name: income_type, dtype: float64


In checking the distribution of income type with missing values, it is apparent that 'employee' is the most prevalent income type, accounting for 50% of the missing data. This goes against my original assumption that retirees would make up the bulk of the missing data. Also in the bank_data table there are retirees that do report an income, which means not all retirees failed to report an income.  


**Possible reasons for missing values in data**

I propose that the missing values are due to a possible human error. Comparing the percentage of income types with the percentage of income types with missing values, the percentages are very similar. This leads me to believe that there are missing values across almost all the income types except for 3 (unemployed, student, and paternity/maternity leave) and that there is no pattern.  



In [12]:
# Checking the distribution in the whole dataset
bank_data.isna().sum()/ bank_data.shape[0]


children            0.000000
days_employed       0.100999
dob_years           0.000000
education           0.000000
education_id        0.000000
family_status       0.000000
family_status_id    0.000000
gender              0.000000
income_type         0.000000
debt                0.000000
total_income        0.100999
purpose             0.000000
dtype: float64

In [13]:
print(missing_data['income_type'].value_counts())
print()
print(missing_data['income_type'].value_counts()/missing_data.shape[0])

employee         1105
business          508
retiree           413
civil servant     147
entrepreneur        1
Name: income_type, dtype: int64

employee         0.508280
business         0.233671
retiree          0.189972
civil servant    0.067617
entrepreneur     0.000460
Name: income_type, dtype: float64


**Intermediate conclusion**
I am first investigating whether income type will lend some clues as to whether there is a pattern for the missing values. The distribution in this filtered table is similar to the original dataset which leads me to believe that the missing values are random. 




In [14]:
# Check for other reasons and patterns that could lead to missing values
missing_data.loc[missing_data['gender'] == 'F']['gender'].count() / missing_data.shape[0]

0.6826126954921803

In [15]:
print(bank_data.loc[bank_data['gender'] == 'F']['gender'].value_counts()/bank_data.shape[0])

F    0.66137
Name: gender, dtype: float64


In [16]:
missing_data.loc[missing_data['gender'] == 'M']['gender'].count() / missing_data.shape[0]

0.3173873045078197

In [17]:
print(bank_data.loc[bank_data['gender'] == 'M']['gender'].value_counts()/bank_data.shape[0])

M    0.338583
Name: gender, dtype: float64


In [18]:
print(missing_data['purpose'].value_counts())
print()
print(missing_data['purpose'].value_counts()/missing_data.shape[0])

having a wedding                            92
to have a wedding                           81
wedding ceremony                            76
construction of own property                75
housing transactions                        74
buy real estate                             72
purchase of the house for my family         71
transactions with my real estate            71
transactions with commercial real estate    70
housing renovation                          70
buy commercial real estate                  67
buying property for renting out             65
property                                    62
buy residential real estate                 61
real estate transactions                    61
housing                                     60
building a property                         59
cars                                        57
going to university                         56
to become educated                          55
second-hand car purchase                    54
buying my own

In [19]:
print(bank_data['purpose'].value_counts()/bank_data.shape[0])

wedding ceremony                            0.037027
having a wedding                            0.036098
to have a wedding                           0.035958
real estate transactions                    0.031405
buy commercial real estate                  0.030848
buying property for renting out             0.030337
housing transactions                        0.030337
transactions with commercial real estate    0.030244
housing                                     0.030058
purchase of the house                       0.030058
purchase of the house for my family         0.029779
construction of own property                0.029501
property                                    0.029454
transactions with my real estate            0.029268
building a real estate                      0.029082
buy real estate                             0.028990
purchase of my own house                    0.028804
building a property                         0.028804
housing renovation                          0.

**Intermediate conclusion**

After looking into the distribution of possible other factors (gender and loan purpose), it is apparent that the distribution of the filtered tables are similar to the original table. I believe it is safe to confirm that the missing values are accidental. I will report this to the department responsible for this data and ask if it's possible to get the meta data.


**Conclusions**

After further investigation, I did not find any pattern in the missing values. The distributions of the missing values when filtering the data by gender, loan purpose, and income type did not yield any pattern as the distributions of the filtered table were similar to the original dataset.  



Through using the absolute value method, I will adjust the 'days_employed' column to change all the integers to positive integers. As total income is imperative to our hypotheses, I will determine the possible factors that can affect one's income and then find the median or mean incomes of those parameters. If there are outliers in these calculations then I will be using the median and if there are no outliers I will use the mean.



My next steps will be transforming the data. This will include ensuring all entries are in lower case to ease with the possible elimination of duplicates, filling in missing values using the best practice, and addressing any potential errors. 



## Data transformation




In [20]:
# Let's see all values in education column to check if and what spellings will need to be fixed
bank_data['education'].value_counts()


secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
Graduate Degree            1
GRADUATE DEGREE            1
Name: education, dtype: int64

In [21]:
# Fix the registers if required
bank_data['education'] = bank_data['education'].str.lower()

In [22]:
# Checking all the values in the column to make sure we fixed them
bank_data['education'].value_counts()


secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

In [23]:
# Let's see the distribution of values in the `children` column
bank_data['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [24]:
bank_data['children'].value_counts()/ bank_data['children'].count()

 0     0.657329
 1     0.223833
 2     0.095470
 3     0.015331
 20    0.003531
-1     0.002184
 4     0.001905
 5     0.000418
Name: children, dtype: float64

There are 47 entries of a client having -1 children. This makes up about .2% of the children column, a low percentage. It is possible that the client meant to indicate 1 child and there was a typo. Seeing how it is a low percentage, I am going to change the -1 to a 1, thus adding those 47 entries to the 1 child categorization. 


In [25]:
# [fix the data based on your decision]
bank_data['children']= bank_data['children'].replace(-1,1)

In [26]:
# Checking the `children` column again to make sure it's all fixed

bank_data['children'].value_counts()

0     14149
1      4865
2      2055
3       330
20       76
4        41
5         9
Name: children, dtype: int64

In [27]:
# Find problematic data in `days_employed`, if they exist, and calculate the percentage
print(bank_data['days_employed'].head(10))
print()
print(bank_data.loc[bank_data['days_employed'] < 0]['days_employed'].count())
print()
print(bank_data.loc[bank_data['days_employed'] < 0]['days_employed'].count() / bank_data['days_employed'].count())

0     -8437.673028
1     -4024.803754
2     -5623.422610
3     -4124.747207
4    340266.072047
5      -926.185831
6     -2879.202052
7      -152.779569
8     -6929.865299
9     -2188.756445
Name: days_employed, dtype: float64

15906

0.8219730246498889


In the days employed column there is a large parcentage (82%) of entries that are logged as negative numbers. Because the percentage is so high, it is possible this was caused by a technical issue. I propose that the numbers themselves are accurate but they should be positive, rather than negative numbers. I will use the absolute value method to turn the negative integers into positive integers.


In [28]:
# Address the problematic values, if they exist
bank_data['days_employed'] = bank_data['days_employed'].abs()


In [29]:
# Check the result - make sure it's fixed
bank_data['days_employed'].head(10)

0      8437.673028
1      4024.803754
2      5623.422610
3      4124.747207
4    340266.072047
5       926.185831
6      2879.202052
7       152.779569
8      6929.865299
9      2188.756445
Name: days_employed, dtype: float64

In [30]:
# Check the `dob_years` for suspicious values and count the percentage
print(bank_data['dob_years'].value_counts().sort_index())
print(bank_data.loc[bank_data['dob_years'] == 0].count() / bank_data['dob_years'].count())

0     101
19     14
20     51
21    111
22    183
23    254
24    264
25    357
26    408
27    493
28    503
29    545
30    540
31    560
32    510
33    581
34    603
35    617
36    555
37    537
38    598
39    573
40    609
41    607
42    597
43    513
44    547
45    497
46    475
47    480
48    538
49    508
50    514
51    448
52    484
53    459
54    479
55    443
56    487
57    460
58    461
59    444
60    377
61    355
62    352
63    269
64    265
65    194
66    183
67    167
68     99
69     85
70     65
71     58
72     33
73      8
74      6
75      1
Name: dob_years, dtype: int64
children            0.004692
days_employed       0.004228
dob_years           0.004692
education           0.004692
education_id        0.004692
family_status       0.004692
family_status_id    0.004692
gender              0.004692
income_type         0.004692
debt                0.004692
total_income        0.004228
purpose             0.004692
dtype: float64


In the 'dob_years' column that indicates the clients day of birth in years, there are 101 instances of the client being 0 years old. This acounts for about 0.4% of the 'dob_years' column. As the percentage is negligible I will drop the rows that have 0 as the date of birth in years.



In [31]:
# Address the issues in the `dob_years` column, if they exist
bank_data = bank_data.loc[bank_data['dob_years'] != 0]

In [32]:
# Check the result - make sure it's fixed
print(bank_data['dob_years'].value_counts().sort_index())

19     14
20     51
21    111
22    183
23    254
24    264
25    357
26    408
27    493
28    503
29    545
30    540
31    560
32    510
33    581
34    603
35    617
36    555
37    537
38    598
39    573
40    609
41    607
42    597
43    513
44    547
45    497
46    475
47    480
48    538
49    508
50    514
51    448
52    484
53    459
54    479
55    443
56    487
57    460
58    461
59    444
60    377
61    355
62    352
63    269
64    265
65    194
66    183
67    167
68     99
69     85
70     65
71     58
72     33
73      8
74      6
75      1
Name: dob_years, dtype: int64


In [33]:
# Let's see the values for the column
print(bank_data['family_status'].value_counts())

married              12331
civil partnership     4156
unmarried             2797
divorced              1185
widow / widower        955
Name: family_status, dtype: int64


In [34]:
# Address the problematic values in `family_status`, if they exist
 


There do not seem to be any problematic values in the 'family_status' column.

In [35]:
# Check the result - make sure it's fixed


In [36]:
# Let's see the values in the column
bank_data['gender'].value_counts()

F      14164
M       7259
XNA        1
Name: gender, dtype: int64

There is one entry in the gender column that is 'XNA'.

In [37]:
bank_data[bank_data['gender'] == 'XNA']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,2358.600502,24,some college,2,civil partnership,1,XNA,business,0,32624.825,buy real estate


There is no way of knowing if this was a typo or if this client identifies as 'non-binary'. In case the latter is true, I will keep this row and not change anything. 

In [38]:
# Let's see the values in the column
bank_data['income_type'].value_counts()

employee                       11064
business                        5065
retiree                         3836
civil servant                   1453
entrepreneur                       2
unemployed                         2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

One potential problem is the lack of clarity between 'business' and 'entrepreneur'. These values are not defined which leaves room for interpretation. I am going to interpret it the following way: An entrepreneur is someone who earns an income from their business. With that in mind, I am going to change 'entrepreneur' into 'business'. 

In [39]:
bank_data.loc[bank_data['income_type'] == 'entrepreneur','income_type'] = 'business'

In [40]:
# Check the result - make sure it's fixed
bank_data['income_type'].value_counts()


employee                       11064
business                        5067
retiree                         3836
civil servant                   1453
unemployed                         2
paternity / maternity leave        1
student                            1
Name: income_type, dtype: int64

In [41]:
# Checking duplicates
print(bank_data.duplicated().sum())

71


There are 71 duplicates in the dataset. As there is no unique identifier for each client, I will use the drop_duplicates method and reset the index.

In [42]:
# Address the duplicates, if they exist
bank_data = bank_data.drop_duplicates().reset_index(drop=True)

In [43]:
# Last check whether we have any duplicates
print(bank_data.duplicated().sum())

0


In [44]:
# Checking the size of the dataset 
bank_data.shape

(21353, 12)

After my initial processing my dataset shape has changed from 21,525 rows to 21,353 rows, a change of about 0.8%. There are currently still 12 columns. I've changed the 'entrepreneur' income type to business and made sure the entries in education were all uniform. I've also dropped duplicate rows and eliminated any rows where a client was logged as being 0 years old.




# Working with missing values

In case I need them, I will create two dictionaries. One for the education/education_id columns and the another for the family_status/family_status_id colums. I chose these colums as they readily have unique ids provided that reference certain entries. 


In [45]:
bank_data['education'].value_counts()


secondary education    15108
bachelor's degree       5215
some college             742
primary education        282
graduate degree            6
Name: education, dtype: int64

In [46]:
# Find the dictionaries
education = {
    0: 'bachelor\'s degree',
    1: 'secondary education',
    2: 'some college',
    3: 'primary education',
    4: 'graduate degree'
}



In [47]:
print(bank_data['family_status'].value_counts())


married              12290
civil partnership     4130
unmarried             2794
divorced              1185
widow / widower        954
Name: family_status, dtype: int64


In [48]:
family =  {
    0: 'married',
    1: 'civil partnership',
    2: 'widow / widower',
    3: 'divorced',
    4: 'unmarried'
}

### Restoring missing values in `total_income`

The total income and days employed colums have missing values that I will need to address. I will fix these columns by deciding which parameter would best help predict one's income or days employed and then I will find the median and mean of those parameters. Finally, after choosing between using the mean or median, I will use whichever one I chose to fill in the missing values. 


In [49]:
bank_data['dob_years'].value_counts().sort_index()

19     14
20     51
21    111
22    183
23    252
24    264
25    357
26    408
27    493
28    503
29    544
30    537
31    559
32    509
33    581
34    601
35    616
36    554
37    536
38    597
39    572
40    607
41    605
42    596
43    512
44    545
45    496
46    472
47    477
48    536
49    508
50    513
51    446
52    484
53    459
54    476
55    443
56    483
57    456
58    454
59    443
60    374
61    354
62    348
63    269
64    260
65    193
66    182
67    167
68     99
69     85
70     65
71     56
72     33
73      8
74      6
75      1
Name: dob_years, dtype: int64

In [50]:
# Writing a function that calculates the age category

def age_category(age):
    if age < 20:
        return 'under twenty'
    if 20 <= age <= 29:
        return '20s'
    if 30 <= age <= 39:
        return '30s'
    if 40 <= age <= 49:
        return '40s'
    if 50 <= age <= 59:
        return '50s'
    if 60 <= age <= 69:
        return '60s'
    if age >= 70:
        return 'over 70'
    

In [51]:
#Function test
age_category(57)

'50s'

In [52]:
# Creating new column based on function
bank_data['age_cat'] = bank_data['dob_years'].apply(age_category)
print(bank_data.head())

   children  days_employed  dob_years            education  education_id  \
0         1    8437.673028         42    bachelor's degree             0   
1         1    4024.803754         36  secondary education             1   
2         0    5623.422610         33  secondary education             1   
3         3    4124.747207         32  secondary education             1   
4         0  340266.072047         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1            married                 0      F    employee     0     17932.802   
2            married                 0      M    employee     0     23341.752   
3            married                 0      M    employee     0     42820.568   
4  civil partnership                 1      F     retiree     0     25378.572   

                   purpose age_cat  
0    purchase of th

In [53]:
# Checking how values in the new column
bank_data['age_cat'].value_counts()


30s             5662
40s             5354
50s             4657
20s             3166
60s             2331
over 70          169
under twenty      14
Name: age_cat, dtype: int64

One's income typically depends on one's job, education, age, experience/days employed, and gender. I will look at these parameters and choose the one that I feel has the strongest impact. 



In [54]:
# Creating a table without missing values
no_missing = bank_data.loc[(bank_data['days_employed'].isna() == False) & (bank_data['total_income'].isna() == False), :]
print(no_missing.head())
print(no_missing.isna().sum())

   children  days_employed  dob_years            education  education_id  \
0         1    8437.673028         42    bachelor's degree             0   
1         1    4024.803754         36  secondary education             1   
2         0    5623.422610         33  secondary education             1   
3         3    4124.747207         32  secondary education             1   
4         0  340266.072047         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1            married                 0      F    employee     0     17932.802   
2            married                 0      M    employee     0     23341.752   
3            married                 0      M    employee     0     42820.568   
4  civil partnership                 1      F     retiree     0     25378.572   

                   purpose age_cat  
0    purchase of th

In [55]:
# Looking at the mean values for income based identified factors
no_missing.groupby('age_cat')['total_income'].mean().sort_index()

age_cat
20s             25572.630177
30s             28312.479963
40s             28551.375635
50s             25811.700327
60s             23242.812818
over 70         20125.658331
under twenty    16993.942462
Name: total_income, dtype: float64

In [56]:
# Looking at the median values for income based identified factors
no_missing.groupby('age_cat')['total_income'].median().sort_index()

age_cat
20s             22799.2580
30s             24667.5280
40s             24764.2290
50s             22203.0745
60s             19817.4400
over 70         18751.3240
under twenty    14934.9010
Name: total_income, dtype: float64

In [57]:
no_missing.groupby('gender')['total_income'].mean()

gender
F      24664.752169
M      30905.772981
XNA    32624.825000
Name: total_income, dtype: float64

In [58]:
no_missing.groupby('gender')['total_income'].median()

gender
F      21469.0015
M      26819.5670
XNA    32624.8250
Name: total_income, dtype: float64

In [59]:
no_missing.groupby('income_type')['total_income'].mean()

income_type
business                       32407.766937
civil servant                  27361.316126
employee                       25824.679592
paternity / maternity leave     8612.661000
retiree                        21939.310393
student                        15712.260000
unemployed                     21014.360500
Name: total_income, dtype: float64

In [60]:
no_missing.groupby('income_type')['total_income'].median()

income_type
business                       27571.0825
civil servant                  24083.5065
employee                       22815.1035
paternity / maternity leave     8612.6610
retiree                        18969.1490
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64

In [61]:
no_missing.groupby('education')['total_income'].mean()

education
bachelor's degree      33172.428387
graduate degree        27960.024667
primary education      21144.882211
secondary education    24600.353617
some college           29040.391842
Name: total_income, dtype: float64

In [62]:
education_income = no_missing.groupby('education')['total_income'].median()
graduate_mean = no_missing[no_missing['education'] == 'graduate degree']['total_income'].median()
print(education_income)
print(graduate_mean)


education
bachelor's degree      28054.5310
graduate degree        25161.5835
primary education      18741.9760
secondary education    21839.4075
some college           25618.4640
Name: total_income, dtype: float64
25161.5835


Based on the above findings, I will chose education as the characteristic that defines income the most. As the results for the mean are quite different, suggesting possible outliers, I will be using the median to fill in missing values.



In [63]:
#  Write a function that we will use for filling in missing values
        
def income_values(edu_income):
   
    if edu_income == 'bachelor\'s degree':
        return 28054.5310
    if edu_income == 'graduate degree':
        return 25161.5835
    if edu_income == 'primary education':
        return 18741.9760
    if edu_income == 'secondary education':
        return 21839.4075
    if edu_income == 'some college':
        return 25618.4640
    

In [64]:
# Check if it works
income_values('secondary education')

21839.4075

In [65]:
bank_data['total_income'].head()

0    40620.102
1    17932.802
2    23341.752
3    42820.568
4    25378.572
Name: total_income, dtype: float64

In [66]:
# Apply it to every row
bank_data['total_income'] = bank_data['total_income'].fillna(value = bank_data['education'].apply(income_values))

In [67]:
# Check if we got any errors
bank_data['total_income'].head()

0    40620.102
1    17932.802
2    23341.752
3    42820.568
4    25378.572
Name: total_income, dtype: float64

In [68]:
# Checking the number of entries in the columns
print(bank_data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21353 entries, 0 to 21352
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21353 non-null  int64  
 1   days_employed     19260 non-null  float64
 2   dob_years         21353 non-null  int64  
 3   education         21353 non-null  object 
 4   education_id      21353 non-null  int64  
 5   family_status     21353 non-null  object 
 6   family_status_id  21353 non-null  int64  
 7   gender            21353 non-null  object 
 8   income_type       21353 non-null  object 
 9   debt              21353 non-null  int64  
 10  total_income      21353 non-null  float64
 11  purpose           21353 non-null  object 
 12  age_cat           21353 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB
None


###  Restoring values in `days_employed`

In [69]:
no_missing = bank_data.loc[bank_data['days_employed'].isna() == False]
print(no_missing.head())

   children  days_employed  dob_years            education  education_id  \
0         1    8437.673028         42    bachelor's degree             0   
1         1    4024.803754         36  secondary education             1   
2         0    5623.422610         33  secondary education             1   
3         3    4124.747207         32  secondary education             1   
4         0  340266.072047         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0     40620.102   
1            married                 0      F    employee     0     17932.802   
2            married                 0      M    employee     0     23341.752   
3            married                 0      M    employee     0     42820.568   
4  civil partnership                 1      F     retiree     0     25378.572   

                   purpose age_cat  
0    purchase of th

In [70]:
# Distribution of `days_employed` medians based on identified parameters
print(no_missing.groupby('age_cat')['days_employed'].median())
print()
print(no_missing.groupby('education')['days_employed'].median())
print()
print(no_missing.groupby('income_type')['days_employed'].median())



age_cat
20s               1005.629955
30s               1601.784231
40s               2111.489906
50s               4796.767897
60s             354935.619093
over 70         361336.993449
under twenty       724.492610
Name: days_employed, dtype: float64

education
bachelor's degree      1895.747795
graduate degree        5660.057032
primary education      3043.933615
secondary education    2392.805768
some college           1209.230373
Name: days_employed, dtype: float64

income_type
business                         1548.009883
civil servant                    2673.404956
employee                         1576.067689
paternity / maternity leave      3296.759962
retiree                        365176.336775
student                           578.751554
unemployed                     366413.652744
Name: days_employed, dtype: float64


In [71]:
# Distribution of `days_employed` means based on identified parameters
print(no_missing.groupby('age_cat')['days_employed'].mean())
print()
print(no_missing.groupby('education')['days_employed'].mean())
print()
print(no_missing.groupby('income_type')['days_employed'].mean())

age_cat
20s               2089.054192
30s               4155.029251
40s              12383.580460
50s             132907.545543
60s             283926.481689
over 70         320819.151927
under twenty       633.678086
Name: days_employed, dtype: float64

education
bachelor's degree       42352.124485
graduate degree        121323.630206
primary education      130340.426349
secondary education     76376.801662
some college            20717.298527
Name: days_employed, dtype: float64

income_type
business                         2112.449218
civil servant                    3388.508552
employee                         2328.603723
paternity / maternity leave      3296.759962
retiree                        365015.727554
student                           578.751554
unemployed                     366413.652744
Name: days_employed, dtype: float64


I will use the median value as there is a bit of range in values.


In [72]:
twentymedian = no_missing.loc[no_missing['age_cat'] == '20s']['days_employed'].median()

thirtymedian= no_missing.loc[no_missing['age_cat'] == '30s']['days_employed'].median()

fourtymedian= no_missing.loc[no_missing['age_cat'] == '40s']['days_employed'].median()

fiftymedian= no_missing.loc[no_missing['age_cat'] == '50s']['days_employed'].median()

sixtiesmedian= no_missing.loc[no_missing['age_cat'] == '60s']['days_employed'].median()

seventiesmedian= no_missing.loc[no_missing['age_cat'] == 'over 70']['days_employed'].median()

In [73]:
# Let's write a function that calculates medians based on identified parameter
def days_median(age_cat):
    if age_cat == '20s':
        return twentymedian
    elif age_cat == '30s':
        return thirtymedian
    elif age_cat == '40s':
        return fourtymedian
    elif age_cat == '50s':
        return fiftymedian
    elif age_cat == '60s':
        return sixtiesmedian
    else:
        return seventiesmedian

In [74]:
# Check that the function works

days_median('50s')

4796.7678969987155

In [75]:
# Replacing missing values
bank_data['days_employed'] = bank_data['days_employed'].fillna(value = bank_data['age_cat'].apply(days_median))


In [76]:
# Check the entries in all columns - make sure we fixed all missing values
bank_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21353 entries, 0 to 21352
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21353 non-null  int64  
 1   days_employed     21353 non-null  float64
 2   dob_years         21353 non-null  int64  
 3   education         21353 non-null  object 
 4   education_id      21353 non-null  int64  
 5   family_status     21353 non-null  object 
 6   family_status_id  21353 non-null  int64  
 7   gender            21353 non-null  object 
 8   income_type       21353 non-null  object 
 9   debt              21353 non-null  int64  
 10  total_income      21353 non-null  float64
 11  purpose           21353 non-null  object 
 12  age_cat           21353 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


## Categorization of data



In [77]:
bank_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21353 entries, 0 to 21352
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21353 non-null  int64  
 1   days_employed     21353 non-null  float64
 2   dob_years         21353 non-null  int64  
 3   education         21353 non-null  object 
 4   education_id      21353 non-null  int64  
 5   family_status     21353 non-null  object 
 6   family_status_id  21353 non-null  int64  
 7   gender            21353 non-null  object 
 8   income_type       21353 non-null  object 
 9   debt              21353 non-null  int64  
 10  total_income      21353 non-null  float64
 11  purpose           21353 non-null  object 
 12  age_cat           21353 non-null  object 
dtypes: float64(2), int64(5), object(6)
memory usage: 2.1+ MB


In [78]:
# Print the values for selected data for categorization
bank_data['family_status']

0                  married
1                  married
2                  married
3                  married
4        civil partnership
               ...        
21348    civil partnership
21349              married
21350    civil partnership
21351              married
21352              married
Name: family_status, Length: 21353, dtype: object

In [79]:
# Check the unique values
bank_data['family_status'].value_counts()


married              12290
civil partnership     4130
unmarried             2794
divorced              1185
widow / widower        954
Name: family_status, dtype: int64

Based on the unique values of family status, I will group that column into clients who are 'Single' and clients who are 'Coupled'.



In [80]:
# Let's write a function to categorize the data based on common topics
def martial_status(status):
    if status == 'married':
        return 'Coupled'
    elif status == 'civil partnership':
        return 'Coupled'
    else:
        return 'Single'


In [81]:
# Create a column with the categories and count the values for them
bank_data['status_cat'] = bank_data['family_status'].apply(martial_status)


In [82]:
bank_data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_cat,status_cat
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40s,Coupled
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30s,Coupled
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30s,Coupled
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30s,Coupled
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50s,Coupled


In [83]:
bank_data['children']

0        1
1        1
2        0
3        3
4        0
        ..
21348    1
21349    0
21350    1
21351    3
21352    2
Name: children, Length: 21353, dtype: int64

In [84]:
bank_data['children'].value_counts()

0     14022
1      4839
2      2039
3       328
20       75
4        41
5         9
Name: children, dtype: int64

Based on the unique values above, I will categorize the clients into three categories: 
    - Those with no children ('No Children')
    - Those with a few children ('1-3 Children')
    - Those with more than a few children ('More than 3 Children')

In [85]:
def child_count (children):
    if children == 0:
        return "No Children"
    elif 1<= children <= 3:
        return '1-3 Children'
    else:
        return 'More than 3 Children'
    

In [86]:
bank_data['child_cat'] = bank_data['children'].apply(child_count)

In [87]:
bank_data['child_cat'].value_counts()

No Children             14022
1-3 Children             7206
More than 3 Children      125
Name: child_cat, dtype: int64

In [88]:
bank_data['purpose'].value_counts()

wedding ceremony                            786
having a wedding                            764
to have a wedding                           760
real estate transactions                    672
buy commercial real estate                  658
buying property for renting out             649
transactions with commercial real estate    648
housing transactions                        646
housing                                     640
purchase of the house                       640
purchase of the house for my family         637
construction of own property                633
property                                    629
transactions with my real estate            627
building a real estate                      621
purchase of my own house                    619
building a property                         619
buy real estate                             618
housing renovation                          605
buy residential real estate                 603
buying my own car                       

Based on the unique values for the purpose column I will categorize my clients by:
    -Those who were using the loan for a wedding ceremony ('wedding ceremony')
    -Those who were using the loan for a car purhcase ('car purchase')
    -Those who were using the loan for educational reasons ('education')
    -Those who were using the loan for real estate/housing ('real estate')

In [89]:
def loan_purpose_cat (loan):
    if 'wedding' in loan:
        return 'wedding ceremony'
    elif 'car' in loan:
        return 'car purchase'
    elif 'education' in loan:
        return 'education'
    else:
        return 'real estate'



In [90]:
bank_data['loan_purpose_cat'] = bank_data['purpose'].apply(loan_purpose_cat)

In [91]:
bank_data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_cat,status_cat,child_cat,loan_purpose_cat
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40s,Coupled,1-3 Children,real estate
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30s,Coupled,1-3 Children,car purchase
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30s,Coupled,No Children,real estate
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30s,Coupled,1-3 Children,education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50s,Coupled,No Children,wedding ceremony


In [92]:
# Looking through all the numerical data in your selected column for categorization
bank_data['total_income'].sort_values()


14480      3306.762
12919      3392.845
16054      3418.824
1590       3471.216
14174      3503.298
            ...    
17051    273809.483
20643    274402.943
9115     276204.162
19452    352136.354
12327    362496.645
Name: total_income, Length: 21353, dtype: float64

In [93]:
# Getting summary statistics for the column
print(bank_data['total_income'].mean())
print(bank_data['total_income'].median())
print(bank_data['total_income'].min())
print(bank_data['total_income'].max())



26472.470093710486
22586.069
3306.762
362496.645


Based on the unique values and the summary statistics for the total income column, I have decided to group the clients into Low, Mid-low, Mid, and High income levels. The ranges were in increments of 40000 as I felt that would encompass the data well without having too many categorization groups. 

Low income <40000

Mid income 40000< x < 100000

High income >100000


In [94]:
# Creating function for categorizing into different numerical groups based on ranges
def income_cat(income):
    if income < 40000:
        return 'Low income'
    if 40000 < income < 80000:
        return 'Mid-low income'
    if 80000 < income < 120000:
        return 'Mid income'
    if income > 120000:
        return 'High income'


In [95]:
# Creating column with categories
bank_data['income_cat'] = bank_data['total_income'].apply(income_cat)

In [96]:
# Count each categories values to see the distribution
bank_data['income_cat'].value_counts()

Low income        18554
Mid-low income     2577
Mid income          173
High income          49
Name: income_cat, dtype: int64

In [97]:
bank_data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_cat,status_cat,child_cat,loan_purpose_cat,income_cat
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,40s,Coupled,1-3 Children,real estate,Mid-low income
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,30s,Coupled,1-3 Children,car purchase,Low income
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,30s,Coupled,No Children,real estate,Low income
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,30s,Coupled,1-3 Children,education,Mid-low income
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,50s,Coupled,No Children,wedding ceremony,Low income


The final new data categorizations are : income level, loan purpose, children, martial status, and age.

## Checking the Hypotheses


**Is there a correlation between having children and paying back on time?**

In [98]:
# Check the children data and paying back on time

print(bank_data.groupby('child_cat')['debt'].value_counts())
# Calculating default-rate based on the number of children

print(bank_data.loc[(bank_data['child_cat'] == '1-3 Children') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['child_cat'] == 'No Children') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['child_cat'] == 'More than 3 Children') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())


child_cat             debt
1-3 Children          0        6543
                      1         663
More than 3 Children  0         113
                      1          12
No Children           0       12964
                      1        1058
Name: debt, dtype: int64
0.3825735718407386

0.6105020196191575

0.006924408540103866


In [100]:
print(bank_data.pivot_table(index='child_cat', columns='debt', values='children', aggfunc='sum', margins=True))

debt                      0     1    All
child_cat                               
1-3 Children           8990   911   9901
More than 3 Children   1533   176   1709
No Children               0     0      0
All                   10523  1087  11610


**Conclusion**
Based on my manipulations of the child_cat column, it seems that those with more children are actually less likely to default on a loan. Clients with more than 3 children only made up about 0.7% of those who defaulted on those loans, with clients with 1 to 3 children making up about 32%. Those with no children made up the majority of clients who defaulted, coming in at about 61%.



**Is there a correlation between family status and paying back on time?**

In [110]:
# Check the family status data and paying back on time
print(bank_data.groupby('status_cat')['debt'].value_counts())


# Calculating default-rate based on family status
print(bank_data.loc[(bank_data['status_cat'] == 'Coupled') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['status_cat'] == 'Single') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())




status_cat  debt
Coupled     0       15107
            1        1313
Single      0        4513
            1         420
Name: debt, dtype: int64
0.7576457010963646

0.2423542989036353


In [117]:
print(bank_data.pivot_table(index='status_cat', columns='debt', values= 'children', aggfunc='sum', margins=True))

debt            0     1    All
status_cat                    
Coupled      9105   920  10025
Single       1418   167   1585
All         10523  1087  11610


**Conclusion**
Based on my manipulations it seems that clients who are in a couple are more likely to default on a loan as they make up about 76% of loan defaults. But it is also worth noting that there are significantly more clients who are in couples than single. 


**Is there a correlation between income level and paying back on time?**

In [111]:
# Check the income level data and paying back on time
print(bank_data.groupby('income_categories')['debt'].value_counts())
print()

# Calculating default-rate based on income level
print(bank_data.loc[(bank_data['income_categories'] == 'Low income') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['income_categories'] == 'Mid-low income') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['income_categories'] == 'Mid income') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['income_categories'] == 'High income') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())



income_categories  debt
High income        0          45
                   1           4
Low income         0       17014
                   1        1540
Mid income         0         163
                   1          10
Mid-low income     0        2398
                   1         179
Name: debt, dtype: int64

0.8886324293133295

0.10328909405654933

0.005770340450086555

0.002308136180034622


In [116]:
print(bank_data.pivot_table(index='income_categories', columns='debt', values= 'children', aggfunc='sum', margins=True))

debt                   0     1    All
income_categories                    
High income           35     1     36
Low income          8998   971   9969
Mid income            91     6     97
Mid-low income      1399   109   1508
All                10523  1087  11610


**Conclusion**

Seems like those with low incomes (less than $40,000) are more likely to default on loan, making up almost 89% of those who defaulted. As income levels increase, those populations make up less and less of the client's who defaulted.



**How does credit purpose affect the default rate?**

In [112]:
# Check the percentages for default rate for each credit purpose and analyze them
print(bank_data.groupby('loan_purpose_cat')['debt'].value_counts())
print()
print(bank_data.loc[(bank_data['loan_purpose_cat'] == 'car purchase') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['loan_purpose_cat'] == 'education') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['loan_purpose_cat'] == 'real estate') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())
print()
print(bank_data.loc[(bank_data['loan_purpose_cat'] == 'wedding ceremony') & (bank_data['debt'] == 1)]['debt'].count() / bank_data[bank_data['debt'] == 1]['debt'].count())


loan_purpose_cat  debt
car purchase      0        3884
                  1         400
education         0        2807
                  1         288
real estate       0       10803
                  1         861
wedding ceremony  0        2126
                  1         184
Name: debt, dtype: int64

0.2308136180034622

0.1661858049624928

0.49682631275245237

0.10617426428159261


In [115]:
print(bank_data.pivot_table(index='loan_purpose_cat', columns='debt', values= 'children', aggfunc='sum', margins=True))

debt                  0     1    All
loan_purpose_cat                    
car purchase       2036   257   2293
education          1586   166   1752
real estate        5797   514   6311
wedding ceremony   1104   150   1254
All               10523  1087  11610


**Conclusion**

Based on my findings it seems like real estate related loans make up about half (50%) of clients who default on their lonas. As for the other purposes, there doesn't seem to be much correlation between them and the rate of defaults. 
[Write your conclusions based on your manipulations and observations.]


# General Conclusion 

The main goal of this report was to determine whether a client's martial status, whether they had children, and other factors correlated with the rate of client's who default on their loans. 
To reach this goal the data had to be analyzed, cleaned up, transformed, and then categorized. There seemed to be examples of both human and technical errors that resulted in problematic data. There were some entries that had to be transformed (education column and days employed) to ensure that removing duplicates would go smoothly. Other categories were combined into one for better clarity of the data.
The original dataset included 2 columns that had missing values (days employed and total income). Various possible reasons for these missing values were investigated, and my conclusion was that there was no pattern to be found. By using the median value of parameters that I deemed as having the greatest impact on these values, I was able to fill in the missing values. 
Once the data was cleaned up, it was time to categorize it to be able to draw conclusions. Based on the value counts of certain categories it seemed based to create categorizations based on client's age, income level, martial status, number of children, and loan purpose. These categorizations then aided me in answering the posed questions.  



As for determining whether or not certain factors predicted whether a client would default on their loan, I made some fascinating observations. It seems that client's with children and/or mid to high incomes made up the minoritity of client's who defaulted on thier loans, leading me to belive those clients are less likely to default.
On the other hand, client's with no children and/or low income are more likely to default. Based on the data, it also seems that clients who are in a couple are more likely to default than those who are not, but it's important to keep in mind that in this sample, there are many more clients who are in a couple than those who are single. 
Finally, beyond those who requested a loan for real estate, there does not seem to be much correlation between loan purpose and those who default on loans. 



