# Analyzing Borrower Default Risk

Your task is to prepare reports for the credit division of a bank. You will find out the effect of a customer's marital status and the number of children he has on the probability of default in loan repayment. The bank already has some data regarding customer credit worthiness.

Your report will be considered when making a **credit assessment** for potential customers. **Credit scoring** is used to evaluate a potential borrower's ability to repay their loan.

## Open the data *file* and read the general information.

In [1]:
# Muat semua *library*
import pandas as pd


# Muat datanya
data = pd.read_csv('/datasets/credit_scoring_eng.csv')

## Question 1. Data exploration

**Data Description**
- `children` - number of children in the family
- `days_employed` - customer work experience in days
- `dob_years` - customer age in years
- `education` - customer education level
- `education_id` - identifier for the customer's education level
- `family_status` - identifier for the customer's marital status
- `family_status_id` - marital status identifier
- `gender` - customer gender
- `income_type` - job type
- `debt` - whether the customer has ever defaulted on a loan
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan



In [2]:
# Mari kita lihat berapa banyak baris dan kolom yang dimiliki oleh dataset kita
data.shape

(21525, 12)

In [3]:
# Mari tampilkan N baris pertama
data.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family



**From the dataset results displayed, two problems need further investigation.**
1. The `days_employed` column shows several negative/minus results on working days, which is considered very wrong because working days are not supposed to be negative.
2. In the `education` column, there is a mismatch in the writing using capital letters.

In [4]:
# Dapatkan informasi data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


**From the results obtained, there are missing values ​​in two columns, namely in the `days_emlpoyed` and `total_income` columns.**

In [5]:
# Mari kita lihat tabel yang telah difilter dengan nilai yang hilang di kolom pertama yang mengandung data yang hilang
data[data['days_employed'].isna()]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


**Missing values ​​in both columns appear symmetrical. This could be because the `days_employed` value is the same as the `total_income` value.**

In [6]:
# Mari kita terapkan beberapa kondisi untuk memfilter data dan melihat jumlah baris dalam tabel yang telah difilter.
data[(data['days_employed'].isna()) & (data['total_income'].isna())].shape

(2174, 12)

**Tentative conclusions**

**The number of rows that have been filtered have values ​​equal to the number of missing values. It can be confirmed that the missing value in the `days_employed` column affects the `total_income` value. However, this is not the main indication of the cause of missing values ​​in the `total_income` column. It is necessary to investigate further regarding other things, such as the type of work in the `income_type` column.**

**It was found that approximately 10% of the values ​​were missing from the total *dataset*, therefore follow-up action is needed such as filling in the missing values ​​by considering or taking into account the characteristics of the customer such as type of work.**

**ascertain whether the indicator causing the missing value is affecting it or not.**

In [7]:
# Mari kita periksa nasabah yang tidak memiliki data tentang karakteristik yang teridentifikasi dan kolom dengan nilai yang hilang
df_null = data[(data['days_employed'].isnull()) & (data['total_income'].isnull())]
df_null

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
12,0,,65,secondary education,1,civil partnership,1,M,retiree,0,,to have a wedding
26,0,,41,secondary education,1,married,0,M,civil servant,0,,education
29,0,,63,secondary education,1,unmarried,4,F,retiree,0,,building a real estate
41,0,,50,secondary education,1,married,0,F,civil servant,0,,second-hand car purchase
55,0,,54,secondary education,1,civil partnership,1,F,retiree,1,,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21489,2,,47,Secondary Education,1,married,0,M,business,0,,purchase of a car
21495,1,,50,secondary education,1,civil partnership,1,F,employee,0,,wedding ceremony
21497,0,,48,BACHELOR'S DEGREE,0,married,0,F,business,0,,building a property
21502,1,,42,secondary education,1,married,0,F,employee,0,,building a real estate


In [8]:
# Periksalah distribusinya
a = df_null['income_type'].value_counts(normalize=True).reset_index().rename(columns={"income_type":"percentage"})
a['percentage'] = a['percentage'].apply("{:,.2%}".format)
b = df_null['income_type'].value_counts().reset_index().rename(columns={"income_type":"count"})
c = pd.concat([a , b[['count']]], axis=1)
c

Unnamed: 0,index,percentage,count
0,employee,50.83%,1105
1,business,23.37%,508
2,retiree,19.00%,413
3,civil servant,6.76%,147
4,entrepreneur,0.05%,1


In [9]:
df_null['family_status'].value_counts(normalize=True)

married              0.568997
civil partnership    0.203312
unmarried            0.132475
divorced             0.051518
widow / widower      0.043698
Name: family_status, dtype: float64

**From the results displayed it can be concluded that if the result of `total_income` is missing, the job should be `retiree` or `unemployed`. However `unemployed` is not displayed in the results. However, the `total_income` value of `enployee` is not 50%.
One possibility that causes this to happen is an input error.**


**Possible causes of missing values ​​in data**

**From the results obtained, it can be concluded that the missing values ​​there do not have a particular pattern. Missing value in `days_employed` column.**

In [10]:
# Memeriksa distribusi di seluruh *dataset*
data['income_type'].value_counts(normalize=True)


employee                       0.516562
business                       0.236237
retiree                        0.179141
civil servant                  0.067782
unemployed                     0.000093
entrepreneur                   0.000093
student                        0.000046
paternity / maternity leave    0.000046
Name: income_type, dtype: float64

**Tentative conclusions**

**After checking the distribution results, it shows that the proportion of missing values ​​occurs randomly, but the results show that the distribution values ​​both in the dataset and with the filtered table distribution have a similar percentage, namely 50-51%. So it can be said that our initial hypothesis is slightly off the mark if the value of work greatly influences the value of income.**

In [11]:
# Periksa penyebab dan pola lain yang dapat mengakibatkan nilai yang hilang
data['total_income'].describe()


count     19351.000000
mean      26787.568355
std       16475.450632
min        3306.762000
25%       16488.504500
50%       23202.870000
75%       32549.611000
max      362496.645000
Name: total_income, dtype: float64

**Tentative conclusions**

The cause of missing values ​​can occur due to several factors, one of which is a system error and is not a coincidence.

In [12]:
# Periksa pola lainnya - jelaskan pola tersebut
data_nan_pivot = df_null.pivot_table(index='dob_years', columns='income_type', values='debt', aggfunc='count', margins=True)
data_nan_pivot

income_type,business,civil servant,employee,entrepreneur,retiree,All
dob_years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2.0,,5.0,,3.0,10
19,,,1.0,,,1
20,1.0,,4.0,,,5
21,7.0,1.0,10.0,,,18
22,6.0,,11.0,,,17
23,5.0,1.0,30.0,,,36
24,9.0,1.0,10.0,,1.0,21
25,4.0,4.0,15.0,,,23
26,9.0,2.0,24.0,,,35
27,6.0,3.0,27.0,,,36


**Conclusion**

We did not find a particular pattern to see the values ​​between `dob_years` and `income_type`.

First of all, before dealing with missing values, we will first look at what the case is like before choosing and determining what method will be used later to deal with missing values. 2 method options can be used, namely `drop()` to delete data that is deemed unnecessary or replace the value with the `df.loc[]` method.

**For cases like the above, several methods are suitable for dealing with these cases. As well as:**
- `str.lower()` method = to handle different notes and convert them to small text.
- `df.loc[]` method = to replace the values ​​in the column according to what we want.
- `drop()` method = to remove or delete the desired column value.
- `duplicated()` method = to check or indicate whether there is data that is still duplicated after all problems have been resolved.

## Data Transformation

In [13]:
# Mari kita lihat semua nilai di kolom pendidikan untuk memeriksa ejaan apa yang perlu diperbaiki
data['education'].unique()

array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [14]:
# Perbaiki pencatatan jika diperlukan
data['education'] = data['education'].str.lower()

In [15]:
# Periksa semua nilai di kolom untuk memastikan bahwa kita telah memperbaikinya dengan tepat
data['education'].unique()

array(["bachelor's degree", 'secondary education', 'some college',
       'primary education', 'graduate degree'], dtype=object)

In [16]:
# Mari kita lihat distribusi nilai pada kolom `children`
data['children'].unique()

array([ 1,  0,  3,  2, -1,  4, 20,  5])

In [17]:
data['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

**It is possible that due to errors in writing the values, I decided to change the values `-1` to `1` and `20` to `2` without removing the erroneous values ​​from the datasets.**

In [18]:
# [perbaiki data berdasarkan keputusan Anda]
data.loc[data['children'] == -1, 'children'] = 1
data.loc[data['children'] == 20, 'children'] = 2

In [19]:
# Periksa kembali kolom `children` untuk memastikan bahwa semuanya telah diperbaiki
data['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

In [20]:
# Temukan data yang bermasalah di kolom `days_employed` jika memang terdapat masalah, dan hitung persentasenya
data['days_employed'].describe()

count     19351.000000
mean      63046.497661
std      140827.311974
min      -18388.949901
25%       -2747.423625
50%       -1203.369529
75%        -291.095954
max      401755.400475
Name: days_employed, dtype: float64

**Because the results show a high amount of problematic data, a good solution is to use the `abs()` method which functions to return the absolute value of the number. In this case there are a large number of minus values, therefore using `abs()` will change the value to no longer minus. So the data values ​​will be much more efficient to read than deleting the problematic data.**

In [21]:
data[data['days_employed'] < 0].shape

(15906, 12)

In [22]:
# Atasi nilai yang bermasalah, jika ada
data['days_employed'] = data['days_employed'].abs()

In [23]:
# Periksa hasilnya - pastikan bahwa masalahnya telah diperbaiki
data['days_employed'].describe()

count     19351.000000
mean      66914.728907
std      139030.880527
min          24.141633
25%         927.009265
50%        2194.220567
75%        5537.882441
max      401755.400475
Name: days_employed, dtype: float64

In [96]:
data.loc[(data['income_type'] == 'retiree') | (data['income_type'] == 'unemployed'), 'days_employed'] = 0

In [97]:
data['days_employed'].describe()

count    21352.000000
mean      1933.681248
std       2181.759841
min          0.000000
25%        317.424164
50%       1354.849526
75%       2577.503256
max      18388.949901
Name: days_employed, dtype: float64

In [98]:
data['days_employed'] / 365

0        23.116912
1        11.026860
2        15.406637
3        11.300677
4         0.000000
           ...    
21520    12.409087
21521     0.000000
21522     5.789991
21523     8.527347
21524     5.437007
Name: days_employed, Length: 21352, dtype: float64

[Sekarang mari kita lihat usia nasabah dan mengecek apakah terdapat masalah di sana. Sekali lagi, pikirkan tentang kemungkinan kejanggalan apa yang bisa kita temui dalam kolom ini, misalnya angka usia yang tidak masuk akal.]

In [27]:
# Periksa `dob_years` untuk nilai yang mencurigakan dan hitung persentasenya
data['dob_years'].describe()

count    21525.000000
mean        43.293380
std         12.574584
min          0.000000
25%         33.000000
50%         42.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [28]:
#data['dob_years'].value_counts()
data.loc[data['dob_years'] == 0].shape

(101, 12)

**It is felt that there is a strange value where the min age value. customer 0 years is not correct, so we will try dropping or eliminating the data row whose value is `dob_years = 0`, this is the best action because the number of values ​​to be dropped is only 100.**

In [29]:
# Atasi masalah pada kolom `dob_years`, jika terdapat masalah
data = data.drop(data[data['dob_years'] == 0].index)

In [30]:
# Periksa hasilnya - pastikan bahwa masalahnya telah diperbaiki
data['dob_years'].describe()

count    21424.000000
mean        43.497479
std         12.246934
min         19.000000
25%         33.000000
50%         43.000000
75%         53.000000
max         75.000000
Name: dob_years, dtype: float64

In [31]:
# Mari kita lihat nilai untuk kolom ini
data['family_status'].value_counts()

married              12331
civil partnership     4156
unmarried             2797
divorced              1185
widow / widower        955
Name: family_status, dtype: int64

In [32]:
# Atasi nilai yang bermasalah di `family_status`, jika ada


In [33]:
# Periksa hasilnya - pastikan nilainya telah diperbaiki
data['family_status'].value_counts()

married              12331
civil partnership     4156
unmarried             2797
divorced              1185
widow / widower        955
Name: family_status, dtype: int64

In [34]:
# Mari kita liat nilai dalam kolom ini
data['gender'].value_counts()

F      14164
M       7259
XNA        1
Name: gender, dtype: int64

In [35]:
# Atasi nilai yang bermasalah, jika ada
data = data.drop(data[data['gender'] == 'XNA'].index)

In [36]:
# Periksa hasilnya - pastikan bahwa masalahnya telah diperbaiki
data['gender'].value_counts()


F    14164
M     7259
Name: gender, dtype: int64

In [37]:
# Mari kita lihat nilai dalam kolom ini
data['income_type'].value_counts()

employee                       11064
business                        5064
retiree                         3836
civil servant                   1453
unemployed                         2
entrepreneur                       2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [38]:
# Atasi nilai yang bermasalah, jika ada

In [39]:
# Periksa hasilnya - pastikan bahwa masalahnya telah diperbaiki


In [40]:
# Periksa duplikat
data.duplicated().sum()
data[data.duplicated()].sort_values(by=list(data.columns))

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
19321,0,,23,secondary education,1,unmarried,4,F,employee,0,,second-hand car purchase
18328,0,,29,bachelor's degree,0,married,0,M,employee,0,,buy residential real estate
6312,0,,30,secondary education,1,married,0,M,employee,0,,building a real estate
13773,0,,35,secondary education,1,civil partnership,1,F,employee,0,,to have a wedding
19387,0,,38,bachelor's degree,0,civil partnership,1,F,business,0,,having a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
17755,1,,43,secondary education,1,married,0,M,employee,0,,to become educated
13025,1,,44,secondary education,1,married,0,F,employee,0,,second-hand car purchase
9238,2,,34,secondary education,1,married,0,F,employee,0,,buying property for renting out
14432,2,,36,bachelor's degree,0,married,0,F,civil servant,0,,getting an education


In [41]:
# Atasi duplikat, jika ada
data = data.drop_duplicates()

In [42]:
# Lakukan pemeriksaan terakhir untuk mengecek apakah kita memiliki duplikat
data.duplicated().sum()

0

In [43]:
# Periksa ukuran dataset yang sekarang Anda miliki setelah manipulasi pertama yang Anda lakukan
data.shape

(21352, 12)


**After analysis, several values ​​were removed and some values ​​were replaced with other values ​​such as:**
1. Deleting values ​​in the `dob_years` column.
2. Replace the values ​​in the `education`, `children` and `gender` columns.
**This of course has changes to the initial datasets which had an Index value: 21525 entries to 21352 entries. From the calculation results, it can be concluded that the datasets experienced a change of 0.8%.**

# Work with missing values

**Data dictionary will be used to map a value in data into another value, in other words *dictionary* is used to store several items in one variable. In this case we use it for `education` and `family_status`.**

In [44]:
# Temukan dictionary
edu_dictionary = data[['education', 'education_id']]
edu_dictionary = edu_dictionary.drop_duplicates()
edu_dictionary

Unnamed: 0,education,education_id
0,bachelor's degree,0
1,secondary education,1
13,some college,2
31,primary education,3
2963,graduate degree,4


In [45]:
#
cek = edu_dictionary.drop_duplicates(subset = 'education_id')
test = dict(zip(cek['education'], cek['education_id']))
test

{"bachelor's degree": 0,
 'secondary education': 1,
 'some college': 2,
 'primary education': 3,
 'graduate degree': 4}

In [46]:
fam_dictionary = data[['family_status', 'family_status_id']]
fam_dictionary = fam_dictionary.drop_duplicates()
fam_dictionary

Unnamed: 0,family_status,family_status_id
0,married,0
4,civil partnership,1
18,widow / widower,2
19,divorced,3
24,unmarried,4


In [47]:
cek = fam_dictionary.drop_duplicates(subset = 'family_status_id')
test = dict(zip(cek['family_status'], cek['family_status_id']))
test

{'married': 0,
 'civil partnership': 1,
 'widow / widower': 2,
 'divorced': 3,
 'unmarried': 4}

### Fixed missing value in `total income`


**There are only two missing columns that will be corrected, namely the `total_income` and `days_employed` columns. Once you know what columns are missing, the process of filling in *missing values* is carried out using the fillna() method to fill in the empty values. It doesn't stop there, we also search for the average or median value of the missing values ​​by indicating certain characteristic values ​​such as `education` and `age_group`.**



In [48]:
# Mari kita tulis sebuah fungsi untuk menghitung kategori usia
data['total_income'].fillna(data['total_income'].mean())
    

0        40620.102
1        17932.802
2        23341.752
3        42820.568
4        25378.572
           ...    
21520    35966.698
21521    24959.969
21522    14347.610
21523    39054.888
21524    13127.587
Name: total_income, Length: 21352, dtype: float64

In [49]:
# Lakukan pengujian untuk melihat apakah fungsi Anda bekerja atau tidak
data['total_income'].mean()

26794.133120774703

In [50]:
# Buatlah kolom baru berdasarkan fungsi

def age_group(age):
    
    try:
        if age <= 30:
            return '<30'
        if 31 <= age <= 40:
            return '31-40'
        if 41 <= age <= 50:
            return '41-50'
        if 51 <= age <= 60:
            return '51-60'
        if 61 <= age <= 70:
            return '61-70'
        else:
            return '>70'
    except:
        return 0

In [51]:
age_group(27)

'<30'

In [52]:
data['age_group'] = data['dob_years'].apply(age_group)

In [53]:
# Periksa bagaimana nilai di dalam kolom baru
data[['age_group']]

Unnamed: 0,age_group
0,41-50
1,31-40
2,31-40
3,31-40
4,51-60
...,...
21520,41-50
21521,61-70
21522,31-40
21523,31-40


In [54]:
# Buat tabel tanpa nilai yang hilang dan tampilkan beberapa barisnya untuk memastikan semuanya berjalan dengan baik
df_without_nan = data[data['days_employed'].notnull()]
df_without_nan.shape

(19259, 13)

In [55]:
# Perhatikan nilai rata-rata untuk pendapatan berdasarkan faktor yang telah Anda identifikasi
test = df_without_nan.pivot_table(index='age_group', values='total_income', aggfunc='mean')
test

Unnamed: 0_level_0,total_income
age_group,Unnamed: 1_level_1
31-40,28376.735148
41-50,28390.207085
51-60,25482.856294
61-70,23245.390243
<30,25815.651899
>70,19575.454327


In [56]:
# Perhatikan nilai median untuk pendapatan berdasarkan faktor yang telah Anda identifikasi
test2 = data.pivot_table(index='age_group', values='total_income', aggfunc='median')
test2

Unnamed: 0_level_0,total_income
age_group,Unnamed: 1_level_1
31-40,24825.1865
41-50,24569.968
51-60,22056.771
61-70,19705.855
<30,22955.474
>70,18611.5935


In [57]:
test3 = data.pivot_table(index='age_group', columns='education', values='total_income', aggfunc='median')
test3

education,bachelor's degree,graduate degree,primary education,secondary education,some college
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
31-40,28964.104,18187.3015,19674.2825,23086.242,28829.711
41-50,30451.01,31771.321,21807.668,22832.514,29497.709
51-60,27665.237,42945.794,18022.0315,21028.01,22718.9595
61-70,25194.173,28334.215,16240.844,18794.68,28058.676
<30,26176.547,,23388.807,21334.837,22753.6105
>70,26223.0685,,15013.505,18146.7015,19946.795


**According to observations, apart from the `age_group` factor the `education` factor influences the `total_income` value more because the customer's *income* results have many differences which are determined by the customer's education level. It can be seen from the results displayed that the higher the customer's education level, the higher the total income received. Using the `median()` method is considered more appropriate because the amount of data has many outliers and there are several *income* values ​​that are too far apart.**

In [58]:
# Tulis fungsi yang akan kita gunakan untuk mengisi nilai yang hilang

def get_median_income(age_group):
    
    try:
        return test2['total_income'][age_group]
    except:
        return "error"
        

In [59]:
# Memeriksa bagaimana nilai di dalam kolom baru
get_median_income('>70')

18611.593500000003

In [60]:
# Terapkan fungsi tersebut ke setiap baris
data['median_income'] = data['age_group'].apply(get_median_income)
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_income
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,41-50,24569.968
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,31-40,24825.1865
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,31-40,24825.1865
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,31-40,24825.1865
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,51-60,22056.771


In [61]:
# Periksa apakah kita mendapatkan kesalahan
#cek error
data[data['median_income'] == 'error']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_income


In [62]:
# Ganti nilai yang hilang jika terdapat kesalahan


In [63]:
#isi missing value
data['total_income'] = data['total_income'].fillna(data['median_income'])

[Setelah Anda selesai dengan `total_income`, periksa apakah jumlah total nilai di kolom ini sesuai dengan jumlah nilai di kolom lain.]

In [64]:
# Periksa jumlah entri di kolom
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21352 entries, 0 to 21524
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21352 non-null  int64  
 1   days_employed     19259 non-null  float64
 2   dob_years         21352 non-null  int64  
 3   education         21352 non-null  object 
 4   education_id      21352 non-null  int64  
 5   family_status     21352 non-null  object 
 6   family_status_id  21352 non-null  int64  
 7   gender            21352 non-null  object 
 8   income_type       21352 non-null  object 
 9   debt              21352 non-null  int64  
 10  total_income      21352 non-null  float64
 11  purpose           21352 non-null  object 
 12  age_group         21352 non-null  object 
 13  median_income     21352 non-null  float64
dtypes: float64(3), int64(5), object(6)
memory usage: 2.4+ MB


In [65]:
data.isna().sum()

children               0
days_employed       2093
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income           0
purpose                0
age_group              0
median_income          0
dtype: int64

### Fixed value in `days_employed`

In [66]:
# Distribusi median dari `days_employed` berdasarkan parameter yang Anda identifikasi
test1 = data.pivot_table(index='income_type', values='days_employed', aggfunc='median')
test1

Unnamed: 0_level_0,days_employed
income_type,Unnamed: 1_level_1
business,1548.009883
civil servant,2673.404956
employee,1576.067689
entrepreneur,520.848083
paternity / maternity leave,3296.759962
retiree,365176.336775
student,578.751554
unemployed,366413.652744


In [67]:
# Distribusi rata-rata dari `days_employed` berdasarkan parameter yang Anda identifikasi
test2 = data.pivot_table(index='income_type', values='days_employed', aggfunc='mean')
test2

Unnamed: 0_level_0,days_employed
income_type,Unnamed: 1_level_1
business,2112.744402
civil servant,3388.508552
employee,2328.603723
entrepreneur,520.848083
paternity / maternity leave,3296.759962
retiree,365015.727554
student,578.751554
unemployed,366413.652744


In [68]:
# Mari tulis fungsi yang menghitung rata-rata atau median (tergantung keputusan Anda) berdasarkan parameter yang Anda identifikasi

def get_median_days_employed(income_type):
    
    try:
        return test2['days_employed'][income_type]
    except:
        return "error"

In [69]:
# Periksa apakah fungsi Anda dapat bekerja
get_median_days_employed('student')


578.7515535382181

In [70]:
# Terapkan fungsi ke income_type

data['median_days_employed'] = data['income_type'].apply(get_median_days_employed)
data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_income,median_days_employed
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,41-50,24569.968,2328.603723
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,31-40,24825.1865,2328.603723
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,31-40,24825.1865,2328.603723
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,31-40,24825.1865,2328.603723
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,51-60,22056.771,365015.727554


In [71]:
data[data['median_days_employed'] == 'error']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_income,median_days_employed


In [72]:
# Periksa apakah fungsi Anda bekerja
data['days_employed'] = data['days_employed'].fillna(data['median_days_employed'])

In [74]:
# Periksa entri di semua kolom - pastikan kita memperbaiki semua nilai yang hilang
data.isna().sum()

children                0
days_employed           0
dob_years               0
education               0
education_id            0
family_status           0
family_status_id        0
gender                  0
income_type             0
debt                    0
total_income            0
purpose                 0
age_group               0
median_income           0
median_days_employed    0
dtype: int64

## Data Categorization

In [75]:
# Tampilkan nilai data yang Anda pilih untuk pengkategorian

data['purpose'].unique()

array(['purchase of the house', 'car purchase', 'supplementary education',
       'to have a wedding', 'housing transactions', 'education',
       'having a wedding', 'purchase of the house for my family',
       'buy real estate', 'buy commercial real estate',
       'buy residential real estate', 'construction of own property',
       'property', 'building a property', 'buying a second-hand car',
       'buying my own car', 'transactions with commercial real estate',
       'building a real estate', 'housing',
       'transactions with my real estate', 'cars', 'to become educated',
       'second-hand car purchase', 'getting an education', 'car',
       'wedding ceremony', 'to get a supplementary education',
       'purchase of my own house', 'real estate transactions',
       'getting higher education', 'to own a car', 'purchase of a car',
       'profile education', 'university education',
       'buying property for renting out', 'to buy a car',
       'housing renovation', 'going

In [76]:
# Periksa nilai unik
for i in data['purpose']:
    if 'car' in i:
        print(i, 'car')
        

car purchase car
car purchase car
buying a second-hand car car
buying my own car car
car purchase car
buying a second-hand car car
cars car
car purchase car
second-hand car purchase car
car purchase car
buying my own car car
second-hand car purchase car
car car
cars car
cars car
second-hand car purchase car
cars car
buying a second-hand car car
to own a car car
purchase of a car car
cars car
purchase of a car car
second-hand car purchase car
purchase of a car car
car car
buying my own car car
buying a second-hand car car
second-hand car purchase car
to buy a car car
car car
cars car
purchase of a car car
car car
car car
second-hand car purchase car
to own a car car
buying my own car car
purchase of a car car
buying my own car car
car purchase car
buying my own car car
car purchase car
to own a car car
second-hand car purchase car
car purchase car
to own a car car
to own a car car
buying a second-hand car car
buying my own car car
second-hand car purchase car
buying a second-hand car ca



**There are 4 general categories, namely car, house, education and wedding.**


In [77]:
# Mari kita tulis sebuah fungsi untuk mengategorikan data berdasarkan topik umum
sample_dictionary = {'purchase of the house' : 'house', 'housing transactions' : 'house', 'purchase of the house for my family' : 'house',
                    'buy real estate' : 'house', 'buy commercial real estate' : 'house', 'buy residential real estate' : 'house',
                    'construction of own property' : 'house', 'property' : 'house', 'building a property' : 'house',
                    'transactions with commercial real estate' : 'house', 'building a real estate' : 'house', 'housing' : 'house',
                    'transactions with my real estate' : 'house', 'purchase of my own house' : 'house',
                    'real estate transactions' : 'house', 'buying property for renting out' : 'house', 'housing renovation' : 'house',
                     
                    'car purchase' : 'car', 'buying a second-hand car' : 'car', 'buying my own car' : 'car', 'cars' : 'car',
                    'second-hand car purchase' : 'car', 'to own a car' : 'car', 'purchase of a car' : 'car', 'to buy a car' : 'car', 
                    'to have a wedding' : 'wedding', 'having a wedding' : 'wedding', 'wedding ceremony' : 'wedding', 
                     
                    'supplementary education' : 'education', 'to become educated' : 'education', 'getting an education' : 'education',
                    'to get a supplementary education' : 'education',  'getting higher education' : 'education',
                    'profile education' : 'education', 'university education' : 'education', 'going to university' : 'education', 
                   }

In [78]:
# Buat kolom yang memuat kategori dan hitung nilainya
data['kategori_purpose'] = data['purpose'].replace(sample_dictionary)
data.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_income,median_days_employed,kategori_purpose
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,41-50,24569.968,2328.603723,house
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,31-40,24825.1865,2328.603723,car
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,31-40,24825.1865,2328.603723,house
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,31-40,24825.1865,2328.603723,education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,51-60,22056.771,365015.727554,wedding
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,<30,22955.474,2112.744402,house
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,41-50,24569.968,2112.744402,house
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,41-50,24569.968,2328.603723,education
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,31-40,24825.1865,2328.603723,wedding
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,41-50,24569.968,2328.603723,house


In [79]:
data['kategori_purpose'].value_counts()

house        10763
car           4284
education     3995
wedding       2310
Name: kategori_purpose, dtype: int64

[Jika Anda memutuskan untuk mengategorikan data numerik, Anda juga harus membuat kategori untuk data tersebut.]

In [80]:
# Lihat semua data numerik di kolom yang Anda pilih untuk pengkategorian
#total_income, days_employed, dob_years

column = ['total_income', 'days_employed', 'dob_years']
data1 = pd.DataFrame(data, columns=column)
data1


Unnamed: 0,total_income,days_employed,dob_years
0,40620.102,8437.673028,42
1,17932.802,4024.803754,36
2,23341.752,5623.422610,33
3,42820.568,4124.747207,32
4,25378.572,340266.072047,53
...,...,...,...
21520,35966.698,4529.316663,43
21521,24959.969,343937.404131,67
21522,14347.610,2113.346888,38
21523,39054.888,3112.481705,38


In [81]:
# Dapatkan kesimpulan statistik untuk kolomnya
data1.describe()


Unnamed: 0,total_income,days_employed,dob_years
count,21352.0,21352.0,21352.0
mean,26453.638835,67083.443967,43.476817
std,15707.465557,139144.318819,12.241877
min,3306.762,24.141633,19.0
25%,17223.82125,1023.688788,33.0
50%,23227.295,2328.603723,43.0
75%,31321.653,5321.001947,53.0
max,362496.645,401755.400475,75.0


**The total_income range will be used for grouping, because from the statistical data obtained the value for total_income is more ideal to use.**

In [82]:
# Buat fungsi yang melakukan pengkategorian menjadi kelompok numerik yang berbeda berdasarkan rentang

def income_group(total_income):
    
    try:
        if (total_income > 0) and (total_income <= 10000):
            return 'low'
        if (total_income > 10000) and (total_income <= 30000):
            return 'middle'
        if (total_income > 30000) and (total_income <= 60000):
            return 'high'
        if (total_income > 60000):
            return 'very high'
    except:
        return 0


In [83]:
income_group(40620.102)

'high'

In [84]:
# Buat kolom yang memuat kategori
#buat kolom baru misal kelompok_income dgn use apply seperti di mengisi nilai hilang
data['income_group'] = data['total_income'].apply(income_group)
data.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,age_group,median_income,median_days_employed,kategori_purpose,income_group
0,1,8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house,41-50,24569.968,2328.603723,house,high
1,1,4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase,31-40,24825.1865,2328.603723,car,middle
2,0,5623.42261,33,secondary education,1,married,0,M,employee,0,23341.752,purchase of the house,31-40,24825.1865,2328.603723,house,middle
3,3,4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education,31-40,24825.1865,2328.603723,education,high
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding,51-60,22056.771,365015.727554,wedding,middle
5,0,926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house,<30,22955.474,2112.744402,house,high
6,0,2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions,41-50,24569.968,2112.744402,house,high
7,0,152.779569,50,secondary education,1,married,0,M,employee,0,21731.829,education,41-50,24569.968,2328.603723,education,middle
8,2,6929.865299,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337.093,having a wedding,31-40,24825.1865,2328.603723,wedding,middle
9,0,2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family,41-50,24569.968,2328.603723,house,middle


## Checking hypotheses


**Is there a correlation between having children and the probability of defaulting on a loan?**

In [86]:
data['debt']

0        0
1        0
2        0
3        0
4        0
        ..
21520    0
21521    0
21522    1
21523    1
21524    0
Name: debt, Length: 21352, dtype: int64

In [87]:
# Periksa data anak dan data gagal bayar pinjaman
# Hitung persentase gagal bayar berdasarkan jumlah anak
pivot_table_children = data.pivot_table(index='children', columns='debt', values='dob_years', aggfunc='count')
pivot_table_children['percentage_gagal_bayar'] = pivot_table_children[1] / (pivot_table_children[1] + pivot_table_children[0] * 100)
pivot_table_children

debt,0,1,percentage_gagal_bayar
children,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,12963.0,1058.0,0.000816
1,4397.0,442.0,0.001004
2,1912.0,202.0,0.001055
3,301.0,27.0,0.000896
4,37.0,4.0,0.00108
5,9.0,,


**Conclusion**

**From the results shown, it can be concluded that: Customers with a total of 5 children show that they have never failed to pay, with this NAN which means there is no amount of value. This is different from customers who have 1 to 4, indicating a possibility of default of 0.0008 - 0.001% of the total number of customers who have 1 to 4 children.**

In [88]:
# Periksa data status keluarga dan data gagal bayar pinjaman
# Hitung persentase gagal bayar berdasarkan status keluarga
pivot_table_family = data.pivot_table(index='family_status', columns='debt', values='dob_years', aggfunc='count')
pivot_table_family['percentage_gagal_bayar'] = pivot_table_family[1] / (pivot_table_family[1] + pivot_table_family[0] * 100)
pivot_table_family


debt,0,1,percentage_gagal_bayar
family_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
civil partnership,3743,386,0.00103
divorced,1100,85,0.000772
married,11363,927,0.000815
unmarried,2521,273,0.001082
widow / widower,892,62,0.000695


**Conclusion**


For Family_status, the results show that 10% of the total **civil partnership** and **unmarried** customers have debt and the possibility of default is less than 0.01%. Of the remainder, only less than 10% have debt and the possibility of failure to pay is less than 0.001%.

**Is there a correlation between income level and the probability of defaulting on a loan?**

In [89]:
# Periksa data tingkat pendapatan dan data gagal bayar pinjaman
# Hitung persentase gagal bayar berdasarkan tingkat pendapatan
pivot_table_income = data.pivot_table(index='income_group', columns='debt', values='dob_years', aggfunc='count')
pivot_table_income['percentage_gagal_bayar'] = pivot_table_income[1] / (pivot_table_income[1] + pivot_table_income[0] * 100)
pivot_table_income


debt,0,1,percentage_gagal_bayar
income_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high,4821,397,0.000823
low,863,58,0.000672
middle,13302,1240,0.000931
very high,633,38,0.0006


**Conclusion**

The largest percentage of possible default is shown by customers who have an income at the *middle* level, which is 9% of the total customers with a *middle* income. The smallest percentage is found in customers who have income at the *very high* level, amounting to only 6% of the total.

**How ​​do credit goals affect default percentage?**

In [90]:
# Periksa persentase tingkat gagal bayar untuk setiap tujuan kredit dan lakukan penganalisisan

pivot_table_purpose = data.pivot_table(index='kategori_purpose', columns='debt', values='dob_years', aggfunc='count')
pivot_table_purpose['percentage_gagal_bayar'] = pivot_table_purpose[1] / (pivot_table_purpose[1] + pivot_table_purpose[0] * 100)
pivot_table_purpose

debt,0,1,percentage_gagal_bayar
kategori_purpose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
car,3884,400,0.001029
education,3625,370,0.00102
house,9984,779,0.00078
wedding,2126,184,0.000865


In [91]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21352 entries, 0 to 21524
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   children              21352 non-null  int64  
 1   days_employed         21352 non-null  float64
 2   dob_years             21352 non-null  int64  
 3   education             21352 non-null  object 
 4   education_id          21352 non-null  int64  
 5   family_status         21352 non-null  object 
 6   family_status_id      21352 non-null  int64  
 7   gender                21352 non-null  object 
 8   income_type           21352 non-null  object 
 9   debt                  21352 non-null  int64  
 10  total_income          21352 non-null  float64
 11  purpose               21352 non-null  object 
 12  age_group             21352 non-null  object 
 13  median_income         21352 non-null  float64
 14  median_days_employed  21352 non-null  float64
 15  kategori_purpose   

**Conclusion**

For customers who have debt for buying a car and for education, the default percentage is 0.0001%, which is 10% of the total for the categories *car* and *education*. The smallest percentage who have debt for marriage purposes is only 184 customers from a total of 2126.

# General conclusion

After carrying out all the data analysis processes from start to finish, it can be concluded that:
- The total value of data entries at the end was 21352 entries from the initial total of 21525 entries. This is because several data values ​​were omitted due to invalid data.
- there are several processes used in this analysis, which include: correcting the writing of `children` data, deleting values ​​in `dob_years` that are deemed necessary due to the presence of an invalid value, namely 0, and filling in missing values ​​which are required in several columns such as `days_employed` and `total_income` which may lose value due to a *system error*. Thus, he decided to fix the problem by taking into account the existing values ​​and taking the middle value (*mean*) to fill in the gaps in the 2 columns.
- From all the main processes, it was finally decided to create several new categories such as income