# ðŸ§  Feature Engineering - Credit Risk Modeling Project

This document outlines the complete **feature engineering workflow** applied to the Home Credit Default Risk dataset. The goal is to transform raw data into meaningful features that improve model performance and interpretability.

---

## ðŸ”¹ 1. Data Cleaning

- Converted all `DAYS_*` columns from negative values to positive (e.g., `DAYS_BIRTH` to `AGE_YEARS`).
- Handled missing values for critical features such as `EXT_SOURCE_2`, `EXT_SOURCE_3`, and `OCCUPATION_TYPE` using imputation strategies.
- Ensured consistent binary flag formatting (0/1) across all `FLAG_*` columns.

---

## ðŸ”¹ 2. Categorical Encoding

- **Label Encoding** for binary categorical variables (e.g., `CODE_GENDER`, `FLAG_OWN_CAR`, `FLAG_OWN_REALTY`).
- **One-Hot Encoding** for nominal variables (e.g., `NAME_CONTRACT_TYPE`, `NAME_EDUCATION_TYPE`, `NAME_FAMILY_STATUS`).
- **Frequency Encoding** for high-cardinality variables like `ORGANIZATION_TYPE`.
- Converted `WEEKDAY_APPR_PROCESS_START` to numerical day of the week (Mon=0 to Sun=6).

---

## ðŸ”¹ 3. Numerical Feature Transformation

Created new features and ratios to provide more meaningful relationships:

- `INCOME_PER_PERSON` = `AMT_INCOME_TOTAL` / `CNT_FAM_MEMBERS`
- `CREDIT_TO_INCOME` = `AMT_CREDIT` / `AMT_INCOME_TOTAL`
- `ANNUITY_TO_INCOME` = `AMT_ANNUITY` / `AMT_INCOME_TOTAL`
- `ANNUITY_TO_CREDIT` = `AMT_ANNUITY` / `AMT_CREDIT`
- `GOODS_TO_CREDIT` = `AMT_GOODS_PRICE` / `AMT_CREDIT`
- `AGE_YEARS` = `-DAYS_BIRTH / 365`
- `EMPLOYMENT_YEARS` = `-DAYS_EMPLOYED / 365`
- `YEARS_SINCE_REGISTRATION` = `-DAYS_REGISTRATION / 365`
- `YEARS_ID_ISSUED` = `-DAYS_ID_PUBLISH / 365`
- `WORKING_LIFE_RATIO` = `EMPLOYMENT_YEARS / AGE_YEARS`
- `TIME_SINCE_LAST_PHONE_CHANGE` = `-DAYS_LAST_PHONE_CHANGE / 365`

---

## âœ… Final Feature Matrix

After the above transformations, the final dataset included:

- Original base features
- 5+ engineered features
- Cleaned and encoded categorical variables
- Aggregated and interaction-based enhancements

These were used for training various models including Logistic Regression, Random Forest, XGBoost, and LightGBM.

---



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('train_preprocessed_3.csv')

In [3]:
test = pd.read_csv('test_preprocessed_3.csv')

In [4]:
train.columns

Index(['Unnamed: 0', 'SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE',
       'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
       'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRC

In [5]:
train.shape

(304527, 82)

In [6]:
train.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year
0,0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,...,0.0,1.0,0,0,0,0,0,0,0,0
1,1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,...,0.0,0.0,0,0,0,0,0,0,0,0
2,2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
3,3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,...,2.0,3.0,0,0,1,0,0,1,0,0
4,4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,...,0.0,0.0,0,0,0,0,0,0,0,0


In [7]:
test.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_amt_income_total,is_outlier_amt_credit,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour
0,0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,...,0.0,0.0,0,0,0,0,0,0,0,0
1,1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,...,0.0,3.0,0,0,0,0,0,0,0,0
2,2,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,...,0.0,3.0,0,1,0,0,0,0,0,0
3,3,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,...,1.0,1.0,0,0,1,0,0,1,1,0
4,4,100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,...,1.0,2.0,0,0,0,0,0,0,0,0


In [8]:
train.drop('Unnamed: 0',axis=1,inplace=True)

In [9]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0.0,1.0,0,0,0,0,0,0,0,0
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0.0,0.0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0.0,0.0,0,0,0,0,0,0,0,0
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,2.0,3.0,0,0,1,0,0,1,0,0
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0.0,0.0,0,0,0,0,0,0,0,0


In [10]:
test.drop('Unnamed: 0',axis=1,inplace=True)

In [11]:
test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_amt_income_total,is_outlier_amt_credit,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0.0,3.0,0,0,0,0,0,0,0,0
2,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0.0,3.0,0,1,0,0,0,0,0,0
3,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,1.0,1.0,0,0,1,0,0,1,1,0
4,100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,810000.0,...,1.0,2.0,0,0,0,0,0,0,0,0


In [12]:
train.set_index('SK_ID_CURR')

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,351000.0,...,0.0,1.0,0,0,0,0,0,0,0,0
100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,1129500.0,...,0.0,0.0,0,0,0,0,0,0,0,0
100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,135000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,297000.0,...,2.0,3.0,0,0,1,0,0,1,0,0
100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,513000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456251,0,Cash loans,M,N,N,0,157500,254700.0,27558.0,225000.0,...,2.0,2.0,0,0,1,0,1,0,0,0
456252,0,Cash loans,F,N,Y,0,72000,269550.0,12001.5,225000.0,...,1.0,1.0,0,0,0,0,1,1,0,0
456253,0,Cash loans,F,N,Y,0,153000,677664.0,29979.0,585000.0,...,0.0,1.0,0,0,1,0,0,1,0,0
456254,1,Cash loans,F,N,Y,0,171000,370107.0,20205.0,319500.0,...,0.0,0.0,0,0,0,0,0,0,0,0


In [13]:
test.set_index('SK_ID_CURR')

Unnamed: 0_level_0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_amt_income_total,is_outlier_amt_credit,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,Unaccompanied,...,0.0,0.0,0,0,0,0,0,0,0,0
100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,Unaccompanied,...,0.0,3.0,0,0,0,0,0,0,0,0
100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,Unaccompanied,...,0.0,3.0,0,1,0,0,0,0,0,0
100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,Unaccompanied,...,1.0,1.0,0,0,1,0,0,1,1,0
100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,810000.0,Unaccompanied,...,1.0,2.0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,Unaccompanied,...,0.0,1.0,0,0,0,0,0,0,0,0
456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,Unaccompanied,...,1.0,0.0,0,0,1,0,0,1,0,0
456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,Unaccompanied,...,3.0,1.0,0,0,0,0,1,0,0,0
456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,Family,...,0.0,2.0,0,0,0,0,0,0,0,0


In [14]:
train.index.duplicated().sum()

0

In [15]:
train['DAYS_BIRTH'].isnull().sum()

0

In [16]:
train['AGE_YEAR'] = train['DAYS_BIRTH']/-365

In [17]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,1.0,0,0,0,0,0,0,0,0,25.920548
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0.0,0,0,0,0,0,0,0,0,45.931507
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0.0,0,0,0,0,0,0,0,0,52.180822
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,3.0,0,0,1,0,0,1,0,0,52.068493
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0.0,0,0,0,0,0,0,0,0,54.608219


In [18]:
train['AGE_YEAR'].max()

69.12054794520547

In [19]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_C

In [20]:
train['DAYS_EMPLOYED'].max()

365243

In [21]:
train['DAYS_EMPLOYED'].sort_values()

278271    -17912
270409    -17583
206879    -17546
34841     -17531
231902    -17522
           ...  
49473     365243
49471     365243
190393    365243
219654    365243
159783    365243
Name: DAYS_EMPLOYED, Length: 304527, dtype: int64

In [22]:
train['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
 365243    54852
-200         154
-224         149
-199         149
-230         147
           ...  
-13952         1
-11498         1
-14619         1
-9361          1
-8694          1
Name: count, Length: 12556, dtype: int64

- So in the column DAYS_EMPLOYED there are 54852 rows which have 365243 values and if convert this value to the years so that turn out to be more than 1000 years 
- this is clearly a wrong data

In [23]:
# fill 0 on DAYS_EMPLOYED where value is 365243
train['DAYS_EMPLOYED'].replace(365243,0,inplace=True)

In [24]:
train['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
 0        54854
-200        154
-224        149
-199        149
-230        147
          ...  
-7859         1
-11498        1
-14619        1
-9361         1
-8694         1
Name: count, Length: 12555, dtype: int64

In [25]:
train['DAYS_EMPLOYED'].mean()

-1956.1603174759546

In [26]:
# fill mean of DAYS_EMPLOYED where value is 0
train['DAYS_EMPLOYED'].replace(0,train['DAYS_EMPLOYED'].mean(),inplace=True)

In [27]:
train['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
-1956.160317     54854
-200.000000        154
-224.000000        149
-199.000000        149
-230.000000        147
                 ...  
-7859.000000         1
-11498.000000        1
-14619.000000        1
-9361.000000         1
-8694.000000         1
Name: count, Length: 12555, dtype: int64

In [28]:
train['YEAR_EMPLOYED'] = train['DAYS_EMPLOYED']/-365

In [29]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,0,0,0,25.920548,1.745205
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,45.931507,3.254795
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,0,0,0,52.180822,0.616438
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,0,0,1,0,0,1,0,0,52.068493,8.326027
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,0,0,0,54.608219,8.323288


In [30]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_C

In [31]:
train['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
-1.0        113
-7.0         97
-6.0         95
-2.0         91
-4.0         90
           ... 
-15854.0      1
-15240.0      1
-14539.0      1
-15524.0      1
-14798.0      1
Name: count, Length: 15678, dtype: int64

In [32]:
train['DAYS_REGISTRATION'].describe()

count    304527.000000
mean      -4986.698950
std        3521.597752
min      -24672.000000
25%       -7478.000000
50%       -4505.000000
75%       -2012.000000
max           0.000000
Name: DAYS_REGISTRATION, dtype: float64

In [33]:
train['DAYS_REGISTRATION'].sort_values()

231828   -24672.0
162048   -23738.0
247054   -23416.0
286371   -22928.0
208909   -22858.0
           ...   
247321        0.0
115298        0.0
139046        0.0
171698        0.0
115213        0.0
Name: DAYS_REGISTRATION, Length: 304527, dtype: float64

In [34]:
# fill -1 on DAYS_REGISTRATION where value is 0
train['DAYS_REGISTRATION'].replace(-1,0,inplace=True)

In [35]:
train['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
 0.0        192
-7.0         97
-6.0         95
-2.0         91
-4.0         90
           ... 
-15854.0      1
-15240.0      1
-14539.0      1
-15524.0      1
-14798.0      1
Name: count, Length: 15677, dtype: int64

In [36]:
# Drop rows of DAYS_REGISTRATION where value is 0
train = train[train['DAYS_REGISTRATION'] != 0]

In [37]:
train['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
-7.0        97
-6.0        95
-2.0        91
-4.0        90
-5.0        85
            ..
-19207.0     1
-14993.0     1
-16022.0     1
-14410.0     1
-14798.0     1
Name: count, Length: 15676, dtype: int64

In [38]:
train['YEAR_REGISTRATION'] = train['DAYS_REGISTRATION']/-365

In [39]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,0,0,25.920548,1.745205,9.994521
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,0,0,45.931507,3.254795,3.249315
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,0,0,52.180822,0.616438,11.671233
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,0,1,0,0,1,0,0,52.068493,8.326027,26.939726
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,0,0,54.608219,8.323288,11.810959


In [40]:
train['DAYS_ID_PUBLISH'].describe()

count    304335.000000
mean      -2994.964848
std        1509.337299
min       -7197.000000
25%       -4299.000000
50%       -3255.000000
75%       -1721.000000
max           0.000000
Name: DAYS_ID_PUBLISH, dtype: float64

In [41]:
train['DAYS_ID_PUBLISH'].value_counts()

DAYS_ID_PUBLISH
-4053    168
-4046    161
-4095    159
-4256    157
-4417    156
        ... 
-6233      1
-6151      1
-5902      1
-5921      1
-6211      1
Name: count, Length: 6167, dtype: int64

In [42]:
train['DAYS_ID_PUBLISH'].min()

-7197

In [43]:
train['DAYS_ID_PUBLISH'].max()

0

In [44]:
train[train['DAYS_ID_PUBLISH'] == 0]

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION
2198,102606,0,Cash loans,F,N,N,0,90000,467257.5,16911.0,...,0,0,0,0,1,0,0,45.079452,4.049315,22.038356
20146,123746,0,Cash loans,F,N,N,0,225000,808650.0,26217.0,...,0,1,1,1,1,0,0,45.10137,23.2,23.224658
37812,144217,0,Cash loans,M,Y,Y,2,225000,526491.0,32337.0,...,0,0,0,0,0,0,0,37.139726,4.975342,4.219178
74984,187795,0,Cash loans,M,N,Y,0,135000,760225.5,34483.5,...,0,0,0,0,0,0,0,45.068493,5.709589,5.309589
82605,196732,1,Cash loans,F,N,N,0,117000,545040.0,20677.5,...,0,0,0,0,1,0,0,45.054795,0.931507,4.260274
85621,200329,0,Cash loans,F,Y,Y,0,135000,1350000.0,37255.5,...,0,0,0,0,1,0,0,45.89863,12.271233,17.561644
85859,200618,0,Cash loans,M,Y,Y,0,675000,1227901.5,48825.0,...,0,0,0,0,0,0,0,45.057534,8.356164,12.854795
90415,206024,0,Cash loans,F,N,Y,0,144000,1546020.0,45333.0,...,0,0,0,0,0,0,0,45.087671,7.29589,11.747945
148363,273684,0,Cash loans,M,Y,Y,0,135000,595903.5,26379.0,...,0,1,0,0,1,0,0,45.558904,10.942466,7.219178
214454,350917,0,Cash loans,F,N,Y,0,135000,1125000.0,33025.5,...,0,0,0,0,0,0,0,51.909589,5.359343,27.561644


In [45]:
# drop rows of DAYS_ID_PUBLISH where value is 0
train = train[train['DAYS_ID_PUBLISH'] != 0]

In [46]:
train['DAYS_ID_PUBLISH'].describe()

count    304319.000000
mean      -2995.122312
std        1509.220736
min       -7197.000000
25%       -4299.000000
50%       -3255.000000
75%       -1721.000000
max          -1.000000
Name: DAYS_ID_PUBLISH, dtype: float64

In [47]:
train['YEAR_ID_PUBLISH'] = train['DAYS_ID_PUBLISH']/-365

In [48]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,1,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973


In [49]:
train['DAYS_LAST_PHONE_CHANGE'].describe()

count    304319.000000
mean       -965.416080
std         826.939133
min       -4292.000000
25%       -1572.000000
50%        -761.000000
75%        -276.000000
max           0.000000
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64

In [50]:
train['DAYS_LAST_PHONE_CHANGE'].value_counts()

DAYS_LAST_PHONE_CHANGE
 0.0       37302
-1.0        2721
-2.0        2200
-3.0        1684
-4.0        1235
           ...  
-3514.0        1
-3894.0        1
-3884.0        1
-3713.0        1
-3538.0        1
Name: count, Length: 3770, dtype: int64

In [51]:
# Fill mean of DAYS_LAST_PHONE_CHANGE where value is 0
train['DAYS_LAST_PHONE_CHANGE'].replace(0,train['DAYS_LAST_PHONE_CHANGE'].mean(),inplace=True)

In [52]:
train['DAYS_LAST_PHONE_CHANGE'].value_counts()

DAYS_LAST_PHONE_CHANGE
-965.41608     37302
-1.00000        2721
-2.00000        2200
-3.00000        1684
-4.00000        1235
               ...  
-3514.00000        1
-3894.00000        1
-3884.00000        1
-3713.00000        1
-3538.00000        1
Name: count, Length: 3770, dtype: int64

In [53]:
train['YEAR_LAST_PHONE_CHANGE'] = train['DAYS_LAST_PHONE_CHANGE']/-365

In [54]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_C

In [55]:
train.drop(['DAYS_BIRTH','DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE'],axis=1,inplace=True)

In [56]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219,3.106849
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726,2.268493
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247,2.232877
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712,1.690411
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973,3.030137


### Test

In [57]:
test.columns

Index(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE'

In [58]:
test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_amt_income_total,is_outlier_amt_credit,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0.0,3.0,0,0,0,0,0,0,0,0
2,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0.0,3.0,0,1,0,0,0,0,0,0
3,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,1.0,1.0,0,0,1,0,0,1,1,0
4,100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,810000.0,...,1.0,2.0,0,0,0,0,0,0,0,0


In [59]:
test['DAYS_BIRTH'].value_counts()

DAYS_BIRTH
-11590    13
-11667    12
-11997    12
-11603    12
-15812    12
          ..
-17212     1
-24095     1
-9484      1
-15975     1
-16019     1
Name: count, Length: 15412, dtype: int64

In [60]:
test['DAYS_BIRTH'].max()

-7338

In [61]:
test['DAYS_BIRTH'].min()

-25195

In [62]:
test['AGE_YEARS'] = test['DAYS_BIRTH']/-365

In [63]:
test['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
 365243    9047
-1119        32
-389         31
-1240        29
-148         28
           ... 
-8589         1
-7690         1
-4661         1
-4048         1
-6551         1
Name: count, Length: 7813, dtype: int64

In [64]:
# fill 0 on DAYS_EMPLOYED where value is 365243
test['DAYS_EMPLOYED'].replace(365243,0,inplace=True)

In [65]:
# fill mean of DAYS_EMPLOYED where value is 0
test['DAYS_EMPLOYED'].replace(0,test['DAYS_EMPLOYED'].mean(),inplace=True)

In [66]:
test['YEAR_EMPLOYED'] = test['DAYS_EMPLOYED']/-365

In [67]:
test['DAYS_ID_PUBLISH'].value_counts()

DAYS_ID_PUBLISH
-4557    39
-4255    38
-4291    34
-4592    32
-4543    30
         ..
-5326     1
-5560     1
-5912     1
-5904     1
-6120     1
Name: count, Length: 5874, dtype: int64

In [68]:
test['DAYS_ID_PUBLISH'].max()

0

In [69]:
test[test['DAYS_ID_PUBLISH'] == 0]

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,is_outlier_amt_income_total,is_outlier_amt_credit,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour,AGE_YEARS,YEAR_EMPLOYED
8212,161300,Cash loans,M,Y,Y,0,180000.0,630000.0,34308.0,630000.0,...,0,0,0,0,0,0,0,0,45.046575,2.205479
8335,162133,Cash loans,F,N,Y,0,121500.0,1196802.0,38605.5,999000.0,...,0,0,0,1,0,0,0,0,45.060274,4.452055
8545,163667,Cash loans,F,N,Y,0,67500.0,351000.0,18049.5,351000.0,...,0,0,0,1,0,0,0,0,45.10137,2.235616
34101,354333,Cash loans,M,N,Y,1,207000.0,447453.0,22977.0,373500.0,...,0,0,0,0,0,0,0,0,45.09589,4.69589
40016,400422,Cash loans,M,Y,Y,0,202500.0,732915.0,75231.0,675000.0,...,0,0,0,0,0,1,1,1,45.230137,7.994521


In [70]:
# Remove rows where DAYS_ID_PUBLISH is 0
test = test[test['DAYS_ID_PUBLISH'] != 0]

In [71]:
test['DAYS_ID_PUBLISH'].value_counts()

DAYS_ID_PUBLISH
-4557    39
-4255    38
-4291    34
-4592    32
-4263    30
         ..
-5790     1
-5367     1
-5888     1
-6208     1
-6120     1
Name: count, Length: 5873, dtype: int64

In [72]:
test['YEAR_ID_PUBLISH'] = test['DAYS_ID_PUBLISH']/-365

In [73]:
test['DAYS_LAST_PHONE_CHANGE'].value_counts()

DAYS_LAST_PHONE_CHANGE
 0.0       5649
-1.0        174
-2.0         71
-3.0         45
-1799.0      43
           ... 
-3355.0       1
-3007.0       1
-3925.0       1
-4187.0       1
-2671.0       1
Name: count, Length: 3571, dtype: int64

In [74]:
# Replace 0 with mean of DAYS_LAST_PHONE_CHANGE
test['DAYS_LAST_PHONE_CHANGE'].replace(0,test['DAYS_LAST_PHONE_CHANGE'].mean(),inplace=True)

In [75]:
test['DAYS_LAST_PHONE_CHANGE'].value_counts()

DAYS_LAST_PHONE_CHANGE
-1078.183139    5649
-1.000000        174
-2.000000         71
-3.000000         45
-1799.000000      43
                ... 
-3355.000000       1
-3007.000000       1
-3925.000000       1
-4187.000000       1
-2671.000000       1
Name: count, Length: 3571, dtype: int64

In [76]:
test['YEAR_LAST_PHONE_CHANGE'] = test['DAYS_LAST_PHONE_CHANGE']/-365

In [77]:
test['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
-818.0      20
-427.0      19
-991.0      18
-3758.0     16
-223.0      15
            ..
-11954.0     1
-10726.0     1
-11859.0     1
-11303.0     1
-12088.0     1
Name: count, Length: 12572, dtype: int64

In [78]:
test['DAYS_REGISTRATION'].max()

0.0

In [79]:
test[test['DAYS_REGISTRATION']==0]

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour,AGE_YEARS,YEAR_EMPLOYED,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE
7896,158963,Cash loans,F,Y,Y,0,315000.0,859500.0,44014.5,859500.0,...,0,0,0,0,0,0,50.512329,8.410959,4.939726,0.909589
13894,202273,Revolving loans,F,Y,N,1,171000.0,180000.0,9000.0,180000.0,...,0,0,0,0,0,0,26.556164,3.506849,2.224658,1.509589
23183,272013,Cash loans,F,N,N,0,99000.0,440784.0,27094.5,360000.0,...,0,0,0,0,0,0,45.660274,1.315068,0.019178,1.819178
23295,272938,Cash loans,F,N,Y,0,382500.0,871029.0,37035.0,765000.0,...,0,0,0,0,0,0,47.747945,10.005479,2.687671,1.605479
25616,290331,Cash loans,M,Y,Y,0,360000.0,927612.0,55368.0,810000.0,...,1,0,0,1,1,1,59.460274,13.139726,9.819178,3.090411
30279,324356,Cash loans,F,Y,Y,0,292500.0,1058197.5,38137.5,913500.0,...,0,0,0,0,0,0,44.578082,10.364384,11.890411,4.821918
30660,327301,Cash loans,F,N,Y,0,112500.0,306306.0,16744.5,247500.0,...,0,1,0,0,0,0,32.454795,2.361644,2.457534,2.419178
35298,363538,Cash loans,F,N,Y,0,405000.0,1545624.0,63918.0,1381500.0,...,0,0,0,0,0,0,38.186301,12.219178,13.09863,2.526027
36463,372179,Cash loans,M,Y,N,0,135000.0,189351.0,17496.0,153000.0,...,0,0,0,1,0,1,62.890411,5.500103,14.339726,1.646575
41948,412111,Cash loans,F,N,Y,1,202500.0,1971072.0,68512.5,1800000.0,...,1,0,0,0,0,0,33.528767,10.60274,0.112329,2.284932


In [80]:
# drop rows of DAYS_REGISTRATION where value is 0
test = test[test['DAYS_REGISTRATION'] != 0]

In [81]:
test['YEAR_REGISTRATION'] = test['DAYS_REGISTRATION']/-365

In [82]:
test.drop(['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE','DAYS_REGISTRATION'],axis=1,inplace=True)

In [83]:
train.shape

(304319, 81)

In [84]:
test.shape

(47755, 80)

In [85]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCL

## Encoding Categorical Features

In [86]:
train['NAME_CONTRACT_TYPE'].unique()

array(['Cash loans', 'Revolving loans'], dtype=object)

In [87]:
test['NAME_CONTRACT_TYPE'].unique()

array(['Cash loans', 'Revolving loans'], dtype=object)

In [88]:
# Assign 0 to Cash loans and 1 to Revolving loans
train['NAME_CONTRACT_TYPE'] = train['NAME_CONTRACT_TYPE'].map({'Cash loans':0,'Revolving loans':1})
test['NAME_CONTRACT_TYPE'] = test['NAME_CONTRACT_TYPE'].map({'Cash loans':0,'Revolving loans':1})

In [89]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE
0,100002,1,0,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219,3.106849
1,100003,0,0,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726,2.268493
2,100004,0,1,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247,2.232877
3,100006,0,0,F,N,Y,0,135000,312682.5,29686.5,...,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712,1.690411
4,100007,0,0,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973,3.030137


In [90]:
train['NAME_CONTRACT_TYPE'].unique()

array([0, 1])

In [91]:
train['CODE_GENDER'].unique()

array(['M', 'F'], dtype=object)

In [92]:
test['CODE_GENDER'].unique()

array(['F', 'M'], dtype=object)

In [93]:
# Assign 0 to M and 1 to F
train['CODE_GENDER'] = train['CODE_GENDER'].map({'M':0,'F':1})
test['CODE_GENDER'] = test['CODE_GENDER'].map({'M':0,'F':1})

In [94]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE
0,100002,1,0,0,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219,3.106849
1,100003,0,0,1,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726,2.268493
2,100004,0,1,0,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247,2.232877
3,100006,0,0,1,N,Y,0,135000,312682.5,29686.5,...,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712,1.690411
4,100007,0,0,0,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973,3.030137


In [95]:
# Assign 0 to N and 1 to Y in FLAG_OWN_CAR and FLAG_OWN_REALTY
train['FLAG_OWN_CAR'] = train['FLAG_OWN_CAR'].map({'N':0,'Y':1})
test['FLAG_OWN_CAR'] = test['FLAG_OWN_CAR'].map({'N':0,'Y':1})
train['FLAG_OWN_REALTY'] = train['FLAG_OWN_REALTY'].map({'N':0,'Y':1})
test['FLAG_OWN_REALTY'] = test['FLAG_OWN_REALTY'].map({'N':0,'Y':1})

In [96]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE
0,100002,1,0,0,0,1,0,202500,406597.5,24700.5,...,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219,3.106849
1,100003,0,0,1,0,0,0,270000,1293502.5,35698.5,...,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726,2.268493
2,100004,0,1,0,1,1,0,67500,135000.0,6750.0,...,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247,2.232877
3,100006,0,0,1,0,1,0,135000,312682.5,29686.5,...,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712,1.690411
4,100007,0,0,0,0,1,0,121500,513000.0,21865.5,...,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973,3.030137


In [97]:
test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,is_outlier_amt_req_credit_bureau_year,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_hour,AGE_YEARS,YEAR_EMPLOYED,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE,YEAR_REGISTRATION
0,100001,0,1,0,1,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0,52.715068,6.380822,2.224658,4.767123,14.164384
1,100005,0,0,0,1,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0,49.490411,12.243836,4.446575,2.953926,24.980822
2,100028,0,1,0,1,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0,38.290411,5.112329,11.528767,4.945205,5.479452
3,100038,0,0,1,0,1,180000.0,625500.0,32067.0,625500.0,...,0,0,1,1,0,35.726027,6.00274,11.676712,2.249315,10.958904
4,100042,0,1,1,1,0,270000.0,959688.0,34600.5,810000.0,...,0,0,0,0,0,50.969863,32.90137,5.553425,4.671233,16.756164


In [98]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 304319 entries, 0 to 304526
Data columns (total 81 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   SK_ID_CURR                             304319 non-null  int64  
 1   TARGET                                 304319 non-null  int64  
 2   NAME_CONTRACT_TYPE                     304319 non-null  int64  
 3   CODE_GENDER                            304319 non-null  int64  
 4   FLAG_OWN_CAR                           304319 non-null  int64  
 5   FLAG_OWN_REALTY                        304319 non-null  int64  
 6   CNT_CHILDREN                           304319 non-null  int64  
 7   AMT_INCOME_TOTAL                       304319 non-null  int64  
 8   AMT_CREDIT                             304319 non-null  float64
 9   AMT_ANNUITY                            304319 non-null  float64
 10  AMT_GOODS_PRICE                        304319 non-null  float

In [99]:
# Categorical features
train.select_dtypes(include=['object']).columns

Index(['NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'],
      dtype='object')

In [100]:
test.select_dtypes(include=['object']).columns

Index(['NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'],
      dtype='object')

In [101]:
train['NAME_TYPE_SUITE'].unique()

array(['Unaccompanied', 'Family', 'Spouse, partner', 'Children',
       'Other_A', 'Other_B', 'Group of people'], dtype=object)

In [102]:
test['NAME_TYPE_SUITE'].unique()

array(['Unaccompanied', 'Family', 'Spouse, partner', 'Group of people',
       'Other_B', 'Children', 'Other_A'], dtype=object)

In [103]:
# Ordinal Encoding of NAME_TYPE_SUITE using sklearn OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['NAME_TYPE_SUITE']])
train['NAME_TYPE_SUITE'] = oe.transform(train[['NAME_TYPE_SUITE']])
test['NAME_TYPE_SUITE'] = oe.transform(test[['NAME_TYPE_SUITE']])

In [104]:
# Check what values get assigned to what categories
oe.categories_


[array(['Children', 'Family', 'Group of people', 'Other_A', 'Other_B',
        'Spouse, partner', 'Unaccompanied'], dtype=object)]

In [105]:
train['NAME_TYPE_SUITE'].unique()

array([6., 1., 5., 0., 3., 4., 2.])

In [106]:
test['NAME_TYPE_SUITE'].unique()

array([6., 1., 5., 2., 4., 0., 3.])

In [107]:
train['NAME_TYPE_SUITE'].dtype

dtype('float64')

In [108]:
test['NAME_TYPE_SUITE'].dtype

dtype('float64')

In [110]:
train.select_dtypes(include=['object']).columns

Index(['NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START',
       'ORGANIZATION_TYPE'],
      dtype='object')

In [111]:
test.select_dtypes(include=['object']).columns

Index(['NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START',
       'ORGANIZATION_TYPE'],
      dtype='object')

In [112]:
train['NAME_INCOME_TYPE'].unique()

array(['Working', 'State servant', 'Commercial associate', 'Pensioner',
       'Unemployed', 'Student', 'Businessman', 'Maternity leave'],
      dtype=object)

In [113]:
train['NAME_INCOME_TYPE'].value_counts()

NAME_INCOME_TYPE
Working                 157233
Commercial associate     70743
Pensioner                54802
State servant            21490
Unemployed                  19
Student                     17
Businessman                 10
Maternity leave              5
Name: count, dtype: int64

In [114]:
# Ordinal Encoding of NAME_INCOME_TYPE using sklearn OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['NAME_INCOME_TYPE']])
train['NAME_INCOME_TYPE'] = oe.transform(train[['NAME_INCOME_TYPE']])
test['NAME_INCOME_TYPE'] = oe.transform(test[['NAME_INCOME_TYPE']])

In [115]:
oe.categories_

[array(['Businessman', 'Commercial associate', 'Maternity leave',
        'Pensioner', 'State servant', 'Student', 'Unemployed', 'Working'],
       dtype=object)]

In [116]:
train['NAME_INCOME_TYPE'].unique()

array([7., 4., 1., 3., 6., 5., 0., 2.])

In [117]:
train['NAME_EDUCATION_TYPE'].unique()

array(['Secondary / secondary special', 'Higher education',
       'Incomplete higher', 'Lower secondary', 'Academic degree'],
      dtype=object)

In [118]:
train['NAME_EDUCATION_TYPE'].value_counts()

NAME_EDUCATION_TYPE
Secondary / secondary special    216438
Higher education                  73769
Incomplete higher                 10166
Lower secondary                    3784
Academic degree                     162
Name: count, dtype: int64

In [119]:
test['NAME_EDUCATION_TYPE'].value_counts()

NAME_EDUCATION_TYPE
Secondary / secondary special    33319
Higher education                 12241
Incomplete higher                 1687
Lower secondary                    468
Academic degree                     40
Name: count, dtype: int64

In [120]:
# ordinal encoding of NAME_EDUCATION_TYPE
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['NAME_EDUCATION_TYPE']])
train['NAME_EDUCATION_TYPE'] = oe.transform(train[['NAME_EDUCATION_TYPE']])
test['NAME_EDUCATION_TYPE'] = oe.transform(test[['NAME_EDUCATION_TYPE']])
oe.categories_

[array(['Academic degree', 'Higher education', 'Incomplete higher',
        'Lower secondary', 'Secondary / secondary special'], dtype=object)]

In [121]:
train['NAME_EDUCATION_TYPE'].unique()

array([4., 1., 2., 3., 0.])

In [122]:
train['NAME_FAMILY_STATUS'].unique()

array(['Single / not married', 'Married', 'Civil marriage', 'Widow',
       'Separated'], dtype=object)

In [123]:
test['NAME_FAMILY_STATUS'].unique()

array(['Married', 'Single / not married', 'Civil marriage', 'Widow',
       'Separated'], dtype=object)

In [124]:
# ordinal encoding of NAME_FAMILY_STATUS
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['NAME_FAMILY_STATUS']])
train['NAME_FAMILY_STATUS'] = oe.transform(train[['NAME_FAMILY_STATUS']])
test['NAME_FAMILY_STATUS'] = oe.transform(test[['NAME_FAMILY_STATUS']])
oe.categories_ 

[array(['Civil marriage', 'Married', 'Separated', 'Single / not married',
        'Widow'], dtype=object)]

In [125]:
train['NAME_FAMILY_STATUS'].unique()

array([3., 1., 0., 4., 2.])

In [126]:
train['NAME_HOUSING_TYPE'].unique()

array(['House / apartment', 'Rented apartment', 'With parents',
       'Municipal apartment', 'Office apartment', 'Co-op apartment'],
      dtype=object)

In [127]:
test['NAME_HOUSING_TYPE'].unique()

array(['House / apartment', 'With parents', 'Rented apartment',
       'Municipal apartment', 'Office apartment', 'Co-op apartment'],
      dtype=object)

In [128]:
# Ordinal encoding of NAME_HOUSING_TYPE
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['NAME_HOUSING_TYPE']])
train['NAME_HOUSING_TYPE'] = oe.transform(train[['NAME_HOUSING_TYPE']])
test['NAME_HOUSING_TYPE'] = oe.transform(test[['NAME_HOUSING_TYPE']])
oe.categories_

[array(['Co-op apartment', 'House / apartment', 'Municipal apartment',
        'Office apartment', 'Rented apartment', 'With parents'],
       dtype=object)]

In [129]:
train['NAME_HOUSING_TYPE'].unique()

array([1., 4., 5., 2., 3., 0.])

In [130]:
train['OCCUPATION_TYPE'].unique()

array(['Laborers', 'Core staff', 'Accountants', 'Managers',
       'Cleaning staff', 'Waiters/barmen staff', 'Drivers', 'Sales staff',
       'Secretaries', 'Cooking staff', 'Security staff',
       'Private service staff', 'HR staff', 'Medicine staff',
       'High skill tech staff', 'IT staff', 'Low-skill Laborers',
       'Realty agents'], dtype=object)

In [131]:
test['OCCUPATION_TYPE'].unique()

array(['Private service staff', 'Low-skill Laborers', 'Sales staff',
       'IT staff', 'Drivers', 'High skill tech staff', 'Core staff',
       'Cleaning staff', 'Laborers', 'Managers', 'Accountants',
       'Medicine staff', 'Secretaries', 'Waiters/barmen staff',
       'Security staff', 'Cooking staff', 'Realty agents', 'HR staff'],
      dtype=object)

In [132]:
# Ordinal encoding of OCCUPATION_TYPE
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['OCCUPATION_TYPE']])
train['OCCUPATION_TYPE'] = oe.transform(train[['OCCUPATION_TYPE']])
test['OCCUPATION_TYPE'] = oe.transform(test[['OCCUPATION_TYPE']])
oe.categories_

[array(['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff',
        'Drivers', 'HR staff', 'High skill tech staff', 'IT staff',
        'Laborers', 'Low-skill Laborers', 'Managers', 'Medicine staff',
        'Private service staff', 'Realty agents', 'Sales staff',
        'Secretaries', 'Security staff', 'Waiters/barmen staff'],
       dtype=object)]

In [133]:
train['OCCUPATION_TYPE'].unique()

array([ 8.,  3.,  0., 10.,  1., 17.,  4., 14., 15.,  2., 16., 12.,  5.,
       11.,  6.,  7.,  9., 13.])

In [134]:
train['WEEKDAY_APPR_PROCESS_START'].unique()

array(['WEDNESDAY', 'MONDAY', 'THURSDAY', 'SUNDAY', 'SATURDAY', 'FRIDAY',
       'TUESDAY'], dtype=object)

In [135]:
test['WEEKDAY_APPR_PROCESS_START'].unique()

array(['TUESDAY', 'FRIDAY', 'WEDNESDAY', 'MONDAY', 'THURSDAY', 'SATURDAY',
       'SUNDAY'], dtype=object)

In [136]:
# Ordinal encoding of WEEKDAY_APPR_PROCESS_START
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['WEEKDAY_APPR_PROCESS_START']])
train['WEEKDAY_APPR_PROCESS_START'] = oe.transform(train[['WEEKDAY_APPR_PROCESS_START']])
test['WEEKDAY_APPR_PROCESS_START'] = oe.transform(test[['WEEKDAY_APPR_PROCESS_START']])
oe.categories_

[array(['FRIDAY', 'MONDAY', 'SATURDAY', 'SUNDAY', 'THURSDAY', 'TUESDAY',
        'WEDNESDAY'], dtype=object)]

In [137]:
train['WEEKDAY_APPR_PROCESS_START'].unique()

array([6., 1., 4., 3., 2., 0., 5.])

In [138]:
test['WEEKDAY_APPR_PROCESS_START'].unique()

array([5., 0., 6., 1., 4., 2., 3.])

In [139]:
train['ORGANIZATION_TYPE'].unique()

array(['Business Entity Type 3', 'School', 'Government', 'Religion',
       'Other', 'XNA', 'Electricity', 'Medicine',
       'Business Entity Type 2', 'Self-employed', 'Transport: type 2',
       'Construction', 'Housing', 'Kindergarten', 'Trade: type 7',
       'Industry: type 11', 'Military', 'Services', 'Security Ministries',
       'Transport: type 4', 'Industry: type 1', 'Emergency', 'Security',
       'Trade: type 2', 'University', 'Police', 'Business Entity Type 1',
       'Postal', 'Transport: type 3', 'Industry: type 4', 'Agriculture',
       'Restaurant', 'Culture', 'Hotel', 'Industry: type 7',
       'Trade: type 3', 'Industry: type 3', 'Bank', 'Industry: type 9',
       'Insurance', 'Trade: type 6', 'Industry: type 2',
       'Transport: type 1', 'Industry: type 12', 'Mobile',
       'Trade: type 1', 'Industry: type 5', 'Industry: type 10',
       'Legal Services', 'Advertising', 'Trade: type 5', 'Cleaning',
       'Industry: type 13', 'Trade: type 4', 'Telecom',
       'I

In [140]:
test['ORGANIZATION_TYPE'].unique()

array(['Kindergarten', 'Self-employed', 'Business Entity Type 3',
       'Government', 'Industry: type 9', 'School', 'Trade: type 2', 'XNA',
       'Services', 'Bank', 'Industry: type 3', 'Other', 'Trade: type 6',
       'Industry: type 12', 'Trade: type 7', 'Postal', 'Medicine',
       'Housing', 'Business Entity Type 2', 'Construction', 'Military',
       'Industry: type 4', 'Trade: type 3', 'Legal Services', 'Security',
       'Industry: type 11', 'University', 'Business Entity Type 1',
       'Agriculture', 'Transport: type 2', 'Transport: type 3',
       'Security Ministries', 'Industry: type 7', 'Transport: type 4',
       'Telecom', 'Emergency', 'Police', 'Industry: type 1',
       'Transport: type 1', 'Electricity', 'Industry: type 5', 'Hotel',
       'Restaurant', 'Advertising', 'Mobile', 'Trade: type 1',
       'Industry: type 8', 'Realtor', 'Cleaning', 'Industry: type 2',
       'Trade: type 4', 'Industry: type 6', 'Culture', 'Insurance',
       'Religion', 'Industry: type 1

In [141]:
# Ordinal encoding of ORGANIZATION_TYPE
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(train[['ORGANIZATION_TYPE']])
train['ORGANIZATION_TYPE'] = oe.transform(train[['ORGANIZATION_TYPE']])
test['ORGANIZATION_TYPE'] = oe.transform(test[['ORGANIZATION_TYPE']])
oe.categories_

[array(['Advertising', 'Agriculture', 'Bank', 'Business Entity Type 1',
        'Business Entity Type 2', 'Business Entity Type 3', 'Cleaning',
        'Construction', 'Culture', 'Electricity', 'Emergency',
        'Government', 'Hotel', 'Housing', 'Industry: type 1',
        'Industry: type 10', 'Industry: type 11', 'Industry: type 12',
        'Industry: type 13', 'Industry: type 2', 'Industry: type 3',
        'Industry: type 4', 'Industry: type 5', 'Industry: type 6',
        'Industry: type 7', 'Industry: type 8', 'Industry: type 9',
        'Insurance', 'Kindergarten', 'Legal Services', 'Medicine',
        'Military', 'Mobile', 'Other', 'Police', 'Postal', 'Realtor',
        'Religion', 'Restaurant', 'School', 'Security',
        'Security Ministries', 'Self-employed', 'Services', 'Telecom',
        'Trade: type 1', 'Trade: type 2', 'Trade: type 3', 'Trade: type 4',
        'Trade: type 5', 'Trade: type 6', 'Trade: type 7',
        'Transport: type 1', 'Transport: type 2', 'Trans

In [142]:
train['ORGANIZATION_TYPE'].unique()

array([ 5., 39., 11., 37., 33., 57.,  9., 30.,  4., 42., 53.,  7., 13.,
       28., 51., 16., 31., 43., 41., 55., 14., 10., 40., 46., 56., 34.,
        3., 35., 54., 21.,  1., 38.,  8., 12., 24., 47., 20.,  2., 26.,
       27., 50., 19., 52., 17., 32., 45., 22., 15., 29.,  0., 49.,  6.,
       18., 48., 44., 25., 36., 23.])

In [144]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 304319 entries, 0 to 304526
Data columns (total 81 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   SK_ID_CURR                             304319 non-null  int64  
 1   TARGET                                 304319 non-null  int64  
 2   NAME_CONTRACT_TYPE                     304319 non-null  int64  
 3   CODE_GENDER                            304319 non-null  int64  
 4   FLAG_OWN_CAR                           304319 non-null  int64  
 5   FLAG_OWN_REALTY                        304319 non-null  int64  
 6   CNT_CHILDREN                           304319 non-null  int64  
 7   AMT_INCOME_TOTAL                       304319 non-null  int64  
 8   AMT_CREDIT                             304319 non-null  float64
 9   AMT_ANNUITY                            304319 non-null  float64
 10  AMT_GOODS_PRICE                        304319 non-null  float