# 🧠 Feature Engineering - Credit Risk Modeling Project

This document outlines the complete **feature engineering workflow** applied to the Home Credit Default Risk dataset. The goal is to transform raw data into meaningful features that improve model performance and interpretability.

---

## 🔹 1. Data Cleaning

- Converted all `DAYS_*` columns from negative values to positive (e.g., `DAYS_BIRTH` to `AGE_YEARS`).
- Handled missing values for critical features such as `EXT_SOURCE_2`, `EXT_SOURCE_3`, and `OCCUPATION_TYPE` using imputation strategies.
- Ensured consistent binary flag formatting (0/1) across all `FLAG_*` columns.

---

## 🔹 2. Categorical Encoding

- **Label Encoding** for binary categorical variables (e.g., `CODE_GENDER`, `FLAG_OWN_CAR`, `FLAG_OWN_REALTY`).
- **One-Hot Encoding** for nominal variables (e.g., `NAME_CONTRACT_TYPE`, `NAME_EDUCATION_TYPE`, `NAME_FAMILY_STATUS`).
- **Frequency Encoding** for high-cardinality variables like `ORGANIZATION_TYPE`.
- Converted `WEEKDAY_APPR_PROCESS_START` to numerical day of the week (Mon=0 to Sun=6).

---

## 🔹 3. Numerical Feature Transformation

Created new features and ratios to provide more meaningful relationships:

- `INCOME_PER_PERSON` = `AMT_INCOME_TOTAL` / `CNT_FAM_MEMBERS`
- `CREDIT_TO_INCOME` = `AMT_CREDIT` / `AMT_INCOME_TOTAL`
- `ANNUITY_TO_INCOME` = `AMT_ANNUITY` / `AMT_INCOME_TOTAL`
- `ANNUITY_TO_CREDIT` = `AMT_ANNUITY` / `AMT_CREDIT`
- `GOODS_TO_CREDIT` = `AMT_GOODS_PRICE` / `AMT_CREDIT`
- `AGE_YEARS` = `-DAYS_BIRTH / 365`
- `EMPLOYMENT_YEARS` = `-DAYS_EMPLOYED / 365`
- `YEARS_SINCE_REGISTRATION` = `-DAYS_REGISTRATION / 365`
- `YEARS_ID_ISSUED` = `-DAYS_ID_PUBLISH / 365`
- `WORKING_LIFE_RATIO` = `EMPLOYMENT_YEARS / AGE_YEARS`
- `TIME_SINCE_LAST_PHONE_CHANGE` = `-DAYS_LAST_PHONE_CHANGE / 365`

---

## 🔹 4. Flag Aggregation

Combined multiple binary flags to create summary features:

- `PHONE_FLAGS_SUM` = Sum of `FLAG_EMP_PHONE`, `FLAG_WORK_PHONE`, `FLAG_CONT_MOBILE`, `FLAG_PHONE`, `FLAG_MOBIL`, `FLAG_EMAIL`
- `REGION_MISMATCH_SUM` = Sum of `REG_REGION_NOT_LIVE_REGION`, `REG_REGION_NOT_WORK_REGION`, `LIVE_REGION_NOT_WORK_REGION`, `REG_CITY_NOT_LIVE_CITY`, `REG_CITY_NOT_WORK_CITY`, `LIVE_CITY_NOT_WORK_CITY`

---

## 🔹 5. Social Circle Features

Aggregated social circle statistics:

- `SOCIAL_OBS_TOTAL` = `OBS_30_CNT_SOCIAL_CIRCLE` + `OBS_60_CNT_SOCIAL_CIRCLE`
- `SOCIAL_DEF_TOTAL` = `DEF_30_CNT_SOCIAL_CIRCLE` + `DEF_60_CNT_SOCIAL_CIRCLE`
- `SOCIAL_DEF_RATIO` = `SOCIAL_DEF_TOTAL` / `SOCIAL_OBS_TOTAL` (handled divide-by-zero cases)

---

## 🔹 6. Document Flags Aggregation

- `NUM_DOCUMENTS_PROVIDED` = Sum of all `FLAG_DOCUMENT_*` features

---

## 🔹 7. Interaction Features

Created features that combine external scores and financial ratios:

- `EXT_SOURCE_AVG` = Mean of `EXT_SOURCE_2` and `EXT_SOURCE_3`
- `EXT_SOURCE_MIN`, `EXT_SOURCE_MAX` = Min and Max of `EXT_SOURCE_2`, `EXT_SOURCE_3`
- `CREDIT_INCOME_EXT_SOURCE` = (`AMT_CREDIT` / `AMT_INCOME_TOTAL`) * `EXT_SOURCE_AVG`

---

## 🔹 8. Temporal Features

- `APPLICATION_HOUR_BIN` = Categorized `HOUR_APPR_PROCESS_START` into time bins (Night, Morning, Afternoon, Evening)
- Converted `WEEKDAY_APPR_PROCESS_START` into integer codes (0–6)


---

## ✅ Final Feature Matrix

After the above transformations, the final dataset included:

- Original base features
- 20+ engineered features
- Cleaned and encoded categorical variables
- Aggregated and interaction-based enhancements

These were used for training various models including Logistic Regression, Random Forest, XGBoost, and LightGBM.

---



In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [6]:
train = pd.read_csv('train_preprocessed_3.csv')
test = pd.read_csv('test_preprocessed_3.csv')

In [7]:
train.columns

Index(['Unnamed: 0', 'SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE',
       'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
       'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRC

In [8]:
train.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year
0,0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,...,0.0,1.0,0,0,0,0,0,0,0,0
1,1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,...,0.0,0.0,0,0,0,0,0,0,0,0
2,2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
3,3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,...,2.0,3.0,0,0,1,0,0,1,0,0
4,4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,...,0.0,0.0,0,0,0,0,0,0,0,0


In [9]:
test.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier
0,0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0
2,2,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0
3,3,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,...,0,0,0,0.0,1.0,1.0,1.0,1.0,1.0,0
4,4,100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,...,0,0,0,0.0,0.0,0.0,0.0,1.0,2.0,0


In [10]:
train.drop('Unnamed: 0',axis=1,inplace=True)

In [11]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0.0,1.0,0,0,0,0,0,0,0,0
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0.0,0.0,0,0,0,0,0,0,0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0.0,0.0,0,0,0,0,0,0,0,0
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,2.0,3.0,0,0,1,0,0,1,0,0
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0.0,0.0,0,0,0,0,0,0,0,0


In [12]:
test.drop('Unnamed: 0',axis=1,inplace=True)

In [13]:
test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0
2,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0
3,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0.0,1.0,1.0,1.0,1.0,1.0,0
4,100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,810000.0,...,0,0,0,0.0,0.0,0.0,0.0,1.0,2.0,0


In [14]:
train.set_index('SK_ID_CURR')

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,351000.0,...,0.0,1.0,0,0,0,0,0,0,0,0
100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,1129500.0,...,0.0,0.0,0,0,0,0,0,0,0,0
100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,135000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,297000.0,...,2.0,3.0,0,0,1,0,0,1,0,0
100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,513000.0,...,0.0,0.0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456251,0,Cash loans,M,N,N,0,157500,254700.0,27558.0,225000.0,...,2.0,2.0,0,0,1,0,1,0,0,0
456252,0,Cash loans,F,N,Y,0,72000,269550.0,12001.5,225000.0,...,1.0,1.0,0,0,0,0,1,1,0,0
456253,0,Cash loans,F,N,Y,0,153000,677664.0,29979.0,585000.0,...,0.0,1.0,0,0,1,0,0,1,0,0
456254,1,Cash loans,F,N,Y,0,171000,370107.0,20205.0,319500.0,...,0.0,0.0,0,0,0,0,0,0,0,0


In [15]:
test.set_index('SK_ID_CURR')

Unnamed: 0_level_0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0
100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0,0
100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,Unaccompanied,...,0,0,0,0.0,1.0,1.0,1.0,1.0,1.0,0
100042,Cash loans,F,Y,Y,0,270000.0,959688.0,34600.5,810000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,1.0,2.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456221,Cash loans,F,N,Y,0,121500.0,412560.0,17473.5,270000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0
456222,Cash loans,F,N,N,2,157500.0,622413.0,31909.5,495000.0,Unaccompanied,...,0,0,0,0.0,0.0,2.0,1.0,1.0,0.0,0
456223,Cash loans,F,Y,Y,1,202500.0,315000.0,33205.5,315000.0,Unaccompanied,...,0,0,0,0.0,0.0,0.0,0.0,3.0,1.0,0
456224,Cash loans,M,N,N,0,225000.0,450000.0,25128.0,450000.0,Family,...,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0,0


In [16]:
train.index.duplicated().sum()

0

In [17]:
train['DAYS_BIRTH'].isnull().sum()

0

In [18]:
train['AGE_YEAR'] = train['DAYS_BIRTH']/-365

In [19]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_REQ_CREDIT_BUREAU_YEAR,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,1.0,0,0,0,0,0,0,0,0,25.920548
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0.0,0,0,0,0,0,0,0,0,45.931507
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0.0,0,0,0,0,0,0,0,0,52.180822
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,3.0,0,0,1,0,0,1,0,0,52.068493
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0.0,0,0,0,0,0,0,0,0,54.608219


In [20]:
train['AGE_YEAR'].max()

69.12054794520547

In [21]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_C

In [22]:
train['DAYS_EMPLOYED'].max()

365243

In [23]:
train['DAYS_EMPLOYED'].sort_values()

278271    -17912
270409    -17583
206879    -17546
34841     -17531
231902    -17522
           ...  
49473     365243
49471     365243
190393    365243
219654    365243
159783    365243
Name: DAYS_EMPLOYED, Length: 304527, dtype: int64

In [24]:
train['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
 365243    54852
-200         154
-224         149
-199         149
-230         147
           ...  
-13952         1
-11498         1
-14619         1
-9361          1
-8694          1
Name: count, Length: 12556, dtype: int64

- So in the column DAYS_EMPLOYED there are 54852 rows which have 365243 values and if convert this value to the years so that turn out to be more than 1000 years 
- this is clearly a wrong data

In [25]:
# fill 0 on DAYS_EMPLOYED where value is 365243
train['DAYS_EMPLOYED'].replace(365243,0,inplace=True)

In [26]:
train['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
 0        54854
-200        154
-224        149
-199        149
-230        147
          ...  
-7859         1
-11498        1
-14619        1
-9361         1
-8694         1
Name: count, Length: 12555, dtype: int64

In [27]:
train['DAYS_EMPLOYED'].mean()

-1956.1603174759546

In [28]:
# fill mean of DAYS_EMPLOYED where value is 0
train['DAYS_EMPLOYED'].replace(0,train['DAYS_EMPLOYED'].mean(),inplace=True)

In [29]:
train['DAYS_EMPLOYED'].value_counts()

DAYS_EMPLOYED
-1956.160317     54854
-200.000000        154
-224.000000        149
-199.000000        149
-230.000000        147
                 ...  
-7859.000000         1
-11498.000000        1
-14619.000000        1
-9361.000000         1
-8694.000000         1
Name: count, Length: 12555, dtype: int64

In [30]:
train['YEAR_EMPLOYED'] = train['DAYS_EMPLOYED']/-365

In [31]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_Income,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,0,0,0,25.920548,1.745205
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,0,0,0,45.931507,3.254795
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,0,0,0,52.180822,0.616438
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,0,0,1,0,0,1,0,0,52.068493,8.326027
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,0,0,0,54.608219,8.323288


In [32]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_C

In [33]:
train['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
-1.0        113
-7.0         97
-6.0         95
-2.0         91
-4.0         90
           ... 
-15854.0      1
-15240.0      1
-14539.0      1
-15524.0      1
-14798.0      1
Name: count, Length: 15678, dtype: int64

In [34]:
train['DAYS_REGISTRATION'].describe()

count    304527.000000
mean      -4986.698950
std        3521.597752
min      -24672.000000
25%       -7478.000000
50%       -4505.000000
75%       -2012.000000
max           0.000000
Name: DAYS_REGISTRATION, dtype: float64

In [35]:
train['DAYS_REGISTRATION'].sort_values()

231828   -24672.0
162048   -23738.0
247054   -23416.0
286371   -22928.0
208909   -22858.0
           ...   
247321        0.0
115298        0.0
139046        0.0
171698        0.0
115213        0.0
Name: DAYS_REGISTRATION, Length: 304527, dtype: float64

In [36]:
# fill -1 on DAYS_REGISTRATION where value is 0
train['DAYS_REGISTRATION'].replace(-1,0,inplace=True)

In [37]:
train['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
 0.0        192
-7.0         97
-6.0         95
-2.0         91
-4.0         90
           ... 
-15854.0      1
-15240.0      1
-14539.0      1
-15524.0      1
-14798.0      1
Name: count, Length: 15677, dtype: int64

In [38]:
# Drop rows of DAYS_REGISTRATION where value is 0
train = train[train['DAYS_REGISTRATION'] != 0]

In [39]:
train['DAYS_REGISTRATION'].value_counts()

DAYS_REGISTRATION
-7.0        97
-6.0        95
-2.0        91
-4.0        90
-5.0        85
            ..
-19207.0     1
-14993.0     1
-16022.0     1
-14410.0     1
-14798.0     1
Name: count, Length: 15676, dtype: int64

In [40]:
train['YEAR_REGISTRATION'] = train['DAYS_REGISTRATION']/-365

In [41]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,0,0,25.920548,1.745205,9.994521
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,0,0,45.931507,3.254795,3.249315
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,0,0,52.180822,0.616438,11.671233
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,0,1,0,0,1,0,0,52.068493,8.326027,26.939726
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,0,0,54.608219,8.323288,11.810959


In [42]:
train['DAYS_ID_PUBLISH'].describe()

count    304335.000000
mean      -2994.964848
std        1509.337299
min       -7197.000000
25%       -4299.000000
50%       -3255.000000
75%       -1721.000000
max           0.000000
Name: DAYS_ID_PUBLISH, dtype: float64

In [43]:
train['DAYS_ID_PUBLISH'].value_counts()

DAYS_ID_PUBLISH
-4053    168
-4046    161
-4095    159
-4256    157
-4417    156
        ... 
-6233      1
-6151      1
-5902      1
-5921      1
-6211      1
Name: count, Length: 6167, dtype: int64

In [44]:
train['DAYS_ID_PUBLISH'].min()

-7197

In [45]:
train['DAYS_ID_PUBLISH'].max()

0

In [46]:
train[train['DAYS_ID_PUBLISH'] == 0]

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_Credit,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION
2198,102606,0,Cash loans,F,N,N,0,90000,467257.5,16911.0,...,0,0,0,0,1,0,0,45.079452,4.049315,22.038356
20146,123746,0,Cash loans,F,N,N,0,225000,808650.0,26217.0,...,0,1,1,1,1,0,0,45.10137,23.2,23.224658
37812,144217,0,Cash loans,M,Y,Y,2,225000,526491.0,32337.0,...,0,0,0,0,0,0,0,37.139726,4.975342,4.219178
74984,187795,0,Cash loans,M,N,Y,0,135000,760225.5,34483.5,...,0,0,0,0,0,0,0,45.068493,5.709589,5.309589
82605,196732,1,Cash loans,F,N,N,0,117000,545040.0,20677.5,...,0,0,0,0,1,0,0,45.054795,0.931507,4.260274
85621,200329,0,Cash loans,F,Y,Y,0,135000,1350000.0,37255.5,...,0,0,0,0,1,0,0,45.89863,12.271233,17.561644
85859,200618,0,Cash loans,M,Y,Y,0,675000,1227901.5,48825.0,...,0,0,0,0,0,0,0,45.057534,8.356164,12.854795
90415,206024,0,Cash loans,F,N,Y,0,144000,1546020.0,45333.0,...,0,0,0,0,0,0,0,45.087671,7.29589,11.747945
148363,273684,0,Cash loans,M,Y,Y,0,135000,595903.5,26379.0,...,0,1,0,0,1,0,0,45.558904,10.942466,7.219178
214454,350917,0,Cash loans,F,N,Y,0,135000,1125000.0,33025.5,...,0,0,0,0,0,0,0,51.909589,5.359343,27.561644


In [47]:
# drop rows of DAYS_ID_PUBLISH where value is 0
train = train[train['DAYS_ID_PUBLISH'] != 0]

In [48]:
train['DAYS_ID_PUBLISH'].describe()

count    304319.000000
mean      -2995.122312
std        1509.220736
min       -7197.000000
25%       -4299.000000
50%       -3255.000000
75%       -1721.000000
max          -1.000000
Name: DAYS_ID_PUBLISH, dtype: float64

In [49]:
train['YEAR_ID_PUBLISH'] = train['DAYS_ID_PUBLISH']/-365

In [50]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_hour,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,1,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973


In [51]:
train['DAYS_LAST_PHONE_CHANGE'].describe()

count    304319.000000
mean       -965.416080
std         826.939133
min       -4292.000000
25%       -1572.000000
50%        -761.000000
75%        -276.000000
max           0.000000
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64

In [52]:
train['DAYS_LAST_PHONE_CHANGE'].value_counts()

DAYS_LAST_PHONE_CHANGE
 0.0       37302
-1.0        2721
-2.0        2200
-3.0        1684
-4.0        1235
           ...  
-3514.0        1
-3894.0        1
-3884.0        1
-3713.0        1
-3538.0        1
Name: count, Length: 3770, dtype: int64

In [53]:
# Fill mean of DAYS_LAST_PHONE_CHANGE where value is 0
train['DAYS_LAST_PHONE_CHANGE'].replace(0,train['DAYS_LAST_PHONE_CHANGE'].mean(),inplace=True)

In [54]:
train['DAYS_LAST_PHONE_CHANGE'].value_counts()

DAYS_LAST_PHONE_CHANGE
-965.41608     37302
-1.00000        2721
-2.00000        2200
-3.00000        1684
-4.00000        1235
               ...  
-3514.00000        1
-3894.00000        1
-3884.00000        1
-3713.00000        1
-3538.00000        1
Name: count, Length: 3770, dtype: int64

In [55]:
train['YEAR_LAST_PHONE_CHANGE'] = train['DAYS_LAST_PHONE_CHANGE']/-365

In [56]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_C

In [57]:
train.drop(['DAYS_BIRTH','DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE'],axis=1,inplace=True)

In [58]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,AGE_YEAR,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE
0,100002,1,Cash loans,M,N,Y,0,202500,406597.5,24700.5,...,0,0,0,0,0,25.920548,1.745205,9.994521,5.808219,3.106849
1,100003,0,Cash loans,F,N,N,0,270000,1293502.5,35698.5,...,0,0,0,0,0,45.931507,3.254795,3.249315,0.79726,2.268493
2,100004,0,Revolving loans,M,Y,Y,0,67500,135000.0,6750.0,...,0,0,0,0,0,52.180822,0.616438,11.671233,6.934247,2.232877
3,100006,0,Cash loans,F,N,Y,0,135000,312682.5,29686.5,...,0,0,1,0,0,52.068493,8.326027,26.939726,6.676712,1.690411
4,100007,0,Cash loans,M,N,Y,0,121500,513000.0,21865.5,...,0,0,0,0,0,54.608219,8.323288,11.810959,9.473973,3.030137
