## The goal of this notebook is to take the cleaned dataset from the data cleaning and label construction notebook, run some feature engineering and then do some exploratory analysis comparing the features to the credit label. We will primarily focus on the lifetime performance window.

### Feature Engineering

Our feature set has a mix of continuous features, binary categorical features, and multiclass features. We want to encode all these features in a way that puts them on level footing. As such, we will want to do two things: encode the multiclass labels via a one in k encoding and normalize the continuous features so they lie in between 0 and 1. Let's begin first by normalizing, as this is easy. The dataset we will use is the final dataframe constructed in the "Label Construction" notebook, which has the application data with one row per customer, along with labels for each customer. 

In [8]:
import sys
import pickle
import itertools
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

In [3]:
with open("application_data_with_harsh_delinquency_labels.csv", "r") as features:
    feature_df = pd.read_csv(features)
    

In [6]:
feature_df.drop(columns='Unnamed: 0')

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,...,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,IDList,6_Month,12_Month,24_Month,Lifetime
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,...,1,0,0,Null,2.0,"[5008804, 5008805]",1.0,1.0,1.0,1.0
1,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,...,0,0,0,Security staff,2.0,[5008806],0.0,0.0,0.0,0.0
2,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,...,0,1,1,Sales staff,1.0,"[5008808, 5008809, 5008810, 5008811]",1.0,1.0,1.0,0.0
3,5008812,F,N,Y,0,283500.0,Pensioner,Higher education,Separated,House / apartment,...,0,0,0,Null,1.0,"[5008812, 5008813, 5008814]",0.0,0.0,1.0,0.0
4,5008815,M,Y,Y,0,270000.0,Working,Higher education,Married,House / apartment,...,1,1,1,Accountants,2.0,"[5008815, 5112956]",0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9723,5148694,F,N,N,0,180000.0,Pensioner,Secondary / secondary special,Civil marriage,Municipal apartment,...,0,0,0,Laborers,2.0,[5148694],1.0,1.0,1.0,1.0
9724,5149055,F,N,Y,0,112500.0,Commercial associate,Secondary / secondary special,Married,House / apartment,...,1,1,0,Null,2.0,"[5149055, 5149056]",1.0,1.0,1.0,1.0
9725,5149729,M,Y,Y,0,90000.0,Working,Secondary / secondary special,Married,House / apartment,...,0,0,0,Null,2.0,[5149729],1.0,1.0,1.0,1.0
9726,5149838,F,N,Y,0,157500.0,Pensioner,Higher education,Married,House / apartment,...,0,1,1,Medicine staff,2.0,[5149838],1.0,1.0,1.0,1.0


In [11]:
print(feature_df.columns.values)

['Unnamed: 0' 'ID' 'CODE_GENDER' 'FLAG_OWN_CAR' 'FLAG_OWN_REALTY'
 'CNT_CHILDREN' 'AMT_INCOME_TOTAL' 'NAME_INCOME_TYPE'
 'NAME_EDUCATION_TYPE' 'NAME_FAMILY_STATUS' 'NAME_HOUSING_TYPE'
 'DAYS_BIRTH' 'DAYS_EMPLOYED' 'FLAG_MOBIL' 'FLAG_WORK_PHONE' 'FLAG_PHONE'
 'FLAG_EMAIL' 'OCCUPATION_TYPE' 'CNT_FAM_MEMBERS' 'IDList' '6_Month'
 '12_Month' '24_Month' 'Lifetime']


In [17]:
rename_dict = {'CODE_GENDER': 'Gender', 'FLAG_OWN_CAR': 'Car', 'FLAG_OWN_REALTY':'Property', 'CT_CHILDREN': 'Children',
                'AMT_INCOME_TOTAL': 'Income', 'NAME_INCOME_TYPE': 'Income_Type', 'NAME_EDUCATION_TYPE': 'Education',
                'NAME_FAMILY_STATUS': 'Marriage_Status', 'NAME_HOUSING_TYPE': 'Housing', 'DAYS_BIRTH': 'Age', 
                'DAYS_EMPLOYED': 'Employment_Length', 'FLAG_MOBIL': 'Mobile_Phone', 'FLAG_WORK_PHONE': 'Work_Phone',
                'FLAG_PHONE': 'Phone', 'FLAG_EMAIL': 'Email', 'OCCUPATION_TYPE': 'Occupation', 'CNT_FAM_MEMBERS': 'Family_Size'}

In [18]:
feat_normalized = feature_df.rename(columns=rename_dict)

In [41]:
Scaler = MinMaxScaler()
feat_normalized['Income']=Scaler.fit_transform(feat_normalized[['Income']])

In [50]:
Scaler = MinMaxScaler()
feat_normalized['Age']=1-Scaler.fit_transform(feat_normalized[['Age']])

To normalize the employment length, we need to a bit more careful, as unemployment has been input as simply a large positive number, and this will skew our ranges. We will simply normalize based on the employed customers and use that to transform our data.

In [57]:
Scaler = MinMaxScaler()
feat_employed = feat_normalized.loc[feat_normalized['Employment_Length']<=0]
scale = Scaler.fit(feat_employed[['Employment_Length']])
feat_normalized['Employment_Length'] = scale.transform(feat_normalized[['Employment_Length']])

In [61]:
feat_normalized.loc[feat_normalized['Employment_Length']>1]['Employment_Length']

3       24.270897
16      24.270897
24      24.270897
34      24.270897
40      24.270897
          ...    
9698    24.270897
9703    24.270897
9704    24.270897
9719    24.270897
9721    24.270897
Name: Employment_Length, Length: 1699, dtype: float64