### 3rd level. Home Credit Default Risk

- [자료1](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction), [자료2](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering)

## prepare

### data

7 sources, [개념적 데이터 연결 관계](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

- application_train / application_test: SK_ID_CURR(index로 보임) 기준, TARGET (0: 빚 청산, 1: 아직 청산 못함)
- bureau(은행이나 대금업자쪽?): client's previous credits from other financial institutions
- bureau_balance: monthly data about hte previous credits in bureau.
- previous_application: previos apllications for loans at Home Credit of clients who have loans in the application data. It is identified by the feature 'SK_ID_PREV'.
- POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit.
- credit_card_balance: monthly data about previous credit cards clients have had with Home Credit.
- installments_payment: payment history

사진을 보니까 외래키를 위해서 거의 모든 id형 키를 만들어 둔 것으로 보이며 SK_ID_어쩌구 형식이다.

### imports

In [None]:
import numpy as np
import pandas as pd
import os

from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns

### read data

In [None]:
print(os.listdir("../input/home-credit-default-risk/"))

In [None]:
train = pd.read_csv("../input/home-credit-default-risk/application_train.csv")
print("Training data shape:", train.shape)
train.head()

In [None]:
test = pd.read_csv("../input/home-credit-default-risk/application_test.csv")
print("Test data shape:", test.shape)
test.head()

## Exploratory Data Analysis

### target distribution (visualization)

In [None]:
train.TARGET.astype(int).plot.hist()

### exmaine missing values

In [None]:
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * mis_val / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: "Missing Values", 1: "% of Total Values"})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        "% of Total Values", ascending=False).round(1)
    
    print("Your selected dataframe has " + str(df.shape[1]) + " columns.",
          "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.", sep="\n")
    return mis_val_table_ren_columns

In [None]:
missing = missing_values_table(train)
missing.head(20)

### Columns types

본격적인 feature 탐색을 위해서 feature에 어떤 type이 있는지 확인하는 작업

In [None]:
train.dtypes.value_counts()

In [None]:
train.select_dtypes("object").apply(pd.Series.nunique, axis=0)

### encoding categorical variables

one-hot encoding 과정

In [None]:
le = LabelEncoder()

In [None]:
le_count = 0
for col in train:
    if train[col].dtype == "object":
        if len(list(train[col].unique())) <= 2:
            le.fit(train[col])
            train[col] = le.transform(train[col])
            test[col] = le.transform(test[col])
            le_count += 1
print("%d columns were label encoded." % le_count)

In [None]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(f"Training Featurees shape: {train.shape}", f"Test Features shape: {test.shape}",
      sep="\n")

#### target feature 제외하고 train, test의 feature 동일하게 하기

In [None]:
train_labels = train.TARGET
train, test = train.align(test, join="inner", axis=1)
train["TARGET"] = train_labels

print(f"Training Featurees shape: {train.shape}", f"Test Features shape: {test.shape}",
      sep="\n")

### 본격적인 EDA

#### anomalies (이상치 감지)

**DAYS_BIRTH (이상 없음)**

In [None]:
(train.DAYS_BIRTH / -365).describe()

**DAYS_EMPLOYED (이상 감지)**

- 여긴 확실히 뭔가 이상하다. 제3 사분위수도 음수인데 최댓값이 미친 듯이 크다. 분포를 확인해보자.
- 이건 무슨 빈익빈 부익부도 아니고. 이상하게 큰 값이 존재한다. 그 비율을 확인해보자.
- 비정상에서 상환률이 낮았다. new feature을 통해 이 값을 표시하고 이상치를 제거하자.

In [None]:
train.DAYS_EMPLOYED.describe()

In [None]:
train.DAYS_EMPLOYED.plot.hist(title="Days Employment Histogram")
plt.xlabel("Days Employment")
plt.show()

In [None]:
the_number = 365243
anom = train[train.DAYS_EMPLOYED == the_number]
non_anom = train[train.DAYS_EMPLOYED != the_number]
print("The non-anomalies default on %0.2f%% of loans" % (100 * non_anom.TARGET.mean()),
      "The anomalies default on %0.2f%% of lonas" % (100 * anom.TARGET.mean()),
      "There are %d anolmalous days of employment" % len(anom), sep="\n")

In [None]:
train["DAYS_EMPLOYED_ANOM"] = train.DAYS_EMPLOYED == the_number
train["DAYS_EMPLOYED"].replace({the_number: np.nan}, inplace=True)

train.DAYS_EMPLOYED.plot.hist(title="Days Employment Histogram")
plt.xlabel("Days Employment")
plt.show()

In [None]:
test["DAYS_EMPLOYED_ANOM"] = test.DAYS_EMPLOYED == the_number
test["DAYS_EMPLOYED"].replace({the_number: np.nan}, inplace=True)

print("There are %d anomalies in the test data out of %d entries"
      % (test.DAYS_EMPLOYED_ANOM.sum(), len(test)))

#### Correlations

전공자신가? 0 ~ 1을 5단계로 분류하여 적용하고자 한다.

In [None]:
correlations = train.corr().TARGET.sort_values()
print("Most Positive Correlations:", correlations.tail(15),
      "\nMost Negative Correlations:", correlations.head(15), sep="\n")

**Effect of Age on Replayment**

- 나이랑은 크게 상관 없는 듯하다.
- 그래도 혹시 써먹을 수 있지 않을까 해서 나이를 categorical variable로 변환하고 확인.
- 수치상으론 작지만, 어릴수록, 젊을수록 아직 상환하지 못한 빚이 있다.

In [None]:
train["DAYS_BIRTH"] = abs(train.DAYS_BIRTH)
train.DAYS_BIRTH.corr(train.TARGET)

In [None]:
plt.style.use("fivethirtyeight")
plt.hist(train.DAYS_BIRTH / 365, edgecolor='k', bins=25)

plt.xlabel("Age (years)")
plt.ylabel("Count")
plt.title("Age of Client")
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.kdeplot(train.loc[train.TARGET == 0, "DAYS_BIRTH"] / 365, label="target == 0")
sns.kdeplot(train.loc[train.TARGET == 1, "DAYS_BIRTH"] / 365, label="target == 1")

plt.xlabel("Age (years)")
plt.ylabel("Density")
plt.legend(loc="best")
plt.title("Distribution of Ages")
plt.show()

In [None]:
age_data = train[["TARGET", "DAYS_BIRTH"]]
age_data["YEARS_BIRTH"] = age_data["DAYS_BIRTH"] / 365
age_data["YEARS_BINNED"] = pd.cut(age_data.YEARS_BIRTH, bins=np.linspace(20, 70, num=11))
age_data.head(10)

In [None]:
age_groups = age_data.groupby("YEARS_BINNED").mean()
age_groups

In [None]:
plt.figure(figsize=(8, 8))
plt.bar(age_groups.index.astype(str), 100 * age_groups.TARGET)

plt.xticks(rotation=75)
plt.xlabel("Age Group (years)")
plt.ylabel("Failure to Repay (%)")
plt.title("Failure to Repay be Age Group")
plt.show()

Exterior Sources

> EXT_SOURCE_\[1-3\] is normalized score from external data sources

- EXT_SOURCE_\[1-3\]을 하나로 만들자.
- EXT_SOURCE는 TARGET에 음의 상관계수를 갖는다.
- EXT_SOURCE_1은 DAYS_BIRTH와 관계가 있는 값인 것 같다.

In [None]:
ext_data = train[["TARGET", "EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_BIRTH"]]
ext_data_corrs = ext_data.corr()
ext_data_corrs

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(ext_data_corrs, cmap=plt.cm.RdYlBu_r, vmin=-0.25, annot=True, vmax=0.6)
plt.title("Correlation Heatmap")
plt.show()

In [None]:
plt.figure(figsize=(10, 12))
for i, source in enumerate(["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]):
    plt.subplot(3, 1, i + 1)
    sns.kdeplot(train.loc[train.TARGET == 0, source], label="target == 0")
    sns.kdeplot(train.loc[train.TARGET == 1, source], label="target == 1")
    
    plt.xlabel("%s" % source)
    plt.ylabel("Density")
    plt.legend(loc="best")
    plt.title("Distribution of %s by Target Value" % source)
plt.tight_layout(h_pad=2.5)

In [None]:
plot_data = ext_data.drop(columns=["DAYS_BIRTH"]).copy()
plot_data["YEARS_BIRTH"] = age_data["YEARS_BIRTH"]
plot_data = plot_data.dropna().loc[:100000, :]

In [None]:
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r), xy=(.2, .8), xycoords=ax.transAxes, size=20)

grid = sns.PairGrid(data=plot_data, size=3, diag_sharey=False, hue="TARGET",
                    vars=[x for x in list(plot_data.columns) if x != "TARGET"])
grid.map_upper(plt.scatter, alpha=0.2)
grid.map_diag(sns.kdeplot)
grid.map_lower(sns.kdeplot, cmap=plt.cm.OrRd_r)
plt.suptitle("Ext Source and Age Features Pairs Plot", size=32, y=1.05)
plt.show()

## Feature Engineering

- Polynomial features
- Domain knowledge features

In [None]:
poly_features = train[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_BIRTH",
                       "TARGET"]]
poly_features_test = test[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_BIRTH"]]

poly_target = poly_features.TARGET
poly_features = poly_features.drop(columns=["TARGET"])

In [None]:
imputer = SimpleImputer(strategy="median")

poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)