### (week4) Home Credit Default Risk

![img](https://storage.googleapis.com/kaggle-competitions/kaggle/9120/logos/header.png)

>`Cf.`
> + [Home-Credit-Default-Risk - github](https://github.com/rishabhrao1997/Home-Credit-Default-Risk/blob/main/EDA%20-%20Home%20Credit%20Default.ipynb)
> + [HOME CREDIT DEFAULT RISK — An End to End ML Case Study — PART 1: Introduction and EDA - medium](https://medium.com/thecyphy/home-credit-default-risk-part-1-3bfe3c7ddd7a)
> + [HOME CREDIT DEFAULT RISK — An End to End ML Case Study — PART 1: Introduction and EDA](https://medium.com/thecyphy/home-credit-default-risk-part-1-3bfe3c7ddd7a)
> + [HOME CREDIT DEFAULT RISK — An End to End ML Case Study — PART 2: Feature Engineering and Modelling](https://medium.com/thecyphy/home-credit-default-risk-part-2-84b58c1ab9d5)
> + [機械学習によく使うPythonのコード一覧まとめ - AI研究所](https://ai-kenkyujo.com/2020/06/08/kikaigakusyu-python/#i)

> ```point.```
> Supervised Classfication

##### - import

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings(action="ignore")

# algorithm
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier #XGBoost
from lightgbm import LGBMClassifier #LightGBM

# evaluations
from sklearn.metrics import accuracy_score # 正解率
from sklearn.metrics import precision_score # 適合率
from sklearn.metrics import recall_score # 再現率
from sklearn.metrics import f1_score # F値
from sklearn.metrics import confusion_matrix # 混合行列

# visualization
import missingno as msn
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import IPython
def display(*dfs, head=True):
    for df in dfs:
        IPython.display.display(df.head() if head else df)

#### 【問題1】コンペティション内容の確認
> + (a) 何を学習し、何を予測するのか
> + (b) どのようなファイルを作りKaggleに提出するか
> + (c) 提出されたものはどういった指標値で評価されるのか

(a): Targetになりうるクライアントの返済能力を予測

(b): [提出ファイル - kaggle](https://www.kaggle.com/c/home-credit-default-risk/overview/evaluation)で指定されているファイルを作成して提出

(c): 予測された確率と観察されたターゲットの間の[ROC曲線](https://ja.wikipedia.org/wiki/%E5%8F%97%E4%BF%A1%E8%80%85%E6%93%8D%E4%BD%9C%E7%89%B9%E6%80%A7)の下の領域

#### 【問題2】学習と検証
> データを簡単に分析、前処理し、学習、検証するまでの一連の流れを作成・実行してください。

> `memo`
> Kagglerは最低限のデータセットでSubmitしてみて、Scoreがどの程度かを検証したりするそう。

##### - Dataset

In [None]:
train_raw = pd.read_csv('input/application_train.csv')
test_raw = pd.read_csv('input/application_test.csv')
print('The size of the train data :', train_raw.shape)
print('The size of the test data :', test_raw.shape)

#train testのフラグつける
train_mid = train_raw.copy()
train_mid['train_or_test'] = 'train'
test_mid = test_raw.copy()
test_mid['train_or_test'] = 'test'
test_mid['TARGET'] = 0.5

alldata = pd.concat([train_mid, test_mid], sort=False, axis=0).reset_index(drop=True)
print('The size of the alldata data:', alldata.shape)

In [None]:
train_mid.head()

In [None]:
test_mid.head()

In [None]:
alldata

##### （問題３まで、NaN全部消しバージョン）

##### - EDA / Preprocessing

In [None]:
# 欠損データの分布確認
msn.matrix(alldata)

In [None]:
# 欠損データall除去
alldata = alldata.dropna(how="any")

In [None]:
print(alldata.shape)
print(alldata.isnull().sum())
msn.matrix(alldata)

In [None]:
# オブジェクト型を全て Label-Encoding
alldata.columns[alldata.dtypes == object]

In [None]:
alldata = pd.get_dummies(alldata)

In [None]:
alldata.columns[alldata.dtypes == object]

In [None]:
alldata.head(3)

In [None]:
# 最初に統合したtrainとtestを分離
train_feature = alldata[alldata['train_or_test_train']==1]
test_feature = alldata[alldata['train_or_test_train']==0]
train_target = train_feature["TARGET"]
print("train: {}".format(train_feature.shape))
print("test: {}".format(test_feature.shape))

X_train, X_test, y_train, y_test = train_test_split(train_feature, train_target, test_size=0.2, random_state=0)
display(X_train)

##### - Baseline

In [None]:
ratio = y_train.sum() / len(y_train)
print(f'Target rate:{ratio}')
print(f'base line accuracy: {1 - ratio}')

##### -Normalize

In [None]:
#標準化
sts = StandardScaler()
sts.fit(X_train, y_train)
X_train_norm = sts.transform(X_train)

##### -Machine Learning
    ロジスティック回帰、ランダムフォレストで学習

In [None]:
# ロジスティック回帰
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
logreg_ev = evaluations(y_test, y_pred, "macro")

In [None]:
# ランダムフォレスト
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
rfc_ev = evaluations(y_test, y_pred, "macro")

##### - Evaliation

In [None]:
def evaluations(test, predict, average):
    accuracy = accuracy_score(test, predict)
    precision = precision_score(test, predict, average=average)
    recall = recall_score(test, predict, average=average)
    f1 = f1_score(test, predict, average=average)
    evaluations = {
        "正解率" : round(accuracy, 3), 
        "適合率" : round(precision, 3),
        "再現率" : round(recall, 3), 
        "F値" : round(f1, 3)
    }
    return evaluations

In [None]:
pd.DataFrame([logreg_ev, rfc_ev], index=["ロジスティック回帰", "ランダムフォレスト"])

試しにRondomforestの`feature_importances_`を見てみる

In [None]:
rfc.fit(X_train, y_train)
# n = rfc.feature_importances_
n = np.argsort(rfc.feature_importances_) # 数列の順位的なの返す
x = X_train.columns[n]
y = rfc.feature_importances_[n]

plt.figure(figsize=(20, 50))
plt.barh(x, y, label="Rondom Forest Classfire")
plt.title('RandomForestClassifier feature importance')
plt.show()

#### 【問題3】テストデータに対する推定
> テストデータ（`application_test.csv`）に対して推定を行い、Kaggleに提出を行ってください。

In [None]:
test_feature

In [None]:
X_test = test_feature.values # (1739, 238)

In [None]:
# ロジスティック回帰
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train, y_train) #(6881, 238) (6881, )
logreg_pred = logreg.predict_proba(X_test)

In [None]:
# ランダムフォレスト
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict_proba(X_test)

#### - submit

In [None]:
a = pd.DataFrame([logreg_pred[:,1]]).T

In [None]:
b = pd.DataFrame([test_feature.values[:,0]]).T

In [None]:
submit = pd.concat([b, a], axis=1)
submit = submit.rename(columns={0 : "SK_ID_CURR", 1 : "TARGET" })
submit.to_csv('output/demo_logreg.csv', index=False)

#### 【問題4】特徴量エンジニアリング
>     精度を上げるために以下のような観点で 特徴量エンジニアリング（Feature Engineering） を行ってください。
>        - どの特徴量を使うか
>        - どう前処理をするか
>     何をした時に検証データに対する評価指標がどのようになったかをまとめてください。最低5パターンの学習・検証を行ってください。

In [None]:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

##### - dataset2

In [None]:
display(train_mid.head(3))
display(test_mid.head(3))

In [None]:
df = pd.concat([train_mid, test_mid]).reset_index(drop=True)
display(df.head(3))
display(df.tail(3))
df.shape

##### - EDA2 / Preprocessing2

In [None]:
# NaN's countup
nan = df.isnull().sum().reset_index()
nan.columns = ["name", "count"]
nan["ratio"] = (nan["count"] / df.shape[0])*100
nan["usabilty"] = np.where(nan["ratio"] > 20, "Discard", "Keep")
nan = nan[nan["count"] > 0].sort_values(by="ratio")
nan

In [None]:
# NaN's plotting
plt.figure(figsize=(15, 6))
sns.barplot(x=nan["name"], y=nan["ratio"])
plt.xticks(rotation=90) #90°傾け
plt.title("Feature containing NaN.")
plt.show()

In [None]:
# nan Discardのカラム名をdfから除去
drop_list = nan[nan["usabilty"] == "Discard"]["name"].values.tolist()
df = df.drop(drop_list, axis=1)

In [None]:
print("# columns: ", len(df.columns))

In [None]:
# こいつらをどうしようか…
keep_nan = nan[nan["usabilty"]=="Keep"]
keep_nan

In [None]:
# objリストのカラムを数値化

# リスト作成
obj_list = df.columns[df.dtypes == "object"].tolist()
print(obj_list, len(obj_list), type(obj_list))

In [None]:
# 変換したdf作成
obj_df = pd.get_dummies(df[obj_list])
display(obj_df.head(3))
print(len(obj_df.columns))

In [None]:
# obj_listのカラム名の列を削除
df = df.drop(obj_list, axis=1)
# obj_dfとデータを結合
df = pd.concat([df, obj_df], axis=1)

In [None]:
df.columns[df.dtypes=="object"]

In [None]:
# Correlation Matrix
f, ax = plt.subplots(figsize=(30, 25))
mat = df.corr("pearson")
mask = np.triu(np.ones_like(mat, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(mat, mask=mask, cmap=cmap, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

In [None]:
mat["TARGET"].sort_values(ascending=False)

In [None]:
drop_list = mat.columns[mat["TARGET"] < 0].tolist()
drop_list.remove("SK_ID_CURR")
drop_list.remove("train_or_test_train")

In [None]:
drop_list

> `Cf.`
> + [2019 Data Science Bowl](https://www.kaggle.com/c/data-science-bowl-2019/discussion/122021)
> + [初めてのLightGBM](https://fukki.pythonanywhere.com/post_detail/19/)

In [None]:
# 最初に統合したtrainとtestを分離
train = df[df['train_or_test_train'] == 1]
test = df[df['train_or_test_test'] == 1]
train

In [None]:
# ターゲット変数と、学習に不要なカラムを定義
target_col = "TARGET"
drop_col = drop_list

In [None]:
# 学習に必要な特徴量のみを保持
train_feature = train.drop(drop_col, axis=1)
test_feature = test.drop(drop_col, axis=1)
train_tagert = train[target_col]

In [None]:
drop_list2 = train_feature.columns[train_feature.isnull().sum() >0 ].tolist()
drop_list3 = test_feature.columns[test_feature.isnull().sum() >0 ].tolist()

In [None]:
train_feature = train.drop(drop_list2, axis=1)
test_feature = test.drop(drop_list3, axis=1)

# LightGBMではJSON形式はcolumn名に使われているとparseができずにエラーが起こる
# train_feature.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in train.columns]

In [None]:
# trainデータを分割
X_train, X_test, y_train, y_test = train_test_split(
    train_feature, train_tagert, test_size=0.2, random_state=0, stratify=train_tagert)

##### - Baseline
> (精度の基準となるモデル)

In [None]:
# trainから頻度に応じて単純なモデルを作る場合
survive_rate = y_train.sum()/len(y_train)
print(f'survive rate:{survive_rate}')
print(f'base line accuracy: {1 - survive_rate}')

##### - Normalize

In [None]:
#標準化
sts = StandardScaler()
sts.fit(X_train, y_train)
X_train_norm = sts.transform(X_train)

##### - Machine Leaning
> `Cf.`
> + [.scoreで出てくる決定係数の解釈 - teratail](https://teratail.com/questions/100203)
> + [機械学習ライブラリ scikit-learnの便利機能の紹介 - Qiita](https://qiita.com/ishizakiiii/items/0650723cc2b4eef2c1cf)

In [None]:
# LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("="*20)
print("LogisticRegression")
print("train acc: ", logreg.score(X_train, y_train))
print("test acc: ", logreg.score(X_test, y_test))

# # RandomForest
# rfc = RandomForestClassifier()
# rfc.fit(train_feature, train_tagert)
# print("="*20)
# print("RandomForest")
# print("train acc: ", rfc.score(X_train, y_train))
# print("test acc: ", rfc.score(X_test, y_test))

# # XGBoost
# xgb = XGBClassifier()
# xgb.fit(train_feature, train_tagert)
# print("="*20)
# print("XGBBoost")
# print("train acc: ", xgb.score(X_train, y_train))
# print("test acc: ", xgb.score(X_test, y_test))

# # LightGBM
# lgb = LGBMClassifier()
# lgb.fit(train_feature, train_tagert)
# print("="*20)
# print("LightGBM")
# print("train acc: ", lgb.score(X_train, y_train))
# print("test acc: ", lgb.score(X_test, y_test))

# # SVC
# svc = SVC()
# svc.fit(train_feature, train_tagert)
# print("="*20)
# print("SVC")
# print("train acc: ", svc.score(X_train, y_train))
# print("test acc: ", svc.score(X_test, y_test))

In [None]:
# LogisticRegression
logreg = LogisticRegression()
logreg.fit(train_feature, train_tagert)

# RandomForest
rfc = RandomForestClassifier()
rfc.fit(train_feature, train_tagert)

# XGBoost
xgb = XGBClassifier()
xgb.fit(train_feature, train_tagert)

# LightGBM
lgb = LGBMClassifier()
lgb.fit(train_feature, train_tagert)

# SVC
svc = SVC()
svc.fit(train_feature, train_tagert)

# 推論
pred = {
    'rfc': rfc.predict(test_feature),
    'xgb': xgb.predict(test_feature),
    'lgb': lgb.predict(test_feature),
    'logreg': logreg.predict(test_feature),
    'svc': svc.predict(test_feature)
}

# ファイル出力
for key, value in pred.items():
    pd.concat(
        [
            pd.DataFrame(test.PassengerId, columns=["SK_ID_CURR"]).reset_index(drop=True),
            pd.DataFrame(value, columns=["TARGET"])
        ],
        axis=1
    ).to_csv(f'output/{key}.csv', index=False)