# 【練習問題】債務不履行リスクの低減
- https://signate.jp/competitions/294

***
<br>

___内容説明___

借入総額や返済期間、金利、借入目的などの顧客データを使って、債務不履行リスクを予測するモデルを構築していただきます。


金融会社では個人や法人にお金を貸す、いわゆる融資を行い、返済額に利子を上乗せすることで利益を得ています。
しかし、様々な理由から貸したのに返済されない、貸し倒れというケースが発生します。貸し倒れは金融会社として大きな損失であるため、できる限り避けたいですが、
一定確率で貸し倒れが起きることは避けられないのが現状です。したがって金融会社は、貸し倒れのリスクを可能な限り減らしたり、貸し倒れても利益がでるように適切に金利を設定したりしたいと考えています。
そこで今回は、借入総額や返済期間、金利、借入目的などの顧客データを使って、債務不履行リスクを予測するモデルの構築にチャレンジしてみましょう。

<br>
  
***

<br>

___データ概要___

課題種別：分類

データ種別：多変量

学習データサンプル数：242156

説明変数の数：9

欠損値：無

目的変数について：ChargedOffを1、FullyPaidを0として予測する必要があります。

<br>

***

<br>

___評価方法___

・精度評価は、評価関数「F1Score」を使用します。

・評価値は0～1の値をとり、精度が高いほど大きな値となります。



# ライブラリのインポート

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
# from sklearn.preprocessing import LabelEncoder

## 学習データの確認

#### データの読み込み

In [2]:
train = pd.read_csv("/train.csv")
test = pd.read_csv("/test.csv")

#### データの中身の確認

In [3]:
train.shape

(242156, 10)

In [4]:
test.shape

(26906, 9)

In [5]:
train.head()

Unnamed: 0,id,loan_amnt,term,interest_rate,grade,employment_length,purpose,credit_score,application_type,loan_status
0,88194295,1800.0,3 years,14.49,C4,,debt_consolidation,665.0,Individual,FullyPaid
1,5146039,1200.0,5 years,16.29,C4,2 years,debt_consolidation,700.0,Individual,ChargedOff
2,3095896,2000.0,5 years,21.98,E4,10 years,home_improvement,670.0,Individual,FullyPaid
3,88625044,1000.0,3 years,8.59,A5,4 years,debt_consolidation,710.0,Individual,FullyPaid
4,1178189,1500.0,3 years,13.99,C1,4 years,debt_consolidation,680.0,Individual,FullyPaid


In [6]:
test.head()

Unnamed: 0,id,loan_amnt,term,interest_rate,grade,employment_length,purpose,credit_score,application_type
0,1496754,1912.5,3 years,10.16,B1,5 years,debt_consolidation,725.0,Individual
1,84909594,1800.0,3 years,8.99,B1,9 years,credit_card,695.0,Individual
2,1165403,550.0,3 years,14.65,C2,10 years,credit_card,660.0,Individual
3,91354446,2000.0,5 years,15.59,C5,10 years,credit_card,695.0,Individual
4,85636932,1500.0,5 years,12.79,C1,0 years,medical,720.0,Individual


In [7]:
len(train[train['loan_status'] == 'FullyPaid'])

193815

In [8]:
len(train[train['loan_status'] == 'ChargedOff'])

48341

## 学習準備

#### データの整形（前処理）

In [9]:
# train.replace({'Individual': 0, 'Joint App': 1}, inplace=True)
train.replace({'FullyPaid': 0, 'ChargedOff': 1}, inplace=True)

# test.replace({'Individual': 0, 'Joint App': 1}, inplace=True)

train.drop(columns=['term', 'grade', 'employment_length', 'purpose', 'application_type'], inplace=True)
test.drop(columns=['term', 'grade', 'employment_length', 'purpose', 'application_type'], inplace=True)

#### Data Validation

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train, train['loan_status'], test_size=0.3, random_state=0, stratify=train['loan_status'])

#### Validationしたデータの確認

In [32]:
X_train.shape

(169509, 4)

In [28]:
X_train.head()

Unnamed: 0,id,loan_amnt,interest_rate,credit_score
52712,5786159,2437.5,22.95,665.0
213103,2371000,2917.5,17.27,680.0
180618,1202248,3235.0,12.12,725.0
148232,84700735,450.0,10.99,665.0
105475,122985525,1680.0,15.05,660.0


In [33]:
X_valid.shape

(72647, 4)

In [29]:
X_valid.head()

Unnamed: 0,id,loan_amnt,interest_rate,credit_score
80469,1302409,1000.0,12.12,705.0
96031,2076083,1500.0,16.29,685.0
189115,2864747,750.0,12.12,680.0
185327,85515436,1600.0,23.99,670.0
117158,4894665,2100.0,6.62,715.0


In [34]:
y_train.shape

(169509,)

In [30]:
y_train.head()

52712     1
213103    1
180618    0
148232    0
105475    0
Name: loan_status, dtype: int64

In [35]:
y_valid.shape

(72647,)

In [31]:
y_valid.head()

80469     1
96031     0
189115    0
185327    1
117158    0
Name: loan_status, dtype: int64

#### 目的変数の削除

In [11]:
X_train.drop(columns='loan_status', inplace=True)
X_valid.drop(columns='loan_status', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


#### 説明変数の設定

In [12]:
categorical_feature = ['loan_amnt', 'interest_rate', 'credit_score']

## モデルの作成/学習

In [13]:
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgb

# 5-fold CVモデルの学習
# 5つのモデルを保存するリストの初期化
models = []
pred_ave = []
first_judge = True
num_fold = 6

# 学習データの数だけの数列（0行から最終行まで連番）
row_no_list = list(range(len(y_train)))

# KFoldクラスをインスタンス化（これを使って5分割する）
K_fold = StratifiedKFold(n_splits=num_fold, shuffle=True,  random_state=42)

# KFoldクラスで分割した回数だけ実行（ここでは5回）
for train_cv_no, eval_cv_no in K_fold.split(row_no_list, y_train):
    # ilocで取り出す行を指定
    X_train_cv = X_train.iloc[train_cv_no, :]
    y_train_cv = pd.Series(y_train).iloc[train_cv_no]
    X_eval_cv = X_train.iloc[eval_cv_no, :]
    y_eval_cv = pd.Series(y_train).iloc[eval_cv_no]
    
    # 学習用
    lgb_train = lgb.Dataset(X_train_cv, y_train_cv,
                            categorical_feature=categorical_feature)
    # 検証用
    lgb_eval = lgb.Dataset(X_eval_cv, y_eval_cv, reference=lgb_train,
                           categorical_feature=categorical_feature)
    
    # パラメータを設定
    params = {'objective': 'binary',
              'metric': 'binary_error',
#               'learning_rate':0.1,
#               'num_iterations':100,
#               'num_leaves':31,
#               'max_depth':-1
#               'weight_columns':[0.07495,0.110919,0.065571,0.067531,0.05158,
#                                0.10085,0.053923,0.147468,0.065578,0.060578,
#                                0.10546,0.095594]
             }
    
    # 学習
    evaluation_results = {}                                     # 学習の経過を保存する箱
    model = lgb.train(params,                                   # 上記で設定したパラメータ
                      lgb_train,                                # 使用するデータセット
                      num_boost_round=1000,                     # 学習の回数
                      valid_sets=[lgb_train, lgb_eval],         # モデル検証のデータセット
                      categorical_feature=categorical_feature, # カテゴリー変数を設定
                      early_stopping_rounds=100,                 # アーリーストッピング# 学習
                      verbose_eval=10)                          # 学習の経過の非表示
    
    # テストデータで予測する
    y_pred = model.predict(test, num_iteration=model.best_iteration)

    if first_judge:
        pred_ave = y_pred
        first_judge = False
    else:
        pred_ave = pred_ave + y_pred
    
    # 学習が終わったモデルをリストに入れておく
    models.append(model) 

pred_ave = pred_ave/num_fold



Training until validation scores don't improve for 100 rounds.
[10]	training's binary_error: 0.199629	valid_1's binary_error: 0.199632
[20]	training's binary_error: 0.198687	valid_1's binary_error: 0.199632
[30]	training's binary_error: 0.195289	valid_1's binary_error: 0.199844
[40]	training's binary_error: 0.194291	valid_1's binary_error: 0.200729
[50]	training's binary_error: 0.193555	valid_1's binary_error: 0.20126
[60]	training's binary_error: 0.192748	valid_1's binary_error: 0.201579
[70]	training's binary_error: 0.192153	valid_1's binary_error: 0.201897
[80]	training's binary_error: 0.191254	valid_1's binary_error: 0.202003
[90]	training's binary_error: 0.190787	valid_1's binary_error: 0.202251
[100]	training's binary_error: 0.190327	valid_1's binary_error: 0.202711
[110]	training's binary_error: 0.189923	valid_1's binary_error: 0.203101
[120]	training's binary_error: 0.189385	valid_1's binary_error: 0.203101
Early stopping, best iteration is:
[21]	training's binary_error: 0.1981

#### trainデータの各modelの出力の平均

In [15]:
first_judge = True

for model in models:
    x_pred = model.predict(X_valid, num_iteration=model.best_iteration)
    if first_judge:
        x_pred_ave = x_pred
        first_judge = False
    else:
        x_pred_ave = x_pred_ave + x_pred
        
x_pred_ave = x_pred_ave/num_fold

#### trainデータのf1_score

In [69]:
y_test = y_valid.values
X_pred = (x_pred_ave > 0.18).astype(int)
f1_score(X_pred, y_test)

0.41372537465873155

In [74]:
pred_ave = (pred_ave > 0.18325).astype(int)
Id = test.id.astype(int)
my_solution = pd.DataFrame(pred_ave, Id, columns=['loan_status'])
my_solution.to_csv("my_prediction_data.csv", header=False)

In [75]:
from google.colab import files
files.download("my_prediction_data.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [73]:
my_solution.head()

Unnamed: 0_level_0,loan_status
id,Unnamed: 1_level_1
1496754,0.104368
84909594,0.152909
1165403,0.214976
91354446,0.297667
85636932,0.208847
