# 1.この課題の目的
- 機械学習の実践的な流れを知る
- 高い汎化性能を持つモデルを完成させる
以下の要件をすべて満たしていた場合、合格とします。

※Jupyter Notebookを使い課題に沿った検証や説明ができている。

# 2.機械学習フロー
1回目のSprintでは、機械学習の実践的な流れを抑えます。このSprintはWeek3,4の延長になり、引き続きKaggleの Home Credit Default Risk コンペティションを用います。特にここでは適切な 検証 を組み込むことを重視し、その上で前処理やモデルの選定を進めていきます。

適切な検証を行い、テストデータに対して高い汎化性能を持つモデルを完成させましょう。

# 【問題1】クロスバリデーション
事前学習期間は検証用データを分割しておき、それに対して指標値を計算することで検証を行っていました。しかし、分割の仕方により精度は変化します。実践的には クロスバリデーション を行います。

具体的には分割を複数回行い、それぞれに対して学習と検証を行う方法です。複数回の分割を行う関数はscikit-learnにKFoldとして用意されています。

[sklearn.model_selection.KFold — scikit-learn 0.20.2 documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)



## 実行環境
### 前処理
- カテゴリカル変数
    - OneHotEncodingを実施
    
    
- 欠損値補完
    - 欠損率6割以上の特徴量を削除。　
    - マルチラベルのデータは最頻値、それ以外は平均値で補完。
    
- 特徴量抽出
    - 使用する特徴量は目的変数との相関係数の絶対値が高い10個を使用

### モデル
- ロジスティック回帰(パラメータは全てデフォルト)

## 確認結果
### ５分割交差検証を行い以下のValidation Scoreを確認→平均0.92

| 分割1      | 分割2      | 分割3      | 分割4      | 分割5      |
|------------|------------|------------|------------|------------|
| 0.91984131 | 0.91725472  | 0.91889695 | 0.91972619| 0.92057169 |

### →上記スコアよりロジスティック回帰モデルのみで高い分類精度、及び高い汎化性能を確認

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## 使用クラスのインポート

In [0]:
import pandas as pd
import numpy as np
import os
from PIL import Image

# モデル作成時に使用するクラス
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb


# 前処理で使用するクラス
from sklearn.model_selection import train_test_split, KFold, cross_val_score,\
                                                                GridSearchCV, StratifiedKFold, cross_validate

## ①データ読み込み

In [5]:
!pwd

/content


In [0]:
# trainデータ
os.chdir('/content/drive/My Drive/Colab Notebooks/sprint1')
df_train = pd.read_csv('application_train.csv')

In [9]:
df_train.shape

(307511, 122)

In [10]:
# testデータ
df_test = pd.read_csv('application_test.csv')
df_test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,...,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,Unaccompanied,Working,Higher education,Married,House / apartment,0.01885,-19241,-2329,-5170.0,-812,,1,1,0,1,0,1,,2.0,2,2,TUESDAY,18,0,0,0,0,0,0,Kindergarten,...,,0.0514,,,,block of flats,0.0392,"Stone, brick",No,0.0,0.0,0.0,0.0,-1740.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.035792,-18064,-4469,-9118.0,-1623,,1,1,0,1,0,0,Low-skill Laborers,2.0,2,2,FRIDAY,9,0,0,0,0,0,0,Self-employed,...,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,,Working,Higher education,Married,House / apartment,0.019101,-20038,-4458,-2175.0,-3503,5.0,1,1,0,1,0,0,Drivers,2.0,2,2,MONDAY,14,0,0,0,0,0,0,Transport: type 3,...,,,,,,,,,,0.0,0.0,0.0,0.0,-856.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.026392,-13976,-1866,-2000.0,-4208,,1,1,0,1,1,0,Sales staff,4.0,2,2,WEDNESDAY,11,0,0,0,0,0,0,Business Entity Type 3,...,0.2446,0.3739,0.0388,0.0817,reg oper account,block of flats,0.37,Panel,No,0.0,0.0,0.0,0.0,-1805.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.010032,-13040,-2191,-4000.0,-4262,16.0,1,1,1,1,0,0,,3.0,2,2,FRIDAY,5,0,0,0,0,1,1,Business Entity Type 3,...,,,,,,,,,,0.0,0.0,0.0,0.0,-821.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,


## ②前処理
### trainデータの特徴量を抽出

In [0]:
train_feature = df_train.drop('TARGET', axis=1)

### trainデータの目的変数を抽出

In [0]:
train_target = df_train.loc[:, 'TARGET']

### trainデータの特徴量とtestデータを結合

In [0]:
df_concat = pd.concat([train_feature, df_test], axis=0)

### 前処理１：OneHotEncoding

In [0]:
# trainデータとtestデータを縦に結合したデータに対してOneHotEncodingしたデータ
df_concat_dummies = pd.get_dummies(df_concat)

In [15]:
df_concat_dummies.shape

(356255, 245)

### 前処理２：欠損値補完
- 欠損率が6割以上の列(特徴量)は削除
- マルチラベルのデータは最頻値、それ以外の数値データは平均値で補完

In [0]:
# 欠損値の数(初期値 0)
missing_value_number = 0

# df_concat_dummiesの行数
df_length = len(df_concat_dummies)

# df_concat_dummiesのcolomusを１つずつ抽出し、変数colに格納
for col in df_concat_dummies.columns:
    
    #df_concat_dummiesのcol列の欠損値の数をmissing_value_numberに格納
    missing_value_number = df_concat_dummies[col].isnull().sum()
    
    # 欠損値の数が1以上か？
    if missing_value_number > 0:
        
        # 1以上→欠損値の割合が60%以上か？
        if missing_value_number/df_length >= 0.6:
            
            # 60%以上→col列を削除
            df_concat_dummies = df_concat_dummies.drop(col, axis=1)
            
        # 60%以下
        else:
            
            # col列のデータタイプがオブジェクト型か？
            if df_concat_dummies[col].dtype == object:
                
                # オブジェクト型→欠損値を最頻値で補完
                df_concat_dummies[col] = df_concat_dummies[col].fillna(train[col].mode()[0])
                
            else:
                
                # 非オブジェクト型→欠損値を平均値で補完
                df_concat_dummies[col] = df_concat_dummies[col].fillna(df_concat_dummies[col].mean())

In [17]:
df_concat_dummies.shape

(356255, 229)

### 結合したtrainデータとtestデータを分離

In [0]:
# 上から307511行はtrainデータ
df_train_prepro = df_concat_dummies.iloc[:307511,:]

# 307511行から下はtestデータ
df_test_prepro = df_concat_dummies.iloc[307511:, :]

In [19]:
train_target.shape

(307511,)

### 目的変数との相関係数を計算し、学習時に使用する特徴量を選定する

### 前処理後のtrainデータに目的変数列を結合

In [0]:
df_train_prepro = pd.concat([train_target, df_train_prepro], axis=1)

### 目的変数との相関係数を計算

In [21]:
df_train_prepro_corr = df_train_prepro.corr()
df_train_prepro_corr.head()

Unnamed: 0,TARGET,SK_ID_CURR,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,LANDAREA_AVG,LIVINGAREA_AVG,NONLIVINGAREA_AVG,...,ORGANIZATION_TYPE_Police,ORGANIZATION_TYPE_Postal,ORGANIZATION_TYPE_Realtor,ORGANIZATION_TYPE_Religion,ORGANIZATION_TYPE_Restaurant,ORGANIZATION_TYPE_School,ORGANIZATION_TYPE_Security,ORGANIZATION_TYPE_Security Ministries,ORGANIZATION_TYPE_Self-employed,ORGANIZATION_TYPE_Services,ORGANIZATION_TYPE_Telecom,ORGANIZATION_TYPE_Trade: type 1,ORGANIZATION_TYPE_Trade: type 2,ORGANIZATION_TYPE_Trade: type 3,ORGANIZATION_TYPE_Trade: type 4,ORGANIZATION_TYPE_Trade: type 5,ORGANIZATION_TYPE_Trade: type 6,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 1,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,FONDKAPREMONT_MODE_not specified,FONDKAPREMONT_MODE_org spec account,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,HOUSETYPE_MODE_block of flats,HOUSETYPE_MODE_specific housing,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
TARGET,1.0,-0.002108,0.019187,-0.003982,-0.030369,-0.012817,-0.039628,-0.037227,0.078239,-0.044932,0.041975,0.051457,0.000534,0.045982,0.028524,0.00037,-0.023806,-0.001758,0.009308,0.058899,0.060893,-0.024166,0.005576,0.006942,0.002819,0.044395,0.050994,0.032518,-0.099163,-0.160303,-0.157473,-0.019151,-0.013541,-0.006444,-0.021553,-0.012528,-0.028936,-0.006452,-0.02163,-0.00839,...,-0.009886,0.001125,0.003339,-0.001337,0.010266,-0.013671,0.007226,-0.00947,0.029139,-0.003871,-0.000712,0.001032,-0.003105,0.008911,-0.002621,-0.000904,-0.005788,0.00817,-0.003375,-0.000839,0.017552,0.005929,-0.007672,-0.045987,-0.002667,-0.011285,-0.022587,-0.011257,-0.040594,0.005311,0.000982,-0.006777,-0.001713,-0.009384,0.000628,-0.033119,-0.012657,0.007946,-0.042201,0.004829
SK_ID_CURR,-0.002108,1.0,-0.001129,-0.00182,-0.000343,-0.000433,-0.000235,0.000849,-0.0015,0.001366,-0.000973,-0.000384,0.002804,-0.001337,-0.000415,0.002815,0.002753,0.000281,-0.002895,-0.001075,-0.001138,0.00035,-0.000283,0.001097,0.002903,-0.001885,-0.001582,6.7e-05,5.4e-05,0.002339,0.0002,0.001086,-0.001338,0.001109,0.003319,-0.002033,0.003437,0.000934,0.001244,0.002038,...,0.001172,-0.003488,0.001141,0.000143,0.000115,-0.001418,-0.002631,-0.00314,0.00282,-0.003127,-0.001047,-0.00277,-0.000435,0.002994,0.002342,-0.000148,0.000408,0.003103,-0.000891,-0.002568,-0.000669,0.002658,-0.000496,0.001368,0.00043,0.00059,-0.000892,0.000749,0.001254,0.001344,0.000553,0.002073,-0.000976,7.2e-05,-0.001396,0.0023,-0.001281,-0.00027,0.00051,0.002549
CNT_CHILDREN,0.019187,-0.001129,1.0,0.012882,0.002145,0.021373,-0.00183,-0.025573,0.330938,-0.239818,0.183395,-0.028019,0.001041,0.240714,0.05563,-0.000794,-0.029906,0.022619,0.87916,0.025423,0.024781,-0.007292,-0.013319,0.008185,0.014835,0.020072,0.07065,0.069957,-0.096778,-0.017996,-0.038414,-0.008583,-0.005073,0.004739,-0.004352,-0.005466,-0.006295,-0.001844,-0.006583,7.2e-05,...,0.025215,0.014322,0.003246,0.007463,0.005339,0.023007,0.000918,0.019777,0.04803,0.008463,0.00087,0.001723,-0.005942,0.014946,0.002594,-0.001939,0.001178,0.018897,0.001264,0.021024,0.004209,0.011959,-0.000854,-0.240722,0.002147,-0.002805,-0.023784,-0.004729,-0.0364,-0.001821,-0.00183,-0.005272,-0.000709,0.001607,-0.002032,-0.020892,-0.025088,0.011036,-0.038644,0.004525
AMT_INCOME_TOTAL,-0.003982,-0.00182,0.012882,1.0,0.15687,0.191657,0.159604,0.074796,0.027261,-0.064223,0.027805,0.008506,0.000325,0.063994,-0.017193,-0.00829,0.000159,0.038378,0.016342,-0.085465,-0.091735,0.036459,0.031191,0.06234,0.058059,0.003574,0.006431,0.008285,0.024823,0.060917,-0.029194,0.032324,0.016106,0.005258,0.042173,0.004989,0.056564,-0.00154,0.037522,0.022793,...,0.014497,-0.012812,0.008201,0.000414,-0.003439,-0.009917,-0.000877,0.011856,-0.00216,0.005332,0.001391,-0.000485,0.007287,-0.003759,0.001359,0.000687,0.002131,-0.00084,0.002208,0.00238,0.001778,0.012323,0.005098,-0.064038,0.007965,0.014113,0.032194,0.007408,0.049553,-0.001914,0.000718,0.011696,0.006149,0.023886,0.003886,0.032753,0.016523,-0.003369,0.050174,-0.002894
AMT_CREDIT,-0.030369,-0.000343,0.002145,0.15687,1.0,0.770127,0.986608,0.099738,-0.055436,-0.066838,0.009621,-0.006575,0.001436,0.065519,-0.021085,0.023653,0.026213,0.016632,0.06316,-0.101776,-0.110915,0.052738,0.02401,0.051929,0.052609,-0.026886,-0.018856,8.1e-05,0.114494,0.131127,0.039335,0.044017,0.026437,0.004558,0.057396,0.01087,0.076114,0.004091,0.052817,0.026502,...,0.020268,-0.011871,0.01148,0.000236,-0.010779,0.008846,-0.003964,0.014016,-0.009372,0.006565,0.002488,-0.002245,-0.004799,-0.006185,0.000726,-0.00073,0.003274,0.002053,0.002425,0.000133,-0.00999,0.012154,0.017552,-0.065594,0.013042,0.014612,0.03897,0.010187,0.057524,-0.005219,0.000869,0.014314,0.007987,0.027255,0.005799,0.046644,0.009756,-0.007373,0.058256,-0.004308


### TARGET列を表示

In [22]:
df_train_prepro_corr['TARGET'][:5:]

TARGET              1.000000
SK_ID_CURR         -0.002108
CNT_CHILDREN        0.019187
AMT_INCOME_TOTAL   -0.003982
AMT_CREDIT         -0.030369
Name: TARGET, dtype: float64

### TARGET列からTARGET行、SK_ID_CURR行を削除
- 特徴量として使用できない為

In [23]:
df_train_prepro_corr_target = df_train_prepro_corr.loc[:,'TARGET'].drop(['TARGET', 'SK_ID_CURR'])
df_train_prepro_corr_target.head()

CNT_CHILDREN        0.019187
AMT_INCOME_TOTAL   -0.003982
AMT_CREDIT         -0.030369
AMT_ANNUITY        -0.012817
AMT_GOODS_PRICE    -0.039628
Name: TARGET, dtype: float64

### absメソッドにて絶対値とし、その状態でsort_valuesメソッドで降順にソートする

In [24]:
train_feature_corr_target_sort = df_train_prepro_corr_target.abs().sort_values(ascending = False) 
train_feature_corr_target_sort.head()

EXT_SOURCE_2                   0.160303
EXT_SOURCE_3                   0.157473
EXT_SOURCE_1                   0.099163
DAYS_BIRTH                     0.078239
REGION_RATING_CLIENT_W_CITY    0.060893
Name: TARGET, dtype: float64

### 上位10個の特徴量名を格納

In [25]:
train_top10_index = train_feature_corr_target_sort[:10].index
train_top10_index

Index(['EXT_SOURCE_2', 'EXT_SOURCE_3', 'EXT_SOURCE_1', 'DAYS_BIRTH',
       'REGION_RATING_CLIENT_W_CITY', 'REGION_RATING_CLIENT',
       'NAME_INCOME_TYPE_Working', 'NAME_EDUCATION_TYPE_Higher education',
       'DAYS_LAST_PHONE_CHANGE', 'CODE_GENDER_M'],
      dtype='object')

### 前処理３：上位10個の特徴量を抽出

In [26]:
df_train_prepro2 = df_train_prepro.loc[:,train_top10_index]
df_train_prepro2.head()

Unnamed: 0,EXT_SOURCE_2,EXT_SOURCE_3,EXT_SOURCE_1,DAYS_BIRTH,REGION_RATING_CLIENT_W_CITY,REGION_RATING_CLIENT,NAME_INCOME_TYPE_Working,NAME_EDUCATION_TYPE_Higher education,DAYS_LAST_PHONE_CHANGE,CODE_GENDER_M
0,0.262949,0.139376,0.083037,-9461,2,2,1,0,-1134.0,1
1,0.622246,0.50935,0.311267,-16765,1,1,0,1,-828.0,0
2,0.555912,0.729567,0.501965,-19046,2,2,1,0,-815.0,1
3,0.650442,0.50935,0.501965,-19005,2,2,1,0,-617.0,0
4,0.322738,0.50935,0.501965,-19932,2,2,1,0,-1106.0,1


In [27]:
df_test_prepro2 = df_test_prepro.loc[:,train_top10_index]
df_test_prepro2.head()

Unnamed: 0,EXT_SOURCE_2,EXT_SOURCE_3,EXT_SOURCE_1,DAYS_BIRTH,REGION_RATING_CLIENT_W_CITY,REGION_RATING_CLIENT,NAME_INCOME_TYPE_Working,NAME_EDUCATION_TYPE_Higher education,DAYS_LAST_PHONE_CHANGE,CODE_GENDER_M
0,0.789654,0.15952,0.752614,-19241,2,2,1,1,-1740.0,0
1,0.291656,0.432962,0.56499,-18064,2,2,1,0,0.0,1
2,0.699787,0.610991,0.501965,-20038,2,2,1,1,-856.0,1
3,0.509677,0.612704,0.525734,-13976,2,2,1,0,-1805.0,0
4,0.425687,0.50935,0.202145,-13040,2,2,1,0,-821.0,1


## 学習データと検証データに分割
- 処理能力の関係で使用する行を制限する 
    - 307511行の内、先頭から10000行のみ使用する

In [28]:
df_train_prepro2.shape

(307511, 10)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_train_prepro2[:1000], train_target[:1000])

## ③クロスバリデーション

### KFoldクラスのインスタンスを作成

In [0]:
kfold = KFold(n_splits=5)

### LogisticRegressionクラスのインスタンスを作成

In [0]:
rfc = RandomForestClassifier()

In [32]:
print('Cross-validation scores:\n{}'.format(\
                                           cross_val_score(rfc, X_train, y_train, cv=kfold)))

Cross-validation scores:
[0.91333333 0.93333333 0.88666667 0.94666667 0.95333333]




In [33]:
total = [0.91855096, 0.91837752, 0.9189195,  0.91987166, 0.9192863]
print('平均スコア = {:.2f}'.format(sum(total) / len(total)))

平均スコア = 0.92


In [34]:
rfc.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### 検証データのスコア

In [35]:
print('val score: {}'.format(rfc.score(X_test, y_test)))

val score: 0.932


In [36]:
df_test.iloc[:,0][:5]

0    100001
1    100005
2    100013
3    100028
4    100038
Name: SK_ID_CURR, dtype: int64

### テストデータのスコア

In [37]:
pred_proba = rfc.predict_proba(df_test_prepro2)[:,1].reshape([-1,1])
pred_proba_df = pd.DataFrame(pred_proba, columns=['TARGET'])
pred_proba_df.head()

Unnamed: 0,TARGET
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


In [38]:
pred_proba.shape

(48744, 1)

In [39]:
test_id_df = pd.DataFrame(df_test.iloc[:,0])
test_id_df.head()

Unnamed: 0,SK_ID_CURR
0,100001
1,100005
2,100013
3,100028
4,100038


In [40]:
predict_test = pd.concat([test_id_df, pred_proba_df], axis=1)
predict_test.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.0
1,100005,0.0
2,100013,0.0
3,100028,0.0
4,100038,0.0


In [None]:
predict_test.to_csv("submission_0717_p1_r1.csv", index=False, encoding='utf-8')

## 評価結果→58.815%
![スコア](https://drive.google.com/uc?id=1idQuiq_3f2oUV_-32626ZEciqu_pg3TV)

# 【問題2】グリッドサーチ
これまで分類器のパラメータは基本的にデフォルトの設定を使用していました。パラメータの詳細は今後のSprintで学んでいくことになりますが、パラメータは状況に応じて最適なものを選ぶ必要があります。パラメータを探索するために グリッドサーチ と呼ばれる総当たり的手法が一般的に利用されます。
グリッドサーチをパイプラインの中に組み込みましょう。

## 実行環境
### 前処理
- 問題1で行なった前処理

### モデル
- LogisticRegressionの以下のパラメータの中でグリッドサーチを実施
    - penalty : L1 or L2
    - gamma : [0.001, 0.01, 0.1, 1]
    - C : np.logspace(-4, 4, 20)→マイナス4~４の間を20等分する値をnumpy配列 X とした場合、10**Xとなるnumpy配列
    - solver : liblinear

## 確認結果
### ５分

### パイプライン+グリッドサーチ を行うインスタンスの作成

In [0]:
pipe = Pipeline([('classifier', None)])

param_grid =[
        {'classifier' : [RandomForestClassifier()],
         'classifier__n_estimators': [200, 500],
         'classifier__max_features': ['auto', 'sqrt', 'log2'],
         'classifier__max_depth' : [4,5,6,7,8],
         'classifier__criterion' :['gini', 'entropy']}
]

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc')

### 学習

In [44]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None, steps=[('classifier', None)],
                                verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'classifier': [RandomForestClassifier(bootstrap=True,
                                                                class_weight=None,
                                                                criterion='entropy',
                                                                max_depth=4,
                                                                max_features='sqrt',
                                                                max_leaf_nodes=None,
                                                                min_impurity_decrease=0.0,
                                                                min_impurity_split=Non...
                                                                min_weight_fraction_leaf=0.0,
                 

### 最も精度が高い時のパラメータ

In [45]:
print("best params: {}".format(grid.best_params_))

best params: {'classifier': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=4, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False), 'classifier__criterion': 'entropy', 'classifier__max_depth': 4, 'classifier__max_features': 'sqrt', 'classifier__n_estimators': 200}


### 検証データでの精度

In [46]:
print("test score: {}".format(grid.score(X_test, y_test)))

test score: 0.7126068376068376


In [47]:
df_test_prepro2.shape

(48744, 10)

### テストデータでの評価

In [48]:
pred_proba_r2 = grid.predict_proba(df_test_prepro2)[:,1].reshape([-1,1])
pred_proba_r2_df = pd.DataFrame(pred_proba_r2, columns=['TARGET'])
pred_proba_r2_df.head()

Unnamed: 0,TARGET
0,0.021853
1,0.117445
2,0.007411
3,0.039261
4,0.073373


In [49]:
predict_test_r2 = pd.concat([test_id_df, pred_proba_r2_df], axis=1)
predict_test_r2.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.021853
1,100005,0.117445
2,100013,0.007411
3,100028,0.039261
4,100038,0.073373


In [None]:
predict_test_r2.to_csv("submission_0717_p2_r1.csv", index=False, encoding='utf-8')

## 評価結果→69.291%
![ ](https://drive.google.com/uc?id=1XzKe17xZn0ljchkCAdd5IpRj0s3Httc5)

## 考察　
- テストデータのスコアが問題1より問題2の方が高い値となっている為、モデルのパラメータをグリッドサーチにて調整する事で、より精度の高いモデルを作成できた事を確認。

# 【問題3】Kernelからの調査
KaggleのKernelから自身にはなかったアイデアを見つけ出して、列挙してください。そして、効果があると考えられるものを検証してください。

## 案①複数のモデル導入＋層化K分割交差検証により最適なモデルを選定
### 導入理由及びメリット
#### データセットに対して最適なモデルを確認可能
- データセットの規模や構造によってはモデルによって得意不得意がある為

#### 汎化性能が高いと想定されるモデルを確認可能
- 層化k分割交差検証を実施し求められたスコアを使用し、最適なモデルを決定する為、汎化性能が高いと想定されるモデルを確認可能

### デメリット
#### モデル選定に時間がかかる
- 層化k分割交差検証を実施するため。

## 案②LightGBMモデルの導入
### 導入理由及びメリット

#### 大規模データセットにも対応可能
- 他のモデルと比較し学習時の所要時間が短く、メモリ効率が高い為。

#### 学習時の所要時間がRandomForestClassiferより短い
- 通常の決定木では最適な枝分かれのポイントを探す際に、全てのデータポイントを読み込む必要があるが、LightGBMでは訓練データの特徴量を階級に分けてヒストグラム化する事で厳密な枝分かれを探していない。
    
    
#### メモリ効率が高い
- 計算値をヒストグラムとして扱う為。

#### 推測精度が高い
- 他のブースティングモデルと比較した場合、Leaf-Wiseのため推測精度が改善する傾向になるため。(Leaf-Wiseの方がLevel-Wiseと比較して、より複雑な決定木となる)

### デメリット
#### 過学習しやすくなる
- Leaf-wiseは決定木がより複雑になる為。

### 参考したKaggleのkernel
[Home Credit Default Risk Using LightGBM](https://www.kaggle.com/zonnalobo/home-credit-default-risk-using-lightgbm)
- 「Just for Fun: Light Gradient Boosting Machine」の欄 Input55

## 【問題4】高い汎化性能のモデル
これまで学んだことを用いながら汎化性能の高いモデルを作成してください。今は全体の流れを掴むことを重視し、Sprintの時間内に結果を出すということも意識しましょう。

## 問題3に記載した案①と②を既存のモデルに導入

## 結果→ロジスティック回帰モデルが最適なモデルと確認

- 今回のKaggleの評価指標がAUCの為。

| モデル             | Accuracy | ROC_AUC  | F1-score | PR_AUC   |
|--------------------|----------|----------|----------|----------|
| ロジスティック回帰 | 0.741511 | 0.735263 | 0.800833 | 0.293847 |
| 決定木             | 0.884001 | 0.534607 | 0.879862 | 0.095984 |
| ランダムフォレスト | 0.924078 | 0.646853 | 0.89144  | 0.192066 |
| LightGBM           | 0.898705 | 0.68747  | 0.888951 | 0.218586 |

### スコアを確認するモデルをリストに格納する

In [0]:
models = []
models.append(('lr', LogisticRegression(class_weight='balanced')))
models.append(('dtc', DecisionTreeClassifier(class_weight='balanced')))
models.append(('rfc', RandomForestClassifier(class_weight='balanced')))
models.append(('GBM', lgb.LGBMClassifier(class_weight='balanced')))

### 確認する評価指標

In [0]:
scorer = ('accuracy','roc_auc','f1_weighted','average_precision')

### 層化K分割交差検証の実施

In [62]:
for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=217, shuffle=True)
    cv_results = cross_validate(model, X_train, y_train,cv=kfold, scoring=scorer)
    print('\n')
    cv_results1=cv_results['test_accuracy']
    cv_results2=cv_results['test_roc_auc']
    cv_results3=cv_results['test_f1_weighted']
    cv_results4=cv_results['test_average_precision']
    msg = "%s by Accuracy: %f(%f), by ROC_AUC: %f(%f), by F1-score: %f(%f), PR_AUC: %f(%f)" % (name, np.mean(cv_results1),
        np.std(cv_results1),np.mean(cv_results2),np.std(cv_results2),np.mean(cv_results3),np.std(cv_results3),
        np.mean(cv_results4),np.std(cv_results4))
    print(msg)
    print('\n')





lr by Accuracy: 0.741511(0.035392), by ROC_AUC: 0.735263(0.087469), by F1-score: 0.800833(0.026917), PR_AUC: 0.293847(0.149641)




dtc by Accuracy: 0.884001(0.026951), by ROC_AUC: 0.534607(0.057322), by F1-score: 0.879862(0.018636), PR_AUC: 0.095984(0.030199)




  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)




rfc by Accuracy: 0.924078(0.007758), by ROC_AUC: 0.646853(0.098009), by F1-score: 0.891440(0.008508), PR_AUC: 0.192066(0.073497)




GBM by Accuracy: 0.898705(0.018713), by ROC_AUC: 0.687470(0.129596), by F1-score: 0.888951(0.018142), PR_AUC: 0.218586(0.096596)




### グリッドサーチの実施

In [0]:
pipe = Pipeline([('classifier', None)])

param_grid =[
        {'classifier' : [LogisticRegression()],
         "classifier__C":np.logspace(-3,3,7), 
          "classifier__penalty":["l1","l2"]}
]

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc')

### 学習

In [65]:
grid.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None, steps=[('classifier', None)],
                                verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'classifier': [LogisticRegression(C=10.0,
                                                            class_weight=None,
                                                            dual=False,
                                                            fit_intercept=True,
                                                            intercept_scaling=1,
                                                            l1_ratio=None,
                                                            max_iter=100,
                                                            multi_class='warn',
                                                            n_jobs=None,
                                                            penalty='l1',
                               

### テストデータでの評価

In [66]:
pred_proba_r3 = grid.predict_proba(df_test_prepro2)[:,1].reshape([-1,1])
pred_proba_r3_df = pd.DataFrame(pred_proba_r2, columns=['TARGET'])
pred_proba_r3_df.head()

Unnamed: 0,TARGET
0,0.021853
1,0.117445
2,0.007411
3,0.039261
4,0.073373


In [68]:
predict_test_r3 = pd.concat([test_id_df, pred_proba_r3_df], axis=1)
predict_test_r3.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.021853
1,100005,0.117445
2,100013,0.007411
3,100028,0.039261
4,100038,0.073373


In [0]:
predict_test_r3.to_csv("submission_0717_p3_r1.csv", index=False, encoding='utf-8')

## 評価結果→69.827%
![ ](https://drive.google.com/uc?id=168e30B7q73fsI44FID4yNd1JJ6MfZnHM)

## 考察
- 問題④のモデルの評価結果が問題①②のモデルより高い値となっており、問題③の施策を行う事でより高い精度と汎化性能のモデルが出来た事を確認。

## 今後の課題
### 学習時により大量のデータを使用する
- 今回は動作環境の処理能力の都合上、学習時に使用したデータが全35万行のうち1000行しか使用していない為、より多くのデータを使用する事で、精度の向上が期待できる。