- competition/dataset : [https://www.kaggle.com/c/costa-rican-household-poverty-prediction](https://www.kaggle.com/c/costa-rican-household-poverty-prediction)
- date : 2021/02/08
- original : [https://www.kaggle.com/skooch/xgboost](https://www.kaggle.com/skooch/xgboost)

## XGBoost

**✏ 필사 1회** 

### 1. LGBM with random split for early stopping
**Edits by Eric Antoine Scuccimarra:**  
[Misha Losvyi의 노트북](https://www.kaggle.com/mlisovyi/feature-engineering-lighgbm-with-f1-macro)을 참고하였으며, 몇 가지 변경사항은 다음과 같습니다:  
* LightGBM 모델 대신 XGBoost 사용  
* 랜덤포레스트의 VotingClassifiers를 사용하여 fitting, XGB의 결과를 RF와 결합  
* feature 추가  
* code 수정  
* 데이터를 한번에 나누어 LGBM의 조기종료를 위해 검증 데이터 사용하는 대신 트레이닝 셋 전체를 학습할 수 있도록 데이터를 분할 \-\> 여기서는 k-fold split보다 효과가 좋았음  

추가적인 feature들은 [Kuriyaman의 노트북](https://www.kaggle.com/kuriyaman1002/reduce-features-140-84-keeping-f1-score)을 참고했습니다.  

**Notes from Original Kernel (edited by EAS):**  
Misha Losvyi의 노트북과 내용이 유사하나, 하이퍼파라미터를 최적화하는 대신 커널의 최적의 값을 사용하여 더욱 빠르게 실행됩니다.  

중요한 점:  
* (가족에 대한 종합 정보를 추출한 후) **이 커널은 가장에 대한 데이터만 학습합니다.** 이것은 가장에 대해서만 점수를 매긴다는 발표된 점수 산정 방식을 따랐습니다. 모든 가족 구성원이 테스트와 샘플 제출물에 포함되어 있지만 가장만 채점합니다. 그러나 [https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403#360115](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403#360115)를 살펴보면, 현재로서는 가장이 아닌 구성원들에 대해서도 평가를 하는 것으로 보입니다. 실제로 점수가 ~0.4 PLB인 결과물에서 class 1의 가장이 아닌 구성원의 데이터를 전부 바꾸면 점수는 ~0.2 PLB까지 떨어집니다.  
* **클래스별 빈도수의 균형이 매우 중요해 보입니다.** 학습 모델의 균형을 맞추지 않으면 ~0.39 PLB / ~0.43 local test의 점수인 반면, 균형을 이루면 ~0.42 PLB / ~0.47 local test의 점수를 보입니다. 이것은 수작업으로 가능하며, 언더샘플링을 통해 만들어 낼 수 있습니다. 그러나 가장 간단하고 언더샘플링보다 강력한 방법은 sklearn API의 LightGBM 모델을 생성할 때, ```class_weight='balanced'```를 설정하는 것입니다.  
* **이 커널에서는 학습에서 조기종료 시 macro F1 score를 사용합니다.** 이것은 scoring 전략에 맞게 시행됩니다.  
* 범주형들은 임의의 레이블 인코딩 대신 적절한 매핑을 통해 숫자형으로 변환됩니다. 
* **OHE는 트리 모델에 대해 더 쉽게 익힐 수 있으므로 레이블 인코딩으로 뒤바뀝니다.** 이 트릭은 트리 모델이 아닌 경우 더 위험할 수 있으므로 주의해야 합니다.  
* **idhogar은 학습에 사용하지 않습니다.** 이것이 의미를 가질 수 있는 방법은 오로지 데이터 누수일 때입니다. 우리는 여기서 빈곤에 대해 싸우고 있으며, 누수를 이용하는 것은 어떤 방법으로든 빈곤을 감소시키지 못할 것입니다.  
* **가구 내에서 집계가 이루어지며, 새로운 feature들은 수작업으로 생성합니다.** 이미 대부분이 가구 수준의 데이터이기 때문에 집계가 가능한 feature들은 많지 않다는 것을 주의해야 합니다.  
* **Voting 분류기는 전체 LightGBM 모델들을 평균내는데 사용합니다.**

In [260]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import f1_score
from joblib import Parallel, delayed
# from sklearn.externals.joblib import Parallel, delayed
from sklearn.base import clone
from sklearn.ensemble import VotingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.utils import class_weight

import warnings
warnings.filterwarnings('ignore')

In [261]:
# 범주형 변수 매핑
from sklearn.preprocessing import LabelEncoder

# 여기서는 idhogar 필드만 변환
def encode_data(df):
    df['idhogar'] = LabelEncoder().fit_transform(df['idhogar'])

# sklearn 의사결정나무의 변수중요도
def feature_importance(forest, x_train, display_results=True):
    ranked_list = []
    zero_features = []
    
    importances = forest.feature_importances_
    
    indices = np.argsort(importances)[::-1]
    if display_results:
        print('Feature ranking:')
    
    for f in range(x_train.shape[1]):
        if display_results:
            print('%d. feature %d (%f) - %s'%(f+1, indices[f], importances[indices[f]], x_train.columns[indices[f]]))
        ranked_list.append(x_train.columns[indices[f]])
        
        if importances[indices[f]] == 0.0:
            zero_features.append(x_train.columns[indices[f]])
    
    return ranked_list, zero_features

In [262]:
def do_features(df):
    feats_div = [
        ('children_fraction', 'r4t1', 'r4t3'), 
        ('working_man_fraction', 'r4h2', 'r4t3'),
        ('all_man_fraction', 'r4h3', 'r4t3'),
        ('human_density', 'tamviv', 'rooms'),
        ('human_bed_density', 'tamviv', 'bedrooms'),
        ('rent_per_person', 'v2a1', 'r4t3'),
        ('rent_per_room', 'v2a1', 'rooms'),
        ('mobile_density', 'qmobilephone', 'r4t3'),
        ('tablet_density', 'v18q1', 'r4t3'),
        ('mobile_adult_density', 'qmobilephone', 'r4t2'),
        ('tablet_adult_density', 'v18q1', 'r4t2'),
    ]
    
    feats_sub = [('people_not_living', 'tamhog', 'tamviv'),
                 ('people_weird_stat', 'tamhog', 'r4t3')
                ]
    
    for f_new, f1, f2 in feats_div:
        df['fe_' + f_new] = (df[f1]/df[f2]).astype(np.float32)
    for f_new, f1, f2 in feats_sub:
        df['fe_' + f_new] = (df[f1]-df[f2]).astype(np.float32)
    
    # 가구에 대한 집계 규칙
    aggs_num = {'age':['min', 'max', 'mean'],
                'escolari': ['min', 'max', 'mean']
               }
    
    aggs_cat = {'dis':['mean']}
    for s_ in ['estadocivil', 'parentesco', 'instlevel']:
        for f_ in [f_ for f_ in df.columns if f_.startswith(s_)]:
            aggs_cat[f_] = ['mean', 'count']
    
    # 가구별 집계
    for name_, df_ in [('18', df.query('age >= 18'))]:
        df_agg = df_.groupby('idhogar').agg({**aggs_num, **aggs_cat}).astype(np.float32)
        df_agg.columns = pd.Index(
            ['agg' + name_ + '_' + e[0] + '_' + e[1].upper() for e in df_agg.columns.tolist()]
        )
        df = df.join(df_agg, how='left', on='idhogar')
        del df_agg
    
    # id 제거
    df.drop('Id', axis=1, inplace=True)
    
    return df

In [263]:
# 원핫인코딩 필드 -> 레이블인코딩
def convert_OHE2LE(df):
    tmp_df = df.copy(deep=True)
    for s_ in ['pared', 'piso', 'techo', 'abastagua', 'sanitario', 'energcocinar', 'elimbasu', 
               'epared', 'etecho', 'eviv', 'estadocivil', 'parentesco', 
               'instlevel', 'lugar', 'tipovivi', 'manual_elec']:
        if 'manual_' not in s_:
            cols_s_ = [f_ for f_ in df.columns if f_.startswith(s_)]
        elif 'elec' in s_:
            cols_s_ = ['public', 'planpri', 'noelec', 'coopele']
        sum_ohe = tmp_df[cols_s_].sum(axis=1).unique()
        
        # sum 결과가 0인 경우
        if 0 in sum_ohe:
            print('The OHE in {} is incomplete. A new column will be added before label encoding'.format(s_))
            # 추가할 더미 컬럼명
            col_dummy = s_ + '_dummy'
            # 데이터프레임에 컬럼 추가
            tmp_df[col_dummy] = (tmp_df[cols_s_].sum(axis=1) == 0).astype(np.int8)
            # 레이블 인코딩을 위해 컬럼 리스트에 추가
            cols_s_.append(col_dummy)
            # 확인
            sum_ohe = tmp_df[cols_s_].sum(axis=1).unique()
            if 0 in sum_ohe:
                print('The category completion did not work')
        
        tmp_cat = tmp_df[cols_s_].idxmax(axis=1)
        tmp_df[s_ + '_LE'] = LabelEncoder().fit_transform(tmp_cat).astype(np.int16)
        
        if 'parentesco1' in cols_s_:
            cols_s_.remove('parentesco1')
        tmp_df.drop(cols_s_, axis=1, inplace=True)
    return tmp_df

### 2. Read in data and clean it up

In [264]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

test_ids = test.Id

In [265]:
description = [
("v2a1"," Monthly rent payment"),
("hacdor"," =1 Overcrowding by bedrooms"),
("rooms","  number of all rooms in the house"),
("hacapo"," =1 Overcrowding by rooms"),
("v14a"," =1 has toilet in the household"),
("refrig"," =1 if the household has refrigerator"),
("v18q"," owns a tablet"),
("v18q1"," number of tablets household owns"),
("r4h1"," Males younger than 12 years of age"),
("r4h2"," Males 12 years of age and older"),
("r4h3"," Total males in the household"),
("r4m1"," Females younger than 12 years of age"),
("r4m2"," Females 12 years of age and older"),
("r4m3"," Total females in the household"),
("r4t1"," persons younger than 12 years of age"),
("r4t2"," persons 12 years of age and older"),
("r4t3"," Total persons in the household"),
("tamhog"," size of the household"),
("tamviv"," number of persons living in the household"),
("escolari"," years of schooling"),
("rez_esc"," Years behind in school"),
("hhsize"," household size"),
("paredblolad"," =1 if predominant material on the outside wall is block or brick"),
("paredzocalo"," =1 if predominant material on the outside wall is socket (wood, zinc or absbesto"),
("paredpreb"," =1 if predominant material on the outside wall is prefabricated or cement"),
("pareddes"," =1 if predominant material on the outside wall is waste material"),
("paredmad"," =1 if predominant material on the outside wall is wood"),
("paredzinc"," =1 if predominant material on the outside wall is zink"),
("paredfibras"," =1 if predominant material on the outside wall is natural fibers"),
("paredother"," =1 if predominant material on the outside wall is other"),
("pisomoscer"," =1 if predominant material on the floor is mosaic ceramic   terrazo"),
("pisocemento"," =1 if predominant material on the floor is cement"),
("pisoother"," =1 if predominant material on the floor is other"),
("pisonatur"," =1 if predominant material on the floor is  natural material"),
("pisonotiene"," =1 if no floor at the household"),
("pisomadera"," =1 if predominant material on the floor is wood"),
("techozinc"," =1 if predominant material on the roof is metal foil or zink"),
("techoentrepiso"," =1 if predominant material on the roof is fiber cement,   mezzanine "),
("techocane"," =1 if predominant material on the roof is natural fibers"),
("techootro"," =1 if predominant material on the roof is other"),
("cielorazo"," =1 if the house has ceiling"),
("abastaguadentro"," =1 if water provision inside the dwelling"),
("abastaguafuera"," =1 if water provision outside the dwelling"),
("abastaguano"," =1 if no water provision"),
("public"," =1 electricity from CNFL,  ICE, ESPH/JASEC"),
("planpri"," =1 electricity from private plant"),
("noelec"," =1 no electricity in the dwelling"),
("coopele"," =1 electricity from cooperative"),
("sanitario1"," =1 no toilet in the dwelling"),
("sanitario2"," =1 toilet connected to sewer or cesspool"),
("sanitario3"," =1 toilet connected to  septic tank"),
("sanitario5"," =1 toilet connected to black hole or letrine"),
("sanitario6"," =1 toilet connected to other system"),
("energcocinar1"," =1 no main source of energy used for cooking (no kitchen)"),
("energcocinar2"," =1 main source of energy used for cooking electricity"),
("energcocinar3"," =1 main source of energy used for cooking gas"),
("energcocinar4"," =1 main source of energy used for cooking wood charcoal"),
("elimbasu1"," =1 if rubbish disposal mainly by tanker truck"),
("elimbasu2"," =1 if rubbish disposal mainly by botan hollow or buried"),
("elimbasu3"," =1 if rubbish disposal mainly by burning"),
("elimbasu4"," =1 if rubbish disposal mainly by throwing in an unoccupied space"),
("elimbasu5"," =1 if rubbish disposal mainly by throwing in river,   creek or sea"),
("elimbasu6"," =1 if rubbish disposal mainly other"),
("epared1"," =1 if walls are bad"),
("epared2"," =1 if walls are regular"),
("epared3"," =1 if walls are good"),
("etecho1"," =1 if roof are bad"),
("etecho2"," =1 if roof are regular"),
("etecho3"," =1 if roof are good"),
("eviv1"," =1 if floor are bad"),
("eviv2"," =1 if floor are regular"),
("eviv3"," =1 if floor are good"),
("dis"," =1 if disable person"),
("male"," =1 if male"),
("female"," =1 if female"),
("estadocivil1"," =1 if less than 10 years old"),
("estadocivil2"," =1 if free or coupled uunion"),
("estadocivil3"," =1 if married"),
("estadocivil4"," =1 if divorced"),
("estadocivil5"," =1 if separated"),
("estadocivil6"," =1 if widow/er"),
("estadocivil7"," =1 if single"),
("parentesco1"," =1 if household head"),
("parentesco2"," =1 if spouse/partner"),
("parentesco3"," =1 if son/doughter"),
("parentesco4"," =1 if stepson/doughter"),
("parentesco5"," =1 if son/doughter in law"),
("parentesco6"," =1 if grandson/doughter"),
("parentesco7"," =1 if mother/father"),
("parentesco8"," =1 if father/mother in law"),
("parentesco9"," =1 if brother/sister"),
("parentesco10"," =1 if brother/sister in law"),
("parentesco11"," =1 if other family member"),
("parentesco12"," =1 if other non family member"),
("idhogar"," Household level identifier"),
("hogar_nin"," Number of children 0 to 19 in household"),
("hogar_adul"," Number of adults in household"),
("hogar_mayor"," # of individuals 65+ in the household"),
("hogar_total"," # of total individuals in the household"),
("dependency"," Dependency rate"),
("edjefe"," years of education of male head of household"),
("edjefa"," years of education of female head of household"),
("meaneduc","average years of education for adults (18+)"),
("instlevel1"," =1 no level of education"),
("instlevel2"," =1 incomplete primary"),
("instlevel3"," =1 complete primary"),
("instlevel4"," =1 incomplete academic secondary level"),
("instlevel5"," =1 complete academic secondary level"),
("instlevel6"," =1 incomplete technical secondary level"),
("instlevel7"," =1 complete technical secondary level"),
("instlevel8"," =1 undergraduate and higher education"),
("instlevel9"," =1 postgraduate higher education"),
("bedrooms"," number of bedrooms"),
("overcrowding"," # persons per room"),
("tipovivi1"," =1 own and fully paid house"),
("tipovivi2"," =1 own, paying in installments"),
("tipovivi3"," =1 rented"),
("tipovivi4"," =1 precarious"),
("tipovivi5"," =1 other(assigned"),
("computer"," =1 if the household has notebook or desktop computer,   borrowed)"),
("television"," =1 if the household has TV"),
("mobilephone"," =1 if mobile phone"),
("qmobilephone"," # of mobile phones"),
("lugar1"," =1 region Central"),
("lugar2"," =1 region Chorotega"),
("lugar3"," =1 region PacÃƒÂ­fico central"),
("lugar4"," =1 region Brunca"),
("lugar5"," =1 region Huetar AtlÃƒÂ¡ntica"),
("lugar6"," =1 region Huetar Norte"),
("area1"," =1 zona urbana"),
("area2"," =2 zona rural"),
("age"," Age in years"),
("SQBescolari"," escolari squared"),
("SQBage"," age squared"),
("SQBhogar_total"," hogar_total squared"),
("SQBedjefe"," edjefe squared"),
("SQBhogar_nin"," hogar_nin squared"),
("SQBovercrowding"," overcrowding squared"),
("SQBdependency"," dependency squared"),
("SQBmeaned"," meaned squared"),
("agesq"," Age squared"),]

description = pd.DataFrame(description, columns=['varname', 'description'])

In [266]:
def process_df(df_):
    # idhogar 인코딩
    encode_data(df_)
    
    # 집계 feature 생성
    return do_features(df_)

train = process_df(train)
test = process_df(test)

결측값 제거, 문자형을 숫자형으로 변환

In [267]:
# dependency에 결측값이 있으므로 SQBdependency의 제곱근값을 사용
train['dependency'] = np.sqrt(train['SQBdependency'])
test['dependency'] = np.sqrt(test['SQBdependency'])

# education의 'no'는 0으로 대체
train.loc[train['edjefa'] == 'no', 'edjefa'] = 0
train.loc[train['edjefe'] == 'no', 'edjefe'] = 0
test.loc[test['edjefa'] == 'no', 'edjefa'] = 0
test.loc[test['edjefe'] == 'no', 'edjefe'] = 0

# education이 'yes'이고 가장인 경우 escolari 값으로 대체
train.loc[(train['edjefa'] == 'yes')&(train['parentesco1'] == 1), 'edjefa'] = train.loc[(train['edjefa'] == 'yes')&(train['parentesco1'] == 1), 'escolari']
train.loc[(train['edjefe'] == 'yes')&(train['parentesco1'] == 1), 'edjefe'] = train.loc[(train['edjefe'] == 'yes')&(train['parentesco1'] == 1), 'escolari']
test.loc[(test['edjefa'] == 'yes')&(test['parentesco1'] == 1), 'edjefa'] = test.loc[(test['edjefa'] == 'yes')&(test['parentesco1'] == 1), 'escolari']
test.loc[(test['edjefe'] == 'yes')&(test['parentesco1'] == 1), 'edjefe'] = test.loc[(test['edjefe'] == 'yes')&(test['parentesco1'] == 1), 'escolari']

# edjefa, edjefe는 gender와 escolari의 상호작용에 대한 데이터, yes는 4로 대체
train.loc[train['edjefa'] == 'yes', 'edjefa'] = 4
train.loc[train['edjefe'] == 'yes', 'edjefe'] = 4
test.loc[test['edjefa'] == 'yes', 'edjefa'] = 4
test.loc[test['edjefe'] == 'yes', 'edjefe'] = 4

# int 타입으로 변경
train['edjefa'] = train['edjefa'].astype('int')
train['edjefe'] = train['edjefe'].astype('int')
test['edjefa'] = test['edjefa'].astype('int')
test['edjefe'] = test['edjefe'].astype('int')

# 가장의 최대 교육기간에 대한 feature 생성
train['edjef'] = np.max(train[['edjefa', 'edjefe']], axis=1)
test['edjef'] = np.max(test[['edjefa', 'edjefe']], axis=1)

# nan에 0 입력
train['v2a1'].fillna(0, inplace=True)
test['v2a1'].fillna(0, inplace=True)

train['v18q1'].fillna(0, inplace=True)
test['v18q1'].fillna(0, inplace=True)

train['rez_esc'].fillna(0, inplace=True)
test['rez_esc'].fillna(0, inplace=True)

train.loc[train['meaneduc'].isnull(), 'meaneduc'] = 0
test.loc[test['meaneduc'].isnull(), 'meaneduc'] = 0

train.loc[train['SQBmeaned'].isnull(), 'SQBmeaned'] = 0
test.loc[test['SQBmeaned'].isnull(), 'SQBmeaned'] = 0

# 일관성 없는 데이터 수정 - 수도 공급이 없는 경우 화장실도 없는 것으로 통일
train.loc[(train['v14a'] == 1)&(train['sanitario1'] == 1)&(train['abastaguano'] == 0), 'v14a'] = 0
train.loc[(train['v14a'] == 1)&(train['sanitario1'] == 1)&(train['abastaguano'] == 0), 'sanitario1'] = 1
test.loc[(test['v14a'] == 1)&(test['sanitario1'] == 1)&(test['abastaguano'] == 0), 'v14a'] = 0
test.loc[(test['v14a'] == 1)&(test['sanitario1'] == 1)&(test['abastaguano'] == 0), 'sanitario1'] = 1

In [268]:
def train_test_apply_func(train_, test_, func_):
    test_['Target'] = 0
    xx = pd.concat([train_, test_])
    
    xx_func = func_(xx)
    train_ = xx_func.iloc[:train_.shape[0], :]
    test_ = xx_func.iloc[train_.shape[0]:, :].drop('Target', axis=1)
    
    del xx, xx_func
    return train_, test_

In [269]:
# 원핫인코딩된 필드를 레이블인코딩
train, test = train_test_apply_func(train, test, convert_OHE2LE)

The OHE in techo is incomplete. A new column will be added before label encoding
The OHE in instlevel is incomplete. A new column will be added before label encoding
The OHE in manual_elec is incomplete. A new column will be added before label encoding


### 3. Geo aggregates

In [270]:
cols_2_ohe = ['eviv_LE', 'etecho_LE', 'epared_LE', 'elimbasu_LE',
              'energcocinar_LE', 'sanitario_LE', 'manual_elec_LE', 'pared_LE']
cols_nums = ['age', 'meaneduc', 'dependency', 'hogar_nin', 'hogar_adul',
             'hogar_mayor', 'hogar_total', 'bedrooms', 'overcrowding']

def convert_geo2aggs(df_):
    tmp_df = pd.concat([df_[(['lugar_LE', 'idhogar'] + cols_nums)],
                        pd.get_dummies(df_[cols_2_ohe], columns=cols_2_ohe)], axis=1)
    geo_agg = tmp_df.groupby(['lugar_LE', 'idhogar']).mean().groupby('lugar_LE').mean().astype(np.float32)
    geo_agg.columns = pd.Index(['geo_' + e for e in geo_agg.columns.tolist()])
    
    del tmp_df
    return df_.join(geo_agg, how='left', on='lugar_LE')

# 지형별 집계 추가
train, test = train_test_apply_func(train, test, convert_geo2aggs)

In [271]:
# 각 가정에서 18세 이상인 사람의 수
train['num_over_18'] = 0
train['num_over_18'] = train[train['age'] >= 18].groupby('idhogar').transform('count')
train['num_over_18'] = train.groupby('idhogar')['num_over_18'].transform('max')
train['num_over_18'].fillna(0, inplace=True)

test['num_over_18'] = 0
test['num_over_18'] = test[test['age'] >= 18].groupby('idhogar').transform('count')
test['num_over_18'] = test.groupby('idhogar')['num_over_18'].transform('max')
test['num_over_18'].fillna(0, inplace=True)

# 그밖의 feature 추가 (다른 커널에서 가져옴)
def extract_features(df):
    df['bedroom_to_rooms'] = df['bedrooms'] / df['rooms']
    df['rent_to_rooms'] = df['v2a1'] / df['rooms']
    df['tamhog_to_rooms'] = df['tamhog'] / df['rooms']
    df['r4t3_to_tamhog'] = df['r4t3'] / df['tamhog']
    df['r4t3_to_rooms'] = df['r4t3'] / df['rooms']
    df['v2a1_to_r4t3'] = df['v2a1'] / df['r4t3']
    df['v2a1_to_under_12'] = df['v2a1'] / (df['r4t3'] - df['r4t1'])
    df['hhsize_to_rooms'] = df['hhsize'] / df['rooms']
    df['rent_to_hhsize'] = df['v2a1'] / df['hhsize']
    df['rent_to_over_18'] = df['v2a1'] / df['num_over_18']
    # 18세 이하가 없는 가정의 월세 총합
    df.loc[df['num_over_18'] == 0, 'rent_to_over_18'] = df[df['num_over_18'] == 0].v2a1

extract_features(train)
extract_features(test)

In [272]:
# 중복된 컬럼 제거
needless_cols = ['r4t3', 'tamhog', 'tamviv', 'hhsize', 'v18q', 'v14a', 'agesq',
                 'mobilephone', 'female']

instlevel_cols = [s for s in train.columns.tolist() if 'instlevel' in s]
needless_cols.extend(instlevel_cols)

train.drop(needless_cols, axis=1, inplace=True)
test.drop(needless_cols, axis=1, inplace=True)

#### Split the data
같은 가구에 속하는 행들은 대부분 같은 데이터를 갖기 때문에 누수를 피하기 위해 데이터를 가구 단위로 분할합니다. 가장만 포함하도록 데이터를 필터링하기 대문에 기술적으로는 필요하지 않지만, 위와 같이 하기 위해서 전체 트레이닝 데이터를 쉽게 사용할 수 있습니다.  

데이터를 분리한 후 트레이닝 데이터를 전체 데이터로 덮어 모든 데이터를 학습할 수 있다는 점을 기억해야 합니다. split_data 함수는 데이터를 덮어쓰는 것을 제외한 나머지 역할을 하고, K-fold split과 유사한 트레이닝 루프에 사용합니다.

In [273]:
def split_data(train, y, sample_weight=None, households=None,
               test_percentage=0.20, seed=None):
    np.random.seed(seed=seed)
    train2 = train.copy()
    
    # 무작위로 테스트에 사용할 가구 추출
    cv_hhs = np.random.choice(households, size=int(len(households)*test_percentage), replace=False)
    
    # 랜덤으로 선택된 가구 적용
    cv_idx = np.isin(households, cv_hhs)
    x_test = train2[cv_idx]
    y_test = y[cv_idx]
    
    x_train = train2[~cv_idx]
    y_train = y[~cv_idx]
    
    if sample_weight is not None:
        y_train_weights = sample_weight[~cv_idx]
        return x_train, y_train, x_test, y_test, y_train_weights
    
    return x_train, y_train, x_test, y_test

In [274]:
x = train.query('parentesco1 == 1')

# target 변수 추출 및 제거
y = x['Target'] - 1
x.drop('Target', axis=1, inplace=True)

np.random.seed(seed=None)
x_train, y_train, x_test, y_test = split_data(x, y, households=x['idhogar'].unique(), test_percentage=0.15)

# 전체 데이터셋 학습
x_train = x
y_train = y

train_households = x_train['idhogar']

# 불균형한 클래스가 있는 학습에 대한 클래스 가중치
y_train_weights = class_weight.compute_sample_weight('balanced', y_train, indices=None)

In [275]:
# LGBM에 사용하지 않거나 변수중요도가 매우 낮은 feature 제거
extra_drop_features = [
 'agg18_estadocivil1_MEAN', 'agg18_estadocivil6_COUNT', 'agg18_estadocivil7_COUNT',
 'agg18_parentesco10_COUNT', 'agg18_parentesco11_COUNT', 'agg18_parentesco12_COUNT',
 'agg18_parentesco1_COUNT', 'agg18_parentesco2_COUNT', 'agg18_parentesco3_COUNT',
 'agg18_parentesco4_COUNT', 'agg18_parentesco5_COUNT', 'agg18_parentesco6_COUNT',
 'agg18_parentesco7_COUNT', 'agg18_parentesco8_COUNT', 'agg18_parentesco9_COUNT',
 'geo_elimbasu_LE_4', 'geo_energcocinar_LE_1', 'geo_energcocinar_LE_2',
 'geo_epared_LE_0', 'geo_hogar_mayor', 'geo_manual_elec_LE_2', 'geo_pared_LE_3',
 'geo_pared_LE_4', 'geo_pared_LE_5', 'geo_pared_LE_6', 'num_over_18',
 'parentesco_LE', 'rez_esc']

xgb_drop_cols = extra_drop_features + ['idhogar', 'parentesco1']

### 4. Fit a voting classifier
조기 종료를 위해 ```fit_params```를 통과할 수 있도록 파생된 VotingClassifier를 정의합니다. 투표는 LGBM 모델을 기반으로 하며, 이 모델은 macro F1과 쇠퇴하는 학습 속도를 기반으로 한 조기 종료를 사용합니다.  

파라미터는 해당 커널에서 무작위 탐색을 통해 최적화됩니다: [https://www.kaggle.com/mlisovyi/lighgbm-hyperoptimisation-with-f1-macro](https://www.kaggle.com/mlisovyi/lighgbm-hyperoptimisation-with-f1-macro)

In [276]:
opt_parameters = {
    'max_depth':35, 'eta':0.15, 'silent':1, 'objective':'multi:softmax',
    'min_child_weight':2, 'num_class':4, 'gamma':2.5, 'colsample_bylevel':1,
    'subsample':0.95, 'colsample_bytree':0.85, 'reg_lambda':0.35
}

def evaluate_macroF1_lgb(predictions, truth):
    # https://github.com/Microsoft/LightGBM/issues/1483를 따름
    pred_labels = predictions.argmax(axis=1)
    truth = truth.get_label()
    f1 = f1_score(truth, pred_labels, average='macro')
    return ('macroF1', 1-f1)

fit_params = {
    'early_stopping_rounds':500, 'eval_metric':evaluate_macroF1_lgb, 
    'eval_set':[(x_train, y_train), (x_test, y_test)], 'verbose':False,
}

def learning_rate_power_0997(current_iter):
    base_learning_rate = 0.1
    min_learning_rate = 0.02
    lr = base_learning_rate * np.power(.995, current_iter)
    return max(lr, min_learning_rate)

fit_params['verbose'] = 50

In [277]:
np.random.seed(100)

def _parallel_fit_estimator(estimator1, x, y, sample_weight=None, threshold=True, **fit_params):
    estimator = clone(estimator1)
    
    # 데이터 무작위 분할
    if sample_weight is not None:
        x_train, y_train, x_test, y_test, y_train_weight = split_data(x, y, sample_weight, households=train_households)
    else:
        x_train, y_train, x_test, y_test = split_data(x, y, households=train_households)
        
    # 새로운 분할에 대한 fit param 업데이트
    fit_params['eval_set'] = [(x_test, y_test)]
    
    # fit the estimator
    if sample_weight is not None:
        if isinstance(estimator1, ExtraTreesClassifier) or isinstance(estimator1, RandomForestClassifier):
            estimator.fit(x_train, y_train)
        else:
            _ = estimator.fit(x_train, y_train, sample_weight=y_train_weight, **fit_params)
    else:
        if isinstance(estimator1, ExtraTreesClassifier) or isinstance(estimator1, RandomForestClassifier):
            estimator.fit(x_train, y_train)
        else:
            _ = estimator.fit(x_train, y_train, **fit_params)
    
    if not isinstance(estimator1, ExtraTreesClassifier) and not isinstance(estimator1, RandomForestClassifier) and not isinstance(estimator1, xgb.XGBClassifier):
        best_cv_round = np.argmax(estimator.evals_result_['validation_0']['mlogloss'])
        best_cv = np.max(estimator.evals_result_['validation_0']['mlogloss'])
        best_train = estimator.evals_result_['train']['macroF1'][best_cv_round]
    else:
        best_train = f1_score(y_train, estimator.predict(x_train), average='macro')
        best_cv = f1_score(y_test, estimator.predict(x_test), average='macro')
        print('Train F1:', best_train)
        print('Test F1:', best_cv)
    
    # reject some estimators based on their performance on train and test sets
    if threshold:
        # valid score가 매우 높으면 train score에서 보다 여유롭게 점수를 얻을 수 있습니다.
        if ((best_cv > 0.37) and (best_train > 0.75)) or ((best_cv > 0.44) and (best_train > 0.65)):
            return estimator
        # 그렇지 않으면 더 좋은 점수가 나올 때까지 반복
        else:
            print('Unacceptable!!! Trying again...')
            return _parallel_fit_estimator(estimator1, x, y, sample_weight=sample_weight, **fit_params)
    else:
        return estimator

In [278]:
class VotingClassifierLGBM(VotingClassifier):
    # fit_params를 전파하는 VotingClassifier의 fit 방법 구현
    def fit(self, x, y, sample_weight=None, threshold=True, **fit_params):
        if isinstance(y, np.ndarray) and len(y.shape) and y.shape[1] > 1:
            raise NotImplementedError('Multilabel and multi-output classification is not supported.')
        if self.voting not in ('soft', 'hard'):
            raise ValueError("Voting must be 'soft' or 'hard'; got (voting=%r)"%self.voting)
        if self.estimators is None or len(self.estimators) == 0:
            raise AttributeError('Invalie `estimators` attribute, `estimators` should be a list of (stirng, estimator) tuples')
        if (self.weights is not None and len(self.wieghts) != len(slef.estimators)):
            raise ValueError('Number of classifiers and weights must be equal; got %d weights, %d estimators'%(len(self.weights), len(self.estimators)))
        
        names, clfs = zip(*self.estimators)
        self._validate_names(names)
        
        n_isnone = np.sum([clf is None for _, clf in self.estimators])
        if n_isnone == len(self.estimators):
            raise ValueError('All estimators are None. At least one is required to be a classifier!')
        
        self.le_ = LabelEncoder().fit(y)
        self.classes_ = self.le_.classes_
        self.estimators_ = []
        
        transformed_y = self.le_.transform(y)
        
        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
            delayed(_parallel_fit_estimator)(
                clone(clf), x, transformed_y, sample_weight=sample_weight, threshold=threshold, **fit_params
            ) for clf in clfs if clf is not None)
    
        return self

In [279]:
clfs = []
for i in range(15):
    clf = xgb.XGBClassifier(random_state=217+i, n_estimators=300, learning_rate=0.15,
                            n_jobs=4, **opt_parameters)
    clfs.append(('xgb{}'.format(i), clf))

vc = VotingClassifierLGBM(clfs, voting='soft')
del clfs

# learning rate 감소에 따라 최종 모델 학습
_ = vc.fit(x_train.drop(xgb_drop_cols, axis=1), y_train,
           sample_weight=y_train_weights, threshold=False, **fit_params)

clf_final = vc.estimators_[0]

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	validation_0-mlogloss:1.30382	validation_0-macroF1:0.66847
[50]	validation_0-mlogloss:0.91046	validation_0-macroF1:0.63081
[100]	validation_0-mlogloss:0.90507	validation_0-macroF1:0.62905
[150]	validation_0-mlogloss:0.90503	validation_0-macroF1:0.63133
[200]	validation_0-mlogloss:0.90546	validation_0-macroF1:0.63140
[250]	validation_0-mlogloss:0.90707	validation_0-macroF1:0.63409
[299]	validation_0-mlogloss:0.90637	validation_0-macroF1:0.63619
Train F1: 0.8772064230716127
Test F1: 0.3858798662205686
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip thro

In [280]:
# params 4 = 400 early stop - 15 estimators - l1 used features - weighted
global_score = f1_score(y_test, clf_final.predict(x_test.drop(xgb_drop_cols, axis=1)), average='macro')

vc.voting = 'soft'
global_score_soft = f1_score(y_test, vc.predict(x_test.drop(xgb_drop_cols, axis=1)), average='macro')

vc.voting = 'hard'
global_score_hard = f1_score(y_test, vc.predict(x_test.drop(xgb_drop_cols, axis=1)), average='macro')

print('Validation score of a single LGBM Classifier: {:.4f}'.format(global_score))
print('Validation score for a VotingClassifier on 3 LGBMs with soft voting strategy: {:.4f}'.format(global_score_soft))
print('Validation score for a VotingClassifier on 3 LGBMs with hard voting strategy: {:.4f}'.format(global_score_hard))

Validation score of a single LGBM Classifier: 0.8215
Validation score for a VotingClassifier on 3 LGBMs with soft voting strategy: 0.8970
Validation score for a VotingClassifier on 3 LGBMs with hard voting strategy: 0.8940


In [281]:
# 모델들에 사용되지 않은 변수
useless_features = []
drop_features = set()
counter = 0
for est in vc.estimators_:
    ranked_features, unused_features = feature_importance(est, x_train.drop(xgb_drop_cols, axis=1), display_results=False)
    useless_features.append(unused_features)
    if counter == 0:
        drop_features = set(unused_features)
    else:
        drop_features = drop_features.intersection(set(unused_features))
    counter += 1

drop_features

{'agg18_estadocivil5_COUNT',
 'geo_energcocinar_LE_0',
 'geo_epared_LE_2',
 'geo_manual_elec_LE_3'}

In [282]:
ranked_features = feature_importance(clf_final, x_train.drop(xgb_drop_cols, axis=1))

Feature ranking:
1. feature 59 (0.023249) - agg18_escolari_MAX
2. feature 42 (0.018774) - fe_children_fraction
3. feature 37 (0.015907) - SQBedjefe
4. feature 128 (0.014577) - geo_manual_elec_LE_0
5. feature 60 (0.013138) - agg18_escolari_MEAN
6. feature 36 (0.012107) - SQBhogar_total
7. feature 125 (0.012032) - geo_sanitario_LE_2
8. feature 22 (0.011352) - dependency
9. feature 74 (0.011255) - agg18_parentesco2_MEAN
10. feature 7 (0.010973) - r4h2
11. feature 107 (0.010781) - geo_overcrowding
12. feature 13 (0.010682) - r4t2
13. feature 87 (0.010281) - piso_LE
14. feature 15 (0.010171) - cielorazo
15. feature 40 (0.010127) - SQBdependency
16. feature 112 (0.010100) - geo_etecho_LE_1
17. feature 102 (0.009908) - geo_dependency
18. feature 105 (0.009904) - geo_hogar_total
19. feature 65 (0.009884) - agg18_estadocivil3_MEAN
20. feature 117 (0.009835) - geo_elimbasu_LE_1
21. feature 124 (0.009646) - geo_sanitario_LE_1
22. feature 23 (0.009519) - edjefe
23. feature 96 (0.009518) - estadoci

#### Random Forest

In [283]:
et_drop_cols = [
    'agg18_age_MAX', 'agg18_age_MEAN', 'agg18_age_MIN', 'agg18_dis_MEAN',
    'agg18_escolari_MAX', 'agg18_escolari_MEAN', 'agg18_escolari_MIN',
    'agg18_estadocivil1_COUNT', 'agg18_estadocivil1_MEAN',
    'agg18_estadocivil2_COUNT', 'agg18_estadocivil2_MEAN',
    'agg18_estadocivil3_COUNT', 'agg18_estadocivil3_MEAN',
    'agg18_estadocivil4_COUNT', 'agg18_estadocivil4_MEAN',
    'agg18_estadocivil5_COUNT', 'agg18_estadocivil5_MEAN',
    'agg18_estadocivil6_COUNT', 'agg18_estadocivil6_MEAN',
    'agg18_estadocivil7_COUNT', 'agg18_estadocivil7_MEAN',
    'agg18_parentesco10_COUNT', 'agg18_parentesco10_MEAN',
    'agg18_parentesco11_COUNT', 'agg18_parentesco11_MEAN',
    'agg18_parentesco12_COUNT', 'agg18_parentesco12_MEAN',
    'agg18_parentesco1_COUNT', 'agg18_parentesco1_MEAN',
    'agg18_parentesco2_COUNT', 'agg18_parentesco2_MEAN',
    'agg18_parentesco3_COUNT', 'agg18_parentesco3_MEAN',
    'agg18_parentesco4_COUNT', 'agg18_parentesco4_MEAN',
    'agg18_parentesco5_COUNT', 'agg18_parentesco5_MEAN',
    'agg18_parentesco6_COUNT', 'agg18_parentesco6_MEAN',
    'agg18_parentesco7_COUNT', 'agg18_parentesco7_MEAN',
    'agg18_parentesco8_COUNT', 'agg18_parentesco8_MEAN',
    'agg18_parentesco9_COUNT', 'agg18_parentesco9_MEAN'
]
    #+ ['parentesco_LE', 'rez_esc']

et_drop_cols.extend([
    'idhogar', 'parentesco1', 'fe_rent_per_person', 'fe_rent_per_room',
    'fe_tablet_adult_density', 'fe_tablet_density'
])

In [284]:
# 반복
ets = []
for i in range(10):
    rf = RandomForestClassifier(
        max_depth=None, random_state=217+i, n_jobs=4, n_estimators=700,
        min_impurity_decrease=1e-3, min_samples_leaf=2, verbose=0, class_weight='balanced'
    )
    ets.append(('rf{}'.format(i), rf))

vc2 = VotingClassifierLGBM(ets, voting='soft')
_ = vc2.fit(x_train.drop(et_drop_cols, axis=1), y_train, threshold=False)

Train F1: 0.8954372019112491
Test F1: 0.4458623810259543
Train F1: 0.8902466898736323
Test F1: 0.434252672331085
Train F1: 0.8975150989017542
Test F1: 0.43286306265457275
Train F1: 0.8936934335737534
Test F1: 0.4182419864722403
Train F1: 0.9005328671076449
Test F1: 0.39998242036131787
Train F1: 0.8961073649993477
Test F1: 0.4779487868342571
Train F1: 0.8948105591249009
Test F1: 0.4127369072617175
Train F1: 0.8939783041785534
Test F1: 0.44027366000562207
Train F1: 0.8848922924264201
Test F1: 0.4586126835893988
Train F1: 0.8978049456841827
Test F1: 0.43410845734325465


In [285]:
vc2.voting = 'soft'
global_rf_score_soft = f1_score(y_test, vc2.predict(x_test.drop(et_drop_cols, axis=1)), average='macro')

vc2.voting = 'hard'
global_rf_score_hard = f1_score(y_test, vc2.predict(x_test.drop(et_drop_cols, axis=1)), average='macro')

print('Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: {:.4f}'.format(global_rf_score_soft))
print('Validation score of a VotingClassifier on 3 LGBMs with hard voting strategy: {:.4f}'.format(global_rf_score_hard))

Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: 0.8794
Validation score of a VotingClassifier on 3 LGBMs with hard voting strategy: 0.8793


In [286]:
# 모델들에 사용되지 않은 변수
useless_features = []
drop_features = set()
counter = 0
for est in vc2.estimators_:
    ranked_features, unused_features = feature_importance(est, x_train.drop(et_drop_cols, axis=1), display_results=False)
    useless_features.append(unused_features)
    if counter == 0:
        drop_features = set(unused_features)
    else:
        drop_features = drop_features.intersection(set(unused_features))
    counter += 1

drop_features

{'parentesco_LE', 'rez_esc'}

In [287]:
def combine_voters(data, weights=[0.5, 0.5]):
    # 두 분류기를 모두 사용하여 soft voting 시행
    vc.voting = 'soft'
    vc1_probs = vc.predict_proba(data.drop(xgb_drop_cols, axis=1))
    
    vc2.voting = 'soft'
    vc2_probs = vc2.predict_proba(data.drop(et_drop_cols, axis=1))
    
    final_vote = (vc1_probs * weights[0]) + (vc2_probs * weights[1])
    predictions = np.argmax(final_vote, axis=1)
    
    return predictions

In [288]:
combo_preds = combine_voters(x_test, weights=[0.5, 0.5])
global_combo_score_soft = f1_score(y_test, combo_preds, average='macro')
global_combo_score_soft

0.9009788676236044

In [289]:
combo_preds = combine_voters(x_test, weights=[0.4, 0.6])
global_combo_score_soft = f1_score(y_test, combo_preds, average='macro')
global_combo_score_soft

0.8963703061529148

In [290]:
combo_preds = combine_voters(x_test, weights=[0.6, 0.4])
global_combo_score_soft = f1_score(y_test, combo_preds, average='macro')
global_combo_score_soft

0.9009788676236044

### 5. Prepare submission

In [291]:
y_subm = pd.DataFrame()
y_subm['Id'] = test_ids

In [292]:
vc.voting = 'soft'
y_subm_lgb = y_subm.copy(deep=True)
y_subm_lgb['Target'] = vc.predict(test.drop(xgb_drop_cols, axis=1)) + 1

vc2.voting = 'soft'
y_subm_rf = y_subm.copy(deep=True)
y_subm_rf['Target'] = vc2.predict(test.drop(et_drop_cols, axis=1)) + 1

y_subm_ens = y_subm.copy(deep=True)
y_subm_ens['Target'] = combine_voters(test) + 1

In [294]:
from datetime import datetime
now = datetime.now()

sub_file_lgb = 'data/submission_3_soft_XGB_{:.4f}_{}.csv'.format(global_score_soft, str(now.strftime('%Y-%m-%d-%H-%M')))
sub_file_rf = 'data/submission_3_soft_RF_{:.4f}_{}.csv'.format(global_rf_score_soft, str(now.strftime('%Y-%m-%d-%H-%M')))
sub_file_ens = 'data/submission_3_soft_ens_{:.4f}_{}.csv'.format(global_combo_score_soft, str(now.strftime('%Y-%m-%d-%H-%M')))

y_subm_lgb.to_csv(sub_file_lgb, index=False)
y_subm_rf.to_csv(sub_file_rf, index=False)
y_subm_ens.to_csv(sub_file_ens, index=False)

## XGBoost

**✏ 필사 2회** 

### 1. LGBM with random split for early stopping
**Edits by Eric Antoine Scuccimarra:**  
[Misha Losvyi의 노트북](https://www.kaggle.com/mlisovyi/feature-engineering-lighgbm-with-f1-macro)을 참고하였으며, 몇 가지 변경사항은 다음과 같습니다:  
* LightGBM 모델 대신 XGBoost 사용  
* 랜덤포레스트의 VotingClassifiers를 사용하여 fitting, XGB의 결과를 RF와 결합  
* feature 추가  
* code 수정  
* 데이터를 한번에 나누어 LGBM의 조기종료를 위해 검증 데이터 사용하는 대신 트레이닝 셋 전체를 학습할 수 있도록 데이터를 분할 \-\> 여기서는 k-fold split보다 효과가 좋았음  

추가적인 feature들은 [Kuriyaman의 노트북](https://www.kaggle.com/kuriyaman1002/reduce-features-140-84-keeping-f1-score)을 참고했습니다.  

**Notes from Original Kernel (edited by EAS):**  
Misha Losvyi의 노트북과 내용이 유사하나, 하이퍼파라미터를 최적화하는 대신 커널의 최적의 값을 사용하여 더욱 빠르게 실행됩니다.  

중요한 점:  
* (가족에 대한 종합 정보를 추출한 후) **이 커널은 가장에 대한 데이터만 학습합니다.** 이것은 가장에 대해서만 점수를 매긴다는 발표된 점수 산정 방식을 따랐습니다. 모든 가족 구성원이 테스트와 샘플 제출물에 포함되어 있지만 가장만 채점합니다. 그러나 [https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403#360115](https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403#360115)를 살펴보면, 현재로서는 가장이 아닌 구성원들에 대해서도 평가를 하는 것으로 보입니다. 실제로 점수가 ~0.4 PLB인 결과물에서 class 1의 가장이 아닌 구성원의 데이터를 전부 바꾸면 점수는 ~0.2 PLB까지 떨어집니다.  
* **클래스별 빈도수의 균형이 매우 중요해 보입니다.** 학습 모델의 균형을 맞추지 않으면 ~0.39 PLB / ~0.43 local test의 점수인 반면, 균형을 이루면 ~0.42 PLB / ~0.47 local test의 점수를 보입니다. 이것은 수작업으로 가능하며, 언더샘플링을 통해 만들어 낼 수 있습니다. 그러나 가장 간단하고 언더샘플링보다 강력한 방법은 sklearn API의 LightGBM 모델을 생성할 때, ```class_weight='balanced'```를 설정하는 것입니다.  
* **이 커널에서는 학습에서 조기종료 시 macro F1 score를 사용합니다.** 이것은 scoring 전략에 맞게 시행됩니다.  
* 범주형들은 임의의 레이블 인코딩 대신 적절한 매핑을 통해 숫자형으로 변환됩니다. 
* **OHE는 트리 모델에 대해 더 쉽게 익힐 수 있으므로 레이블 인코딩으로 뒤바뀝니다.** 이 트릭은 트리 모델이 아닌 경우 더 위험할 수 있으므로 주의해야 합니다.  
* **idhogar은 학습에 사용하지 않습니다.** 이것이 의미를 가질 수 있는 방법은 오로지 데이터 누수일 때입니다. 우리는 여기서 빈곤에 대해 싸우고 있으며, 누수를 이용하는 것은 어떤 방법으로든 빈곤을 감소시키지 못할 것입니다.  
* **가구 내에서 집계가 이루어지며, 새로운 feature들은 수작업으로 생성합니다.** 이미 대부분이 가구 수준의 데이터이기 때문에 집계가 가능한 feature들은 많지 않다는 것을 주의해야 합니다.  
* **Voting 분류기는 전체 LightGBM 모델들을 평균내는데 사용합니다.**

In [358]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import f1_score
from joblib import Parallel, delayed
# from sklearn.externals.joblib import Parallel, delayed
from sklearn.base import clone
from sklearn.ensemble import VotingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.utils import class_weight

import warnings
warnings.filterwarnings('ignore')

In [359]:
from sklearn.preprocessing import LabelEncoder

def encode_data(df):
    df['idhogar'] = LabelEncoder().fit_transform(df['idhogar'])

def feature_importance(forest, x_train, display_results=True):
    ranked_list = []
    zero_features = []
    importances = forest.feature_importances_
    indices = np.argsort(importances)[::-1]
    if display_results:
        print('Feature ranking:')
    for f in range(x_train.shape[1]):
        if display_results:
            print('%d. feature %d (%f) - %s'%(f+1, indices[f], importances[indices[f]], x_train.columns[indices[f]]))
        ranked_list.append(x_train.columns[indices[f]])
        if importances[indices[f]] == 0.0:
            zero_features.append(x_train.columns[indices[f]])
    
    return ranked_list, zero_features

In [360]:
def do_features(df):
    feats_div = [
        ('children_fraction', 'r4t1', 'r4t3'), 
        ('working_man_fraction', 'r4h2', 'r4t3'),
        ('all_man_fraction', 'r4h3', 'r4t3'),
        ('human_density', 'tamviv', 'rooms'),
        ('human_bed_density', 'tamviv', 'bedrooms'),
        ('rent_per_person', 'v2a1', 'r4t3'),
        ('rent_per_room', 'v2a1', 'rooms'),
        ('mobile_density', 'qmobilephone', 'r4t3'),
        ('tablet_density', 'v18q1', 'r4t3'),
        ('mobile_adult_density', 'qmobilephone', 'r4t2'),
        ('tablet_adult_density', 'v18q1', 'r4t2'),
    ]
    feats_sub = [('people_not_living', 'tamhog', 'tamviv'),
                 ('people_weird_stat', 'tamhog', 'r4t3')
                ]
    for f_new, f1, f2 in feats_div:
        df['fe_' + f_new] = (df[f1]/df[f2]).astype(np.float32)
    for f_new, f1, f2 in feats_sub:
        df['fe_' + f_new] = (df[f1]-df[f2]).astype(np.float32)
    aggs_num = {'age':['min', 'max', 'mean'],
                'escolari': ['min', 'max', 'mean']
               }
    aggs_cat = {'dis':['mean']}
    for s_ in ['estadocivil', 'parentesco', 'instlevel']:
        for f_ in [f_ for f_ in df.columns if f_.startswith(s_)]:
            aggs_cat[f_] = ['mean', 'count']
    for name_, df_ in [('18', df.query('age >= 18'))]:
        df_agg = df_.groupby('idhogar').agg({**aggs_num, **aggs_cat}).astype(np.float32)
        df_agg.columns = pd.Index(
            ['agg' + name_ + '_' + e[0] + '_' + e[1].upper() for e in df_agg.columns.tolist()]
        )
        df = df.join(df_agg, how='left', on='idhogar')
        del df_agg
    df.drop('Id', axis=1, inplace=True)
    
    return df

In [361]:
def convert_OHE2LE(df):
    tmp_df = df.copy(deep=True)
    for s_ in ['pared', 'piso', 'techo', 'abastagua', 'sanitario', 'energcocinar', 'elimbasu', 
               'epared', 'etecho', 'eviv', 'estadocivil', 'parentesco', 
               'instlevel', 'lugar', 'tipovivi', 'manual_elec']:
        if 'manual_' not in s_:
            cols_s_ = [f_ for f_ in df.columns if f_.startswith(s_)]
        elif 'elec' in s_:
            cols_s_ = ['public', 'planpri', 'noelec', 'coopele']
        sum_ohe = tmp_df[cols_s_].sum(axis=1).unique()
        
        if 0 in sum_ohe:
            print('The OHE in {} is incomplete. A new column will be added before label encoding'.format(s_))
            col_dummy = s_ + '_dummy'
            tmp_df[col_dummy] = (tmp_df[cols_s_].sum(axis=1) == 0).astype(np.int8)
            cols_s_.append(col_dummy)
            sum_ohe = tmp_df[cols_s_].sum(axis=1).unique()
            if 0 in sum_ohe:
                print('The category completion did not work')
        tmp_cat = tmp_df[cols_s_].idxmax(axis=1)
        tmp_df[s_ + '_LE'] = LabelEncoder().fit_transform(tmp_cat).astype(np.int16)
        
        if 'parentesco1' in cols_s_:
            cols_s_.remove('parentesco1')
        tmp_df.drop(cols_s_, axis=1, inplace=True)
    return tmp_df

### 2. Read in data and clean it up

In [362]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

test_ids = test.Id

In [363]:
def process_df(df_):
    encode_data(df_)
    return do_features(df_)

train = process_df(train)
test = process_df(test)

결측값 제거, 문자형을 숫자형으로 변환

In [364]:
train['dependency'] = np.sqrt(train['SQBdependency'])
test['dependency'] = np.sqrt(test['SQBdependency'])

train.loc[train['edjefa'] == 'no', 'edjefa'] = 0
train.loc[train['edjefe'] == 'no', 'edjefe'] = 0
test.loc[test['edjefa'] == 'no', 'edjefa'] = 0
test.loc[test['edjefe'] == 'no', 'edjefe'] = 0

train.loc[(train['edjefa'] == 'yes')&(train['parentesco1'] == 1), 'edjefa'] = train.loc[(train['edjefa'] == 'yes')&(train['parentesco1'] == 1), 'escolari']
train.loc[(train['edjefe'] == 'yes')&(train['parentesco1'] == 1), 'edjefe'] = train.loc[(train['edjefe'] == 'yes')&(train['parentesco1'] == 1), 'escolari']
test.loc[(test['edjefa'] == 'yes')&(test['parentesco1'] == 1), 'edjefa'] = test.loc[(test['edjefa'] == 'yes')&(test['parentesco1'] == 1), 'escolari']
test.loc[(test['edjefe'] == 'yes')&(test['parentesco1'] == 1), 'edjefe'] = test.loc[(test['edjefe'] == 'yes')&(test['parentesco1'] == 1), 'escolari']

train.loc[train['edjefa'] == 'yes', 'edjefa'] = 4
train.loc[train['edjefe'] == 'yes', 'edjefe'] = 4
test.loc[test['edjefa'] == 'yes', 'edjefa'] = 4
test.loc[test['edjefe'] == 'yes', 'edjefe'] = 4

train['edjefa'] = train['edjefa'].astype('int')
train['edjefe'] = train['edjefe'].astype('int')
test['edjefa'] = test['edjefa'].astype('int')
test['edjefe'] = test['edjefe'].astype('int')

train['edjef'] = np.max(train[['edjefa', 'edjefe']], axis=1)
test['edjef'] = np.max(test[['edjefa', 'edjefe']], axis=1)

train['v2a1'].fillna(0, inplace=True)
test['v2a1'].fillna(0, inplace=True)

train['v18q1'].fillna(0, inplace=True)
test['v18q1'].fillna(0, inplace=True)

train['rez_esc'].fillna(0, inplace=True)
test['rez_esc'].fillna(0, inplace=True)

train.loc[train['meaneduc'].isnull(), 'meaneduc'] = 0
test.loc[test['meaneduc'].isnull(), 'meaneduc'] = 0

train.loc[train['SQBmeaned'].isnull(), 'SQBmeaned'] = 0
test.loc[test['SQBmeaned'].isnull(), 'SQBmeaned'] = 0

train.loc[(train['v14a'] == 1)&(train['sanitario1'] == 1)&(train['abastaguano'] == 0), 'v14a'] = 0
train.loc[(train['v14a'] == 1)&(train['sanitario1'] == 1)&(train['abastaguano'] == 0), 'sanitario1'] = 1
test.loc[(test['v14a'] == 1)&(test['sanitario1'] == 1)&(test['abastaguano'] == 0), 'v14a'] = 0
test.loc[(test['v14a'] == 1)&(test['sanitario1'] == 1)&(test['abastaguano'] == 0), 'sanitario1'] = 1

In [365]:
def train_test_apply_func(train_, test_, func_):
    test_['Target'] = 0
    xx = pd.concat([train_, test_])
    xx_func = func_(xx)
    train_ = xx_func.iloc[:train_.shape[0], :]
    test_ = xx_func.iloc[train_.shape[0]:, :].drop('Target', axis=1)
    
    del xx, xx_func
    return train_, test_

In [366]:
train, test = train_test_apply_func(train, test, convert_OHE2LE)

The OHE in techo is incomplete. A new column will be added before label encoding
The OHE in instlevel is incomplete. A new column will be added before label encoding
The OHE in manual_elec is incomplete. A new column will be added before label encoding


### 3. Geo aggregates

In [367]:
cols_2_ohe = ['eviv_LE', 'etecho_LE', 'epared_LE', 'elimbasu_LE',
              'energcocinar_LE', 'sanitario_LE', 'manual_elec_LE', 'pared_LE']
cols_nums = ['age', 'meaneduc', 'dependency', 'hogar_nin', 'hogar_adul',
             'hogar_mayor', 'hogar_total', 'bedrooms', 'overcrowding']

def convert_geo2aggs(df_):
    tmp_df = pd.concat([df_[(['lugar_LE', 'idhogar'] + cols_nums)],
                        pd.get_dummies(df_[cols_2_ohe], columns=cols_2_ohe)], axis=1)
    geo_agg = tmp_df.groupby(['lugar_LE', 'idhogar']).mean().groupby('lugar_LE').mean().astype(np.float32)
    geo_agg.columns = pd.Index(['geo_' + e for e in geo_agg.columns.tolist()])
    
    del tmp_df
    return df_.join(geo_agg, how='left', on='lugar_LE')

train, test = train_test_apply_func(train, test, convert_geo2aggs)

In [368]:
train['num_over_18'] = 0
train['num_over_18'] = train[train['age'] >= 18].groupby('idhogar').transform('count')
train['num_over_18'] = train.groupby('idhogar')['num_over_18'].transform('max')
train['num_over_18'].fillna(0, inplace=True)

test['num_over_18'] = 0
test['num_over_18'] = test[test['age'] >= 18].groupby('idhogar').transform('count')
test['num_over_18'] = test.groupby('idhogar')['num_over_18'].transform('max')
test['num_over_18'].fillna(0, inplace=True)


def extract_features(df):
    df['bedroom_to_rooms'] = df['bedrooms'] / df['rooms']
    df['rent_to_rooms'] = df['v2a1'] / df['rooms']
    df['tamhog_to_rooms'] = df['tamhog'] / df['rooms']
    df['r4t3_to_tamhog'] = df['r4t3'] / df['tamhog']
    df['r4t3_to_rooms'] = df['r4t3'] / df['rooms']
    df['v2a1_to_r4t3'] = df['v2a1'] / df['r4t3']
    df['v2a1_to_under_12'] = df['v2a1'] / (df['r4t3'] - df['r4t1'])
    df['hhsize_to_rooms'] = df['hhsize'] / df['rooms']
    df['rent_to_hhsize'] = df['v2a1'] / df['hhsize']
    df['rent_to_over_18'] = df['v2a1'] / df['num_over_18']
    df.loc[df['num_over_18'] == 0, 'rent_to_over_18'] = df[df['num_over_18'] == 0].v2a1

extract_features(train)
extract_features(test)

In [369]:
needless_cols = ['r4t3', 'tamhog', 'tamviv', 'hhsize', 'v18q', 'v14a', 'agesq',
                 'mobilephone', 'female']

instlevel_cols = [s for s in train.columns.tolist() if 'instlevel' in s]
needless_cols.extend(instlevel_cols)

train.drop(needless_cols, axis=1, inplace=True)
test.drop(needless_cols, axis=1, inplace=True)

#### Split the data
같은 가구에 속하는 행들은 대부분 같은 데이터를 갖기 때문에 누수를 피하기 위해 데이터를 가구 단위로 분할합니다. 가장만 포함하도록 데이터를 필터링하기 대문에 기술적으로는 필요하지 않지만, 위와 같이 하기 위해서 전체 트레이닝 데이터를 쉽게 사용할 수 있습니다.  

데이터를 분리한 후 트레이닝 데이터를 전체 데이터로 덮어 모든 데이터를 학습할 수 있다는 점을 기억해야 합니다. split_data 함수는 데이터를 덮어쓰는 것을 제외한 나머지 역할을 하고, K-fold split과 유사한 트레이닝 루프에 사용합니다.

In [370]:
def split_data(train, y, sample_weight=None, households=None, test_percentage=0.20, seed=None):
    np.random.seed(seed=seed)
    train2 = train.copy()
    cv_hhs = np.random.choice(households, size=int(len(households) * test_percentage), replace=False)
    cv_idx = np.isin(households, cv_hhs)
    x_test = train2[cv_idx]
    y_test = y[cv_idx]
    x_train = train2[~cv_idx]
    y_train = y[~cv_idx]
    
    if sample_weight is not None:
        y_train_weights = sample_weight[~cv_idx]
        return x_train, y_train, x_test, y_test, y_train_weights
    
    return x_train, y_train, x_test, y_test

In [376]:
x = train.query('parentesco1 == 1')

y = x['Target'] - 1
x = x.drop('Target', axis=1)

np.random.seed(seed=None)
train2 = x.copy()
households = train2.idhogar.unique()
cv_hhs = np.random.choice(households, size=int(len(households)*0.15), replace=False)
cv_idx = np.isin(train2.idhogar, cv_hhs)
x_test = train2[cv_idx]
y_test = y[cv_idx]
x_train = train2
y_train = y
train_households = x_train.idhogar
y_train_weights = class_weight.compute_sample_weight('balanced', y_train, indices=None)

In [377]:
extra_drop_features = [
 'agg18_estadocivil1_MEAN', 'agg18_estadocivil6_COUNT', 'agg18_estadocivil7_COUNT',
 'agg18_parentesco10_COUNT', 'agg18_parentesco11_COUNT', 'agg18_parentesco12_COUNT',
 'agg18_parentesco1_COUNT', 'agg18_parentesco2_COUNT', 'agg18_parentesco3_COUNT',
 'agg18_parentesco4_COUNT', 'agg18_parentesco5_COUNT', 'agg18_parentesco6_COUNT',
 'agg18_parentesco7_COUNT', 'agg18_parentesco8_COUNT', 'agg18_parentesco9_COUNT',
 'geo_elimbasu_LE_4', 'geo_energcocinar_LE_1', 'geo_energcocinar_LE_2',
 'geo_epared_LE_0', 'geo_hogar_mayor', 'geo_manual_elec_LE_2', 'geo_pared_LE_3',
 'geo_pared_LE_4', 'geo_pared_LE_5', 'geo_pared_LE_6', 'num_over_18',
 'parentesco_LE', 'rez_esc']

xgb_drop_cols = extra_drop_features + ['idhogar', 'parentesco1']

### 4. Fit a voting classifier
조기 종료를 위해 ```fit_params```를 통과할 수 있도록 파생된 VotingClassifier를 정의합니다. 투표는 LGBM 모델을 기반으로 하며, 이 모델은 macro F1과 쇠퇴하는 학습 속도를 기반으로 한 조기 종료를 사용합니다.  

파라미터는 해당 커널에서 무작위 탐색을 통해 최적화됩니다: [https://www.kaggle.com/mlisovyi/lighgbm-hyperoptimisation-with-f1-macro](https://www.kaggle.com/mlisovyi/lighgbm-hyperoptimisation-with-f1-macro)

In [378]:
opt_parameters = {
    'max_depth':35, 'eta':0.15, 'silent':1, 'objective':'multi:softmax',
    'min_child_weight':2, 'num_class':4, 'gamma':2.5, 'colsample_bylevel':1,
    'subsample':0.95, 'colsample_bytree':0.85, 'reg_lambda':0.35
}

def evaluate_macroF1_lgb(predictions, truth):
    pred_labels = predictions.argmax(axis=1)
    truth = truth.get_label()
    f1 = f1_score(truth, pred_labels, average='macro')
    return ('macroF1', 1-f1)

fit_params={"early_stopping_rounds":500,
            "eval_metric" : evaluate_macroF1_lgb, 
            "eval_set" : [(x_train, y_train), (x_test, y_test)],
            'verbose': False}

def learning_rate_power_0997(current_iter):
    base_learning_rate = 0.1
    min_learning_rate = 0.02
    lr = base_learning_rate * np.power(.995, current_iter)
    return max(lr, min_learning_rate)

fit_params['verbose'] = 50

In [379]:
np.random.seed(100)

def _parallel_fit_estimator(estimator1, x, y, sample_weight=None, threshold=True, **fit_params):
    estimator = clone(estimator1)

    if sample_weight is not None:
        x_train, y_train, x_test, y_test, y_train_weight = split_data(x, y, sample_weight, households=train_households)
    else:
        x_train, y_train, x_test, y_test = split_data(x, y, households=train_households)
    fit_params['eval_set'] = [(x_test, y_test)]

    if sample_weight is not None:
        if isinstance(estimator1, ExtraTreesClassifier) or isinstance(estimator1, RandomForestClassifier):
            estimator.fit(x_train, y_train)
        else:
            _ = estimator.fit(x_train, y_train, sample_weight=y_train_weight, **fit_params)
    else:
        if isinstance(estimator1, ExtraTreesClassifier) or isinstance(estimator1, RandomForestClassifier):
            estimator.fit(x_train, y_train)
        else:
            _ = estimator.fit(x_train, y_train, **fit_params)
    
    if not isinstance(estimator1, ExtraTreesClassifier) and not isinstance(estimator1, RandomForestClassifier) and not isinstance(estimator1, xgb.XGBClassifier):
        best_cv_round = np.argmax(estimator.evals_result_['validation_0']['mlogloss'])
        best_cv = np.max(estimator.evals_result_['validation_0']['mlogloss'])
        best_train = estimator.evals_result_['train']['macroF1'][best_cv_round]
    else:
        best_train = f1_score(y_train, estimator.predict(x_train), average='macro')
        best_cv = f1_score(y_test, estimator.predict(x_test), average='macro')
        print('Train F1:', best_train)
        print('Test F1:', best_cv)
    
    if threshold:
        if ((best_cv > 0.37) and (best_train > 0.75)) or ((best_cv > 0.44) and (best_train > 0.65)):
            return estimator
        else:
            print('Unacceptable!!! Trying again...')
            return _parallel_fit_estimator(estimator1, x, y, sample_weight=sample_weight, **fit_params)
    else:
        return estimator

In [380]:
class VotingClassifierLGBM(VotingClassifier):
    def fit(self, x, y, sample_weight=None, threshold=True, **fit_params):
        if isinstance(y, np.ndarray) and len(y.shape) and y.shape[1] > 1:
            raise NotImplementedError('Multilabel and multi-output classification is not supported.')
        if self.voting not in ('soft', 'hard'):
            raise ValueError("Voting must be 'soft' or 'hard'; got (voting=%r)"%self.voting)
        if self.estimators is None or len(self.estimators) == 0:
            raise AttributeError('Invalie `estimators` attribute, `estimators` should be a list of (stirng, estimator) tuples')
        if (self.weights is not None and len(self.wieghts) != len(slef.estimators)):
            raise ValueError('Number of classifiers and weights must be equal; got %d weights, %d estimators'%(len(self.weights), len(self.estimators)))
       
        names, clfs = zip(*self.estimators)
        self._validate_names(names)
        n_isnone = np.sum([clf is None for _, clf in self.estimators])
        if n_isnone == len(self.estimators):
            raise ValueError('All estimators are None. At least one is required to be a classifier!')
        self.le_ = LabelEncoder().fit(y)
        self.classes_ = self.le_.classes_
        self.estimators_ = []
        transformed_y = self.le_.transform(y)
        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
            delayed(_parallel_fit_estimator)(
                clone(clf), x, transformed_y, sample_weight=sample_weight, threshold=threshold, **fit_params
            ) for clf in clfs if clf is not None)
    
        return self

In [384]:
clfs = []
for i in range(15):
    clf = xgb.XGBClassifier(random_state=217+i, n_estimators=300, learning_rate=0.15, h_jobs=4, **opt_parameters)
    clfs.append(('xgb{}'.format(i), clf))

vc = VotingClassifierLGBM(clfs, voting='soft')
del clfs

_ = vc.fit(x_train.drop(xgb_drop_cols, axis=1), y_train, sample_weight=y_train_weights, threshold=False, **fit_params)
clf_final = vc.estimators_[0]

Parameters: { h_jobs, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	validation_0-mlogloss:1.30198	validation_0-macroF1:0.62322
[50]	validation_0-mlogloss:0.93623	validation_0-macroF1:0.58164
[100]	validation_0-mlogloss:0.93519	validation_0-macroF1:0.58533
[150]	validation_0-mlogloss:0.93394	validation_0-macroF1:0.58252
[200]	validation_0-mlogloss:0.93058	validation_0-macroF1:0.57567
[250]	validation_0-mlogloss:0.93034	validation_0-macroF1:0.57423
[299]	validation_0-mlogloss:0.93250	validation_0-macroF1:0.57463
Train F1: 0.9257652559128702
Test F1: 0.4283567965201257
Parameters: { h_jobs, silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not us

In [399]:
global_score = f1_score(y_test, clf_final.predict(x_test.drop(xgb_drop_cols, axis=1)), average='macro')
print('Validation score of a single LGBM Classifier: {:.4f}'.format(global_score))

vc.voting = 'soft'
global_score_soft = f1_score(y_test, vc.predict(x_test.drop(xgb_drop_cols, axis=1)), average='macro')
print('Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: {:.4f}'.format(global_score_soft))

vc.voting = 'hard'
global_score_soft = f1_score(y_test, vc.predict(x_test.drop(xgb_drop_cols, axis=1)), average='macro')
print('Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: {:.4f}'.format(global_score_soft))

Validation score of a single LGBM Classifier: 0.8367
Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: 0.9055
Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: 0.9176


In [401]:
useless_features = []
drop_features = set()
counter = 0
for est in vc.estimators_:
    ranked_features, unused_features = feature_importance(est, x_train.drop(xgb_drop_cols, axis=1), display_results=False)
    useless_features.append(unused_features)
    if counter == 0:
        drop_features = set(unused_features)
    else:
        drop_features = drop_features.intersection(set(unused_features))
    counter += 1

drop_features

{'agg18_estadocivil5_COUNT', 'geo_energcocinar_LE_0', 'geo_epared_LE_2'}

In [402]:
ranked_features = feature_importance(clf_final, x_train.drop(xgb_drop_cols, axis=1))

Feature ranking:
1. feature 59 (0.020225) - agg18_escolari_MAX
2. feature 74 (0.017963) - agg18_parentesco2_MEAN
3. feature 17 (0.017889) - male
4. feature 42 (0.017873) - fe_children_fraction
5. feature 60 (0.014787) - agg18_escolari_MEAN
6. feature 111 (0.013076) - geo_etecho_LE_0
7. feature 100 (0.012871) - geo_age
8. feature 129 (0.012395) - geo_manual_elec_LE_1
9. feature 110 (0.012376) - geo_eviv_LE_2
10. feature 101 (0.011999) - geo_meaneduc
11. feature 97 (0.011980) - lugar_LE
12. feature 40 (0.011299) - SQBdependency
13. feature 49 (0.011270) - fe_mobile_density
14. feature 112 (0.010889) - geo_etecho_LE_1
15. feature 69 (0.010647) - agg18_estadocivil5_MEAN
16. feature 117 (0.010591) - geo_elimbasu_LE_1
17. feature 12 (0.010582) - r4t1
18. feature 41 (0.009625) - SQBmeaned
19. feature 32 (0.009536) - area2
20. feature 128 (0.009521) - geo_manual_elec_LE_0
21. feature 52 (0.009370) - fe_tablet_adult_density
22. feature 98 (0.009328) - tipovivi_LE
23. feature 37 (0.009307) - SQB

#### Random Forest

In [403]:
et_drop_cols = [
    'agg18_age_MAX', 'agg18_age_MEAN', 'agg18_age_MIN', 'agg18_dis_MEAN',
    'agg18_escolari_MAX', 'agg18_escolari_MEAN', 'agg18_escolari_MIN',
    'agg18_estadocivil1_COUNT', 'agg18_estadocivil1_MEAN',
    'agg18_estadocivil2_COUNT', 'agg18_estadocivil2_MEAN',
    'agg18_estadocivil3_COUNT', 'agg18_estadocivil3_MEAN',
    'agg18_estadocivil4_COUNT', 'agg18_estadocivil4_MEAN',
    'agg18_estadocivil5_COUNT', 'agg18_estadocivil5_MEAN',
    'agg18_estadocivil6_COUNT', 'agg18_estadocivil6_MEAN',
    'agg18_estadocivil7_COUNT', 'agg18_estadocivil7_MEAN',
    'agg18_parentesco10_COUNT', 'agg18_parentesco10_MEAN',
    'agg18_parentesco11_COUNT', 'agg18_parentesco11_MEAN',
    'agg18_parentesco12_COUNT', 'agg18_parentesco12_MEAN',
    'agg18_parentesco1_COUNT', 'agg18_parentesco1_MEAN',
    'agg18_parentesco2_COUNT', 'agg18_parentesco2_MEAN',
    'agg18_parentesco3_COUNT', 'agg18_parentesco3_MEAN',
    'agg18_parentesco4_COUNT', 'agg18_parentesco4_MEAN',
    'agg18_parentesco5_COUNT', 'agg18_parentesco5_MEAN',
    'agg18_parentesco6_COUNT', 'agg18_parentesco6_MEAN',
    'agg18_parentesco7_COUNT', 'agg18_parentesco7_MEAN',
    'agg18_parentesco8_COUNT', 'agg18_parentesco8_MEAN',
    'agg18_parentesco9_COUNT', 'agg18_parentesco9_MEAN'
]
et_drop_cols.extend([
    'idhogar', 'parentesco1', 'fe_rent_per_person', 'fe_rent_per_room',
    'fe_tablet_adult_density', 'fe_tablet_density'
])

In [405]:
# 반복
ets = []
for i in range(10):
    rf = RandomForestClassifier(
        max_depth=None, random_state=217+i, n_jobs=4, n_estimators=700,
        min_impurity_decrease=1e-3, min_samples_leaf=2, verbose=0, class_weight='balanced'
    )
    ets.append(('rf{}'.format(i), rf))

vc2 = VotingClassifierLGBM(ets, voting='soft')
_ = vc2.fit(x_train.drop(et_drop_cols, axis=1), y_train, threshold=False)

Train F1: 0.8967053028831977
Test F1: 0.4415984453281717
Train F1: 0.8895165722836282
Test F1: 0.425542097614492
Train F1: 0.8966419931929255
Test F1: 0.4042115261132996
Train F1: 0.8948606285884819
Test F1: 0.44192363902082576
Train F1: 0.8971289703135341
Test F1: 0.3761610570281515
Train F1: 0.8879124321976221
Test F1: 0.4182277668373249
Train F1: 0.8884395609577884
Test F1: 0.39850654650738626
Train F1: 0.8924923382778056
Test F1: 0.4412403268965936
Train F1: 0.890453876137964
Test F1: 0.4314992204062239
Train F1: 0.8966033200269068
Test F1: 0.4287285338189526


In [406]:
vc2.voting = 'soft'
global_rf_score_soft = f1_score(y_test, vc2.predict(x_test.drop(et_drop_cols, axis=1)), average='macro')
print('Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: {:.4f}'.format(global_rf_score_soft))

vc2.voting = 'hard'
global_rf_score_hard = f1_score(y_test, vc2.predict(x_test.drop(et_drop_cols, axis=1)), average='macro')
print('Validation score of a VotingClassifier on 3 LGBMs with hard voting strategy: {:.4f}'.format(global_rf_score_hard))

Validation score of a VotingClassifier on 3 LGBMs with soft voting strategy: 0.8864
Validation score of a VotingClassifier on 3 LGBMs with hard voting strategy: 0.8957


In [408]:
useless_features = []
drop_features = set()
counter = 0
for est in vc2.estimators_:
    ranked_features, unused_features = feature_importance(est, x_train.drop(et_drop_cols, axis=1), display_results=False)
    useless_features.append(unused_features)
    if counter == 0:
        drop_features = set(unused_features)
    else:
        drop_features = drop_features.intersection(set(unused_features))
    counter += 1

drop_features

{'parentesco_LE', 'rez_esc'}

In [409]:
def combine_voters(data, weights=[0.5, 0.5]):
    vc.voting = 'soft'
    vc1_probs = vc.predict_proba(data.drop(xgb_drop_cols, axis=1))
    
    vc2.voting = 'soft'
    vc2_probs = vc2.predict_proba(data.drop(et_drop_cols, axis=1))
    
    final_vote = (vc1_probs * weights[0]) + (vc2_probs * weights[1])
    predictions = np.argmax(final_vote, axis=1)
    
    return predictions

In [411]:
combo_preds = combine_voters(x_test, weights=[0.5, 0.5])
global_combo_score_soft = f1_score(y_test, combo_preds, average='macro')
global_combo_score_soft

0.9086853371310112

In [414]:
combo_preds = combine_voters(x_test, weights=[0.4, 0.6])
global_combo_score_soft = f1_score(y_test, combo_preds, average='macro')
global_combo_score_soft

0.9132077555154477

In [413]:
combo_preds = combine_voters(x_test, weights=[0.6, 0.4])
global_combo_score_soft = f1_score(y_test, combo_preds, average='macro')
global_combo_score_soft

0.9031594390425502

### 5. Prepare submission

In [415]:
y_subm = pd.DataFrame()
y_subm['Id'] = test_ids

In [416]:
vc.voting = 'soft'
y_subm_lgb = y_subm.copy(deep=True)
y_subm_lgb['Target'] = vc.predict(test.drop(xgb_drop_cols, axis=1)) + 1

vc2.voting = 'soft'
y_subm_rf = y_subm.copy(deep=True)
y_subm_rf['Target'] = vc2.predict(test.drop(et_drop_cols, axis=1)) + 1

y_subm_ens = y_subm.copy(deep=True)
y_subm_ens['Target'] = combine_voters(test) + 1

In [417]:
from datetime import datetime
now = datetime.now()

sub_file_lgb = 'data/submission_3_soft_XGB_{:.4f}_{}.csv'.format(global_score_soft, str(now.strftime('%Y-%m-%d-%H-%M')))
sub_file_rf = 'data/submission_3_soft_RF_{:.4f}_{}.csv'.format(global_rf_score_soft, str(now.strftime('%Y-%m-%d-%H-%M')))
sub_file_ens = 'data/submission_3_soft_ens_{:.4f}_{}.csv'.format(global_combo_score_soft, str(now.strftime('%Y-%m-%d-%H-%M')))

y_subm_lgb.to_csv(sub_file_lgb, index=False)
y_subm_rf.to_csv(sub_file_rf, index=False)
y_subm_ens.to_csv(sub_file_ens, index=False)