# シンプルな11位の解法【詳細】
# very simple 11th place solution【Details】

このNotebookではPseudo Labeling(疑似ラベリング)について説明しています。11位の解法の概要についてはこちらのNotebookを参照してください。  
【日本語&English】TPS Feb 11th place solution  
https://www.kaggle.com/maostack/english-tps-feb-11th-place-solution
  
This Notebook describes Pseudo Labeling, see this Notebook for an overview of the 11th solution.  
【日本語&English】TPS Feb 11th place solution  
https://www.kaggle.com/maostack/english-tps-feb-11th-place-solution

## 疑似ラベリング / Pseudo Labeling
半教師あり学習の手法の一つ / One of the methods of semi-supervised learning  
  
疑似ラベリングは2段階の構成になっている。  
まず、何らかの予測モデルを用意する（今回はLightGBM）。  
第1段階では、モデルを学習させた後、普通にテストデータに対して予測を行う。その予測値をテストデータに対する疑似ラベルとする。つまり、テストデータに対する予測値を疑似的に目的変数(label・target)として扱う。  
第2段階では、"もともとの学習データにテストデータを合体させたもの"を学習データとして用いて、テストデータに対する予測を行う。  
  
The pseudo labeling consists of two steps.  
First, prepare some kind of prediction model (LightGBM in this case).  
In the 1st stage, we train the model and make a prediction for the test data. The predicted value is used as a pseudo-label for the test data. In other words, the predicted value for the test data is treated as a pseudo target variable (label/target).  
In the 2nd stage, predictions are made for the test data using the "original training data combined with the test data" as the training data. 

In [1]:
import os
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter("ignore")

別のNotebookで既に前処理をしたデータをtrain, testとして読み込んでいます。  
前処理として行ったことは、  
・targetが外れ値の行を除外(targetが4より小さい行を除外)  
・変数"cat6"について"G"は学習データにしか存在しない(テストデータで値がGをとるデータが存在しない)ので、cat6の値がGの行を除外  
・カテゴリ変数に対するLabel Encoding  
・cont列に対するRankGauss変換  
です。最後のRankGauss変換は、決定木系のモデルには影響を与えないのでしなくてもいいのですが一応しておきました。  
除外後の学習データのデータ数は299963になりました。37行減った。テストデータの数は変わっていない。  
  
The data that has already been preprocessed in another Notebook is loaded as train and test.  
What we did as preprocessing was  
・Exclude rows where target is an outlier (exclude rows where target is less than 4)  
・For the variable "cat6", "G" exists only in the training data (there is no data that takes the value G in the test data), so the line with the value G in cat6 is excluded.  
・Label Encoding for categorical variables  
・RankGauss transform for cont columns  
The RankGauss transformation is not necessary because it does not affect the decision tree model, but I did it just in case.  
After preprocessing, the number of data in the training data is now 299963. 37 rows have been reduced. The number of test data has not changed.

In [2]:
train = pd.read_csv("../input/tps-feb-eda-fe/train_data.csv")
test = pd.read_csv("../input/tps-feb-eda-fe/test_data.csv")

In [3]:
train

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,target
0,0,1,0,0,1,3,0,4,2,8,...,0.908575,0.056403,0.883833,1.060525,0.713408,0.768699,0.330933,1.126528,0.415784,6.994023
1,1,0,0,0,1,1,0,4,0,5,...,-0.005509,-0.428840,-0.472035,0.550824,0.157424,0.312172,0.597869,-0.547837,1.007245,8.071256
2,0,0,0,2,1,3,0,1,2,13,...,1.099466,-0.253375,1.119124,0.862004,1.021238,-0.754509,0.285567,1.046086,1.426711,5.760456
3,0,0,0,2,1,3,0,4,6,10,...,1.249950,0.424574,0.072623,0.874236,0.635667,-0.035676,0.303143,0.720188,0.148221,7.806457
4,0,1,0,0,1,1,0,4,2,5,...,-0.248527,0.694318,-0.154434,0.082267,-0.049087,0.769007,0.214702,-0.335699,-0.419706,6.868974
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299958,0,1,0,2,1,1,0,4,4,11,...,-0.651785,-1.075792,0.009048,-0.539073,-0.587157,-0.897136,-0.224822,-1.764247,0.800649,8.343538
299959,0,1,0,2,1,1,0,4,4,11,...,-1.353569,-1.089192,0.344379,-0.852094,-0.706559,-0.665378,-0.445194,-0.789555,-0.884763,7.851861
299960,0,1,0,2,1,1,0,4,2,12,...,0.034636,-0.140409,0.173324,-1.184553,0.033184,-0.896589,-0.477534,-1.485441,-0.231799,7.600558
299961,0,1,1,2,1,1,0,3,4,5,...,-0.382145,-0.601613,-1.543380,-0.867771,0.066891,-0.275560,-0.792100,-0.826383,0.063476,8.272095


In [4]:
test

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
0,0,1,0,2,1,3,0,4,4,6,...,0.595238,0.309205,-0.873422,-0.458534,-0.920132,-1.117440,-0.668930,-0.694335,0.230501,-0.384539
1,0,1,0,2,1,3,0,4,2,11,...,-1.036099,0.140669,-0.070859,0.174289,0.827521,0.069893,0.095434,0.547913,-0.649927,0.505568
2,0,1,0,2,1,3,0,4,2,5,...,-0.541313,0.424204,0.668697,-0.829366,0.538120,0.164693,1.154018,0.650631,-0.459497,-0.650035
3,0,0,1,0,1,3,0,4,4,5,...,0.234134,0.591905,-1.230619,-0.316383,0.091343,0.268959,-0.208796,-0.737892,0.141772,-0.332619
4,0,1,0,0,1,1,0,4,4,8,...,0.810410,-1.178937,-0.557295,0.418894,-1.029271,-1.029516,0.658701,-0.920389,-1.045779,0.814884
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,0,0,0,2,1,3,0,4,6,11,...,-1.062735,1.501875,-1.226386,0.679031,1.232057,1.425871,-0.103254,0.036514,1.009991,-0.012324
199996,0,0,0,2,1,3,0,4,4,5,...,0.347952,-0.931046,-0.298485,0.714352,-1.179449,-0.730398,-0.369483,-0.208399,0.248459,-0.943160
199997,0,0,0,2,1,3,0,4,2,10,...,0.332920,-0.031559,0.997821,-0.139357,0.367153,0.135578,-0.088173,0.116135,0.522593,0.024800
199998,0,1,0,0,1,3,0,4,2,5,...,-0.615755,0.770680,0.635364,0.285544,0.386398,0.791350,0.376017,0.671068,2.168309,0.831018


In [5]:
cat_columns = [f"cat{i}" for i in range(10)]

In [6]:
X = train.drop(["target"], axis=1)
X_test = test
y = train.target

print(X.shape)
print(X_test.shape)
print(y.shape)

(299963, 24)
(200000, 24)
(299963,)


In [7]:
X

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
0,0,1,0,0,1,3,0,4,2,8,...,-0.278224,0.908575,0.056403,0.883833,1.060525,0.713408,0.768699,0.330933,1.126528,0.415784
1,1,0,0,0,1,1,0,4,0,5,...,-0.222351,-0.005509,-0.428840,-0.472035,0.550824,0.157424,0.312172,0.597869,-0.547837,1.007245
2,0,0,0,2,1,3,0,1,2,13,...,-0.162923,1.099466,-0.253375,1.119124,0.862004,1.021238,-0.754509,0.285567,1.046086,1.426711
3,0,0,0,2,1,3,0,4,6,10,...,0.880367,1.249950,0.424574,0.072623,0.874236,0.635667,-0.035676,0.303143,0.720188,0.148221
4,0,1,0,0,1,1,0,4,2,5,...,-0.626103,-0.248527,0.694318,-0.154434,0.082267,-0.049087,0.769007,0.214702,-0.335699,-0.419706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299958,0,1,0,2,1,1,0,4,4,11,...,0.872306,-0.651785,-1.075792,0.009048,-0.539073,-0.587157,-0.897136,-0.224822,-1.764247,0.800649
299959,0,1,0,2,1,1,0,4,4,11,...,0.915129,-1.353569,-1.089192,0.344379,-0.852094,-0.706559,-0.665378,-0.445194,-0.789555,-0.884763
299960,0,1,0,2,1,1,0,4,2,12,...,-0.154583,0.034636,-0.140409,0.173324,-1.184553,0.033184,-0.896589,-0.477534,-1.485441,-0.231799
299961,0,1,1,2,1,1,0,3,4,5,...,0.815325,-0.382145,-0.601613,-1.543380,-0.867771,0.066891,-0.275560,-0.792100,-0.826383,0.063476


In [8]:
X_test

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
0,0,1,0,2,1,3,0,4,4,6,...,0.595238,0.309205,-0.873422,-0.458534,-0.920132,-1.117440,-0.668930,-0.694335,0.230501,-0.384539
1,0,1,0,2,1,3,0,4,2,11,...,-1.036099,0.140669,-0.070859,0.174289,0.827521,0.069893,0.095434,0.547913,-0.649927,0.505568
2,0,1,0,2,1,3,0,4,2,5,...,-0.541313,0.424204,0.668697,-0.829366,0.538120,0.164693,1.154018,0.650631,-0.459497,-0.650035
3,0,0,1,0,1,3,0,4,4,5,...,0.234134,0.591905,-1.230619,-0.316383,0.091343,0.268959,-0.208796,-0.737892,0.141772,-0.332619
4,0,1,0,0,1,1,0,4,4,8,...,0.810410,-1.178937,-0.557295,0.418894,-1.029271,-1.029516,0.658701,-0.920389,-1.045779,0.814884
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,0,0,0,2,1,3,0,4,6,11,...,-1.062735,1.501875,-1.226386,0.679031,1.232057,1.425871,-0.103254,0.036514,1.009991,-0.012324
199996,0,0,0,2,1,3,0,4,4,5,...,0.347952,-0.931046,-0.298485,0.714352,-1.179449,-0.730398,-0.369483,-0.208399,0.248459,-0.943160
199997,0,0,0,2,1,3,0,4,2,10,...,0.332920,-0.031559,0.997821,-0.139357,0.367153,0.135578,-0.088173,0.116135,0.522593,0.024800
199998,0,1,0,0,1,3,0,4,2,5,...,-0.615755,0.770680,0.635364,0.285544,0.386398,0.791350,0.376017,0.671068,2.168309,0.831018


In [9]:
y

0         6.994023
1         8.071256
2         5.760456
3         7.806457
4         6.868974
            ...   
299958    8.343538
299959    7.851861
299960    7.600558
299961    8.272095
299962    6.025685
Name: target, Length: 299963, dtype: float64

In [10]:
from sklearn.model_selection import KFold
import lightgbm as lgb

SEED = 8970365

In [11]:
kf = KFold(n_splits=5, shuffle=True, random_state=SEED)

In [12]:
# パラメータの値は他の方のNotebookを参考にしました。感謝します。
# The value of the parameter was taken from another person's Notebook.
# I appreciate it.

params_lgb = {
    "task": "train",
    "boosting_type": "gbdt",
    "objective": "regression",
    "metric": "rmse",
    "learning_rate": 0.003899156646724397,
    "num_leaves": 63,
    "max_depth": 99,
    "feature_fraction": 0.2256038826485174,
    "bagging_fraction": 0.8805303688019942,
    "min_child_samples": 290,
    "reg_alpha": 9.562925363678952,
    "reg_lambda": 9.355810045480153,
    "max_bin": 882,
    "min_data_per_group": 127,
    "bagging_freq": 1,
    "cat_smooth": 96,
    "cat_l2": 19,
    "verbosity": -1,
    "bagging_seed": SEED,
    "feature_fraction_seed": SEED,
    "seed": SEED
}

まず普通にテストデータに対して予測を行う。  
First, make a prediction for the test data as usual.

In [13]:
# 予測値を格納するdf
# df to store the predicted value
preds_lgb = pd.DataFrame()

X[cat_columns] = X[cat_columns].astype("category")
X_test[cat_columns] = X_test[cat_columns].astype("category")

for k, (tr_id, vl_id) in enumerate(kf.split(X, y)):
    print("="*50)
    print(f"               KFold{k+1}")
    print("="*50)
    
    X_train, X_val = X.iloc[tr_id, :], X.iloc[vl_id, :]
    y_train, y_val = y.iloc[tr_id], y.iloc[vl_id]
    
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_val = lgb.Dataset(X_val, y_val)
    
    model_lgb = lgb.train(params=params_lgb,
                          train_set=lgb_train,
                          valid_sets=lgb_val,
                          num_boost_round=100000,
                          early_stopping_rounds=200,
                          verbose_eval=1000)
    
    pred_lgb = model_lgb.predict(X_test, num_iteration=model_lgb.best_iteration)
    pred_lgb = pd.DataFrame(pred_lgb)
    
    # 予測値を横に連結していく
    # Concatenate the predictions horizontally
    preds_lgb = pd.concat([preds_lgb, pred_lgb], axis=1)

               KFold1
Training until validation scores don't improve for 200 rounds
[1000]	valid_0's rmse: 0.851661
[2000]	valid_0's rmse: 0.844799
[3000]	valid_0's rmse: 0.842719
[4000]	valid_0's rmse: 0.841835
[5000]	valid_0's rmse: 0.841445
[6000]	valid_0's rmse: 0.841315
[7000]	valid_0's rmse: 0.841248
Early stopping, best iteration is:
[7542]	valid_0's rmse: 0.841226
               KFold2
Training until validation scores don't improve for 200 rounds
[1000]	valid_0's rmse: 0.850553
[2000]	valid_0's rmse: 0.843702
[3000]	valid_0's rmse: 0.84185
[4000]	valid_0's rmse: 0.841178
[5000]	valid_0's rmse: 0.840995
[6000]	valid_0's rmse: 0.840971
Early stopping, best iteration is:
[5873]	valid_0's rmse: 0.840962
               KFold3
Training until validation scores don't improve for 200 rounds
[1000]	valid_0's rmse: 0.852435
[2000]	valid_0's rmse: 0.845282
[3000]	valid_0's rmse: 0.843149
[4000]	valid_0's rmse: 0.842291
[5000]	valid_0's rmse: 0.84192
[6000]	valid_0's rmse: 0.841766
Early st

In [14]:
preds_lgb

Unnamed: 0,0,0.1,0.2,0.3,0.4
0,7.602948,7.638474,7.613636,7.629203,7.636820
1,7.813892,7.815879,7.753977,7.844712,7.820410
2,7.611244,7.651846,7.615406,7.580313,7.625089
3,7.551531,7.499887,7.536462,7.452304,7.511212
4,7.289235,7.273643,7.229313,7.230314,7.235424
...,...,...,...,...,...
199995,7.518459,7.518438,7.500072,7.499188,7.458684
199996,7.210795,7.212373,7.298205,7.247478,7.254758
199997,7.506612,7.538585,7.540951,7.538262,7.570632
199998,7.486802,7.454527,7.514924,7.457591,7.443532


In [15]:
# 平均を計算して、テストデータに対する疑似ラベルとする
# Calculate the mean and use it as a pseudo labels for the test data

label = preds_lgb.mean(axis=1)
label

0         7.624216
1         7.809774
2         7.616780
3         7.510279
4         7.251586
            ...   
199995    7.498968
199996    7.244722
199997    7.539008
199998    7.471475
199999    7.277810
Length: 200000, dtype: float64

In [16]:
# もともとの学習データX, yにテストデータと疑似ラベルを縦に連結する。
# これを新たな学習データとする
# Concatenate the test data and pseudo labels to the original training data X, y.
# Make this the new training data.

X = pd.concat([X, X_test], axis=0).reset_index(drop=True)
y = pd.concat([y, label], axis=0).reset_index(drop=True)

print("X.shape: ", X.shape)
print("y.shape: ", y.shape)

X.shape:  (499963, 24)
y.shape:  (499963,)


In [17]:
X

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
0,0,1,0,0,1,3,0,4,2,8,...,-0.278224,0.908575,0.056403,0.883833,1.060525,0.713408,0.768699,0.330933,1.126528,0.415784
1,1,0,0,0,1,1,0,4,0,5,...,-0.222351,-0.005509,-0.428840,-0.472035,0.550824,0.157424,0.312172,0.597869,-0.547837,1.007245
2,0,0,0,2,1,3,0,1,2,13,...,-0.162923,1.099466,-0.253375,1.119124,0.862004,1.021238,-0.754509,0.285567,1.046086,1.426711
3,0,0,0,2,1,3,0,4,6,10,...,0.880367,1.249950,0.424574,0.072623,0.874236,0.635667,-0.035676,0.303143,0.720188,0.148221
4,0,1,0,0,1,1,0,4,2,5,...,-0.626103,-0.248527,0.694318,-0.154434,0.082267,-0.049087,0.769007,0.214702,-0.335699,-0.419706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499958,0,0,0,2,1,3,0,4,6,11,...,-1.062735,1.501875,-1.226386,0.679031,1.232057,1.425871,-0.103254,0.036514,1.009991,-0.012324
499959,0,0,0,2,1,3,0,4,4,5,...,0.347952,-0.931046,-0.298485,0.714352,-1.179449,-0.730398,-0.369483,-0.208399,0.248459,-0.943160
499960,0,0,0,2,1,3,0,4,2,10,...,0.332920,-0.031559,0.997821,-0.139357,0.367153,0.135578,-0.088173,0.116135,0.522593,0.024800
499961,0,1,0,0,1,3,0,4,2,5,...,-0.615755,0.770680,0.635364,0.285544,0.386398,0.791350,0.376017,0.671068,2.168309,0.831018


In [18]:
y

0         6.994023
1         8.071256
2         5.760456
3         7.806457
4         6.868974
            ...   
499958    7.498968
499959    7.244722
499960    7.539008
499961    7.471475
499962    7.277810
Length: 499963, dtype: float64

In [19]:
# 最終予測値を格納するdf
# df to store the final prediction
preds_lgb = pd.DataFrame()

X[cat_columns] = X[cat_columns].astype("category")
X_test[cat_columns] = X_test[cat_columns].astype("category")

for k, (tr_id, vl_id) in enumerate(kf.split(X, y)):
    print("="*50)
    print(f"               KFold{k+1}")
    print("="*50)
    
    X_train, X_val = X.iloc[tr_id, :], X.iloc[vl_id, :]
    y_train, y_val = y.iloc[tr_id], y.iloc[vl_id]
    
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_val = lgb.Dataset(X_val, y_val)
    
    model_lgb = lgb.train(params=params_lgb,
                          train_set=lgb_train,
                          valid_sets=lgb_val,
                          num_boost_round=100000,
                          early_stopping_rounds=200,
                          verbose_eval=1000)
    
    pred_lgb = model_lgb.predict(X_test, num_iteration=model_lgb.best_iteration)
    pred_lgb = pd.DataFrame(pred_lgb)
    preds_lgb = pd.concat([preds_lgb, pred_lgb], axis=1)

               KFold1
Training until validation scores don't improve for 200 rounds
[1000]	valid_0's rmse: 0.664132
[2000]	valid_0's rmse: 0.655471
[3000]	valid_0's rmse: 0.65283
[4000]	valid_0's rmse: 0.651688
[5000]	valid_0's rmse: 0.65112
[6000]	valid_0's rmse: 0.65082
[7000]	valid_0's rmse: 0.650654
[8000]	valid_0's rmse: 0.650549
[9000]	valid_0's rmse: 0.650482
[10000]	valid_0's rmse: 0.650443
[11000]	valid_0's rmse: 0.650421
Early stopping, best iteration is:
[11398]	valid_0's rmse: 0.650416
               KFold2
Training until validation scores don't improve for 200 rounds
[1000]	valid_0's rmse: 0.66204
[2000]	valid_0's rmse: 0.653306
[3000]	valid_0's rmse: 0.650662
[4000]	valid_0's rmse: 0.649505
[5000]	valid_0's rmse: 0.648942
[6000]	valid_0's rmse: 0.648648
[7000]	valid_0's rmse: 0.648484
[8000]	valid_0's rmse: 0.648372
[9000]	valid_0's rmse: 0.648309
[10000]	valid_0's rmse: 0.648263
Early stopping, best iteration is:
[10172]	valid_0's rmse: 0.648257
               KFold3
Tra

# Submission

In [20]:
submission = pd.read_csv("../input/tabular-playground-series-feb-2021/sample_submission.csv")

#　予測値の平均を計算して、最終的な予測値とする
# Calculate the average of the predictions to get the final prediction.
pred = preds_lgb.mean(axis=1)
submission.target = pred

submission.head()

Unnamed: 0,id,target
0,0,7.629647
1,5,7.768953
2,15,7.611821
3,16,7.515771
4,17,7.253258


In [21]:
submission.to_csv("submission_pseudo_lgb_5.csv", index=False)

このNotebookと同じ予測方法で、パラメータの値を変えたり、シード値を変えたり、early stopping roundsを変えたりなどをしながら、複数の予測提出ファイルを作る。最後にそれらをアンサンブルする。  
以上が11位の解法です。
  
Using the same prediction method as in this Notebook, create multiple prediction submission files, changing the paramters, seed value and early stopping rounds etc.. Finally, ensemble them.  
This is the 11th solution.