## 回答のヤバさを図る指標
+0.006と自分たちが作った特徴量の中では圧倒的に光り輝いていた😇

### 作り方

train全部を使って各問題の各選択肢がどれくらい選択されているかを計算

例: content_id=XXXX の選択肢1/2/3/4の選択率: 9% / 5% / 1% / 85%

各問題で選択率を積み上げて各選択肢のパーセンタイルを算出

例:
- 選択肢1: 15% (=1+5+9)
- 選択肢2: 6% (=1+5)
- 選択肢3: 1%
- 選択肢4: 100% (=1+5+9+85)

各ユーザーの過去の選択肢のパーセンタイルをAggregation(std, avg, min, etc.)


### 気持ち
- ほとんどの人が選んでいないようなヤバイ選択肢を選んでる人は多分ヤバイ
- よくできる人はたとえ間違ったとしてもヤバイ選択肢は選ばないはず
- ↑の例の選択肢3とか選ぶ人は多分ヤバイのでそのあともヤバイはず
- 多分この考えは合っててstdの集計がめちゃくちゃ効いていた

In [1]:
import cudf
import pandas as pd
print('cudf_version: ', cudf.__version__)
print('pd_version: ', pd.__version__)

cudf_version:  21.10.01
pd_version:  1.3.5


In [2]:
import pandas as pd
import numpy as np
import gc
from sklearn.metrics import roc_auc_score
from collections import defaultdict
from tqdm.notebook import tqdm
import lightgbm as lgb

## cudf の場合

In [3]:
%%time
validaten_flg = True
if validaten_flg:
    data =  cudf.from_pandas(pd.read_pickle('../input/riiid-cross-validation-files/cv1_train.pickle'))
else:
    data = cudf.read_csv("../input/riiid-test-answer-prediction/train.csv")

print("Train size:", data.shape)

Train size: (98730332, 13)
CPU times: user 1.34 s, sys: 3 s, total: 4.35 s
Wall time: 4.36 s


In [6]:
data.head(10).shift(1)

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation,max_time_stamp,rand_time_stamp,viretual_time_stamp
0,,,,,,,,,,,,,
1,32933156.0,0.0,705741139.0,128.0,0.0,0.0,0.0,1.0,,,87425772049.0,0.0,0.0
2,32933157.0,20666.0,705741139.0,7860.0,0.0,1.0,0.0,1.0,16000.0,False,87425772049.0,0.0,20666.0
3,32933158.0,39172.0,705741139.0,7922.0,0.0,2.0,1.0,1.0,19000.0,False,87425772049.0,0.0,39172.0
4,32933159.0,58207.0,705741139.0,156.0,0.0,3.0,2.0,1.0,17000.0,False,87425772049.0,0.0,58207.0
5,32933160.0,75779.0,705741139.0,51.0,0.0,4.0,0.0,1.0,17000.0,False,87425772049.0,0.0,75779.0
6,32933161.0,96110.0,705741139.0,50.0,0.0,5.0,3.0,1.0,16000.0,False,87425772049.0,0.0,96110.0
7,32933162.0,113305.0,705741139.0,7896.0,0.0,6.0,2.0,1.0,18000.0,False,87425772049.0,0.0,113305.0
8,32933163.0,131516.0,705741139.0,7863.0,0.0,7.0,0.0,1.0,16000.0,False,87425772049.0,0.0,131516.0
9,32933164.0,152038.0,705741139.0,152.0,0.0,8.0,1.0,1.0,16000.0,False,87425772049.0,0.0,152038.0


In [5]:
data.groupby('user_id').shift(1).expanding().mean()

AttributeError: DataFrame object has no attribute expanding

In [5]:
data_pdf = data.query('user_answer != -1').to_pandas().groupby('content_id')['user_answer'].value_counts(normalize=True)

In [6]:
data_gdf = cudf.Series.from_pandas(data_pdf)
data_gdf

content_id  user_answer
0           0              0.907334
            1              0.049727
            2              0.030544
            3              0.012395
1           1              0.890571
                             ...   
13521       3              0.012453
13522       3              0.909887
            1              0.048811
            2              0.032541
            0              0.008761
Name: user_answer, Length: 52050, dtype: float64

In [7]:
gdf_user_answer = data_gdf.rename('user_answer_rate').reset_index()
gdf_user_answer

Unnamed: 0,content_id,user_answer,user_answer_rate
0,0,0,0.907334
1,0,1,0.049727
2,0,2,0.030544
3,0,3,0.012395
4,1,1,0.890571
...,...,...,...
52045,13521,3,0.012453
52046,13522,3,0.909887
52047,13522,1,0.048811
52048,13522,2,0.032541


In [8]:
data_merge = cudf.merge(data, gdf_user_answer, on=['content_id','user_answer'], how='left')
data_merge

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation,max_time_stamp,rand_time_stamp,viretual_time_stamp,user_answer_rate
0,43357222,2660812,918309463,2052,0,68,1,1,22333.0,False,3213491466,280812803,283473615,0.821034
1,43357223,2660812,918309463,2051,0,68,2,1,22333.0,False,3213491466,280812803,283473615,0.745510
2,10405786,150097537,226364065,7862,0,355,1,1,20000.0,True,32007438472,133378842,283476379,0.980425
3,60610523,824690,1286426216,559,0,29,3,1,17000.0,True,906141,282660684,283485374,0.695524
4,67030152,265484710,1424896302,2595,0,46,2,0,12000.0,True,1222216830,18007988,283492698,0.080073
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98730327,90430642,15293028075,1920242653,1309,0,5413,1,1,18000.0,True,15732999210,66957716406,82250744481,0.421801
98730328,28719730,3113539480,613867353,4417,0,882,1,0,13000.0,True,4614454983,79137207687,82250747167,0.343949
98730329,42136729,3159622708,893296561,5679,0,584,1,0,1000.0,True,5227351169,79091124634,82250747342,0.055309
98730330,209682,9702810145,4222121,7898,0,324,2,1,53600.0,True,12375567057,72547937929,82250748074,0.543301


In [9]:
import pickle

def pickle_dump(obj, path):
    with open(path, mode='wb') as f:
        pickle.dump(obj,f)

if validaten_flg:
    pickle_dump(gdf_user_answer.to_pandas(), '../input/my_validaten_datasets/user_answer_rate_cv1.pickle')
else:
    gdf_user_answer.to_csv('../input/user_answer_rate.csv', index=False)

In [10]:
# data_merge.to_pandas().to_csv('../input/all_train.csv', index=False)