I'd like to share my train/valid split script.

In [15]:
import pandas as pd
import random
import gc

random.seed(1)

In [16]:
train = pd.read_csv('/Users/hesu/Documents/KT/riiid/train.csv',
                   dtype={'row_id': 'int64',
                          'timestamp': 'int64',
                          'user_id': 'int32',
                          'content_id': 'int16',
                          'content_type_id': 'int8',
                          'task_container_id': 'int16',
                          'user_answer': 'int8',
                          'answered_correctly':'int8',
                          'prior_question_elapsed_time': 'float32',
                          'prior_question_had_explanation': 'boolean'}
                   )

Using last several entry for each user as validation data is easy and doesn't look too bad.
However, this split method may be focusing too much on light users over heavy users.
As a result, the average percentage of correct answers become lower, and there may be a risk of leading us in the wrong direction.

In [17]:
valid_split1 = train.groupby('user_id').tail(5)
train_split1 = train[~train.row_id.isin(valid_split1.row_id)]
valid_split1 = valid_split1[valid_split1.content_type_id == 0]
train_split1 = train_split1[train_split1.content_type_id == 0]
print(f'{train_split1.answered_correctly.mean():.3f} {valid_split1.answered_correctly.mean():.3f}')

0.660 0.541


In [18]:
train_split1.head(10)

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,,
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False
5,5,157063,115,156,0,5,2,1,5000.0,False
6,6,176092,115,51,0,6,0,1,17000.0,False
7,7,194190,115,50,0,7,3,1,17000.0,False
8,8,212463,115,7896,0,8,2,1,16000.0,False
9,9,230983,115,7863,0,9,0,1,16000.0,False


In [19]:
valid_split1.head(10)

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
41,41,667971812,115,2064,0,40,1,1,17000.0,False
42,42,667971812,115,2063,0,40,3,0,17000.0,False
43,43,668090043,115,3363,0,41,1,0,14333.0,False
44,44,668090043,115,3365,0,41,0,0,14333.0,False
45,45,668090043,115,3364,0,41,1,1,14333.0,False
71,71,554504,124,6911,0,14,2,0,7000.0,False
72,72,571323,124,7218,0,15,3,0,6500.0,False
73,73,571323,124,7216,0,15,0,0,6500.0,False
74,74,571323,124,7217,0,15,3,0,6500.0,False
75,75,571323,124,7219,0,15,1,0,6500.0,False


In [20]:
del valid_split1, train_split1
gc.collect()

80

Since training data and test data are split by time, the validation data should also be split by time.
However, the given timestamp is the time that has elapsed since the user's first event, not the actual time.
So I set a random first access time for each user within a certain interval.

In [21]:
max_timestamp_u = train[['user_id','timestamp']].groupby(['user_id']).agg(['max']).reset_index()
max_timestamp_u.columns = ['user_id', 'max_time_stamp']
MAX_TIME_STAMP = max_timestamp_u.max_time_stamp.max()

In [22]:
max_timestamp_u.head(10)

Unnamed: 0,user_id,max_time_stamp
0,115,668090043
1,124,571323
2,2746,835457
3,5382,2101551456
4,8623,862338736
5,8701,1571291
6,12741,4465486358
7,13134,18122046414
8,24418,14243735782
9,24600,1550831


`(MAX_TIME_STAMP for all users) - (max_time_stamp for each user)` is used for this interval.

In [23]:
def rand_time(max_time_stamp):
    interval = MAX_TIME_STAMP - max_time_stamp
    rand_time_stamp = random.randint(0,interval)
    return rand_time_stamp

max_timestamp_u['rand_time_stamp'] = max_timestamp_u.max_time_stamp.apply(rand_time)
# 这个rand_time_stamp是构建的每个user的随机起始时间
# 所以下面的viretual_time_stamp可以看做是每个用户的真实交互时间
train = train.merge(max_timestamp_u, on='user_id', how='left')
train['viretual_time_stamp'] = train.timestamp + train['rand_time_stamp']

In [24]:
del train['max_time_stamp']
del train['rand_time_stamp']
del max_timestamp_u
gc.collect()

80

In [25]:
kaggle_env = True
if kaggle_env:
    # Full dataframe can not be sorted on kaggle kernel due to lack of memory.
    train = train[:10000000]
train = train.sort_values(['viretual_time_stamp', 'row_id']).reset_index(drop=True)

Now we have sorted dataframe by viretual_time_stamp, we can easly split dataframe by time.

In [26]:
if kaggle_env:
    val_size = 250000
else:
    val_size = 2500000

for cv in range(5):
    valid = train[-val_size:]
    train = train[:-val_size]
    # check new users and new contents
    new_users = len(valid[~valid.user_id.isin(train.user_id)].user_id.unique())
    valid_question = valid[valid.content_type_id == 0]
    train_question = train[train.content_type_id == 0]
    new_contents = len(valid_question[~valid_question.content_id.isin(train_question.content_id)].content_id.unique())    
    print(f'cv{cv} {train_question.answered_correctly.mean():.3f} {valid_question.answered_correctly.mean():.3f} {new_users} {new_contents}')
    valid.to_pickle(f'/Users/hesu/Documents/KT/riiid/cv{cv+1}_valid.pickle')
    train.to_pickle(f'/Users/hesu/Documents/KT/riiid/cv{cv+1}_train.pickle')

cv0 0.659 0.657 1392 0
cv1 0.658 0.668 1142 0
cv2 0.659 0.644 1015 0
cv3 0.659 0.661 1014 0
cv4 0.659 0.667 815 3


For full data, this would be:
<pre>
cv0 0.658 0.642 15119 0
cv1 0.658 0.651 11198 0
cv2 0.658 0.647 10159 0
cv3 0.658 0.651 9687 3
cv4 0.658 0.655 9184 0
</pre>
Average percentage of correct answers seems match better now!


These files can be downloaded from:
https://www.kaggle.com/its7171/riiid-cross-validation-files

This notebook is a sample that uses this dataset:
https://www.kaggle.com/its7171/iter-test-emulator