In [1]:
import pandas as pd

# Idea

The test data is a bunch of events (which may include assessments) followed by the start of an assessement, for which we have to predict the number of attempts. 

The train data has a bunch of events (which may include assessments) and we have the full data for all of the assessments.

This means that each observation is a collection of events up to the start of an assessment, and the labels are calculated by what happens during that assessment. So each installation id in train gives us a bunch of data points (one for each assessment).

We should drop data from installation_ids which do not have assessments.

In [2]:
train = pd.read_pickle('../data/processed/train.pkl')
train.head(3)

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
4152505,27253bdc,5f9ff9bf9350a7ef,2019-07-23T02:12:17.279Z,"{""event_code"": 2000, ""event_count"": 1}",5b826029,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
4152506,27253bdc,9fc66af070776a8d,2019-07-23T02:12:53.427Z,"{""event_code"": 2000, ""event_count"": 1}",5b826029,1,2000,0,Tree Top City - Level 1,Clip,TREETOPCITY
4152507,27253bdc,0f1889a6816bd427,2019-07-23T02:14:27.038Z,"{""event_code"": 2000, ""event_count"": 1}",5b826029,1,2000,0,Ordering Spheres,Clip,TREETOPCITY


We want to be able to recreate this information ourselves... this is because we're eventually going to have to calculate it for stuff in the test data.

In [3]:
labels = pd.read_csv('../data/raw/train_labels.csv')
labels.query("installation_id == '0006a69f'")

Unnamed: 0,game_session,installation_id,title,num_correct,num_incorrect,accuracy,accuracy_group
0,6bdf9623adc94d89,0006a69f,Mushroom Sorter (Assessment),1,0,1.0,3
1,77b8ee947eb84b4e,0006a69f,Bird Measurer (Assessment),0,11,0.0,0
2,901acc108f55a5a1,0006a69f,Mushroom Sorter (Assessment),1,0,1.0,3
3,9501794defd84e4d,0006a69f,Mushroom Sorter (Assessment),1,1,0.5,2
4,a9ef3ecb3d1acc6a,0006a69f,Bird Measurer (Assessment),1,0,1.0,3


In [4]:
labels.query("game_session == 'a9ef3ecb3d1acc6a'")

Unnamed: 0,game_session,installation_id,title,num_correct,num_incorrect,accuracy,accuracy_group
4,a9ef3ecb3d1acc6a,0006a69f,Bird Measurer (Assessment),1,0,1.0,3


For all assessments except bird measurer, attempts are captured with code 4100. Bird measurer uses 4110. The corresponding event_data contains one of 

- "correct":true
- "correct":false

to indicate whether the attempt was successful.

However, Bird measurer also has "correct":true and stuff in code 4100 so we must remove it. We must also ensure we are only doing this for assessments.

Below is an example of all attempts by one particular installation. Pre-calculated labels for it are shown above.

In [5]:
train.query("installation_id == '0006a69f' and type == 'Assessment' and event_code in [4100, 4110]")

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
2228,25fa8af4,901acc108f55a5a1,2019-08-06T05:22:32.357Z,"{""correct"":true,""stumps"":[1,2,4],""event_count""...",0006a69f,44,4100,31011,Mushroom Sorter (Assessment),Assessment,TREETOPCITY
2709,17113b36,77b8ee947eb84b4e,2019-08-06T05:35:54.898Z,"{""correct"":false,""caterpillars"":[11,8,3],""even...",0006a69f,29,4110,35771,Bird Measurer (Assessment),Assessment,TREETOPCITY
2715,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:01.927Z,"{""correct"":false,""caterpillars"":[11,8,11],""eve...",0006a69f,35,4110,42805,Bird Measurer (Assessment),Assessment,TREETOPCITY
2720,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:06.512Z,"{""correct"":false,""caterpillars"":[11,8,5],""even...",0006a69f,40,4110,47388,Bird Measurer (Assessment),Assessment,TREETOPCITY
2725,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:09.739Z,"{""correct"":false,""caterpillars"":[11,8,7],""even...",0006a69f,45,4110,50605,Bird Measurer (Assessment),Assessment,TREETOPCITY
2730,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:13.951Z,"{""correct"":false,""caterpillars"":[11,8,4],""even...",0006a69f,50,4110,54822,Bird Measurer (Assessment),Assessment,TREETOPCITY
2733,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:17.407Z,"{""correct"":false,""caterpillars"":[11,8,4],""even...",0006a69f,53,4110,58280,Bird Measurer (Assessment),Assessment,TREETOPCITY
2738,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:21.390Z,"{""correct"":false,""caterpillars"":[11,8,2],""even...",0006a69f,58,4110,62256,Bird Measurer (Assessment),Assessment,TREETOPCITY
2743,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:26.296Z,"{""correct"":false,""caterpillars"":[11,8,1],""even...",0006a69f,63,4110,67164,Bird Measurer (Assessment),Assessment,TREETOPCITY
2750,17113b36,77b8ee947eb84b4e,2019-08-06T05:36:32.187Z,"{""correct"":false,""caterpillars"":[11,8,1],""even...",0006a69f,70,4110,73056,Bird Measurer (Assessment),Assessment,TREETOPCITY


Add binary column specifying whether event is an attempt or not

In [22]:
train['attempt'] = (
    (train.type == "Assessment") & 
    ((train.event_code == 4100) & (train.title != 'Bird Measurer (Assessment)') | 
     ((train.event_code == 4110) & (train.title == 'Bird Measurer (Assessment)')))
)

Add binary column specifying whether event is correct. True means success, false means fail, and NaN on all other rows.

In [21]:
train.loc[train.attempt, 'correct'] = train.loc[train.attempt].event_data.str.contains('"correct":true')

We only really care about assessment attempts for now. We also don't care about a bunch of these columns... let's have a look at what's left

In [32]:
subset = (
    train
    .loc[train.attempt]
    .drop(columns = ['type', 'event_code', 'game_time', 'event_count', 'event_id', 'world'])
)
subset

Unnamed: 0,game_session,timestamp,event_data,installation_id,title,attempt,correct
3897982,3c96c025b64f4bd9,2019-07-23T17:09:00.809Z,"{""correct"":true,""left"":[{""id"":""gem05"",""weight""...",554fc65a,Cart Balancer (Assessment),True,True
3898026,70e401e4086e1864,2019-07-23T17:11:39.686Z,"{""correct"":false,""pillars"":[],""event_count"":39...",554fc65a,Chest Sorter (Assessment),True,False
5366529,d7cd15f8c2a14dbf,2019-07-23T20:08:27.864Z,"{""correct"":false,""left"":[],""right"":[],""event_c...",76ed11ab,Cart Balancer (Assessment),True,False
5366533,d7cd15f8c2a14dbf,2019-07-23T20:08:32.180Z,"{""correct"":false,""left"":[],""right"":[],""event_c...",76ed11ab,Cart Balancer (Assessment),True,False
5366537,d7cd15f8c2a14dbf,2019-07-23T20:08:37.875Z,"{""correct"":false,""left"":[],""right"":[],""event_c...",76ed11ab,Cart Balancer (Assessment),True,False
...,...,...,...,...,...,...,...
9166495,83aa3029cc448cca,2019-10-14T21:09:42.272Z,"{""correct"":true,""pillars"":[1,2,3],""event_count...",ce4eb68c,Chest Sorter (Assessment),True,True
1069004,5906ac17ead8dc42,2019-10-14T21:37:17.733Z,"{""correct"":false,""caterpillars"":[4,10,3],""even...",191b94d8,Bird Measurer (Assessment),True,False
6199506,2ade4b725b8c5de8,2019-10-14T21:39:10.710Z,"{""correct"":false,""left"":[{""id"":""gem07"",""weight...",89754278,Cart Balancer (Assessment),True,False
6199517,2ade4b725b8c5de8,2019-10-14T21:39:28.723Z,"{""correct"":true,""left"":[{""id"":""gem07"",""weight""...",89754278,Cart Balancer (Assessment),True,True


In [51]:
correct = (subset
#  .query("installation_id == '0006a69f'")
 .groupby('game_session', observed=True)['correct']
 .sum()
 .astype(int)
)
correct

game_session
3c96c025b64f4bd9    1
70e401e4086e1864    0
d7cd15f8c2a14dbf    0
47b0add25b24c37d    1
ba4d8db4f7773219    1
                   ..
cc380bc05b6489f4    1
83aa3029cc448cca    1
5906ac17ead8dc42    0
2ade4b725b8c5de8    1
ca84b8d7c6758795    1
Name: correct, Length: 17690, dtype: int64

In [45]:
counts = (subset
#  .query("installation_id == '0006a69f'")
 .groupby('game_session', observed=True)['correct']
 .count()
)
counts.name = "n_attempts"
counts

game_session
3c96c025b64f4bd9    1
70e401e4086e1864    1
d7cd15f8c2a14dbf    3
47b0add25b24c37d    1
ba4d8db4f7773219    1
                   ..
cc380bc05b6489f4    1
83aa3029cc448cca    8
5906ac17ead8dc42    1
2ade4b725b8c5de8    2
ca84b8d7c6758795    1
Name: n_attempts, Length: 17690, dtype: int64

## Create labels data

In [55]:
(
    subset[['installation_id', 'game_session', 'timestamp', 'title']]
    .drop_duplicates(subset=['game_session'])
    .set_index('game_session')
    .join(counts)
    .join(correct)
    .query("installation_id == ''0006a69f'")
)

Unnamed: 0_level_0,installation_id,timestamp,title,n_attempts,correct
game_session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3c96c025b64f4bd9,554fc65a,2019-07-23T17:09:00.809Z,Cart Balancer (Assessment),1,1
70e401e4086e1864,554fc65a,2019-07-23T17:11:39.686Z,Chest Sorter (Assessment),1,0
d7cd15f8c2a14dbf,76ed11ab,2019-07-23T20:08:27.864Z,Cart Balancer (Assessment),3,0
47b0add25b24c37d,8123bc13,2019-07-23T20:19:20.553Z,Cauldron Filler (Assessment),1,1
ba4d8db4f7773219,8123bc13,2019-07-23T20:20:28.915Z,Cart Balancer (Assessment),1,1
...,...,...,...,...,...
cc380bc05b6489f4,ce4eb68c,2019-10-14T21:07:28.359Z,Cart Balancer (Assessment),1,1
83aa3029cc448cca,ce4eb68c,2019-10-14T21:08:37.250Z,Chest Sorter (Assessment),8,1
5906ac17ead8dc42,191b94d8,2019-10-14T21:37:17.733Z,Bird Measurer (Assessment),1,0
2ade4b725b8c5de8,89754278,2019-10-14T21:39:10.710Z,Cart Balancer (Assessment),2,1
