# 2019 Data Science Bowl

In this competition, we will analyse usage event of "PBS KIDS Measure Up!" app.<br>
The goal is to predict how many times the user answers to get correct answer for "Assesement" (problem).<br>
I will proceed with exploratory data analysis to understand this competition. Especially focusing on **how the labels are made**.


# Table of Contents:

**1. [PBS KIDS Measure UP!](#id1)** <br>
**2. [Data description](#id2)** <br>
**3. [Data Visualization](#id3)** <br>
**4. [References](#ref)** <br>


<a id="id1"></a>
## PBS KIDS Measure UP!

Cite from official website [PBS KIDS Measure Up!](https://pbskids.org/apps/pbs-kids-measure-up.html):

children ages 3 to 5 learn early math concepts focused on length, width, capacity, and weight while going on an adventure through Treetop City, Magma Peak, and Crystal Caves.

### Specific features of Measure Up! include:

 - 19 unique measuring games.
 - 10 measurement-focused video clips.
 - Sticker books featuring favorite PBS KIDS characters.
 - Rewards for completion of tasks.
 - Embedded challenges and reports to help parents and caregivers monitor kids’ progress.
 - Ability to track your child's progress using the PBS KIDS Super Vision companion app. (Read more about Super Vision below.)



In [None]:
from IPython.display import HTML, IFrame

HTML('<iframe width="400" height="200" src="https://pbskids.org/apps/media/video/Seesaw_v6_subtitled_ccmix.mp4" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

In [None]:
import gc
import os
from pathlib import Path
import random
import sys

from tqdm import tqdm_notebook as tqdm
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import display, HTML

# --- plotly ---
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

# --- models ---
from sklearn import preprocessing
from sklearn.model_selection import KFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb

<a id="id2"></a>
# Data description

In [None]:
# Input data files are available in the "../input/" directory.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Be careful that data size is very big and it takes time (about 1min) to load the data.<br>
We may refer [How to Work with BIG Datasets on Kaggle Kernels (16G RAM)](https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask) by @yuliagm to load data efficiently.

In [None]:
%%time
datadir = Path('/kaggle/input/data-science-bowl-2019')

# Read in the data CSV files
train = pd.read_csv(datadir/'train.csv')
train_labels = pd.read_csv(datadir/'train_labels.csv')
test = pd.read_csv(datadir/'test.csv')
specs = pd.read_csv(datadir/'specs.csv')
ss = pd.read_csv(datadir/'sample_submission.csv')

In [None]:
train["timestamp"] = pd.to_datetime(train["timestamp"])
test["timestamp"] = pd.to_datetime(test["timestamp"])

Train data consists of 11M rows, while test data hasa 1M rows.

train labels has 17K rows, which means the task is not to predict the value each row, as we will see later.

In [None]:
print(f'train.shape        : {train.shape}')
print(f'train_labels.shape : {train_labels.shape}')
print(f'test.shape         : {test.shape}')
print(f'specs.shape        : {specs.shape}')
print(f'ss.shape           : {ss.shape}')      

# train and test data

train data and test data consists of event information.

In [None]:
train.head()

Same `installation_id` continues, which means same user try to play "Welcome to Loast lagoon!" -> "Magma Peak - Level 1" -> "Sandcastle Builder (Activity)".

These game are separated by `game_session`, even in same game there are many events happing which is described by `event_id`.

# train_labels

`accuracy_group` is the label that we want to predict in this competition, it is categorized as

 - 3: the assessment was solved on the first attempt
 - 2: the assessment was solved on the second attempt
 - 1: the assessment was solved after 3 or more attempts
 - 0: the assessment was never solved

More detailed information is given by `num_correct`, `num_incorrect` and `accuracy`.

 - When `num_correct` is 0, which means `accuracy_group` is 0.
 - When `num_correct` is 1 and `num_incorrect` is 0, which means `accuracy_group` is 3.
 - When `num_correct` is 1 and `num_incorrect` is 1, which means `accuracy_group` is 2.
 - When `num_correct` is 1 and `num_incorrect` is more than 1, which means `accuracy_group` is 1.

In [None]:
train_labels.head()

`accuracy_group` is determined by each assesement play.

You can see several `accuracy_group` exists in same `installation_id`.

Because when the same user (specified by `installation_id`) plays several games (specified by `game_session`), `accuracy_group` is calculated for each play.

Same user can play same game several times as well. As you can see above, user `0006a69f` plays `Mushroom Sorter` more than 3 times.

In [None]:
train_labels['accuracy_group'].value_counts().sort_index().plot(kind="bar", title='accuracy group counts')

It seems many assesements are solved without incorrect trial, yay! :)

# sample_submission

`sample_submission` shows the submission format. It seems we need to predict `accuracy_group` for each `installation_id`.

In [None]:
ss.head()

I get confused when I compare `train_labels` and `sample_submission` at first time.<br>
`train_labels` contains many `accuracy_group` in one `installation_id` but `sample_submission` assumes only one `accuracy_group` is assigned for each `installtion_id`. 

Why?

It can be understood by carefully looking test data. <br>
As a conclusion, you need to predict the `accuracy_group` of the last game play for each user (`installation_id`) in the test dataset.
I will explain details following.

# Checking test data for each installation_id


The number of `installation_id` in the `test` dataset is 1000. This is same with the number of rows of `sample_submission`.

In [None]:
test.installation_id.nunique()

Let's forcus on one user's total history in the test dataset.

In [None]:
#101999d8
# f47ef997

test_tmp = test[test['installation_id'] == '101999d8']

In [None]:
test_tmp

As you can see, user starts from the "Welcome to Lost Lagoon!" event and ends with **"Chest Sorter (Assessment)" at last**.

It seems all the test data's last row data of each `installation_id` is some `Assessment` event. We are going to predict this las Assessment's `accuracy_group`.<br>
Of course, user may play different "Assessment" in the past, so Assessment event may be contained several times and you can understand this user could anwer or not in the past "Assessment".

I chose an installation_id with very small history in the above example.<br>
But most of them contains thousands of events. Some user has 20000~ events.

In [None]:
ax = sns.distplot(test['installation_id'].value_counts().values)
ax.set_title('Number of event ids for each installation id')

# Calculating labels from train/test data

Now we understand the meaning of input `train`, `test` data and true labels `train_labels` and submission format `sample_submission` for test.<br>
But how `train_labels` are calculated from `train` data? I will try writing a code to re-construct these labels.

Again, let's focus on specific `installation_id, 0006a69f` for the data analysis.<br>
This user solved Assessment 5 times.

In [None]:
train_labels[train_labels.installation_id == '0006a69f']

Let's see this user's event.

In [None]:
tmp_train = train[train.installation_id == '0006a69f']
tmp_train

The event consists of 3801 rows, we cannot see all in detail... 

As written in the [data description](https://www.kaggle.com/c/data-science-bowl-2019/data):

> Assessment attempts are captured in event_code 4100 for all assessments except for Bird Measurer, which uses event_code 4110. If the attempt was correct, it contains "correct":true.

So let's extract rows with `event_code` is 4100 or 4110. Also we are only intersted in "Assessments" type.

In [None]:
tmp_train[tmp_train['event_code'].isin([4100, 4110]) & (tmp_train['type'] == 'Assessment')]

Now the size is reduced and we can see all the rows.

As we can see, for each game_session:

 - `901acc108f55a5a1`: Tried "Mushroom Sorter (Assessment)", correct with 1 try
 - `77b8ee947eb84b4e`: Tried "Bird Measurer (Assessment)", incorrect 11 times
 - `6bdf9623adc94d89`: Tried "Mushroom Sorter (Assessment)", correct with 1 try
 - `9501794defd84e4d`: Tried "Mushroom Sorter (Assessment)", incorrect 1 time and correct
 - `a9ef3ecb3d1acc6a`: Tried "Bird Measurer (Assessment)", incorrect 1 time and correct
 
This is consistent with the `train_labels` we saw above!

<a id="id3"></a>
# Data visualization

Now let's proceed with some data visualization to understand data more deeply.

### Difficulty by each Assessment

It seems "Mushroom Sorter" and "Cart Balancer" are easy (many 3), and "Chest Sorter" is difficult (many 0s).

In [None]:
g = sns.FacetGrid(train_labels, col="title")
g = g.map(plt.hist, "accuracy_group")

# sns.distplot(train_labels, x='accuracy_group', hue='title')

All Assessment are solved comparable times.

In [None]:
train_labels['title'].value_counts().plot(kind="bar")

How many users are in `train_labels` and how many Assessments are solved by each user?

Many people solved ~10 Assessments, but some user solved 160 Assessments!

In [None]:
print('{} users solved {} Assessments in train_labels'
      .format(train_labels['installation_id'].nunique(), len(train_labels)))

In [None]:
sns.distplot(train_labels['installation_id'].value_counts().values)

## Users usage in time

Code inspired from this [great kernel](https://www.kaggle.com/robikscube/2019-data-science-bowl-an-introduction), 
but I modified `timestamp` column to be datetime to show it in proper time scale. Also used `plotly` for interactive visualization.

This user played Aug 6 6am and 5pm and Aug 9 6pm and Aug 29 4pm.

In [None]:
target_id = '0006a69f'

px.scatter(train[train['installation_id'] == target_id], x='timestamp', y='event_code')

Check all the event title and type.

In [None]:
train.groupby(['title', 'type']).size().reset_index().rename(columns={0: 'count'}).sort_values('type')

# Specs

From [data description](https://www.kaggle.com/c/data-science-bowl-2019/data),

> This file gives the specification of the various event types.
 - event_id - Global unique identifier for the event type. Joins to event_id column in events table.
 - info - Description of the event.
 - args - JSON formatted string of event arguments. Each argument contains:
    - name - Argument name.
    - type - Type of the argument (string, int, number, object, array).
    - info - Description of the argument.

In [None]:
specs.head()

Each `args` stores a lot of information, how to utilize these information is the feature engineering task left for you!

In [None]:
specs.loc[0, 'args']

Work in progress for checking below questions...

 - Understanding each Assessments difficulty.
   - Each assessments has some "stage" or "level"??

 <h3 style="color:red">If this kernel helps you, please upvote to keep me motivated 😁<br>Thanks!</h3>

<a id="ref"></a>
# Reference

 - [A baseline for DSB 2019](https://www.kaggle.com/mhviraf/a-baseline-for-dsb-2019): This kernel explains me how to understand the relation ship between event data and labels
 
 - [🚸 2019 Data Science Bowl - An Introduction](https://www.kaggle.com/robikscube/2019-data-science-bowl-an-introduction)