# DS Bowl 2019 : Explore How Young Children Learn

Exploration based on [Kaggle DS Bowl 2019](https://www.kaggle.com/c/data-science-bowl-2019) competition dataset.

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

## Load Data

Data specification:

__train.csv:__
_Gameplay events._

* `event_id` - Randomly generated unique identifier for the event type. Maps to `event_id` column in specs table.
* `game_session` - Randomly generated unique identifier grouping events within a single game or video play session.
* `timestamp` - Client-generated datetime
* `event_data` - Semi-structured JSON formatted string containing the events parameters. Default fields are: `event_count`, `event_code`, and `game_time`; otherwise fields are determined by the event type.
* `installation_id` - Randomly generated unique identifier grouping game sessions within a single installed application instance.
* `event_count` - Incremental counter of events within a game session (offset at 1). Extracted from `event_data`.
* `event_code` - Identifier of the event `class`. Unique per game, but may be duplicated across games. E.g. event code `2000` always identifies the `Start Game` event for all games. Extracted from `event_data`.
* `game_time` - Time in milliseconds since the start of the game session. Extracted from `event_data`.
* `title` - Title of the game or video.
* `type` - Media type of the game or video. Possible values are: `Game`, `Assessment`, `Activity`, `Clip`.
* `world` - The section of the application the game or video belongs to. Helpful to identify the educational curriculum goals of the media. Possible values are: `NONE` (at the app's start screen), `TREETOPCITY` (Length/Height), `MAGMAPEAK` (Capacity/Displacement), `CRYSTALCAVES` (Weight).

__specs.csv:__
_Specification of the various event types._

* `event_id` - Global unique identifier for the event type. Joins to `event_id` column in events table.
* `info` - Description of the event.
* `args` - JSON formatted string of event arguments. Each argument contains:
    * `name` - Argument name.
    * `type` - Type of the argument (string, int, number, object, array).
    * `info` - Description of the argument.

In [4]:
logs = pd.read_csv('./data/train.csv')
specs = pd.read_csv('./data/specs.csv')

In [5]:
logs.head()

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
0,27253bdc,45bb1e1b6b50c07b,2019-09-06T17:53:46.937Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
1,27253bdc,17eeb7f223665f53,2019-09-06T17:54:17.519Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK
2,77261ab5,0848ef14a8dc6892,2019-09-06T17:54:56.302Z,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0001e90f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK
3,b2dba42b,0848ef14a8dc6892,2019-09-06T17:54:56.387Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,2,3010,53,Sandcastle Builder (Activity),Activity,MAGMAPEAK
4,1bb5fbdb,0848ef14a8dc6892,2019-09-06T17:55:03.253Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,3,3110,6972,Sandcastle Builder (Activity),Activity,MAGMAPEAK


In [6]:
specs.head()

Unnamed: 0,event_id,info,args
0,2b9272f4,The end of system-initiated feedback (Correct)...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
1,df4fe8b6,The end of system-initiated feedback (Incorrec...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
2,3babcb9b,The end of system-initiated instruction event ...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
3,7f0836bf,The end of system-initiated instruction event ...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
4,ab3136ba,The end of system-initiated instruction event ...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."


## Questions

`1` a. Find top-5 most active users (`installation_id`) in August 2019 (`timestamp`) by the number of events

`1` b. Find top-5 most active users (`installation_id`) in August 2019 (`timestamp`) by the number of sessions

`2` Which assessment is the most complicated?

Assessment attempts are captured in event_code 4100 for all assessments except for Bird Measurer and 4110 for Bird Measurer (use `title != 'Bird Measurer (Assessment)` ). If the attempt was correct, it contains "correct":true.

In [50]:
# example of the successful attempt
logs[(logs.event_code == 4100 ) | (logs.event_code == 4110) ].event_data.values[0] # Bird Measurer check to be added

'{"correct":true,"stumps":[1,2,4],"event_count":44,"game_time":31011,"event_code":4100}'

`3` Create a pivot table for number of event type vs month (hint: use `pivot_table` function)

Example:

| Month/Type        | 07           | 08  |
| ------------- |:-------------:| -----:|
| Activity     | 100| 100 |
| Assessment      | 100      |   100 |
| Clip | 100      |    100 |

`4` Binarize `game_time` column into following bins:
* `early` - < 30000
* `mid` - >= 30000  and < 70000
* `late` - >= 70000

What is the difference between `cut` and `qcut` functions?