# Data Tree Structure Visualization

When first looking at the data I had a hard time understanding the structure of the data, especially related to `accuracy_group` and the progression of users through the game. In this notebook I want to briefly explore the structure of the train data by visualizing it for one `installation_id` in a **tree structure**.<br><br>

So first let's read in the data:

In [None]:
import numpy as np
import pandas as pd
import pydot
from IPython.display import Image, display

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

train_df = pd.read_csv("../input/data-science-bowl-2019/train.csv")
test_df = pd.read_csv("../input/data-science-bowl-2019/test.csv")
train_labels_df = pd.read_csv("../input/data-science-bowl-2019/train_labels.csv")
specs_df = pd.read_csv("../input/data-science-bowl-2019/specs.csv")

Before we start visualizing there are a couple of things we should keep in mind when looking at the data:
* `installation_id` designates a specific installation of the app. It can be thought of as the users id as we cannot know if this ID is used by multiple users.
* `game_session` denotes a specific part of the game and every `installation_id` consists of multiple unique game sessions.
* Events for each `game_session` are ordr by `event_count` (they are typical user events such as click, start, etc.. Event data is identified trhough event-title combinations coded in `event_id`.
* The first `event_code` in a `game_session` is always **2000** and the last event varies.
* Assessments have the `event_codes` **4110** or **4100** and these map to `installation_ids` and `game_sessions` in **train_labels** data
* The last event having an assesment (ordered by `timestamp` or `event_count`) is always the one for which a label exists. In the test data this is an event with `event_code`=2000 (the label we have to predict).

These were derived from the following excellent kernels/discussions: [Measure Up! Media types introduction](https://www.kaggle.com/c/data-science-bowl-2019/discussion/115034#latest-671298),  [Big picture: Understanding the given data](https://www.kaggle.com/c/data-science-bowl-2019/discussion/117019), [Data Science Bowl 2019 EDA and Baseline](https://www.kaggle.com/erikbruin/data-science-bowl-2019-eda-and-baseline), [Some insights and observations](https://www.kaggle.com/c/data-science-bowl-2019/discussion/116137), [Some more insights and observations](https://www.kaggle.com/c/data-science-bowl-2019/discussion/119715)

Next, we will identify one `ìnstallation_id` and visualize the **train data structure** in a tree. First let's filter the data to only contain events with assessments and pick one `installation_id` from the result.

In [None]:
train_ass = train_df.loc[train_df['event_code'].isin([4100, 4110])]
train_ass.head(15)

We will pick `installation_id` = **0006a69f** as our **root node**:

In [None]:
train_filtered = train_df.loc[train_df['installation_id'] == '0006a69f'] 
train_filtered.sort_values(by=['event_count'], ascending=False)
print(train_filtered.shape)
train_filtered.head()

In order to visualize the data in a tree structure, we will add the **accuracy_group** from `train_labels` to our selection and take a subset of the data so it is visualizable in a graphical tree (not to many nodes). <br>
For the current purpose we will select data according to following filters (this is, so the graph can get rendered in the notebook and still be understandable. You can modify below filtering to see other outputs and especailly bigger graphs):
* sample of **20 rows**
* additional **20 samples** where we ensure an assessment took place
* `world` = **TREETOPCITY**

In [None]:
# left join with labels
train_filtered_final = pd.merge(train_filtered, train_labels_df, 'left', on=['installation_id', 'game_session'])

# FILTERING:
n_samples = 20
world = 'TREETOPCITY'
train_filtered_final = train_filtered_final.loc[train_filtered_final['world'] == world]
# then get rows where assessment was present
train_with_assessment = train_filtered_final.loc[train_filtered_final['type']=='Assessment'][:n_samples]
train_filtered_final = train_filtered_final[:n_samples]
# union both
train_filtered_final = pd.concat([train_filtered_final, train_with_assessment])

# inspect df
print(train_filtered_final.shape)
train_filtered_final.head(50)

Next, we will collect relevant columns that will be part of the tree in a python dictionary in order to visualize it recursively in a tree. Here we are ignoring time relevant columns as this information will be present inherently in the tree direction (top to bottom) [at least partly]. We will include `game_session` in one of last leaves to see to what session the structure corresponds to. The leave information will be `accuracy_group`:

In [None]:
cols_to_select = ['installation_id', 'world', 'title_x', 'game_session','accuracy_group']
train_viz_df = train_filtered_final[cols_to_select]
train_viz_df.head(50)

Now the data will be collected in a python dict and plotted in a tree:

In [None]:
# collect data in dict
def create_nested_dict(df):
    d = {}
    for row in df.values:
        here = d
        for elem in row[:-2]:
            if elem not in here:
                here[elem] = {}
            here = here[elem]
        here[row[-2]] = row[-1]
    return d

train_dict = create_nested_dict(train_viz_df)
train_dict

In [None]:
def draw(parent_name, child_name):
    edge = pydot.Edge(parent_name, child_name)
    graph.add_edge(edge)

def visit(node, parent=None):
    for k,v in node.items():
        if isinstance(v, dict):
            # start with the root node where parent is None
            # we don't want to graph the None node
            if parent:
                draw(str(parent), str(k))
            visit(v, k)
        else:
            draw(parent, k)
            # drawing the label using a distinct name
            draw(str(k), str(k)+'_'+str(v))

def show_graph(pdot_graph):
    plt = Image(pdot_graph.create_png())
    display(plt)
    
# instantiate pydot, recursive call, show graph
# and write graph to output directory
graph = pydot.Dot(graph_type='graph')
visit(train_dict)
show_graph(graph)
graph.write_png('sample_train_data_tree.png')

* For our subset TREETOPCITY consists of different activities
* Obviously different activities are having unique game sessions
* The assesment activity has an accuracy group of `..._3`
* We can use thes same to plot the test data. Here we would note that there is no `accuracy_group` in any of the leaves

Feel free to change the filters above and construct different sizes of nested trees. The output will always also be saved as a PNG file so you can download it from there and resize and crop it,
