# Introduction 
This is the kernel about [2019 Data Science Bowl](https://www.kaggle.com/c/data-science-bowl-2019)

### PBS KIDS Measure UP!
![](https://github.com/seriousran/img_link/blob/master/kg/app.PNG?raw=true)
- Try it :)
- [Official website](https://pbskids.org/apps/pbs-kids-measure-up.html)
- [App Store](https://apps.apple.com/us/app/pbs-kids-measure-up/id1088888867)
- [Google Play](https://play.google.com/store/apps/details?id=org.pbskids.measureup&hl=en)

### History:
1. [National Data Science Bowl](https://www.kaggle.com/c/datasciencebowl/overview)
    - For measuring and monitoring plankton populations
2. [Second Annual Data Science Bowl](https://www".kaggle.com/c/second-annual-data-science-bowl)
    - To automatically determine cardiac volumes from MRI scans (Diagnosing Heart Diseases)
3. [Data Science Bowl 2017](https://www.kaggle.com/c/data-science-bowl-2017)
    - To detect lung cancer
4. [2018 Data Science Bowl](https://www.kaggle.com/c/data-science-bowl-2018)
    - To automate nucleus detection.
       
The Data Science Bowl is the world’s premier data science for social good competition, created in 2014 and presented by Booz Allen Hamilton and Kaggle. <br/>
There is an official website. https://datasciencebowl.com/ <br/>
    
### And, the 2019 Data science Bowl is fifth! :)
- predict scores on in-game assessments and create an algorithm that will lead to better-designed games and improved learning outcomes.
- aid in discovering important relationships between engagement with high-quality educational media and learning processes.
- use anonymous gameplay data(including knowledge of videos watched and games played)

### The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):
- 3: the assessment was solved on the first attempt
- 2: the assessment was solved on the second attempt
- 1: the assessment was solved after 3 or more attempts
- 0: the assessment was never solved

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data Desciption

### train.csv & test.csv
These are the main data files which contain the gameplay events.
- event_id - Randomly generated unique identifier for the event type. Maps to event_id column in specs table.
- game_session - Randomly generated unique identifier grouping events within a single game or video play session.
- timestamp - Client-generated datetime
- event_data - Semi-structured JSON formatted string containing the events parameters. Default fields are: event_count, event_code, and game_time; otherwise fields are determined by the event type.
- installation_id - Randomly generated unique identifier grouping game sessions within a single installed application instance.
- event_count - Incremental counter of events within a game session (offset at 1). Extracted from event_data.
- event_code - Identifier of the event 'class'. Unique per game, but may be duplicated across games. E.g. event code '2000' always identifies the 'Start Game' event for all games. Extracted from event_data.
- game_time - Time in milliseconds since the start of the game session. Extracted from event_data.
- title - Title of the game or video.
- type - Media type of the game or video. Possible values are: 'Game', 'Assessment', 'Activity', 'Clip'.
- world - The section of the application the game or video belongs to. Helpful to identify the educational curriculum goals of the media. Possible values are: 'NONE' (at the app's start screen), TREETOPCITY' (Length/Height), 'MAGMAPEAK' (Capacity/Displacement), 'CRYSTALCAVES' (Weight).

### specs.csv
This file gives the specification of the various event types.
- event_id - Global unique identifier for the event type. Joins to event_id column in events table.
- info - Description of the event.
- args - JSON formatted string of event arguments. Each argument contains:
- name - Argument name.
- type - Type of the argument (string, int, number, object, array).
- info - Description of the argument.

### train_labels.csv
This file demonstrates how to compute the ground truth for the assessments in the training set.

### sample_submission.csv
A sample submission in the correct format.

# Data Exploration

In [None]:
ROOT = '../input/data-science-bowl-2019/'
train_df = pd.read_csv(ROOT + 'train.csv')
train_labels_df = pd.read_csv(ROOT + 'train_labels.csv')
specs_df = pd.read_csv(ROOT + 'specs.csv')
test_df = pd.read_csv(ROOT + 'test.csv')
sample_submission = pd.read_csv(ROOT + 'sample_submission.csv')

In [None]:
print(train_df.shape)
train_df.head()

In [None]:
train_sub_df = train_df.sample(n=1000000, random_state=2019)

In [None]:
train_sub_df['installation_id'].value_counts()

In [None]:
numerical = ['timestamp', 'game_time']
categorical = ['event_id', 'game_session', 'installation_id', 'event_code', 'title', 'type', 'world']
dictionary = ['event_data']

In [None]:
print(train_labels_df.shape)
train_labels_df.head()

In [None]:
train_label_CB_df = train_labels_df[train_labels_df['title']=='Cart Balancer (Assessment)']
train_label_CF_df = train_labels_df[train_labels_df['title']=='Cauldron Filler (Assessment)']
train_label_MS_df = train_labels_df[train_labels_df['title']=='Mushroom Sorter (Assessment)']
train_label_CS_df = train_labels_df[train_labels_df['title']=='Chest Sorter (Assessment)']
train_label_BM_df = train_labels_df[train_labels_df['title']=='Bird Measurer (Assessment)']

In [None]:
for df in [train_label_CB_df, train_label_CF_df, train_label_MS_df, train_label_CS_df, train_label_BM_df]:
    df['accuracy'] = df['accuracy'].apply(lambda x: round(x, 2))

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
sns.countplot(train_label_CB_df['accuracy'])

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
sns.countplot(train_label_CF_df['accuracy'])

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
sns.countplot(train_label_MS_df['accuracy'])

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
sns.countplot(train_label_CS_df['accuracy'])

In [None]:
sns.countplot(train_label_BM_df['accuracy'])