# Settings

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

# Loading the data

In [5]:
sample_submission = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/sample_submission.csv")
test = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/test.csv")
specs = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/specs.csv")
train_labels = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/train_labels.csv")
train = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/train.csv")

In [6]:
print('Shapes:\n - train: ', np.shape(train), '\n - train_labels: ',np.shape(train_labels), '\n - test: ',np.shape(test), '\n - specs: ',np.shape(specs), '\n - sample_submission: ',np.shape(sample_submission))

Shapes:
 - train:  (11341042, 11) 
 - train_labels:  (17690, 7) 
 - test:  (1156414, 11) 
 - specs:  (386, 3) 
 - sample_submission:  (1000, 2)


# Context And Objective

We have some information about children playing games on an app called PBS KIDS Measure Up.
In this app they are 5 assesements with several levels.

* Bird Measurer 
* Cart Balancer 
* Cauldron Filler 
* Chest Sorter
* Mushroom Sorter

Our objctive is to predict hom many times a child has to play to a game (number of attempts) to win it (pass the assesment).

# Understanding the data

We are provided with 5 files:
* the train_labels file is a "example-file" of what the final train (cleaned-up) should look-like.
* the train file with missy data: the labels are not available but the information to build them is contained in the event_data column. 
* the specs file is a file containing information about the game_session
* the test file
* the sample_submission file which give us an idea of the right format needed to make a submission on kaggle

### The target Variable

The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):

* 3: the assessment was solved on the first attempt
* 2: the assessment was solved on the second attempt
* 1: the assessment was solved after 3 or more attempts
* 0: the assessment was never solved

In [27]:
train_labels.groupby('accuracy_group').count()

Unnamed: 0_level_0,game_session,installation_id,title,num_correct,num_incorrect,accuracy
accuracy_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,4229,4229,4229,4229,4229,4229
1,2411,2411,2411,2411,2411,2411
2,2205,2205,2205,2205,2205,2205
3,8845,8845,8845,8845,8845,8845


!!!! Unbalanced-class !!!! : weights ?? check if it's the same in the "real" train after cleanning

### Train 

In [14]:
train.head()

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
0,27253bdc,45bb1e1b6b50c07b,2019-09-06T17:53:46.937Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
1,27253bdc,17eeb7f223665f53,2019-09-06T17:54:17.519Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK
2,77261ab5,0848ef14a8dc6892,2019-09-06T17:54:56.302Z,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0001e90f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK
3,b2dba42b,0848ef14a8dc6892,2019-09-06T17:54:56.387Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,2,3010,53,Sandcastle Builder (Activity),Activity,MAGMAPEAK
4,1bb5fbdb,0848ef14a8dc6892,2019-09-06T17:55:03.253Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,3,3110,6972,Sandcastle Builder (Activity),Activity,MAGMAPEAK


* **event_id** : identifier for the event type ( = event_id in specs)


* **game_session** : identifier grouping events within a single game or video play session


* **timestamp** : Client-generated datetime


* **event_data** : Semi-structured JSON formatted string containing the events parameters. Default fields are: event_count, event_code, and game_time; otherwise fields are determined by the event type.


* **installation_id** : identifier grouping game sessions within a single installed application instance.


* **event_count** : Incremental counter of events within a game session (offset at 1). Extracted from event_data.


* **event_code** : Identifier of the event 'class'. Unique per game, but may be duplicated across games. E.g. event code '2000' always identifies the 'Start Game' event for all games. Extracted from event_data.


* **game_time** : Time in milliseconds since the start of the game session. Extracted from event_data.


* **title** : Title of the game or video.


* **type** : Media type of the game or video. Possible values are: 'Game', 'Assessment', 'Activity', 'Clip'.


* **world** : The section of the application the game or video belongs to. Helpful to identify the educational curriculum goals of the media. Possible values are: 'NONE' (at the app's start screen), TREETOPCITY' (Length/Height), 'MAGMAPEAK' (Capacity/Displacement), 'CRYSTALCAVES' (Weight).

### Specs

In [19]:
specs.head()

Unnamed: 0,event_id,info,args
0,2b9272f4,The end of system-initiated feedback (Correct)...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
1,df4fe8b6,The end of system-initiated feedback (Incorrec...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
2,3babcb9b,The end of system-initiated instruction event ...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
3,7f0836bf,The end of system-initiated instruction event ...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."
4,ab3136ba,The end of system-initiated instruction event ...,"[{""name"":""game_time"",""type"":""int"",""info"":""mill..."


This file gives the specification of the various event types.

* **event_id** : Global unique identifier for the event type. Joins to event_id column in events table.


* **info** : Description of the event.


* **args** : JSON formatted string of event arguments. Each argument contains:

    * **name** : Argument name.
    * **type** : Type of the argument (string, int, number, object, array).
    * **info** : Description of the argument.


In [34]:
sample_submission.head()

Unnamed: 0,installation_id,accuracy_group
0,00abaee7,3
1,01242218,3
2,017c5718,3
3,01a44906,3
4,01bc6cb6,3


## Making labels

The first very important thing to do is to create the labels.
As we said the right information is contained in the event_data column.
This column is a dictionary with a variable size for each line.

Organisation of the information recorded by the application:
    * When a child downloads the app an id is created: id_installation
    * When he starts playing a game an other id is created: game_session
    * Inside a game_session many things can be done by the gamer. Every event has its own id: event_id. The different possible events are caracterized by a code contained in the event_code column.
The only interesting events to build the labels are the lasts one in a game session: they correspond to the succes or the failure of the game in a given game_session. The corrresponding event_code is 4100 for 4 of the 5 games and 4110 for the last one.

In [53]:
def make_labels(data):
    
    """
     1. Ne garder que les lignes où l'event_code vaut 4100 ou 4110: correspond aux event finaux d'une game_session
         donc contennant l'info de succés ou d'echec
         
     2. groupby par game_session
     
     3. Au sein de chaque game_session compter le nombre de succés et d'echecs
     
     4. Avec un merge créer un tableau contennant nombre de succés et nombre d'echecs pour chaque game_session. 
         /!\ Il peut y avoir des game_session pour lesquelles il n'y a que des succes ou que des echecs dans ce cas
         la case vide corespondante est NA que l'on remplace par 0.
         
     5. Calculer l'accuracy_score = nb_succes / nb_tentatives ou nb_tentatives = nb_succes + nb_echecs
     
     6. Construire la variable accuracy: repartition dans les classes (0,1,2,3)
     
     7. Ajouter au DataFrame final les variables de train informatives
    """
    
    # 1 
    data_assignement = data[(data['event_code'] == 4100) | (data['event_code'] == 4110)]
    
    # 2-3
    num_correct = pd.DataFrame(data_assignement[data_assignement['event_data'].str.contains('correct":true')].groupby('game_session').count()["event_id"].rename('num_correct'))
    
    num_incorrect = pd.DataFrame(data_assignement[data_assignement['event_data'].str.contains('correct":false')].groupby('game_session').count()["event_id"].rename('num_incorrect'))
    
    # 4
    data_l = pd.DataFrame(num_correct.merge(num_incorrect, how='outer', left_on=num_correct.index, right_on=num_incorrect.index)).fillna(0)
    data_l = pd.DataFrame(data_l.rename(columns={'key_0':'game_session', 'num_correct_x':'num_correct', 'num_correct_y':'num_incorrect'}))
    
    # 5
    data_l["accuracy"] = data_l["num_correct"]/(data_l["num_correct"]+data_l["num_incorrect"])
    
    # 6 
    data_l["accuracy_group"] = data_l["accuracy"].apply(lambda x: 0 if x==0 else (1 if x<0.5 else (2 if x<0.9 else 3)))
    
    data_l = data_l.merge(train[['installation_id', 'game_session', 'title','event_data','event_count','game_time','world','timestamp']], how='inner', left_on='game_session', right_on='game_session')

    return data_l

In [54]:
trainL = make_labels(train)
import random
trainL.sample(100)

Unnamed: 0,game_session,num_correct,num_incorrect,accuracy,accuracy_group,installation_id,title,event_data,event_count,game_time,world,timestamp
1268560,37ad3390310414fd,0.0,17.0,0.000000,0,906b1791,Bird Measurer (Assessment),"{""hat"":0,""caterpillar"":""middle"",""coordinates"":...",85,84371,TREETOPCITY,2019-07-31T15:51:15.344Z
433928,5f0285ead92b6a11,1.0,0.0,1.000000,3,d2f30b4a,Mushroom Sorter (Assessment),"{""coordinates"":{""x"":228,""y"":517,""stage_width"":...",63,33260,TREETOPCITY,2019-10-03T18:46:42.703Z
287317,3ef20ba8d6cb46d9,1.0,5.0,0.166667,1,ad9df0ca,Air Show,"{""coordinates"":{""x"":247,""y"":146,""stage_width"":...",190,140221,TREETOPCITY,2019-09-10T01:00:32.664Z
379273,536b25018382f0e8,1.0,4.0,0.200000,1,8188b29c,Bird Measurer (Assessment),"{""hat"":4,""source"":""middle"",""coordinates"":{""x"":...",82,115392,TREETOPCITY,2019-09-07T18:32:45.881Z
577330,7c95e316c91e304b,14.0,0.0,1.000000,3,61d105aa,Pan Balance,"{""location"":""TABLE"",""scale_weights"":3,""table_w...",85,125015,CRYSTALCAVES,2019-07-27T17:20:29.464Z
780808,a7548f9136be65e3,14.0,2.0,0.875000,2,f3e09002,Pan Balance,"{""location"":""TABLE"",""scale_weights"":2,""table_w...",85,84698,CRYSTALCAVES,2019-08-07T22:30:57.187Z
958942,cd3b6d31a109fdd6,14.0,14.0,0.500000,2,f1c21eda,Pan Balance,"{""scale_weights"":5,""target_weight"":5,""table_we...",130,133004,CRYSTALCAVES,2019-09-14T06:27:26.072Z
511031,6e1cf6cfffe47070,8.0,2.0,0.800000,2,c6695e15,Pan Balance,"{""location"":""BALANCE"",""scale_weights"":4,""table...",153,289590,CRYSTALCAVES,2019-09-22T02:33:50.289Z
564174,799003e70ee16459,12.0,1.0,0.923077,3,f17dd9ce,Pan Balance,"{""coordinates"":{""x"":666,""y"":238,""stage_width"":...",158,183247,CRYSTALCAVES,2019-08-17T16:56:39.361Z
1390135,9d8622efa8dc0dc3,0.0,6.0,0.000000,0,6a7b1b7a,Bird Measurer (Assessment),"{""hat"":0,""caterpillar"":""middle"",""coordinates"":...",58,119896,TREETOPCITY,2019-08-05T20:18:29.986Z


In [55]:
np.shape(train), np.shape(trainL)

((11341042, 11), (1502699, 12))

# Mini-data to keep my computer from dying

### Reminder: take care of the class balancing !!

In [35]:
from sklearn.model_selection import train_test_split

In [56]:
trash, mini_train = train_test_split(trainL, test_size=0.1, random_state=42, stratify = trainL['accuracy_group'])
#trash = the rest of the "big" dataset

In [57]:
np.shape(trainL)  ,np.shape(mini_train)

((1502699, 12), (150270, 12))

# False submission just to try (V2)

In [45]:
from sklearn.linear_model import LogisticRegression

In [85]:
X_train = pd.DataFrame(mini_train[['event_count', 'game_time']])#, index = mini_train['installation_id'])
Y_train = pd.DataFrame(mini_train['accuracy_group'])
X_test = pd.DataFrame(test[['event_count', 'game_time']])#, index = test['installation_id'])

In [86]:
np.shape(X_train), np.shape(Y_train), np.shape(X_test)

((150270, 2), (150270, 1), (1156414, 2))

In [93]:
from sklearn.datasets import load_iris

MLR = LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg').fit(X_train, Y_train)

pred_test = MLR.predict(X_test)

  y = column_or_1d(y, warn=True)


In [96]:
np.shape(pred_test), np.shape(X_test)

((1156414,), (1156414, 2))

In [118]:
pred_test = pd.DataFrame(pred_test, columns = ['accuracy_group'])
pred_test.sample(10)

Unnamed: 0,accuracy_group
1130245,3
1081836,2
943670,3
1092665,3
937471,1
411429,1
61525,3
485384,3
541791,3
118499,1


In [130]:
submission2 = pd.DataFrame({'installation_id' : test['installation_id'], 'accuracy_group': pred_test['accuracy_group']})