# Settings

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

# Loading the data

In [2]:
sample_submission = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/sample_submission.csv")
test = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/test.csv")
specs = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/specs.csv")
train_labels = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/train_labels.csv")
train = pd.read_csv("C:/Users/odeli/Desktop/dataCamp_data/train.csv")

In [3]:
print('Shapes:\n - train: ', np.shape(train), '\n - train_labels: ',np.shape(train_labels), '\n - test: ',np.shape(test), '\n - specs: ',np.shape(specs), '\n - sample_submission: ',np.shape(sample_submission))

Shapes:
 - train:  (11341042, 11) 
 - train_labels:  (17690, 7) 
 - test:  (1156414, 11) 
 - specs:  (386, 3) 
 - sample_submission:  (1000, 2)


## Making labels

The first very important thing to do is to create the labels.
As we said the right information is contained in the event_data column.
This column is a dictionary with a variable size for each line.

Organisation of the information recorded by the application:
    * When a child downloads the app an id is created: id_installation
    * When he starts playing a game an other id is created: game_session
    * Inside a game_session many things can be done by the gamer. Every event has its own id: event_id. The different possible events are caracterized by a code contained in the event_code column.
The only useful events to build the labels are the last ones in a game session: they correspond to the succes or the failure of the game in a given game_session. The corrresponding event_code is 4100 for 4 of the 5 games and 4110 for the last one (Bird Measurer).

In [76]:
def get_accuracy(data):
    """
    input: data
    output: data_labels for each of the 5 games
    """
    
    df = pd.DataFrame()
    
    games = ['Bird Measurer', 'Cart Balancer', 'Cauldron Filler', 'Chest Sorter', 'Mushroom Sorter']
    
    # Loop on the 5 games
    for game in games:
        tmp = data[data['title'].str.contains(game)]
        
        # Filter the last event : 4110/4100 (code)
        if game == 'Bird Measurer':
            tmp = tmp[(tmp['event_code'] == 4110) | (tmp['event_code'] == 4100)]
        else:
            tmp = tmp[tmp['event_code'] == 4100]
    
        # num_correct and num_incorrect
        correct = ["NA" for i in range(np.shape(tmp)[0])]
        incorrect = ["NA" for i in range(np.shape(tmp)[0])]
        for i in range(np.shape(tmp)[0]):
            if ('correct":false' in tmp.loc[tmp.index[i], 'event_data']):
                correct[i] = 0
                incorrect[i] = 1
            elif ('correct":true' in tmp.loc[tmp.index[i], 'event_data']):
                correct[i] = 1
                incorrect[i] = 0
            else:
                correct[i] = 'NA'
                incorrect[i] = 'NA'
        tmp['num_correct'] = correct
        tmp['num_incorrect'] = incorrect
        tmp = pd.DataFrame(tmp.groupby(('installation_id','game_session','title')).sum())
            
        # accuracy
        accuracy = tmp['num_correct'] / (tmp['num_correct'] + tmp['num_incorrect'])
        tmp['accuracy'] = accuracy

        # accuracy_group
        tmp["accuracy_group"] = tmp["accuracy"].apply(lambda x: 0 if x==0 else (1 if x<0.5 else (2 if x<0.9 else 3)))
        df = pd.concat([df, tmp])
        
    df = df.reset_index()[['game_session','installation_id','title','num_correct','num_incorrect','accuracy','accuracy_group']]
    return(df)

In [77]:
my_train_labels = get_accuracy(train)



In [78]:
my_train_labels = my_train_labels.sort_values(['installation_id', 'game_session']).reset_index(drop=True)
#my_train_labels.head()

In [79]:
np.shape(my_train_labels), np.shape(train_labels)

((17692, 7), (17690, 7))

## The test file

How many distinct installation_id in the different files:

In [106]:
len(set(test['installation_id'])), len(set(train['installation_id'])) ,len(set(train_labels['installation_id'])), len(set(my_train_labels['installation_id']))

(1000, 17000, 3614, 3614)

Shape of the sample submission file:

In [19]:
np.shape(sample_submission)

(1000, 2)

Il y a 1000 enfants distincts dont on doit prédire l'accuracy dans le test : c'est exactement le nombre de ligne du file sample_submission donc cohérent.
Reste à comprendre ce que sont les autres lignes du test 

Y a-t-il des installation_id communs dans le train et dans le test ? (NON)

In [None]:
i = 0
for id in set(test['installation_id']):
    if(id in set(train['installation_id'])):
        i = i+1
print(i)

Il y a 1000 enfants distincts ayant fait un assesement dans le test: c'est eux que l'on doit prédire.
De plus aucun de ces enfants ne sont presents dans le train.

Que se passe-t-il quand on applique la fonction get_accuracy sur le test

In [81]:
test_labels = get_accuracy(test)



In [49]:
np.shape(test_labels)

(2018, 7)

In [50]:
len(set(test_labels['installation_id']))

557

Les résultats obtenus sont étranges: necessite approfondissement

Test data

Per installation_id, the last row contains the event start of the assessment (event_code 2000).  

We need to predict the accuracy_group for this assessment.  

Note: In the test data, you may find previous assessments (with their outcomes, event_code 4100 or 4110)

Pour 'c31c4183' cb d'assesment poursuivi jusqu'à la fin (debut: 2000  et fin 4100/4110)

In [84]:
np.shape(test[(test['installation_id'] == 'c31c4183') & (test['type'] == 'Assessment') & ((test['event_code'] == 4100) | (test['event_code'] == 4110))])

(11, 11)

In [83]:
test_labels[test_labels['installation_id'] == 'c31c4183']

Unnamed: 0,game_session,installation_id,title,num_correct,num_incorrect,accuracy,accuracy_group
231,151807bba56c5059,c31c4183,Bird Measurer (Assessment),1,3,0.25,1
694,6a817f6a1fee05d7,c31c4183,Cart Balancer (Assessment),1,3,0.25,1
1514,7a70d528a32097ab,c31c4183,Chest Sorter (Assessment),0,3,0.0,0


Nombre d'assesment commencé par chaque installation_id dans le train

In [85]:
train[train['event_code'] == 2000].groupby('installation_id').count()

Unnamed: 0_level_0,event_id,game_session,timestamp,event_data,event_count,event_code,game_time,title,type,world
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0001e90f,10,10,10,10,10,10,10,10,10,10
000447c4,5,5,5,5,5,5,5,5,5,5
0006a69f,80,80,80,80,80,80,80,80,80,80
0006c192,50,50,50,50,50,50,50,50,50,50
0009a5a9,7,7,7,7,7,7,7,7,7,7
0011edc8,14,14,14,14,14,14,14,14,14,14
00129856,9,9,9,9,9,9,9,9,9,9
0016b7cc,23,23,23,23,23,23,23,23,23,23
00195df7,7,7,7,7,7,7,7,7,7,7
001d0ed0,52,52,52,52,52,52,52,52,52,52


idem dans le test

In [94]:
test[(test['event_code'] == 2000)  & (test['type'] == "Assessment")].groupby('installation_id').count()

Unnamed: 0_level_0,event_id,game_session,timestamp,event_data,event_count,event_code,game_time,title,type,world
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
00abaee7,2,2,2,2,2,2,2,2,2,2
01242218,6,6,6,6,6,6,6,6,6,6
017c5718,1,1,1,1,1,1,1,1,1,1
01a44906,1,1,1,1,1,1,1,1,1,1
01bc6cb6,1,1,1,1,1,1,1,1,1,1
02256298,2,2,2,2,2,2,2,2,2,2
0267757a,1,1,1,1,1,1,1,1,1,1
027e7ce5,10,10,10,10,10,10,10,10,10,10
02a29f99,4,4,4,4,4,4,4,4,4,4
0300c576,1,1,1,1,1,1,1,1,1,1


On remarque qu'à la fois dans le train et dans le test, il esxiste des enfants n'ayant tenté un assessment qu'une seule fois donc pas d'historique pour ces enfants.

Il faut reproduire sur le train la structure du test cad pour chaque installation_id, tronquer le déroulé du dernier assessemnt. (C'est bien le dernier qui a été tronqué dans le test et pas un assessment random selon une certaine loi).