# Duolingo SLAM Explorer

We are focus on creating new features to use with the gradient boosted trees (microsoft lightgbm):

---
So far Alex created features that fall into these categories:

**Basic word features:**  
These are a bit like the mental lexicon.  Definitions, stuff you look up in wordnet.  Noun? Verb?
Plural?, etc...  Many came for free from the dataset itself and we aren't sure about adding too much more here. We aren't word people anyway.
- word length
- morphological features
- tokenid (one-hot word index)

**Position/sequence features:**  
These are sort of like grammatical aspect because captures something about sequential structure.
- previous word part of speeach
- next word part of speech
- root word part of speach 

**User features:**  
Features about the users themselves.
- userid (one-hot user index)

**Temporal features (per word):**  
- number of observation of a word (total, unlabeled, labeled)
- time since last observation (lab, unlabeled)
- exponentially smoothed running average of probabily of remembering (4 different fixed rates).  no here decay in absense of information
- is it 1st encounter with word? (true/false)

**Semantic features:**  
Not sure if these are particularly useful here.  Something about word meaning, similarities in meanings, e.g., positive or negative word, emotion?, some might be in the basic features, etc...
- none currently

---
Plan of attack for this weekend:  
1. [ ] Focus on user features (more information about user motivation, session structure, etc...).  (**Anselm is pursuing this**)
1. [ ] Focus on temporal features that capture spaced/massed practice.  (**Alex is pursuing this**) 
1. [ ] Model something about context (repeated contexts aid memory) (**Todd is pursuing this**)
1. [ ] Cognates and word similarity both in terms of letters and meaning (**Pam is pursuing this**)

In [7]:
import os
from processing import build_data
import pandas as pd
#from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import roc_auc_score
#import lightgbm as lgb

<div class="alert alert-warning">
You can control what language you are messing with here: options are `all`, `en_es` (reverse spanish), `fr_en` (french), `es_en` (spanish)
</div>

In [8]:
# use this to change language pair trained on
lang = 'en_es'

<div class="alert alert-warning">
The main script for parsing and constructing features is `processing.py`.  You should edit it in a different editor (e.g., sublime) and then run the cell below to re-load it into this jupyter kernel.
</div>

In [36]:
%run processing

In [37]:
success_failure = False
ave_success = True
verbconj = True

In [60]:
# configuration options
NUSERS = 10. # set this to None to load all the users for the given language
FEATUREIZED = True # set this to true to return the features as dict() instead of instances of the User() class

# load data
if lang == 'all':
    data = build_data(
        'all',
        [
            'data/data_{0}/{0}.slam.20171218.train.new'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.train.new'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.train.new'.format('es_en')
        ],
        [
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('es_en')
        ],
        labelfiles=[
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('es_en')
        ],
        n_users=NUSERS)
else:
    data = build_data(
        lang[:2],
        ['data/data_{0}/{0}.slam.20171218.train.new'.format(lang)],
        ['data/data_{0}/{0}.slam.20171218.dev.new'.format(lang)],
        labelfiles=['data/data_{0}/{0}.slam.20171218.dev.key'.format(lang)],
        n_users=NUSERS)
train_x, train_ids, train_y, test_x, test_ids, test_y = data

using ave success
loading data files
retrieving labels
building features
retrieving features


In [61]:
train_x

[{'client:web': 1.0,
  'context_rep': 1,
  'dependency_label:nsubj': 1.0,
  'exercise_length': 4,
  'exercise_num': 0,
  'format:reverse_translate': 1.0,
  'lang': 'en',
  'morphological_feature:case_Nom': 1.0,
  'morphological_feature:fpos_PRON++PRP': 1.0,
  'morphological_feature:number_Sing': 1.0,
  'morphological_feature:person_1': 1.0,
  'morphological_feature:prontype_Prs': 1.0,
  'next_pos:AUX': 1.0,
  'next_token': 'am_en',
  'parseroot_pos:NOUN': 1.0,
  'parseroot_token': 'boy_en',
  'part_of_speech:PRON': 1.0,
  'prev_pos:None': 1.0,
  'prev_token': '_NONE_',
  'root': 'i_en',
  'root:encounters': 1,
  'root:encounters_unlab': 1,
  'root:time_since_last_encounter': 0.0,
  'root:time_since_last_label': nan,
  'session:lesson': 1.0,
  'time': 9,
  'token': 'i_en',
  'token:encounters': 1,
  'token:encounters_unlab': 1,
  'token:time_since_last_encounter': 0.0,
  'token:time_since_last_label': nan,
  'user': 'XEinXf5+en',
  'word_length': 4},
 {'client:web': 1.0,
  'context_rep'

If you ran with `FEATUREIZED = False` then the following cells will let you explore individual users:

---

## Exploring the basic data structures programatically

Get the user id and the languge out of the user object:

In [40]:
train_x[0].id, train_x[0].features['user'], train_x[0].features['lang']

('XEinXf5+en', 'XEinXf5+en', 'en')

A list of the exercises this user completed each as a Exercise() instance:

In [41]:
train_x[0].exercises

[<__main__.Exercise at 0x11fd35b00>,
 <__main__.Exercise at 0x11fd35b70>,
 <__main__.Exercise at 0x11fd24d30>,
 <__main__.Exercise at 0x11fd243c8>,
 <__main__.Exercise at 0x11fd2f550>,
 <__main__.Exercise at 0x11fd30c88>,
 <__main__.Exercise at 0x11fd301d0>,
 <__main__.Exercise at 0x11fd2e860>,
 <__main__.Exercise at 0x11fd2e668>,
 <__main__.Exercise at 0x11fd31828>,
 <__main__.Exercise at 0x11fd31c50>,
 <__main__.Exercise at 0x11fd31358>,
 <__main__.Exercise at 0x11fd36588>,
 <__main__.Exercise at 0x11fd36358>,
 <__main__.Exercise at 0x11fd36fd0>,
 <__main__.Exercise at 0x11fd3abe0>,
 <__main__.Exercise at 0x11fd3ea90>,
 <__main__.Exercise at 0x11fd3e080>,
 <__main__.Exercise at 0x11fd40cf8>,
 <__main__.Exercise at 0x11fd404e0>,
 <__main__.Exercise at 0x11fd45ef0>,
 <__main__.Exercise at 0x11fd456d8>,
 <__main__.Exercise at 0x11fd48940>,
 <__main__.Exercise at 0x11fd4ae80>,
 <__main__.Exercise at 0x11fd4a400>,
 <__main__.Exercise at 0x11fc0d940>,
 <__main__.Exercise at 0x11fc10da0>,
 

Get the first exercise this person did:

In [42]:
train_x[0].exercises[0]

<__main__.Exercise at 0x11fd35b00>

Examine the raw text of the exercise:

In [43]:
train_x[0].exercises[0].textlist

['# user:XEinXf5+  countries:CO  days:0.003  client:web  session:lesson  format:reverse_translate  time:9',
 'DRihrVmh0101  I             I             PRON    case=Nom|prontype=Prs|fpos=PRON++PRP|number=Sing|person=1               nsubj        4  0',
 'DRihrVmh0102  am            be            AUX     mood=Ind|fpos=AUX++VBP|number=Sing|person=1|tense=Pres|verbform=Fin     cop          4  0',
 'DRihrVmh0103  a             a             DET     prontype=Art|definite=Ind|fpos=DET++DT                                  det          4  0',
 'DRihrVmh0104  boy           boy           NOUN    fpos=NOUN++NN|number=Sing                                               root         0  0']

Some of the other features defined on the exercise:

In [44]:
train_x[0].exercises[24].features

{'client:web': 1.0,
 'context_rep': 3,
 'exercise_length': 4,
 'exercise_num': 24,
 'format:reverse_translate': 1.0,
 'session:practice': 1.0,
 'time': 10}

Each exercise has a list of Instance() instances which are are python structured representation of the entried of the exercise

In [45]:
train_x[0].exercises[0].instances

[<__main__.Instance at 0x11fd35a58>,
 <__main__.Instance at 0x11fd35c88>,
 <__main__.Instance at 0x11fd35e48>,
 <__main__.Instance at 0x11fd24278>]

Which itself has a lot of features many which are akin to the basic and position features described above, but also including temporal features such as `root:erravg0` which is keeping track of a exponential smoothed average of error probability, etc...  

note: these same things aren't all present for the first items in the exercise list (this is showing the last item) because error average isn't yet defined.

In [52]:
train_x[0].exercises[-1].instances[1].features

{'dependency_label:cop': 1.0,
 'morphological_feature:fpos_AUX++VBZ': 1.0,
 'morphological_feature:mood_Ind': 1.0,
 'morphological_feature:number_Sing': 1.0,
 'morphological_feature:person_3': 1.0,
 'morphological_feature:tense_Pres': 1.0,
 'morphological_feature:verbform_Fin': 1.0,
 'next_pos:DET': 1.0,
 'next_token': 'a_en',
 'parseroot_pos:NOUN': 1.0,
 'parseroot_token': 'student_en',
 'part_of_speech:AUX': 1.0,
 'prev_pos:PRON': 1.0,
 'prev_token': 'she_en',
 'root': 'be_en',
 'root:encounters': 168,
 'root:encounters_lab': 152,
 'root:encounters_unlab': 16,
 'root:erravg0': 0.05424194615766341,
 'root:erravg1': 0.17654470117893167,
 'root:erravg2': 0.22300850447260473,
 'root:erravg3': 0.14759599680700788,
 'root:time_since_last_encounter': 0.0,
 'root:time_since_last_label': 2.4940000000000015,
 'token': 'is_en',
 'token:encounters': 104,
 'token:encounters_lab': 91,
 'token:encounters_unlab': 13,
 'token:erravg0': 0.07833455549365839,
 'token:erravg1': 0.14165994872492743,
 'tok

In [53]:
xl = pd.ExcelFile("data/13428_2013_348_MOESM1_ESM.xlsx")
# Print the sheet names
print(xl.sheet_names)

# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1')

['Sheet1', 'Sheet2', 'Sheet3']


In [79]:
df1[df1['Word'] == 'boy']
'boy' in df1['Word']

False

In [93]:
for d in train_x:
    word = d['token'].lower()
    word = word.split('_', 1)[0]
    word_df = df1[df1['Word'] == word]
    if not word_df.empty:
        print(word_df['Rating.Mean'].values[0])

2.893384
3.67
4.441998
2.715564
3.68
3.568124
2.893384
4.0
3.813235
2.893384
3.67
3.56
2.893384
7.58
4.090317
3.72
2.893384
11.17
7.44
4.239515
2.715564
5.78
2.89
5.25193
3.56
2.893384
4.22
4.772365
3.813235
2.893384
3.67
2.893384
4.95
3.568124
2.893384
4.95
3.813235
2.893384
3.67
2.893384
3.11
2.893384
4.95
2.893384
3.67
2.893384
3.11
2.893384
4.95
3.813235
2.893384
3.67
3.568124
2.893384
4.0
3.568124
2.893384
4.0
3.568124
2.893384
4.95
2.893384
3.67
3.813235
2.893384
5.14536
3.813235
2.893384
3.67
2.893384
3.11
2.893384
4.95
3.55
4.0
4.569882
3.55
3.61
3.75995
4.569882
3.53
5.3585
4.346085
3.482868
4.569882
3.53
3.44
3.280385
3.56
2.893384
7.58
4.090317
3.56
2.893384
4.39
4.06
3.951776
2.310598
6.33
3.280385
2.37
2.78
3.781264
5.53
3.482868
3.855863
3.5
4.772365
3.482868
6.1
3.685351
3.568124
2.893384
4.0
3.568124
2.715564
2.63
4.569882
3.813235
2.715564
4.11
4.569882
3.813235
2.715564
3.63
3.813235
2.715564
5.53
3.2271
3.67
4.569882
4.239515
4.11
4.569882
3.78
3.813235
2.715564
2.58

KeyboardInterrupt: 

## Examining user properties

Days is not a feature currently but is a value in the header of each exercise that says how long since the person started duolingo the current exercise was completed.  These intervals might index something about user engagement or consisteny and so might be interesting.  The following cell step through an example so you can see how to analyze and possibly add this feature.

In [95]:
import plotly

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

This plots a scatter plot where the x location of the point is the time in days since the first session (so whole numbers are 24 intervals).   If you try different `user_number` values you can see how different people used it.

In [102]:
user_number = 0
x_days = [train_x[user_number].exercises[i].days for i in range(len(train_x[0].exercises))]

trace = go.Scatter(
    x = x_days,
    y = [1.0]*len(x_days),
    mode='markers',
    marker=dict(opacity=0.2)
)

data = [trace]
py.iplot(data, filename='basic')

This does a modulo on the number of days for each session then plots the resulting data as a histogram.  Most of the plots then show a bi-modal distribution which is nightime.  Is there something interesting there about consistency in time and performance?

In [121]:
user_number = 5
x_days = [train_x[user_number].exercises[i].days%1.0 for i in range(len(train_x[user_number].exercises))]


data = [go.Histogram(x=x_days)]

py.iplot(data, filename='basic')

This is everyone we loaded originaly (NUSERS) together

In [122]:
times = []
for user_number in range(len(train_x)):
    for i in range(len(train_x[user_number].exercises)):
        times.append(train_x[user_number].exercises[i].days%1.0)
    

data = [go.Histogram(x=times)]

py.iplot(data, filename='basic')

## Playing with chunks (Todd)

Each word has a unique token.  An exercise can be summarized then by the list of words presented within the exercise.

Let's look at the list of words for the first exercise of the first user:

In [144]:
ex = train_x[0].exercises[0].instances
','.join([i.token for i in ex])

'i_en,am_en,a_en,boy_en'

Now let's look at this for all the exercises this user did:

In [146]:
for i in range(len(train_x[0].exercises)):
    ex= train_x[0].exercises[i].instances
    print(','.join([i.token  for i in ex]))

i_en,am_en,a_en,boy_en
i_en,am_en,from_en,mexico_en
my_en,name_en,is_en,pedro_en
she_en,is_en,a_en,girl_en
is_en,he_en,a_en,boy_en
i_en,need_en,a_en,taxi_en
where_en
i_en,have_en,a_en,reservation_en
i_en,am_en,fine_en
when_en
my_en,newspaper_en
stop_en,now_en
i_en,need_en,a_en,room_en,today_en
is_en,he_en,a_en,boy_en
a_en,woman_en
she_en,is_en,a_en,woman_en
is_en,he_en,a_en,boy_en
a_en,man_en,a_en,woman_en
i_en,am_en,a_en,boy_en
a_en,man_en,a_en,woman_en
is_en,he_en,a_en,boy_en
she_en,is_en,a_en,girl_en
is_en,she_en,a_en,girl_en
she_en,is_en,a_en,woman_en
i_en,am_en,a_en,boy_en
he_en,is_en,a_en,child_en
is_en,he_en,a_en,boy_en
a_en,man_en,a_en,woman_en
good_en,morning_en,and_en,good_en,night_en
hello_en,and_en,thanks_en
how_en,are_en,you_en
please_en,and_en,thanks_en
i_en,am_en,sorry_en,andrea_en
bye_en
i_en,need_en,a_en,taxi_en
where_en
i_en,need_en,a_en,table_en
welcome_en,to_en,mexico_en
yes_en,excuse_en,me_en
i_en,am_en,drinking_en,water_en
eat_en,more_en
check_en,please_en
what_en

Ah but this hurts my head.  Are these the same or different or what?

In [148]:
import hashlib

In [150]:
for i in range(len(train_x[0].exercises)):
    ex= train_x[0].exercises[i].instances
    words = ','.join([i.token  for i in ex])
    hash_object = hashlib.md5(words.encode())
    print(hash_object.hexdigest())

99e8c499b93a8ff33077ecb2452ca7f6
489af8f89794c75a4c8dc7698ae66478
df1ca3f09ff00b09ed0652d7653b8f80
c79a61de010f239f7bc750c6ba3b832e
32172570822e93faa12673c944d88cda
f722f10499c0c51a81375809d829e964
e6219f7407217a27aa2271c1f608798b
af5cb51c14a00c8478553c469c138181
3c95d459bbe901c958f5c828bdb1cf64
41b93e9475d02ebe73786f473d39dd88
f2f48f985485419f4340a8bca5058937
574a62544859c788e1911b8a6634d08c
e2126f0068dd9e7296842e464ea354f6
32172570822e93faa12673c944d88cda
ac224395d9958a44c13ef5257c5a1f91
333ba7b1c942c6ff3dda5022df496eb8
32172570822e93faa12673c944d88cda
9dfb6a488fcab2dd5ec3a5a41957d7dc
99e8c499b93a8ff33077ecb2452ca7f6
9dfb6a488fcab2dd5ec3a5a41957d7dc
32172570822e93faa12673c944d88cda
c79a61de010f239f7bc750c6ba3b832e
983223e79744c808bbf9dab47e53894a
333ba7b1c942c6ff3dda5022df496eb8
99e8c499b93a8ff33077ecb2452ca7f6
59232213fde9cacbff518e184973c030
32172570822e93faa12673c944d88cda
9dfb6a488fcab2dd5ec3a5a41957d7dc
3f378e1c5ced2e90ccf5e1aafaa5aeed
25bb979a20733a778383a2824cec3c8b
73483b01c2

Great, now my head is really hurting these look like totally random numbers!

In [156]:
counts = {}

for i in range(len(train_x[0].exercises)):
    ex= train_x[0].exercises[i].instances
    words = ','.join([i.token  for i in ex])
    hash_object = hashlib.md5(words.encode())
    mykey = hash_object.hexdigest()
    if mykey in counts:
        counts[mykey]+=1
    else:
        counts[mykey]=1

for key in counts.keys():
    print(key,counts[key])


99e8c499b93a8ff33077ecb2452ca7f6 3
489af8f89794c75a4c8dc7698ae66478 1
df1ca3f09ff00b09ed0652d7653b8f80 1
c79a61de010f239f7bc750c6ba3b832e 3
32172570822e93faa12673c944d88cda 5
f722f10499c0c51a81375809d829e964 2
e6219f7407217a27aa2271c1f608798b 2
af5cb51c14a00c8478553c469c138181 1
3c95d459bbe901c958f5c828bdb1cf64 1
41b93e9475d02ebe73786f473d39dd88 2
f2f48f985485419f4340a8bca5058937 1
574a62544859c788e1911b8a6634d08c 3
e2126f0068dd9e7296842e464ea354f6 1
ac224395d9958a44c13ef5257c5a1f91 1
333ba7b1c942c6ff3dda5022df496eb8 2
9dfb6a488fcab2dd5ec3a5a41957d7dc 3
983223e79744c808bbf9dab47e53894a 1
59232213fde9cacbff518e184973c030 1
3f378e1c5ced2e90ccf5e1aafaa5aeed 1
25bb979a20733a778383a2824cec3c8b 3
73483b01c2bef281bd299354557f1db0 3
5fbee0408351faf2b6b6f7cfe47302be 1
89a39458f3236290b0678d2b74848ba9 1
992bac8cd50d0561abfdb604cd6ba9cc 1
ad9c195d2c53ad9f1a927a9fc69640f8 2
8378997bb028ae2e2392ab822b0a0dd2 2
e55a0e64ed5059aae65156ad6ddee0d8 1
7ad820fc4ddfe80dd8e6617ed1331652 2
93e965409167c76d776c

Ok cool, so like I counted how many times each context has appeared.  sweet.

Hey that gives me an idea... What if one of the features on the words what how many times the context had been repeated so far!?