# Duolingo SLAM Explorer

We are focus on creating new features to use with the gradient boosted trees (microsoft lightgbm):

---
So far Alex created features that fall into these categories:

**Basic word features:**  
These are a bit like the mental lexicon.  Definitions, stuff you look up in wordnet.  Noun? Verb?
Plural?, etc...  Many came for free from the dataset itself and we aren't sure about adding too much more here. We aren't word people anyway.
- word length
- morphological features
- tokenid (one-hot word index)

**Position/sequence features:**  
These are sort of like grammatical aspect because captures something about sequential structure.
- previous word part of speeach
- next word part of speech
- root word part of speach 

**User features:**  
Features about the users themselves.
- userid (one-hot user index)

**Temporal features (per word):**  
- number of observation of a word (total, unlabeled, labeled)
- time since last observation (lab, unlabeled)
- exponentially smoothed running average of probabily of remembering (4 different fixed rates).  no here decay in absense of information
- is it 1st encounter with word? (true/false)

**Semantic features:**  
Not sure if these are particularly useful here.  Something about word meaning, similarities in meanings, e.g., positive or negative word, emotion?, some might be in the basic features, etc...
- none currently

---
Plan of attack for this weekend:  
1. [ ] Focus on user features (more information about user motivation, session structure, etc...).  (**Anselm is pursuing this**)
1. [ ] Focus on temporal features that capture spaced/massed practice.  (**Alex is pursuing this**) 
1. [ ] Model something about context (repeated contexts aid memory) (**Todd is pursuing this**)
1. [ ] Cognates and word similarity both in terms of letters and meaning (**Pam is pursuing this**)

In [1]:
import os
#from processing import build_data
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score
#import lightgbm as lgb

<div class="alert alert-warning">
You can control what language you are messing with here: options are `all`, `en_es` (reverse spanish), `fr_en` (french), `es_en` (spanish)
</div>

In [27]:
# use this to change language pair trained on
lang = 'en_es'

<div class="alert alert-warning">
The main script for parsing and constructing features is `processing.py`.  You should edit it in a different editor (e.g., sublime) and then run the cell below to re-load it into this jupyter kernel.
</div>

In [41]:
%run processing

In [110]:
# configuration options
NUSERS = 10. # set this to None to load all the users for the given language
FEATUREIZED = False # set this to true to return the features as dict() instead of instances of the User() class

# load data
if lang == 'all':
    data = build_data(
        'all',
        [
            'data/data_{0}/{0}.slam.20171218.train.new'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.train.new'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.train.new'.format('es_en')
        ],
        [
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('es_en')
        ],
        labelfiles=[
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('es_en')
        ],
        n_users=NUSERS, featurized=FEATUREIZED)
else:
    data = build_data(
        lang[:2],
        ['data/data_{0}/{0}.slam.20171218.train.new'.format(lang)],
        ['data/data_{0}/{0}.slam.20171218.dev.new'.format(lang)],
        labelfiles=['data/data_{0}/{0}.slam.20171218.dev.key'.format(lang)],
        n_users=NUSERS, featurized=FEATUREIZED)
train_x, train_ids, train_y, test_x, test_ids, test_y = data

loading data files
retrieving labels
building features
retrieving features


In [111]:
train_x

[<__main__.User at 0x188a9c588>,
 <__main__.User at 0x188e01f98>,
 <__main__.User at 0x189d30470>,
 <__main__.User at 0x18b53b470>,
 <__main__.User at 0x18b68f908>,
 <__main__.User at 0x18b800400>,
 <__main__.User at 0x18bda8438>,
 <__main__.User at 0x18c08d6a0>,
 <__main__.User at 0x18c456828>,
 <__main__.User at 0x18c61e668>]

If you ran with `FEATUREIZED = False` then the following cells will let you explore individual users:

---

## Exploring the basic data structures programatically

Get the user id and the languge out of the user object:

In [49]:
train_x[0].id, train_x[0].features['user'], train_x[0].features['lang']

('XEinXf5+en', 'XEinXf5+en', 'en')

A list of the exercises this user completed each as a Exercise() instance:

In [44]:
train_x[0].exercises

[<__main__.Exercise at 0x185507b38>,
 <__main__.Exercise at 0x185507b70>,
 <__main__.Exercise at 0x18dd2fd68>,
 <__main__.Exercise at 0x18dd2f400>,
 <__main__.Exercise at 0x18d9b4588>,
 <__main__.Exercise at 0x18d0f3cc0>,
 <__main__.Exercise at 0x18d0f3208>,
 <__main__.Exercise at 0x18d5b0898>,
 <__main__.Exercise at 0x18d5b06a0>,
 <__main__.Exercise at 0x18d5b4860>,
 <__main__.Exercise at 0x18d5b4c88>,
 <__main__.Exercise at 0x18d5b4390>,
 <__main__.Exercise at 0x18d0bd5c0>,
 <__main__.Exercise at 0x18d0bd390>,
 <__main__.Exercise at 0x18d0b9c18>,
 <__main__.Exercise at 0x18d0b9160>,
 <__main__.Exercise at 0x18dc6cac8>,
 <__main__.Exercise at 0x18dc6c0b8>,
 <__main__.Exercise at 0x18dc6fd30>,
 <__main__.Exercise at 0x18dc6f518>,
 <__main__.Exercise at 0x18d863f28>,
 <__main__.Exercise at 0x18d863710>,
 <__main__.Exercise at 0x18d859978>,
 <__main__.Exercise at 0x18d85deb8>,
 <__main__.Exercise at 0x18d85d438>,
 <__main__.Exercise at 0x18d862978>,
 <__main__.Exercise at 0x18dd7ddd8>,
 

Get the first exercise this person did:

In [50]:
train_x[0].exercises[0]

<__main__.Exercise at 0x185507b38>

Examine the raw text of the exercise:

In [78]:
train_x[0].exercises[0].textlist

['# user:XEinXf5+  countries:CO  days:0.003  client:web  session:lesson  format:reverse_translate  time:9',
 'DRihrVmh0101  I             I             PRON    case=Nom|prontype=Prs|fpos=PRON++PRP|number=Sing|person=1               nsubj        4  0',
 'DRihrVmh0102  am            be            AUX     mood=Ind|fpos=AUX++VBP|number=Sing|person=1|tense=Pres|verbform=Fin     cop          4  0',
 'DRihrVmh0103  a             a             DET     prontype=Art|definite=Ind|fpos=DET++DT                                  det          4  0',
 'DRihrVmh0104  boy           boy           NOUN    fpos=NOUN++NN|number=Sing                                               root         0  0']

Some of the other features defined on the exercise:

In [80]:
train_x[0].exercises[1].features

{'client:web': 1.0,
 'exercise_length': 4,
 'exercise_num': 1,
 'format:reverse_translate': 1.0,
 'log_time': 2.5649493574615367,
 'session:lesson': 1.0,
 'time': 12}

Each exercise has a list of Instance() instances which are are python structured representation of the entried of the exercise

In [58]:
train_x[0].exercises[0].instances

[<__main__.Instance at 0x185507ba8>,
 <__main__.Instance at 0x185507cc0>,
 <__main__.Instance at 0x185507e80>,
 <__main__.Instance at 0x18dd2f2b0>]

Which itself has a lot of features many which are akin to the basic and position features described above, but also including temporal features such as `root:erravg0` which is keeping track of a exponential smoothed average of error probability, etc...  

note: these same things aren't all present for the first items in the exercise list (this is showing the last item) because error average isn't yet defined.

In [68]:
train_x[0].exercises[-1].instances[0].features

{'dependency_label:nsubj': 1.0,
 'morphological_feature:case_Nom': 1.0,
 'morphological_feature:fpos_PRON++PRP': 1.0,
 'morphological_feature:gender_Fem': 1.0,
 'morphological_feature:number_Sing': 1.0,
 'morphological_feature:person_3': 1.0,
 'morphological_feature:prontype_Prs': 1.0,
 'next_pos:AUX': 1.0,
 'part_of_speech:PRON': 1.0,
 'prev_pos:None': 1.0,
 'root:encounters': 24,
 'root:encounters_lab': 21,
 'root:encounters_unlab': 3,
 'root:erravg0': 0.0,
 'root:erravg1': 0.0,
 'root:erravg2': 0.0,
 'root:erravg3': 0.0,
 'root:time_since_last_encounter': 1.1430000000000007,
 'root:time_since_last_label': 4.352,
 'root_pos:NOUN': 1.0,
 'token:encounters': 24,
 'token:encounters_lab': 21,
 'token:encounters_unlab': 3,
 'token:erravg0': 0.0,
 'token:erravg1': 0.0,
 'token:erravg2': 0.0,
 'token:erravg3': 0.0,
 'token:she_en': 1.0,
 'token:time_since_last_encounter': 1.1430000000000007,
 'token:time_since_last_label': 4.352,
 'word_length': 6}

## Examining user properties

Days is not a feature currently but is a value in the header of each exercise that says how long since the person started duolingo the current exercise was completed.  These intervals might index something about user engagement or consisteny and so might be interesting.  The following cell step through an example so you can see how to analyze and possibly add this feature.

In [95]:
import plotly

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

This plots a scatter plot where the x location of the point is the time in days since the first session (so whole numbers are 24 intervals).   If you try different `user_number` values you can see how different people used it.

In [102]:
user_number = 0
x_days = [train_x[user_number].exercises[i].days for i in range(len(train_x[0].exercises))]

trace = go.Scatter(
    x = x_days,
    y = [1.0]*len(x_days),
    mode='markers',
    marker=dict(opacity=0.2)
)

data = [trace]
py.iplot(data, filename='basic')

This does a modulo on the number of days for each session then plots the resulting data as a histogram.  Most of the plots then show a bi-modal distribution which is nightime.  Is there something interesting there about consistency in time and performance?

In [116]:
user_number = 5
x_days = [train_x[user_number].exercises[i].days%1.0 for i in range(len(train_x[user_number].exercises))]


data = [go.Histogram(x=x_days)]

py.iplot(data, filename='basic')

This is everyone we loaded originaly (NUSERS) together

In [120]:
times = []
for user_number in range(len(train_x)):
    for i in range(len(train_x[user_number].exercises)):
        times.append(train_x[user_number].exercises[i].days%1.0)
    

data = [go.Histogram(x=times)]

py.iplot(data, filename='basic')