# Duolingo SLAM Explorer

We are focus on creating new features to use with the gradient boosted trees (microsoft lightgbm):

---
So far Alex created features that fall into these categories:

**Basic word features:**  
These are a bit like the mental lexicon.  Definitions, stuff you look up in wordnet.  Noun? Verb?
Plural?, etc...  Many came for free from the dataset itself and we aren't sure about adding too much more here. We aren't word people anyway.
- word length
- morphological features
- tokenid (one-hot word index)

**Position/sequence features:**  
These are sort of like grammatical aspect because captures something about sequential structure.
- previous word part of speeach
- next word part of speech
- root word part of speach 

**User features:**  
Features about the users themselves.
- userid (one-hot user index)

**Temporal features (per word):**  
- number of observation of a word (total, unlabeled, labeled)
- time since last observation (lab, unlabeled)
- exponentially smoothed running average of probabily of remembering (4 different fixed rates).  no here decay in absense of information
- is it 1st encounter with word? (true/false)

**Semantic features:**  
Not sure if these are particularly useful here.  Something about word meaning, similarities in meanings, e.g., positive or negative word, emotion?, some might be in the basic features, etc...
- none currently

---
Plan of attack for this weekend:  
1. [ ] Focus on user features (more information about user motivation, session structure, etc...).  (**Anselm is pursuing this**)
1. [ ] Focus on temporal features that capture spaced/massed practice.  (**Alex is pursuing this**) 
1. [ ] Model something about context (repeated contexts aid memory) (**Todd is pursuing this**)
1. [ ] Cognates and word similarity both in terms of letters and meaning (**Pam is pursuing this**)

In [1]:
import os
#from processing import build_data
import pandas as pd
#from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import roc_auc_score
#import lightgbm as lgb
import numpy as np

In [2]:
# for longer computations
from IPython.display import Audio
notification = 'notification.wav'

<div class="alert alert-warning">
You can control what language you are messing with here: options are `all`, `en_es` (reverse spanish), `fr_en` (french), `es_en` (spanish)
</div>

In [3]:
# use this to change language pair trained on
lang = 'en_es'

<div class="alert alert-warning">
The main script for parsing and constructing features is `processing.py`.  You should edit it in a different editor (e.g., sublime) and then run the cell below to re-load it into this jupyter kernel.
</div>

In [4]:
%run processing

In [None]:
# configuration options
# NUSERS = 10. # set this to None to load all the users for the given language
NUSERS = None # set this to None to load all the users for the given language
FEATUREIZED = False # set this to true to return the features as dict() instead of instances of the User() class

# load data
if lang == 'all':
    data = build_data(
        'all',
        [
            'data/data_{0}/{0}.slam.20171218.train.new'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.train.new'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.train.new'.format('es_en')
        ],
        [
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.dev.new'.format('es_en')
        ],
        labelfiles=[
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('en_es'),
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('fr_en'),
            'data/data_{0}/{0}.slam.20171218.dev.key'.format('es_en')
        ],
        n_users=NUSERS, featurized=FEATUREIZED)
else:
    data = build_data(
        lang[:2],
        ['data/data_{0}/{0}.slam.20171218.train.new'.format(lang)],
        ['data/data_{0}/{0}.slam.20171218.dev.new'.format(lang)],
        labelfiles=['data/data_{0}/{0}.slam.20171218.dev.key'.format(lang)],
        n_users=NUSERS, featurized=FEATUREIZED)
train_x, train_ids, train_y, test_x, test_ids, test_y = data

Audio(url=notification, autoplay=True)  # play sound for longer computations

loading data files
retrieving labels
building features


In [None]:
train_x

If you ran with `FEATUREIZED = False` then the following cells will let you explore individual users:

---

## Exploring the basic data structures programatically

Get the user id and the languge out of the user object:

In [None]:
train_x[0].id, train_x[0].features['user'], train_x[0].features['lang']

New: get the user usage entropy:

In [None]:
train_x[0].entropy

A list of the exercises this user completed each as a Exercise() instance:

In [None]:
train_x[0].exercises

Get the first exercise this person did:

In [None]:
train_x[0].exercises[0]

Examine the raw text of the exercise:

In [None]:
train_x[0].exercises[0].textlist

Some of the other features defined on the exercise:

In [None]:
train_x[0].exercises[24].features

Each exercise has a list of Instance() instances which are are python structured representation of the entried of the exercise

In [None]:
train_x[0].exercises[0].instances

Which itself has a lot of features many which are akin to the basic and position features described above, but also including temporal features such as `root:erravg0` which is keeping track of a exponential smoothed average of error probability, etc...  

note: these same things aren't all present for the first items in the exercise list (this is showing the last item) because error average isn't yet defined.

In [None]:
train_x[0].exercises[-1].instances[0].features

## Examining user properties

Days is not a feature currently but is a value in the header of each exercise that says how long since the person started duolingo the current exercise was completed.  These intervals might index something about user engagement or consisteny and so might be interesting.  The following cell step through an example so you can see how to analyze and possibly add this feature.

In [None]:
import plotly

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

This plots a scatter plot where the x location of the point is the time in days since the first session (so whole numbers are 24 intervals).   If you try different `user_number` values you can see how different people used it.

In [None]:
user_number = 0
x_days = [train_x[user_number].exercises[i].days for i in range(len(train_x[0].exercises))]

trace = go.Scatter(
    x = x_days,
    y = [1.0]*len(x_days),
    mode='markers',
    marker=dict(opacity=0.2)
)

data = [trace]
plot(data, filename='basic.html')

This does a modulo on the number of days for each session then plots the resulting data as a histogram.  Most of the plots then show a bi-modal distribution which is nightime.  Is there something interesting there about consistency in time and performance?

In [None]:
user_number = 5
x_days = [train_x[user_number].exercises[i].days%1.0 for i in range(len(train_x[user_number].exercises))]


data = [go.Histogram(x=x_days)]

plot(data, filename='basic.html')

This is everyone we loaded originaly (NUSERS) together

In [None]:
times = []
for user_number in range(len(train_x)):
    for i in range(len(train_x[user_number].exercises)):
        times.append(train_x[user_number].exercises[i].days%1.0)
    

data = [go.Histogram(x=times)]

plot(data, filename='basic.html')

## User features (Anselm)

In [None]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

from scipy.stats import entropy
from collections import Counter

Load data with `NUSERS = None`.

This creates a u-shaped distribution. Explanation: Day = 0.00 is the moment they started using the app, not midnight:

In [None]:
times = []
for user_number in range(len(train_x)):
    for i in range(len(train_x[user_number].exercises)):
        times.append(train_x[user_number].exercises[i].days%1.0)
        
data = [go.Histogram(x=times)]

iplot(data)

#### User experience = number of exercises
This has been added as a user feature to processing.py

#### Compute entropy for each user
This has been added as a user feature to processing.py:

In [None]:
def compute_usage_entropy(user_number, train_x):
    x_days = [train_x[user_number].exercises[i].days%1.0 for i in range(len(train_x[user_number].exercises))]
    x_bins = [round(x * 24 * 3) for x in x_days]  # 20-minutes bins
    freq = Counter(x_bins)
    rel_freq = [freq[key]/len(x_bins) for key in freq]
    return(entropy(rel_freq, base=2))

x = [compute_usage_entropy(user_number, train_x) for user_number in range(len(train_x))]
data = [go.Histogram(x=x)]
iplot(data)

#### Variability in accuracy for each user
This has been added as a user feature to processing.py:

In [None]:
def compute_accuracy_variance(user_number, train_x):
    x_accuracy = [e.accuracy for e in train_x[user_number].exercises if not e.test]
    return(np.var(x_accuracy))

compute_accuracy_variance(0, train_x)

x = [compute_accuracy_variance(user_number, train_x) for user_number in range(len(train_x))]
data = [go.Histogram(x=x)]
iplot(data)

#### Continuing after mistakes
This has been added as a user feature to processing.py:

In [None]:
import copy
from scipy.stats.stats import pearsonr

In [None]:
def compute_cor_mistakes_break(user_number, train_x):
    exercise_mistakes = []
    exercise_days_diff = []
    days = 0
    for e in train_x[user_number].exercises:
        mistakes = len([1 for i in e.instances if not i.label == 0])
        exercise_mistakes.append(mistakes)
        days_diff = e.days - days
        days = copy.deepcopy(e.days)  # save for next exercise
        exercise_days_diff.append(days_diff)
    cor = pearsonr(exercise_mistakes, exercise_days_diff)[0]
    return(cor)

compute_cor_mistakes_break(0, train_x)
x = [compute_cor_mistakes_break(user_number, train_x) for user_number in range(len(train_x))]
data = [go.Histogram(x=x)]
iplot(data)

In [None]:
# single user example
user_number = 111
exercise_mistakes = []
exercise_days_diff = []
days = 0
for e in train_x[user_number].exercises:
    mistakes = len([1 for i in e.instances if not i.label == 0])
    exercise_mistakes.append(mistakes)
    days_diff = e.days - days
    days = copy.deepcopy(e.days)  # save for next exercise
    exercise_days_diff.append(days_diff)
iplot(go.Figure(data=[go.Scatter(
    x = exercise_mistakes,
    y = exercise_days_diff,
    mode = 'markers'
)], layout=go.Layout(xaxis=dict(title='exercise_mistakes'), yaxis=dict(title='exercise_days_diff'))))

In [None]:
train_x[user_number].exercises[0].days
pearsonr([1,2,3],[0,2,3])[0]

## Exploration

#### User experience vs variability

In [None]:
x = [user.features['experience'] for user in train_x]
y = [compute_accuracy_variance(user_number, train_x) for user_number in range(len(train_x))]

data = [go.Scatter(
    x = x,
    y = y,
    mode = 'markers',
    marker = {'opacity':.2}
)]
layout = go.Layout(xaxis=dict(title='n_exercises'), yaxis=dict(title='variability'))
iplot(go.Figure(data=data, layout=layout))

#### Accuracy over time

In [None]:
user_number = 0
x = list(range(len(train_x[user_number].exercises)))
y = [train_x[user_number].exercises[i].accuracy for i in x]
iplot([{"x": x, "y": y}])

#### Histgram of exercise accuracies (0 = 100% correct):

In [None]:
x = []
for user_number in range(len(train_x)):
    for i in range(len(train_x[user_number].exercises)):
        x.append(train_x[user_number].exercises[i].accuracy)
        
data = [go.Histogram(x=x)]

iplot(data)