# Capstone: English to Spanish Morphosyntax Order of Acqusition

Kelly Slatery | US-DSI-10 | 03.13.2020

## Problem Statement

Goal: Using data from the [2018 Duolingo Shared Task Challenge on Second Language Acquisition Modeling (SLAM)](#https://sharedtask.duolingo.com/2018.html), extract a common order of acquisition for morphosyntactical features of Spanish amongst users using the English platform to learn Spanish. Morphosyntacic features are parts of word that actually give grammatical information, such as the plural -s or different types of conjugations in Spanish.

We will look at data from English speakers learnign Spanish for three major reasons:
- The U.S. has one of the highest populations of monolinguals in the world.
- Foreign language education in the U.S. is subpar and needs reform based on research.
- Spanish is increasingly important in the U.S. as the U.S. Spanish-speaking population grows and generations continue to learn and use Spanish.

The applications and utility of discovering and supporting a data-informed order of acquisiton for morphosyntactic festures of Spanish are two-fold. First, understanding the natural order in which students acquire different features of Spanish can help those who write Spanish textbooks and Spanish teachers alike structure their lessons in a way more conducive to learning. Second, understanding this natural order can positively impact motivation in Spanish-learning by adjusting both teachers' and students' expectations for student learning to hopefully increase effectiveness of error correction and student self-esteem. 

## Proposed Methods and Models

First, the data needs to be compiled into a DataFrame for easier analysis and filtered to only those sessions where the format is 'reverse_translate' and the session is either 'lesson' or 'practice'. This will most likely require AWS, SQL, or perhaps just cutting down on the data used. Next, as the basis for the classification models, we will label a certain subset of the train data prompt tokens for which possible errors can be made (Ex. {1: plural, 2: ser vs estar, 3: regular past tense, 4: irregular past tense, ...}). Using these generated labels, we can build a model to generate labels for the remaining millions of tokens.

This project will consist of three models:

- Using a classification model, most likely an ensemble decision tree (random forest, bagged classfier, extra trees), create a label for what error a user made in each prompt. This will mainly be based on the column labeling error/no error [0,1] and the accompanying word (tokens) and its details (Ex. {1: plural, 2: ser vs estar, 3: regular past tense, 4: irregular past tense, ...}). Then, to identify which errors are most common across the board on each prompt, we will sort the data by prompt and find the value counts for the labeled error column.

- Using another classification model, most likely an ensemble decision tree (random forest, bagged classfier, extra trees), create a label for what errors are possible, given a prompt. This will be based mainly on what words (tokens) are present in the prompt and their token details.

- Using a neural network, predict based on the features available, like days and time, along with a new calculation of whether or not the user made the morphosyntactic error labeled (combining above possible error label, with typo of error label, and provided [0,1] label, where 1 indicates that the student made an error), predict, given the next prompt, whether or not the student will make the possible morphosyntactic mistake.


## Objective and Success Metrics

Objective: Produce a table describing the order of acquisition of Spanish morphosyntactic features for English-speaking learners, and describe which token features are most represented in errors. 

Success Metrics: 

For the [2018 Duolingo Shared Task Challenge on Second Language Acquisition Modeling (SLAM)](#https://sharedtask.duolingo.com/2018.html), the success metrics used were AUC and F1 scores. This also makes sense for our third prediction model, as well as our two classification models. For the classificaiton models, we will be labeling data manually and then using a model to label further data. For the third model, given that there is a development and test dataset, we will be testing our model predictions of errors on labeled data (supervised learning) and evaluating success based on accuracy of these predictions, which works with AUC and F1. 

The predictions will be based on two models above that label morphosyntactic features based on supervised leanring from manually labeling data, and then our error predictions use these classifications to determine whether a student will make an error or not on given words in a given prompt at a given time in their study. 

## Risks and Assumptions

Risks / Perceived Challenges:
- The dataset is quite robust with over 2 million tokens from over 6,000 users worldwide. It is impossible to analyze all of this data the way we've been doing in class, so it will be necessary to either cut down the data or upload it to AWS and figure out how AWS connects to jupyter notebooks.
- The data is not yet in a DataFrame and I am only comfortable as of yet performing data analysis on pandas DataFrames. Because the data seems to be compatible with a database-style organization, SQL might be helpful. I like using SQL, but I have no idea how to integrate it into what I'm doing or build a database from the data.
- I may only be able to finish the first model of classifying error types by graduation day. Though I could finish it later, the first model is only a stepping stone, and may not be as interesting or as helpful to my audience (classmates, employers, etc.).

Assumptions:
- This data does not provide the student response. Rather, it provides an indication of on which tokens in the user-provided translation the user made an error. Thus, we have to assume that the user's mistake was, in fact, morphosyntactic if the error was made on a word that should contain one of the morphosyntactic qualities we are looking for (and labeled [0,1,...,n]). 
- In labeling morphosyntactic error types, I will have to assume different groups of error types with being aware of their representation in the dataset. I will base these error labels on previous research on the order of acquisition for Spanish morphosyntactice features, but perhaps it is this grouping that needs to be reanalyzed in future studies of order of acquisition.
- Lessons are often based on only one morphosyntactic feature at a time, so this will need to be taken into account in our models. It may make sense to compare only practice sessions errors, rather than lesson errors. On the other hand, previously learned morphosyntactice features are most likely included in future lesson prompts as well, so it may not make a huge difference, except that we need to compare errors by feature over time rather than errors at a time amongst features.

## Summary

The below data summary is from sections "Prediction Task" and "Data Format" from [2018 Duolingo Shared Task Challenge on Second Language Acquisition Modeling (SLAM)](#https://sharedtask.duolingo.com/2018.html):

"In these exercises, students construct answers in the L2 they are learning, and make various mistakes along the way. The goal of this task is to predict future mistakes that learners of English, Spanish, and French will make based on a history of the mistakes they have made in the past. More specifically, the data set contains more than 2 million tokens (words) from answers submitted by more than 6,000 Duolingo students over the course of their first 30 days.

We provide token-level labels and dependency parses for the most similar correct answer to each student submission. 

...

Most tokens (about 83%) are perfect matches and are given the label 0 for "OK." Tokens that are missing or spelled incorrectly (ignoring capitalization, punctuation, and accents) are given the label 1 denoting a mistake.

Note: For this task, we provide labels but not actual student responses.

...

The data format is inspired by the Universal Dependencies CoNNL-U format. Each student exercise is represented by a group of lines separated by a blank line: one token per line prepended with exercise-level metadata.

...

The first line of each exercise group (beginning with #) contains the following metadata about the student, session, and exercise:

user: a B64 encoded, 8-digit, anonymized, unique identifier for each student (may include / or + characters)
countries: a pipe (|) delimited list of 2-character country codes from which this user has done exercises
days: the number of days since the student started learning this language on Duolingo
client: the student's device platform (one of: android, ios, or web)
session: the session type (one of: lesson, practice, or test; explanation below)
format: the exercise format (one of: reverse_translate, reverse_tap, or listen; see figures above)
time: the amount of time (in seconds) it took for the student to construct and submit their whole answer (note: for some exercises, this can be null due to data logging issues)
These fields are separated by whitespaces on the same line, and key:value pairs are denoted with a colon (:).

The lesson sessions (about 77% of the data set) are where new words or concepts are introduced, although lessons also include a lot previously-learned material (e.g., each exercise tries to introduce only one new word or tense, so all other tokens should have been seen by the student before). The practice sessions (22%) should contain only previously-seen words and concepts. The test sessions (1%) are quizzes that allow a student "skip" a particular skill unit of the curriculum (i.e., the student may have never seen this content before in the Duolingo app, but may well have had prior knowledge before starting the course).

The remaining lines in each exercise group represent each token (word) in the correct answer that is most similar to the student's answer, one token per line, arranged into seven (7) columns separated by whitespaces:

- A Unique 12-digit ID for each token instance: the first 8 digits are a B64-encoded ID representing the session, the next 2 digits denote the index of this exercise within the session, and the last 2 digits denote the index of the token (word) in this exercise
- The token (word)
- Part of speech in Universal Dependencies (UD) format
- Morphological features in UD format
- Dependency edge label in UD format
- Dependency edge head in UD format (this corresponds to the last 1-2 digits of the ID in the first column)
- The label to be predicted (0 or 1)
- All dependency features (columns 3-6) are generated by the Google SyntaxNet dependency parser using the language-agnostic Universal Dependencies tagset. (In other words, these morpho-syntactic features should be comparable across all three tracks in the shared task. Note that SyntaxNet isn't perfect, so parse errors may occur.)

The only difference between TRAIN and DEV/TEST set formats is that the final column (labels) will be omitted from the DEV/TEST set files. The first column (unique instance IDs) are also used for the submission output format."

## Self-Notes

In [1]:
# To run in terminal:
# python baseline.py --train ../data_es_en/es_en.slam.20190204.train --test ../data_es_en/es_en.slam.20190204.test

# Duolingo Code

The following classes and functions are from a file called __*baseline.py*__, publicly available on the [website above](#https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8SWHNO). The code and accompanying datasets were released as part of the [2018 Duolingo Shared Task Challenge on Second Language Acquisition Modeling (SLAM)](#https://sharedtask.duolingo.com/2018.html). 

- Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8SWHNO 
- Resource: https://sharedtask.duolingo.com/2018.html

In [None]:
"""
Duolingo SLAM Shared Task - Baseline Model

This baseline model loads the training and test data that you pass in via --train and --test arguments for a particular
track (course), storing the resulting data in InstanceData objects, one for each instance. The code then creates the
features we'll use for logistic regression, storing the resulting LogisticRegressionInstance objects, then uses those to
train a regularized logistic model with SGD, and then makes predictions for the test set and dumps them to a CSV file
specified with the --pred argument, in a format appropriate to be read in and graded by the eval.py script.

We elect to use two different classes, InstanceData and LogisticRegressionInstance, to delineate the boundary between
the two purposes of this code; the first being to act as a user-friendly interface to the data, and the second being to
train and run a baseline model as an example. Competitors may feel free to use InstanceData in their own code, but
should consider replacing the LogisticRegressionInstance with a class more appropriate for the model they construct.

This code is written to be compatible with both Python 2 or 3, at the expense of dependency on the future library. This
code does not depend on any other Python libraries besides future.
"""

In [3]:
# Imports
import argparse
from collections import defaultdict, namedtuple
from io import open
import math
import os
from random import shuffle, uniform

from future.builtins import range
from future.utils import iteritems

# Sigma is the L2 prior variance, regularizing the baseline model. Smaller sigma means more regularization.
_DEFAULT_SIGMA = 20.0

# Eta is the learning rate/step size for SGD. Larger means larger step size.
_DEFAULT_ETA = 0.1

In [4]:
def main():
    """
    Executes the baseline model. This loads the training data, training labels, and dev data, then trains a logistic
    regression model, then dumps predictions to the specified file.

    Modify the middle of this code, between the two commented blocks, to create your own model.
    """

    parser = argparse.ArgumentParser(description='Duolingo shared task baseline model')
    parser.add_argument('--train', help='Training file name' , required=True)
    parser.add_argument('--test', help='Test file name, to make predictions on' , required=True)
    parser.add_argument('--pred', help='Output file name for predictions, defaults to test_name.pred')
    args = parser.parse_args()

    if not args.pred:
        args.pred = args.test + '.pred'

    assert os.path.isfile(args.train)
    assert os.path.isfile(args.test)

    # Assert that the train course matches the test course
    assert os.path.basename(args.train)[:5] == os.path.basename(args.test)[:5]

    training_data, training_labels = load_data(args.train)
    test_data = load_data(args.test)

    ####################################################################################
    # Here is the delineation between loading the data and running the baseline model. #
    # Replace the code between this and the next comment block with your own.          #
    ####################################################################################

    training_instances = [LogisticRegressionInstance(features=instance_data.to_features(),
                                                     label=training_labels[instance_data.instance_id],
                                                     name=instance_data.instance_id
                                                     ) for instance_data in training_data]

    test_instances = [LogisticRegressionInstance(features=instance_data.to_features(),
                                                 label=None,
                                                 name=instance_data.instance_id
                                                 ) for instance_data in test_data]

    logistic_regression_model = LogisticRegression()
    logistic_regression_model.train(training_instances, iterations=10)

    predictions = logistic_regression_model.predict_test_set(test_instances)

    ####################################################################################
    # This ends the baseline model code; now we just write predictions.                #
    ####################################################################################

    with open(args.pred, 'wt') as f:
        for instance_id, prediction in iteritems(predictions):
            f.write(instance_id + ' ' + str(prediction) + '\n')

In [5]:
def load_data(filename):
    """
    This method loads and returns the data in filename. If the data is labelled training data, it returns labels too.

    Parameters:
        filename: the location of the training or test data you want to load.

    Returns:
        data: a list of InstanceData objects from that data type and track.
        labels (optional): if you specified training data, a dict of instance_id:label pairs.
    """

    # 'data' stores a list of 'InstanceData's as values.
    data = []

    # If this is training data, then 'labels' is a dict that contains instance_ids as keys and labels as values.
    training = False
    if filename.find('train') != -1:
        training = True

    if training:
        labels = dict()

    num_exercises = 0
    print('Loading instances...')
    instance_properties = dict()

    with open(filename, 'rt') as f:
        for line in f:
            line = line.strip()

            # If there's nothing in the line, then we're done with the exercise. Print if needed, otherwise continue
            if len(line) == 0:
                num_exercises += 1
                if num_exercises % 100000 == 0:
                    print('Loaded ' + str(len(data)) + ' instances across ' + str(num_exercises) + ' exercises...')
                instance_properties = dict()

            # If the line starts with #, then we're beginning a new exercise
            elif line[0] == '#':
                if 'prompt' in line:
                    instance_properties['prompt'] = line.split(':')[1]
                else:
                    list_of_exercise_parameters = line[2:].split()
                    for exercise_parameter in list_of_exercise_parameters:
                        [key, value] = exercise_parameter.split(':')
                        if key == 'countries':
                            value = value.split('|')
                        elif key == 'days':
                            value = float(value)
                        elif key == 'time':
                            if value == 'null':
                                value = None
                            else:
                                assert '.' not in value
                                value = int(value)
                        instance_properties[key] = value

            # Otherwise we're parsing a new Instance for the current exercise
            else:
                line = line.split()
                if training:
                    assert len(line) == 7
                else:
                    assert len(line) == 6
                assert len(line[0]) == 12

                instance_properties['instance_id'] = line[0]

                instance_properties['token'] = line[1]
                instance_properties['part_of_speech'] = line[2]

                instance_properties['morphological_features'] = dict()
                for l in line[3].split('|'):
                    [key, value] = l.split('=')
                    if key == 'Person':
                        value = int(value)
                    instance_properties['morphological_features'][key] = value

                instance_properties['dependency_label'] = line[4]
                instance_properties['dependency_edge_head'] = int(line[5])
                if training:
                    label = float(line[6])
                    labels[instance_properties['instance_id']] = label
                data.append(InstanceData(instance_properties=instance_properties))

        print('Done loading ' + str(len(data)) + ' instances across ' + str(num_exercises) +
              ' exercises.\n')

    if training:
        return data, labels
    else:
        return data

In [6]:
class InstanceData(object):
    """
    A bare-bones class to store the included properties of each instance. This is meant to act as easy access to the
    data, and provides a launching point for deriving your own features from the data.
    """
    def __init__(self, instance_properties):

        # Parameters specific to this instance
        self.instance_id = instance_properties['instance_id']
        self.token = instance_properties['token']
        self.part_of_speech = instance_properties['part_of_speech']
        self.morphological_features = instance_properties['morphological_features']
        self.dependency_label = instance_properties['dependency_label']
        self.dependency_edge_head = instance_properties['dependency_edge_head']

        # Derived parameters specific to this instance
        self.exercise_index = int(self.instance_id[8:10])
        self.token_index = int(self.instance_id[10:12])

        # Derived parameters specific to this exercise
        self.exercise_id = self.instance_id[:10]

        # Parameters shared across the whole session
        self.user = instance_properties['user']
        self.countries = instance_properties['countries']
        self.days = instance_properties['days']
        self.client = instance_properties['client']
        self.session = instance_properties['session']
        self.format = instance_properties['format']
        self.time = instance_properties['time']
        self.prompt = instance_properties.get('prompt', None)

        # Derived parameters shared across the whole session
        self.session_id = self.instance_id[:8]

    def to_features(self):
        """
        Prepares those features that we wish to use in the LogisticRegression example in this file. We introduce a bias,
        and take a few included features to use. Note that this dict restructures the corresponding features of the
        input dictionary, 'instance_properties'.

        Returns:
            to_return: a representation of the features we'll use for logistic regression in a dict. A key/feature is a
                key/value pair of the original 'instance_properties' dict, and we encode this feature as 1.0 for 'hot'.
        """
        to_return = dict()

        to_return['bias'] = 1.0
        to_return['user:' + self.user] = 1.0
        to_return['format:' + self.format] = 1.0
        to_return['token:' + self.token.lower()] = 1.0

        to_return['part_of_speech:' + self.part_of_speech] = 1.0
        for morphological_feature in self.morphological_features:
            to_return['morphological_feature:' + morphological_feature] = 1.0
        to_return['dependency_label:' + self.dependency_label] = 1.0

        return to_return

In [7]:
class LogisticRegressionInstance(namedtuple('Instance', ['features', 'label', 'name'])):
    """
    A named tuple for packaging together the instance features, label, and name.
    """
    def __new__(cls, features, label, name):
        if label:
            if not isinstance(label, (int, float)):
                raise TypeError('LogisticRegressionInstance label must be a number.')
            label = float(label)
        if not isinstance(features, dict):
            raise TypeError('LogisticRegressionInstance features must be a dict.')
        return super(LogisticRegressionInstance, cls).__new__(cls, features, label, name)

In [8]:
class LogisticRegression(object):
    """
    An L2-regularized logistic regression object trained using stochastic gradient descent.
    """

    def __init__(self, sigma=_DEFAULT_SIGMA, eta=_DEFAULT_ETA):
        super(LogisticRegression, self).__init__()
        self.sigma = sigma  # L2 prior variance
        self.eta = eta  # initial learning rate
        self.weights = defaultdict(lambda: uniform(-1.0, 1.0)) # weights initialize to random numbers
        self.fcounts = None # this forces smaller steps for things we've seen often before

    def predict_instance(self, instance):
        """
        This computes the logistic function of the dot product of the instance features and the weights.
        We truncate predictions at ~10^(-7) and ~1 - 10^(-7).
        """
        a = min(17., max(-17., sum([float(self.weights[k]) * instance.features[k] for k in instance.features])))
        return 1. / (1. + math.exp(-a))

    def error(self, instance):
        return instance.label - self.predict_instance(instance)

    def reset(self):
        self.fcounts = defaultdict(int)

    def training_update(self, instance):
        if self.fcounts is None:
            self.reset()
        err = self.error(instance)
        for k in instance.features:
            rate = self.eta / math.sqrt(1 + self.fcounts[k])
            # L2 regularization update
            if k != 'bias':
                self.weights[k] -= rate * self.weights[k] / self.sigma ** 2
            # error update
            self.weights[k] += rate * err * instance.features[k]
            # increment feature count for learning rate
            self.fcounts[k] += 1

    def train(self, train_set, iterations=10):
        for it in range(iterations):
            print('Training iteration ' + str(it+1) + '/' + str(iterations) + '...')
            shuffle(train_set)
            for instance in train_set:
                self.training_update(instance)
        print('\n')

    def predict_test_set(self, test_set):
        return {instance.name: self.predict_instance(instance) for instance in test_set}

In [9]:
# if __name__ == '__main__':
#     main()

# Load Train Data

In [11]:
# Instantiate function above and call the train dataset to load data & labels
data, labels = load_data('../data_es_en/es_en.slam.20190204.train')

Loading instances...
Loaded 266882 instances across 100000 exercises...
Loaded 537453 instances across 200000 exercises...
Loaded 804717 instances across 300000 exercises...
Loaded 1075884 instances across 400000 exercises...
Loaded 1348070 instances across 500000 exercises...
Loaded 1620764 instances across 600000 exercises...
Loaded 1887018 instances across 700000 exercises...
Done loading 1973556 instances across 731896 exercises.



In [13]:
# Look at contents of data
data[:3]

[<__main__.InstanceData at 0x106ad6650>,
 <__main__.InstanceData at 0x106add710>,
 <__main__.InstanceData at 0x106e03f90>]

In [20]:
# Look at contents of labels
i = 5
for key, value in labels.items():
    if i > 0:
        print(key, value)
        i -= 1


d44lo//L0101 0.0
d44lo//L0102 0.0
d44lo//L0201 0.0
d44lo//L0202 0.0
d44lo//L0301 0.0


In [21]:
# Can we convert the data into a dataframe?
import pandas as pd

In [22]:
pd.DataFrame(data)

Unnamed: 0,0
0,<__main__.InstanceData object at 0x106ad6650>
1,<__main__.InstanceData object at 0x106add710>
2,<__main__.InstanceData object at 0x106e03f90>
3,<__main__.InstanceData object at 0x106ab8ad0>
4,<__main__.InstanceData object at 0x10703ca50>
...,...
1973551,<__main__.InstanceData object at 0x1d94b0d10>
1973552,<__main__.InstanceData object at 0x1d94b21d0>
1973553,<__main__.InstanceData object at 0x1d94b2510>
1973554,<__main__.InstanceData object at 0x1d94b2950>


## Explore Class Attributes

In [39]:
# Define a function here or inside the class to iterate over all class attributes and print
# def get_attributes(instance_data):
#     self_params = {}
#     attrs = [instance_id, token, part_of_speech, morphological_features, dependency_label, dependency_edge_head, 
#              exercise_index, token_index, exercise_id, user, countries, days, client, session, format, time, prompt]
#     for attr in attrs:
#         self_params[str(attr)] = instance_data.attr
        
#     return self_params
    
#  # Additional function by Kelly Slatery
#     def get_attributes(self):
#         '''Returns a dictionary of all of its attributes'''
#         self_params = {}
#         for attr in self.__init__    

#### data[0]

In [24]:
data[0].to_features()

{'bias': 1.0,
 'user:+H9QWAV4': 1.0,
 'format:listen': 1.0,
 'token:el': 1.0,
 'part_of_speech:DET': 1.0,
 'morphological_feature:Definite': 1.0,
 'morphological_feature:Gender': 1.0,
 'morphological_feature:Number': 1.0,
 'morphological_feature:PronType': 1.0,
 'morphological_feature:fPOS': 1.0,
 'dependency_label:det': 1.0}

In [26]:
data[0].instance_id

'd44lo//L0101'

In [27]:
data[0].token

'El'

In [28]:
data[0].part_of_speech

'DET'

In [23]:
data[0].morphological_features

{'Definite': 'Def',
 'Gender': 'Masc',
 'Number': 'Sing',
 'PronType': 'Art',
 'fPOS': 'DET++'}

In [29]:
data[0].dependency_label

'det'

In [30]:
data[0].dependency_edge_head

2

In [49]:
data[0].exercise_index

1

In [50]:
data[0].token_index

1

In [51]:
data[0].exercise_id

'd44lo//L01'

In [31]:
data[0].user

'+H9QWAV4'

In [32]:
data[0].countries

['CA']

In [33]:
data[0].days

0.009

In [34]:
data[0].client

'ios'

In [35]:
data[0].session

'lesson'

In [36]:
data[0].format

'listen'

In [37]:
data[0].time

9

In [38]:
data[0].prompt

#### data[1]

In [42]:
data[1].to_features()

{'bias': 1.0,
 'user:+H9QWAV4': 1.0,
 'format:listen': 1.0,
 'token:pan': 1.0,
 'part_of_speech:NOUN': 1.0,
 'morphological_feature:Gender': 1.0,
 'morphological_feature:Number': 1.0,
 'morphological_feature:fPOS': 1.0,
 'dependency_label:ROOT': 1.0}

In [43]:
data[1].morphological_features

{'Gender': 'Masc', 'Number': 'Sing', 'fPOS': 'NOUN++'}