Duolingo Shared Task on Second Language Acquisition Modeling
This repository contains code for running the 2nd place (Spanish-to-English) and 3rd place (English-to-Spanish and French-to-English) model in the Duolingo SLAM competition. The paper describing our approach can be found here.
Acquiring the data
Download from here and unzip in the "data" folder
Running the model
To preprocess the data, run
reprocess_syntax.py on each data file. See the
file's docstring for more details on getting google SyntaxNet set up. Then run
translate_frequency.py to generate external word-frequency features.
The model can then be trained to produce predictions on the
dev set using
lightgbm_dev.py or on the
test set using
lightgbm_script.py. The language
trained on (
all) and the number of user trained
on can be controlled using the
Models trained on each individual language can be averaged with a model trained
on all languages using the
Testing model lesions
To test the effects of removing different feature sets, first run
preprocess_to_pickle.py to create a pickled version of the data and cut down
on preprocessing time across different lesions. Then run
--lesion flag to choose the lesion experiment to conduct. See code or
paper for list of options.
The results of the lesions can be plotted using
graph_lesions.r (in R, not python).