# NLP for Ed-tech answer / question generation

<a href = 'contents'></a>
### Contents:

[1. Introduction](#intro)

[2. Theoretical background](#theory)

[3. Cleaning, preprocessing and feature engineering](#clean)

[4. Model selection and design](#modelselection)

[5. Modelling the answer generator](#answer_gen)

[5.1. Model training and validation](#answer_gen_train)

[5.2 Model testing](#answer_gen_test)

[6. Modelling the question generator](#question_gen)

[6.1. Model training and validation](#question_gen_train)

[6.2 Model testing](#question_gen_test)

[7. Concluding remarks](#conclusion)

[8. Bibliography](#bib)

<a id = 'intro'></a>

## 1. Introduction

Recent advances in the field of Natural Language Processing (NLP) have generated novel ideas for technical applications in a variety of fields. One domain that stands to benefit greatly from recent models is educational technology (edtech) and, by extension, all educational providers. For many edtech platforms a key deliverable is assessment content, which is most likely painstakingly made by human experts, with multiple, plausible answers written in. Much of an edtech's content creating will be as automated as possible, but the complex, human-dependant element of generating comprehension questions and plausible answers that test for a (human) learner's understanding is often deemed too difficult for an edtech company to implement. There are two key tasks that, for the moment, seem too human for a machine to perform: 1. writing a question about a paragraph of text (our document) and 2. writing several equally plausible answers, with only one of them being detectable as the truly correct answer to a human expert. 

The first task is in the domain of Natural Language Generation (NLG) and the second is a Natural Language Understanding (NLU) task. In this notebook I've written up the process by which I've attempted to build two machine learners dedicated to accomplishing the two tasks above in a way that would be most useful to an edtech provider. Section 2 is my 

<a id = 'theory'></a>

## 2. Theoretical background

<a id = 'clean'></a>


## 3. Cleaning, preprocessing and feature engineering

Given the interactive nature of the desired end-product, cleaning and preprocessing had to be kept relatively minimal, and whenever necessary, kept modular. For instance, I initially discarded "stop words" (e.g. "what", "of", "it") but returned to include them later in the modelling process because I'd considered that such information would be more valuable to the learner.
After extracting the data from the JSON file the text was stored inside a dataframe as 3 separate columns, with context paragraphs, the question and the answers. Attempting various different ways of engineering data, I decided to incorporate 3 different features:

1. Allocate each word a Part of Sentence (POS) tag using NLTK's POS tagger
2. Lemmatize each word using the WordNet lemmatizer
3. Utilize pretrained word embeddings from Stanford's GloVe project[2](#2). 

The GloVe pretrained weights were available in 4 lengths of one-column vectors (50, 100, 200, 300) and in order to translate the POS tag into a weight I added an extra element to each of those vectors for the POS tags (shown two cells below). Directly below is a slice of the dataframe as it looked at this point in time. 

In [5]:
df = pd.read_pickle('df1_pos_W.pkl')
df.head()

Unnamed: 0,context_lemma_pos,question_lemma_pos,answers_lemma_pos,answer_len,question_len,context_len
2,"[(beyoncé, NN), (giselle, NN), (knowlescarter,...","[(when, WRB), (beyonce, NN), (leave, VBP), (de...","[(2003, CD)]",1,8,75
11,"[(beyoncé, NN), (giselle, NN), (knowlescarter,...","[(when, WRB), (beyoncé, NN), (release, NN), (d...","[(2003, CD)]",1,5,75
12,"[(beyoncé, NN), (giselle, NN), (knowlescarter,...","[(how, WRB), (many, JJ), (grammy, JJ), (award,...","[(five, CD)]",1,9,75
15,"[(follow, VBG), (disbandment, NN), (destiny, N...","[(after, IN), (second, JJ), (solo, NN), (album...","[(act, VBG)]",1,8,112
17,"[(follow, VBG), (disbandment, NN), (destiny, N...","[(to, TO), (set, VB), (record, NN), (grammys, ...","[(six, CD)]",1,7,112


In [8]:
pos_dict = lib.get_pos_dict(df)
pos_dict

{'': 0.0,
 'NN': 0.08078729573077403,
 'VBP': 0.43049489205603175,
 'JJ': 0.22627048703939012,
 'CD': 0.782025845371569,
 'NNS': 0.7196239018674129,
 'VBD': 0.43316906320635873,
 'VBG': 0.2738475150287274,
 'RB': 0.12437301301267711,
 'VBZ': 0.9574997725141698,
 'VBN': 0.6235842224079152,
 'PRP$': 0.7594464769150902,
 'VB': 0.5933595139266659,
 'DT': 0.780331365792381,
 'PRP': 0.21155649677086108,
 'JJR': 0.08616271143743359,
 'IN': 0.3159941099223168,
 'JJS': 0.023459126356141446,
 'NNP': 0.2589476751408578,
 'MD': 0.31947113330180155,
 'WDT': 0.2930305958336812,
 'FW': 0.9362648530794228,
 '$': 0.25367766821903814,
 'WRB': 0.9848137711349689,
 'RP': 0.22819105986883026,
 'RBR': 0.6478409439239122,
 'TO': 0.21128814560671427,
 'RBS': 0.036154152229286085,
 'CC': 0.5883740392104109,
 'WP$': 0.3022983423385648,
 'WP': 0.2606244819690704,
 'EX': 0.5138941380588374,
 'NNPS': 0.5506818065967534,
 'PDT': 0.542542421914276,
 'SYM': 0.16669352976733498,
 'POS': 0.17299372118613776,
 'UH': 0.8

The words were fed in as concatenations of the word and the POS tag (e.g."leave_VBP"), as shown below:

In [9]:
df = lib.concat_word_pos(df, columns=['context_lemma_pos', 'question_lemma_pos','answers_lemma_pos'])
df.tail()

Unnamed: 0,context_lemma_pos,question_lemma_pos,answers_lemma_pos,answer_len,question_len,context_len
130044,the_DT main_JJ international_JJ airport_NN ser...,from_IN city_NN arkefly_NN offer_VBP nonstop_J...,amsterdam_NN,1,7,103
130046,kathmandu_JJ metropolitan_JJ city_NN kmc_NN or...,in_IN us_PRP state_NN kathmandu_VBD first_JJ e...,oregon_NN,1,8,71
130047,kathmandu_JJ metropolitan_JJ city_NN kmc_NN or...,what_WP yangon_VBZ previously_RB know_VBN,rangoon_NN,1,4,71
130048,kathmandu_JJ metropolitan_JJ city_NN kmc_NN or...,with_IN belorussian_JJ city_NN kathmandu_NN re...,minsk_NN,1,5,71
130049,kathmandu_JJ metropolitan_JJ city_NN kmc_NN or...,in_IN year_NN kathmandu_NN create_VBP initial_...,1975_CD,1,7,71


Our total vocabulary size is 127983, consisting of unique word and POS combinations. 

In [12]:
vocabulary = lib.get_total_vocab(df, columns=['context_lemma_pos', 'question_lemma_pos'])
df['text'] = df['context_lemma_pos'] + ' ' + df['question_lemma_pos']
df['question_answer'] = df['question_lemma_pos'] + ' ' + df['answers_lemma_pos']

<a id = 'modelselection'></a>

## 4. Model selection and design

<a id = 'answer_gen'></a>


## 5. Modelling the answer generator

<a id = 'answer_gen_train'></a>


## 5.1 Model training and validation

<a id = 'answer_gen_test'></a>


## 5.2  Model testing

<a id = 'question_gen'></a>


## 6. Modelling the question generator

<a id = 'question_gen_train'></a>


## 6.1 Model training and validation

<a id = 'question_gen_test'></a>


## 6.2 Model testing

<a id = 'conclusion'></a>


## 7. Concluding remarks

<a id = 'bib'></a>

### 8. Bibliography

<a id=1 ></a>
1. Stanford Natural Language Processing Group - Questions and Answers Dataset (SQuAD) - https://rajpurkar.github.io/SQuAD-explorer/ 
<a id=2 ></a>
2. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation
<a id=3 ></a>
3.

In [3]:
from __future__ import print_function, unicode_literals
from unicodedata import normalize
import library_py2 as lib
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
import warnings
import nltk
from nltk.stem.wordnet import WordNetLemmatizer, wordnet
import re
warnings.filterwarnings(action='once')
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, Embedding, Masking, Dropout, TimeDistributed
from keras.layers import Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import os
import tensorflow as tf
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.preprocessing.sequence import pad_sequences
%matplotlib inline
%load_ext autoreload
%autoreload 2

Using TensorFlow backend.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/flatironschool/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
