# Predict the topic of a Math Question on Math Education Resources

We will use **Machine Learning** to predict the topic of a Math Question from the [Math Education Resources](http://math-education-resources.com). For simplicity we will only consider two topics. Using [multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification) this can be extended to more than two topics (at the time of writing, April 2015, we have about 1500 questions with 150 topics on MER).

## Data inspection

In [15]:
import os
import json
import numpy as np
from pymongo import MongoClient

In [16]:
client = MongoClient()

In [17]:
questions_collection = client['merdb'].questions

In [18]:
questions_collection.find_one()

{u'ID': u'UBC+MATH307+April_2012+01_(d)',
 u'_id': ObjectId('55383310cec2a2367cebc622'),
 u'answer_html': u'<p>No content found.</p>',
 u'answer_latex': u'No content found.',
 u'contributors': [u'Konradbe'],
 u'course': u'MATH307',
 u'flags': [u'RQ', u'CH', u'CS', u'CT'],
 u'hints_html': [u'<p>No content found.</p>'],
 u'hints_latex': [u'No content found.'],
 u'hints_raw': [u'No content found.'],
 u'num_votes': 0,
 u'question': u'1 (d)',
 u'rating': -1,
 u'sols_html': [u'<p>No content found.</p>'],
 u'sols_latex': [u'No content found.'],
 u'sols_raw': [u'No content found.'],
 u'statement_html': u'<p>Suppose you are given a set of <em>N</em> data points <em>(x<sub>n</sub>, y<sub>n</sub>)</em>, with <em>x<sub>n</sub></em> increasing, and you wish to interpolate these points with a spline function <em><span class="math">\\(f\\)</span></em>, where <em><span class="math">\\(f\\)</span>(x)</em> is given by the cubic polynomial <em>p<sub>n</sub>(x)</em> on each interval <em>(x<sub>n</sub>, x<

In [19]:
topic_tags = ["Eigenvalues_and_eigenvectors", "Probability_density_function", "Taylor_series"]

In [20]:
questions = []
for q in questions_collection.find({"topics": 
                                         {"$in": topic_tags}
                                        }):
    questions.append(q)

In [21]:
for t in topic_tags:
    print(questions_collection.find({"topics": t}).count())
    

45
39
50


In [22]:
questions[77].keys()

[u'rating',
 u'contributors',
 u'topics',
 u'year',
 u'answer_html',
 u'course',
 u'solvers',
 u'sols_raw',
 u'hints_html',
 u'votes',
 u'question',
 u'statement_raw',
 u'num_votes',
 u'statement_html',
 u'term',
 u'statement_latex',
 u'hints_raw',
 u'hints_latex',
 u'ID',
 u'sols_latex',
 u'url',
 u'flags',
 u'answer_latex',
 u'sols_html',
 u'_id']

In [23]:
np.random.seed(23)  # for reproducibility we set the seed of the random number generator
num_samples = int(0.75 * len(questions))
test_indices = np.random.choice(range(len(questions)), num_samples, replace=False)
questions_train = [q for i, q in enumerate(questions) if not i in test_indices]
questions_test = [q for i, q in enumerate(questions) if i in test_indices]
print('%s questions in test set: %d' % (topic_tags[0], sum([1 for q in questions_test if topic_tags[0] in q['topics']])))
print('%s questions in test set: %d' % (topic_tags[1], sum([1 for q in questions_test if topic_tags[1] in q['topics']])))
print('%s questions in test set: %d' % (topic_tags[2], sum([1 for q in questions_test if topic_tags[2] in q['topics']])))

Eigenvalues_and_eigenvectors questions in test set: 34
Probability_density_function questions in test set: 25
Taylor_series questions in test set: 41


In [24]:
import helpers
from nltk import PorterStemmer
from nltk.corpus import stopwords

In [25]:
def words_from_question(q):
    all_text = q['statement_html'] + q['hints_html'][0] + q['sols_html'][0]
    return helpers.strip_text(all_text)

def words_stemmed_no_stop(words):
    stop = stopwords.words('english')
    res = []
    for word in words:
        stemmed = PorterStemmer().stem_word(word)
        if stemmed not in stop and len(stemmed) > 1:
            res.append(stemmed)
    return res

In [26]:
vocabulary = []
for q in questions_train:
    vocabulary += words_stemmed_no_stop(words_from_question(q))
vocabulary_sorted = sorted(set(vocabulary))
print('Number of distinct words:', len(vocabulary_sorted))
print(vocabulary_sorted[:15])

('Number of distinct words:', 431)
[u'abl', u'abov', u'absolut', u'accord', u'act', u'ad', u'ahead', u'allow', u'also', u'alt', u'alway', u'amp', u'analyt', u'ani', u'anoth']


In [27]:
def question_to_vector(q, voc):
    x_vec = np.zeros(len(voc))
    words = words_stemmed_no_stop(words_from_question(q))
    for word in words:
        if word in voc:
            x_vec[voc.index(word)] = 1
    return x_vec

In [28]:
sum(question_to_vector(questions[0], vocabulary))

20.0

In [29]:
questions[0]

{u'ID': u'UBC+MATH307+December_2008+06_(b)',
 u'_id': ObjectId('55383312cec2a2367cebc643'),
 u'answer_html': u'<p>Therefore, the matrix <span class="math">\\(Q+iI\\)</span> is invertible.</p>',
 u'answer_latex': u'Therefore, the matrix $Q+iI$ is invertible.',
 u'contributors': [u'IainMoyles', u'Konradbe'],
 u'course': u'MATH307',
 u'flags': [u'QGQ', u'QGH', u'QGS', u'RT'],
 u'hints_html': [u'<p>Think eigenvalues and recall part (a).</p>'],
 u'hints_latex': [u'Think eigenvalues and recall part (a).'],
 u'hints_raw': [u'Think eigenvalues and recall part (a).'],
 u'num_votes': 0,
 u'question': u'6 (b)',
 u'rating': -1,
 u'sols_html': [u'<p>Recall the definition of the eigenvalue, we know that <span class="math">\\(-i\\)</span> is one of Q<span class="math">\\(^{\\prime}\\)</span>s eigenvalues when <span class="math">\\(det(Q+iI)=0\\)</span>, and <span class="math">\\(Q+iI\\)</span> is invertible if and only if <span class="math">\\(det(Q+iI)\\neq 0\\)</span>. From <span>part a</span>, we 

In [39]:
from sklearn.linear_model import LogisticRegression

In [43]:
from sklearn.preprocessing import label_binarize

In [59]:
def questions_to_y(qs):
    topic_labels = []
    for q in qs:
        if topic_tags[0] in q['topics']:
            topic_labels.append(0)
        elif topic_tags[1] in q['topics']:
            topic_labels.append(1)
        else:
            topic_labels.append(2)
    return label_binarize(topic_labels, classes = [0, 1, 2])

In [56]:
def questions_to_X(qs, voc):
    X = np.zeros(shape=(len(qs), len(voc)))

    for i, q in enumerate(qs):
        X[i, :] = question_to_vector(q, voc)
    return X

In [57]:
X_train = questions_to_X(questions_train, vocabulary_sorted)
X_test = questions_to_X(questions_test, vocabulary_sorted)

[u'true', u'fals', u'squar', u'matrix', u'ha', u'repeat', u'eigenvalu', u'cannot', u'diagon', u'justifi', u'answer', u'content', u'found', u'fals', u'becaus', u'ha', u'repeat', u'eigenvalu', u'diagon', u'array', u'greater_than']
[u'consid', u'matrix', u'verifi', u'eigenvalu', u'content', u'found', u'recal', u'eigenvalu', u'onli', u'therefor', u'suffic', u'show', u'would', u'prefer', u'avoid', u'lot', u'comput', u'find', u'determin', u'right', u'away', u'instead', u'recal', u'ad', u'multipl', u'one', u'row', u'anoth', u'doe', u'chang', u'determin', u'therefor', u'ad', u'second', u'row', u'third', u'subtract', u'first', u'row', u'second', u'yield', u'matrix', u'ha', u'determin', u'note', u'two', u'row', u'multipl', u'therefor', u'determin', u'abov', u'matrix', u'zero', u'henc', u'matrix', u'greater_than', u'determin']
[u'find', u'number', u'make', u'matrix', u'diagonaliz', u'need', u'diagon', u'content', u'found', u'let', u'find', u'eigenvalu', u'solv', u'equat', u'abov', u'get', u'simil

In [49]:
assert len(topic_labels) == len(questions_train)

In [62]:
y_train = questions_to_y(questions_train)
y_test = questions_to_y(questions_test)

In [53]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm

In [55]:
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability = True, random_state=np.random.RandomState(0)))

In [63]:
trained_classifier = classifier.fit(X_train, y_train)

In [66]:
preds = trained_classifier.predict_proba(X_test)

In [72]:
from sklearn.metrics import roc_curve, auc

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], preds[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), preds.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

print(roc_auc)

{0: 0.99955436720142599, 1: 0.98026666666666673, 2: 0.98222405952873093, 'micro': 0.97630000000000028}
