# Predict the topic of a Math Question on Math Education Resources

We will use **Machine Learning** to predict the topic of a Math Question from the [Math Education Resources](http://math-education-resources.com). For simplicity we will only consider two topics. Using [multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification) this can be extended to more than two topics (at the time of writing, April 2015, we have about 1500 questions with 150 topics on MER).

## Data inspection

In [2]:
import os
import json
import numpy as np
from pymongo import MongoClient

In [3]:
client = MongoClient()

In [21]:
questions_collection = client['merdb'].questions

In [22]:
questions_collection.find_one()

{u'ID': u'UBC+MATH307+April_2012+01_(d)',
 u'_id': ObjectId('554c104798cccfa52bc9ba95'),
 u'answer_html': u'<p>No content found.</p>\n',
 u'answer_latex': u'No content found.',
 u'contributors': [u'Konradbe'],
 u'course': u'MATH307',
 u'flags': [u'RQ', u'CH', u'CS', u'CT'],
 u'hints_html': [u'<p>No content found.</p>\n'],
 u'hints_latex': [u'No content found.'],
 u'hints_raw': [u'No content found.'],
 u'num_votes': 0,
 u'question': u'1 (d)',
 u'rating': -1,
 u'sols_html': [u'<p>No content found.</p>\n'],
 u'sols_latex': [u'No content found.'],
 u'sols_raw': [u'No content found.'],
 u'statement_html': u'<p>Suppose you are given a set of <em>N</em> data points <em>(x<sub>n</sub>, y<sub>n</sub>)</em>, with <em>x<sub>n</sub></em> increasing, and you wish to interpolate these points with a spline function <em><span class="math">\\(f\\)</span></em>, where <em><span class="math">\\(f\\)</span>(x)</em> is given by the cubic polynomial <em>p<sub>n</sub>(x)</em> on each interval <em>(x<sub>n</su

In [17]:
topic_tags = ["Eigenvalues_and_eigenvectors", "Probability_density_function", "Taylor_series"]

In [34]:
questions = []
for q in questions_collection.find({"topics": 
                                         {"$in": topic_tags}
                                        }):
    questions.append(q)

In [36]:
for t in topic_tags:
    print(questions_collection.find({"topics": t}).count())
    

47
40
51


In [37]:
questions[77].keys()

[u'rating',
 u'contributors',
 u'topics',
 u'year',
 u'answer_html',
 u'course',
 u'solvers',
 u'sols_raw',
 u'hints_html',
 u'votes',
 u'question',
 u'statement_raw',
 u'num_votes',
 u'statement_html',
 u'term',
 u'statement_latex',
 u'hints_raw',
 u'hints_latex',
 u'ID',
 u'sols_latex',
 u'url',
 u'flags',
 u'answer_latex',
 u'sols_html',
 u'_id']

In [58]:
np.random.seed(23)  # for reproducibility we set the seed of the random number generator
num_samples = int(0.75 * len(questions))
test_indices = np.random.choice(range(len(questions)), num_samples, replace=False)
questions_train = [q for i, q in enumerate(questions) if not i in test_indices]
questions_test = [q for i, q in enumerate(questions) if i in test_indices]
print('%s questions in test set: %d' % (topic_tags[0], sum([1 for q in questions_test if topic_tags[0] in q['topics']])))
print('%s questions in test set: %d' % (topic_tags[1], sum([1 for q in questions_test if topic_tags[1] in q['topics']])))
print('%s questions in test set: %d' % (topic_tags[2], sum([1 for q in questions_test if topic_tags[2] in q['topics']])))

Eigenvalues_and_eigenvectors questions in test set: 37
Probability_density_function questions in test set: 27
Taylor_series questions in test set: 39


In [49]:
import helpers
from nltk import PorterStemmer
from nltk.corpus import stopwords

In [51]:
def words_from_question(q):
    all_text = q['statement_html'] + q['hints_html'][0] + q['sols_html'][0]
    return helpers.strip_text(all_text)

def words_stemmed_no_stop(words):
    stop = stopwords.words('english')
    res = []
    for word in words:
        stemmed = PorterStemmer().stem_word(word)
        if stemmed not in stop and len(stemmed) > 1:
            res.append(stemmed)
    return res

In [52]:
vocabulary = []
for q in questions_train:
    vocabulary += words_stemmed_no_stop(words_from_question(q))
vocabulary_sorted = sorted(set(vocabulary))
print('Number of distinct words:', len(vocabulary_sorted))
print(vocabulary_sorted[:15])

('Number of distinct words:', 435)
[u'abl', u'abov', u'absolut', u'accord', u'ad', u'ahead', u'also', u'alt', u'amp', u'analyt', u'ani', u'anoth', u'answer', u'anywher', u'appli']


In [53]:
def question_to_vector(q, voc):
    x_vec = np.zeros(len(voc))
    words = words_stemmed_no_stop(words_from_question(q))
    for word in words:
        if word in voc:
            x_vec[voc.index(word)] = 1
    return x_vec

In [55]:
sum(question_to_vector(questions[0], vocabulary))

20.0

In [56]:
questions[0]

{u'ID': u'UBC+MATH307+December_2008+06_(b)',
 u'_id': ObjectId('554c104898cccfa52bc9bab6'),
 u'answer_html': u'<p>Therefore, the matrix <span class="math">\\(Q+iI\\)</span> is invertible.</p>\n',
 u'answer_latex': u'Therefore, the matrix $Q+iI$ is invertible.',
 u'contributors': [u'IainMoyles', u'Konradbe'],
 u'course': u'MATH307',
 u'flags': [u'QGQ', u'QGH', u'QGS', u'RT'],
 u'hints_html': [u'<p>Think eigenvalues and recall part (a).</p>\n'],
 u'hints_latex': [u'Think eigenvalues and recall part (a).'],
 u'hints_raw': [u'Think eigenvalues and recall part (a).'],
 u'num_votes': 0,
 u'question': u'6 (b)',
 u'rating': -1,
 u'sols_html': [u'<p>Recall the definition of the eigenvalue, we know that <span class="math">\\(-i\\)</span> is one of Q<span class="math">\\(^{\\prime}\\)</span>s eigenvalues when <span class="math">\\(det(Q+iI)=0\\)</span>, and <span class="math">\\(Q+iI\\)</span> is invertible if and only if <span class="math">\\(det(Q+iI)\\neq 0\\)</span>. From <span>part a</span>,