# Question Classification

In document classification, we aim to categorize each document with a label from a predetermined set. 

In this example, we will demonstrate how to classify potential user questions **textblob**.

Please see http://cogcomp.cs.illinois.edu/Data/QA/QC/ for more information about the dataset.

In [327]:
from textblob import TextBlob
from nltk.corpus import qc, reuters
import nltk
import pprint
import warnings
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pp = pprint.PrettyPrinter(indent=4)
warnings.filterwarnings('ignore')

# Data

The dataset we will use contains 6,000 questions that are labelled into one of 50 categories. There are 6 high level categories:
* Abbreviation: ABB
* Entity: ENT
* Description: DESC
* Human: HUM
* Location: LOC
* Numeric: NUM

These categories are further dividided into subcategories, but for the purposes of this demonstration, we will only concern oursevles with the highest levels.

Since the goal of this tutorial is classification, the first thing we will do is split our data into training and test sets. Even though the data is already split as provided, we will combine all questions and split it ourselves into an 80/20 split.

In [328]:
questions = qc.tuples("train.txt") + qc.tuples("test.txt")

In [329]:
questions[19]

(u'DESC:def', u'What is an annotated bibliography ?')

In [330]:
train = [(q[1], q[0].split(":")[0]) for q in qc.tuples("train.txt")]
test = [(q[1], q[0].split(":")[0]) for q in qc.tuples("test.txt")]
full = test + train
len(full)

train = full[:4760]
test = full[4760:]

In [331]:
train[0]

(u'How far is it from Denver to Aspen ?', u'NUM')

In [332]:
q_train = [q[0] for q in train]
c_train = [q[1] for q in train]
q_test = [q[0] for q in test]
c_test = [q[1] for q in test]
q_full = [q[0] for q in test]
c_full = [q[1] for q in test]

In [333]:
pipeline = Pipeline([
        ('bow', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('classifier', MultinomialNB())
    ])

In [334]:
text_clf = pipeline.fit(q_train, c_train)

In [337]:
text_clf.predict(["Who is Michael?"])

array([u'HUM'], 
      dtype='<U4')

In [336]:
predicted = text_clf.predict(q_test)
np.mean(predicted == c_test)

0.77432885906040272

# Resources

http://text-processing.com/demo/

http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

http://wit.ai

ibm tone analyzer

https://api.ai/

https://algorithmia.com/tags/text%20analysis

https://demos.explosion.ai/sense2vec/?word=natural%20language%20processing&sense=auto

https://spacy.io/

https://explosion.ai/

http://blog.aylien.com/naive-bayes-for-dummies-a-simple-explanation/