## Text Classifiers
Now that we have explored and built some text features (i.e. features extracted from text data), we want something useful to feed the text data into. A common machine learning problem involving text data is *text classification*. In this notebook we will explore text classification using a real dataset.

First let's import the required packages.

In [None]:
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 1000)

### Loading text data
Often, text data will come in the format of *JavaScript Object Notation* (JSON). Let's import some JSON data.

Run the following code to load the JSON file `Data/reviews.json` into a Pandas DataFrame and print the first $10$ lines.

In [None]:
with open('Data/reviews.json','r') as in_file:
    df = pd.DataFrame(json.load(in_file))
df.head(10)

This above data consists of user reviews of Google Play applications and the corresponding rating for each review. [SOURCE](http://jmcauley.ucsd.edu/data/amazon/). All reviews with *medium* ratings ($2-4$) were removed from the raw data, and a *balanced* number of reviews with ratings $1$ and $5$ were extracted. This was done to make it easier to train a classifier.

### Simpler Text Classifier
Here we will introduce common classifier for text classification, the [*Multinomial naive Bayes*](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes) classifier. Using Sklearn, it is simple to add a classifer as the last step in the text processing pipeline. 

Run the following code to setup a text classification pipeline and fit it to the rating and review data.

In [None]:
text_pipe = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_pipe.fit(df.review, df.rating);

Given a text classification pipeline, we create our own inputs and pass them through the pipeline as follows.

In [None]:
X_test = [
    "Love this app!",
    "Hate this app!",
    "Total rubbish",
    "Works perfectly",
]
print(text_pipe.predict(X_test))

Do the ratings output by the text classification pipeline makes sense given the example text inputs? Experiment with your own inputs and try to fool the system.

### Performance on training set
Given a text classification piepline, we can evaluate how well the pipeline performs on the training set.

We can compute the *confusion matrix* as follows.

In [None]:
print(metrics.confusion_matrix(df.rating, text_pipe.predict(df.review)))

The confusion matrix simply calculates the number of correctly and incorrectly classified examples in the data.

These numbers can be used to calculate the *precision*, *recall* and *f1-score* as follows. See [here](https://en.wikipedia.org/wiki/Precision_and_recall) for more details.

In [None]:
print(metrics.classification_report(df.rating, text_pipe.predict(df.review)))

### Performance on test set
As with other classifiers, we need to evaluate a text classification pipeline on a tes set (i.e. a data set it has not seen before). Let's first split the ratings and reviews data into training and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.review, df.rating, test_size=0.2, random_state = 0)

Now, let's build another text classification pipeline, but this time let's train it on the training data and evaluate it on the test data.

In [None]:
text_pipe = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_pipe.fit(X_train, y_train);
print(text_pipe.score(X_test, y_test))

### Model Selection
There is a chance there might exist better parameter choices for the above pipeline. For example, maybe stop-words are useful...

Run the following code using the same pipeline, but without removing stop-words.

In [None]:
text_pipe_wstops = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_pipe_wstops.fit(X_train, y_train);
print(text_pipe_wstops.score(X_test, y_test))

Indeed, we get slightly better performance on the test data without stop words!

We can also compare the same pipeline with and without using a Tfidf transform. Let's do that, but this time let's do it using a grid search.

Run the following code.

In [None]:
from sklearn.model_selection import GridSearchCV
grid_params = {
    'tfidf__use_idf': (True, False),
}
search = GridSearchCV(text_pipe_wstops, grid_params)
search.fit(X_train, y_train);
print(search.best_params_)
print(search.score(X_test, y_test))

Looks like, for this data, we don't really need a Tfidf transform either!

*Note*: This is often not the case.