# Modern Methods for Text Classification

Text Classification is a very frequently seen challenge. This has several applications ranging from sentiment analysis, tagging and filtering news articles, and detecting fraud reviews (on websites such as Amazon) to name a few. 

For simplicity, we will be working with a sentiment analysis dataset. For same reasons, we evaluate accuracy as our only metric here. 

We have a curated set of movie reviews picked up from Imdb. Each review is marked as either positive or negative. Ofcourse, this marks the overall sentiment of each review. 

Beginning this section, we will incorporate more and more Machine Learning instead of relying on linguistics analysis alone. In writing this, I assume that you have some basic familiarity with Python packages like [scikit-learn](http://scikit-learn.org/). 

If you don't, that's fine too. The intent here is too give you a quick reference of how these APIs functions work and save your time in looking up what to learn. 

You can and must learn to use such functions well, even without knowing all the nitty gritty of underlying math. You can trust these functions as black boxes.

### Simple Classifiers

We begin by simply tries a few machine learning classifiers such as Logistic Regression, Naive Bayes, Decision Trees. 
Next, we try Random Forest and Extra Trees Classifier. For all of these implementations, we don't use anything except scikit-learn. 


### Optimizing Simple Classifiers

We can tweak the simple classifiers above to improve their performance. For this, the most common method is to try several slightly different versions of the classifier. We do this by changing the parameters of our classifier. 

We will learn how to automate this "search" process for the best classifier parameters using *GridSearch* and *RandomizedSearch*

### Ensemble Methods

Ensemble several different classifiers means we will be using a group of models. It is a very popular and easy to understand machine learning technique. This is part of almost every winning Kaggle competition. 

Despite initial concerns of why this might be slow, some teams working on commercial software have begun using this in production software as well. This is because it requires very little overhead, is easy to parallelize, and allows for a built-in fallback of using a single model. 

We will look at some of the simplest ensembling techniques based on simple majority, also known as voting ensemble and build using that.

In summary, this **Machine Learning for NLP** section covers simple classifiers, parameter optimization, and ensemble methods

In [1]:
from pathlib import Path
import pandas as pd
import gzip
from urllib.request import urlretrieve
from tqdm import tqdm
import os
import numpy as np
# if you are using the fastAI environment, all of these imports work

In [2]:
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None: self.total = tsize
        self.update(b * bsize - self.n)

In [3]:
def get_data(url, filename):
    """
    Download data if the filename does not exist already
    Uses Tqdm to show download progress
    """
    if not os.path.exists(filename):

        dirname = os.path.dirname(filename)
        if not os.path.exists(dirname):
            os.makedirs(dirname)

        with TqdmUpTo(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
            urlretrieve(url, filename, reporthook=t.update_to)

In [4]:
# Let's download some data:
data_url = 'http://files.fast.ai/data/aclImdb.tgz'
get_data(data_url, 'data/imdb.tgz')

Let's extract the files above and see what the directory contains:

In [10]:
data_path = Path(os.getcwd())/'data'/'imdb'/'aclImdb'
assert data_path.exists()
for pathroute in os.walk(data_path):
    next_path = pathroute[1]
    for stop in next_path:
        print(stop)

test
train
all
neg
pos
all
neg
pos
unsup


This really badly written utility tells us that there are atleast two folders: `train` and `test`. Each of these folders in turn has atleast 3 folders:
```bash
Test
|- all
|- neg
|- pos
```
and

```bash
Train
|- all
|- neg
|- pos
|- unsup
```

The pos and neg folders contain reviews which are positive and negative respectively. The `unsup` folder stands for unsupervised. They are useful for building language models, specially for Deep Learning. We will not use that here. Similarly, the folder `all` is redundant because these reviews are repeated in pos and neg folders. 

# Read Data 

In [11]:
train_path = data_path/'train'
test_path = data_path/'test'

In [14]:
def read_data(dir_path):
    """read data into pandas dataframe"""
    
    def load_dir_reviews(reviews_path):    
        files_list = list(reviews_path.iterdir())
        reviews = []
        for filename in files_list:
            f = open(filename, 'r', encoding='utf-8')
            reviews.append(f.read())
        return pd.DataFrame({'text':reviews})
        
    
    pos_path = dir_path/'pos'
    neg_path = dir_path/'neg'
    
    pos_reviews, neg_reviews = load_dir_reviews(pos_path), load_dir_reviews(neg_path)
    
    pos_reviews['label'] = 1
    neg_reviews['label'] = 0
    
    merged = pd.concat([pos_reviews, neg_reviews])
    merged.reset_index(inplace=True)
    
    return merged

In [15]:
train = read_data(train_path)
test = read_data(test_path)

In [16]:
X_train, y_train = train['text'], train['label']
X_test, y_test = test['text'], test['label']

In [17]:
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Logistic Regression
The simplest of all, we replicate the exact steps which we saw from Chapter 01. 

Feature Extraction: 
- Bag of Words
- TF-IDF

In [18]:
from sklearn.linear_model import LogisticRegression as LR

In [19]:
lr_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',LR())])

We saw the Pipeline in our introductory section. Pipeline allows to queue multiple operations in one single Python object.

#### !TIP
We are able to call functions like `fit`, `predict` and `fit_transform` on our `Pipeline` objects because Pipeline automatically calls the corresponding function of the last component in the list 

In [20]:
%%time
lr_clf.fit(X=X_train, y=y_train) # note that .fit function calls are inplace, and the Pipeline is not re-assigned

Wall time: 5.48 s


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

TK mention fit, fit_transform and partial_fit here 
- add code examples for partial fit here

In [21]:
lr_predicted = lr_clf.predict(X_test)

As mentioned earlier, we are calling the `predict` function on our Pipeline. The test reviews go through under the same pre-processing steps e.g. `CountVectorizer()` and `TfidfTransformer()` here as the reviews during training. 

This ease of simplicity makes `Pipeline` one of the most frequently used abstractions in software-grade machine learning. Users might prefer to execute each step independently, or build their own Pipeline equivalents in some research/experimentation use cases. 

In [71]:
lr_acc = sum(lr_predicted == y_test)/len(lr_predicted)
lr_acc

0.88316

**How do we find our model accuracy?**
Let's take a quick look at what is happening in the line above. 

Consider that our predictions are: `[1, 1, 1]` and ground truth: `[1, 0, 1]`. The equality would return a simple list of boolean objects like: `[True, False, True]`. When we `sum` a boolean list in Python, it returns the number of True cases - giving us exact count of how many times did our model make correct predictions. 

Diving this value by the total number of predictions made (or, equally the number of test reviews) gives us our accuracy.

Let's write the above two line logic into a simple, light weight function to calculate accuracy. This would prevent us from repeating the logic.   

In [65]:
from sklearn.metrics import accuracy_score

In [66]:
def imdb_acc(pipeline_clf):
    predictions = pipeline_clf.predict(X_test)
    assert len(y_test) == len(predictions)
    return sum(predictions == y_test)/len(y_test)

### Remove Stop Words

In [24]:
lr_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf',LR())])

In [25]:
lr_clf.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [27]:
imdb_acc(lr_clf)

0.879

### Increase the Ngram Range

In [28]:
lr_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',LR())])

In [29]:
lr_clf.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [30]:
imdb_acc(lr_clf)

0.86596

# Multinomial Naive Bayes

Note: Why is the above called Naive? There are more powerful and complex methods involving Bayesian approaches. 

In [31]:
from sklearn.naive_bayes import MultinomialNB as MNB
mnb_clf = Pipeline([('vect', CountVectorizer()), ('clf',MNB())])

In [32]:
mnb_clf.fit(X=X_train, y=y_train)
imdb_acc(mnb_clf)

0.81356

### Add TF-IDF

Now, let's try the above model with TF-IDF as another step after the Bag of Words (Unigrams)

In [33]:
mnb_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',MNB())])

In [34]:
mnb_clf.fit(X=X_train, y=y_train)
imdb_acc(mnb_clf)

0.82956

### Remove Stop Words

In [35]:
mnb_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf',MNB())])

In [36]:
mnb_clf.fit(X=X_train, y=y_train)
imdb_acc(mnb_clf)

0.82992

### Add Ngram Range from 1 to 3

In [37]:
mnb_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',MNB())])

In [38]:
mnb_clf.fit(X=X_train, y=y_train)
imdb_acc(mnb_clf)

0.8572

### Change Fit Prior to False

In [43]:
mnb_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',MNB(fit_prior=False))])

In [44]:
mnb_clf.fit(X=X_train, y=y_train)
imdb_acc(mnb_clf)

0.8572

In the above example, we made small modifications to try out few combinations in our Pipeline. 

We thought of each combination which might improve our performance. Increasing the `ngram_range` did work, while changing prior from uniform to fitting it (by changing `fit_prior` to False) did not help at all. This approach is tedious, and slightly error-prone because it also relies too much on human intuition of underlying data the machine learning model to be correct.

### Why we don't try Gaussian Naive Bayes?

Gaussian Naive Bayes assumes that the underlying features matrix (our TF-IDF) is densely packed. Owing to the nature of text (where every word is a feature), this is not the case. Our TF-IDF matrix is not densely packed. 

Additionally, our feature matrix is not even close to a Gaussian distribution.  

We don't use Gaussian Naive Bayes for text classification, because it would not meet our requirements and assumptions. 

### Support Vector Machine
Prior work such as that by [T Joachims](https://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf) with over 9K citations recommend Support Vector Classifiers for text classification. 

It's difficult to estimate whether it will be equally effective for us or not based on such literature due to difference in dataset, pre-processing steps. Let's give it a shot nevertheless:

In [45]:
from sklearn.svm import SVC

In [48]:
svc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',SVC())])

In [50]:
# %%time
# svc_clf.fit(X=X_train, y=y_train)
# Wall time: 14min 23s

Wall time: 14min 23s


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [52]:
# %%time
# svc_acc = imdb_acc(svc_clf)
# print(svc_acc)
# 0.6562
# Wall time: 13min 4s

0.6562
Wall time: 13min 4s


While SVM works best with linearly separabale data (looks like our text is usually not linearly separable), we still wanted to give it a try for completeness. 

Here, SVM does not have a great performance, and took a really long time to train (~150x) of most other classifiers. We will not look at SVM for this particular dataset again.

## Tree Baseed Models

### Decision Trees

In [51]:
from sklearn.tree import DecisionTreeClassifier as DTC
dtc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',DTC())])

In [53]:
dtc_clf.fit(X=X_train, y=y_train)
imdb_acc(dtc_clf)

0.7026

## Random Forest Classifier 

In [54]:
from sklearn.ensemble import RandomForestClassifier as RFC
rfc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',RFC())])

In [55]:
rfc_clf.fit(X=X_train, y=y_train)
imdb_acc(rfc_clf)

0.72428

## Extra Trees Classifier 

In [56]:
from sklearn.ensemble import ExtraTreesClassifier as XTC
xtc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',XTC())])

In [57]:
xtc_clf.fit(X=X_train, y=y_train)
imdb_acc(xtc_clf)

0.751

# Automatically Fine Tuning 

Let's focus on our best performing model: Logistic Regression and see if we can push it's performance a little more. 

The best performance for our model was **0.88312** accuracy earlier. 

We are using the phrases parameter-search and hyperparameter search interchangeably here. This is done to stay consistent with the Deep Learning vocabulary.

We want to select the best performing configuration of our pipeline. Each configuration might be diffirent is small ways like removing stop words, including bigrams and trigrams or similar. 

The total number of such configurations can be fairly large running into few thousands. In addition to manually selecting few combinations to try, we can try all of these several thousand combinations *and* evaluate each combination. 

This is too slow for most small-scale experiments such as ours. In large experiments, the possible space can run into millions and several days of computing again making it cost and time prohibitive. 

I strongly urge you to read this blog on [Hyperparameter Tuning](https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/5/hyperparameter-tuning) to become familiar with the vocabulary and ideas in the space beyond what is discussed here. 

### RandomizedSearch

An alternative was proposed by [*Bergstra & Bengio, 2012*](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf). They demonstrated that Random Search across a large hyperparameter space is more effective than manual (as we did for Multinomial Naive Bayes) and often as-effective or more effective than Grid Search. 

**How do we use it here?**
We build on top of the results such as that of Bergstra et al. We break down our parameter search into two steps: 
Step 1: Randomized Search to go through a wide parameter combination space in a limited number of iterations 
Step 2: Use the results above to run a GridSearch in that slightly narrow space. 

We can repeat the above steps till we stop seeing improvements in our results, but we won't do that here. We leave that as an exercise to the reader

In [58]:
from sklearn.model_selection import RandomizedSearchCV

#### How to prepare the param_grid?
TK

In [59]:
param_grid = dict(clf__C=[50, 75, 85, 100], 
                  vect__stop_words=['english', None],
                  vect__ngram_range = [(1, 1), (1, 3)],
                  vect__lowercase = [True, False],
                 )

In [60]:
random_search = RandomizedSearchCV(lr_clf, param_distributions=param_grid, n_iter=5, scoring='accuracy', n_jobs=-1, cv=3)

What does cv do? Adding cv above causes use of StratifiedKFold for evaluation of the scoring metric
What does n_iter do? 
What does scoring do? 

In [61]:
%%time
random_search.fit(X_train, y_train)

Wall time: 3min 46s


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
          fit_params=None, iid=True, n_iter=5, n_jobs=-1,
          param_distributions={'clf__C': [50, 75, 85, 100], 'vect__stop_words': ['english', None], 'vect__ngram_range': [(1, 1), (1, 3)], 'vect__lowercase': [True, False]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='accuracy', verbose=0)

In [78]:
print(f'Calculated cross-validation accuracy: {random_search.best_score_}')

Calculated cross-validation accuracy: 0.87396


In order to compare the performance of this classifier on the ones which we have already seen, we need to train it on complete dataset and test it on the same split as earlier. We do this next:

In [79]:
best_random_clf = random_search.best_estimator_

In [80]:
best_random_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        stri...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [81]:
imdb_acc(best_random_clf)

0.89916

We see that the classifier performance improves by more than 1% by simply changing very few parameters. This is amazing. 

Let's see what parameters are here. In order to compare this, you would need to know the default values for all of the parameters. Alternatively, we can simply look at the parameters from the `param_grid` that we wrote and note the selected parameter values. For everything not in the grid, default values are chosen and remain unchanged.

In [82]:
best_random_clf.steps

[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
          lowercase=False, max_df=1.0, max_features=None, min_df=1,
          ngram_range=(1, 3), preprocessor=None, stop_words=None,
          strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
          tokenizer=None, vocabulary=None)),
 ('tfidf',
  TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
 ('clf',
  LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False))]

We notice that in the best classifier: 
    - the chosen C value in clf is 100, 
    - lowercase is set to False
    - removing stop words is bad idea, and
    - adding bigrams and trigrams helps
    
Observations like these are very specific to this dataset and classifier pipeline. In my experience, this can and does vary widely.

We can also not assume that the values are always the best value when we run `RandomizedSearch` for so few iterations. The rule of thumb is to run it for **60 iterations** atleast, and use a much larger `param_grid` as well. 

We used RandomizedSearch to understand the broad layout of parameters we want to try. We add the best values for some of those to our pipeline itself and continue to experiment with values of other parameters. 

We will now run GridSearch for these selected parameters. Here, on a whim, I am choosing to include bigrams and trigrams while running grid search over the `parameter C` of LogisticRegression. 

**!TIP**

I have not mentioned what the parameter `C` stands for or how it influences the classifier. This is definitely important to understand while doing manual parameter search. I could notice that changing `C` helps simply by trying out different values. 

But, our intention here is to automate as much as possible. I instead try varying values in `C` to try during our `RandomizedSearch`. We are trading off human learning time (maybe a few hours) with compute time (maybe a few extra minutes). This mindset saves us time and effort both.

In [87]:
lr_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',LR())])

In [None]:
from sklearn.model_selection import GridSearchCV

In [89]:
param_grid = dict(clf__C=[85, 100, 125, 150])
grid_search = GridSearchCV(lr_clf, param_grid=param_grid, scoring='accuracy', n_jobs=-1, cv=3)

In [None]:
%%time
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_estimator_.score

# Ensemble Models 
**Ensembling models** is a very powerful technique to improve your model performance across a variety of Machine Learning tasks. 

In the section below, I borrow heavily from the [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/) written by [MLWave](https://mlwave.com/).

I explain why ensembling helps reduce error, or improve accuracy. I demonstrate all the popular techniques on our chosen task and dataset. Each demo includes the ensembling code (borrowed from MLWave) and performance gain from those. 

To ensure that you understand these techniques, I strongly urge you to try them on a few datasets. 

## Voting Ensemble

### Simple Majority (aka Hard Voting)
The simplest ensembling technique is perhaps to take a simple majority. This works on the intuition that a single model might make a error on a particular prediction, but several different models are unlikely to make identical errors. 

Let's look at an example. 

Ground truth: 1**1**011001

Let's assume there are 3 models with only one error for this example

Model A Prediction: 1**0**011001

Model B Prediction: 1**1**011001

Model C Prediction: 1**1**011001

The majority votes gives us the correct answer in this example - 

Majority vote: 1**1**10110011

---

We have predictions from our models above, let's check the first few predictions from each of those: 

In [None]:
from sklearn.ensemble import VotingClassifier

In [None]:
%%time
voting_clf = VotingClassifier(estimators=[('lr', lr_clf), ('rf', xtc_clf), ('mnb', mnb_clf)], voting='hard')
voting_clf.fit(X_train, y_train)

#### Soft Voting

In [None]:
%%time
voting_clf = VotingClassifier(estimators=[('lr', lr_clf), ('rf', xtc_clf), ('mnb', mnb_clf)], voting='soft')
voting_clf.fit(X_train, y_train)
voting_predictions = voting_clf.predict(X_test)
sum(voting_predictions == y_test)/len(y_test)

### Weighted Classifiers

In [None]:
%%time
voting_clf = VotingClassifier(estimators=[('lr', lr_clf), ('lr2', lr_clf),('rf', xtc_clf), ('mnb2', mnb_clf),('mnb', mnb_clf)], voting='soft')
voting_clf.fit(X_train, y_train)
voting_predictions = voting_clf.predict(X_test)
sum(voting_predictions == y_test)/len(y_test)

We get an improvement from our simple ensembling technique but nothing significant. Why would this be?

#### Removing Correlated Classifiers 

To see this, let us take 3 simple models again. The ground truth is all 1’s:
`
1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy
`

These models are highly correlated in their predictions. When we take a majority vote we see no improvement:

`
1111111100 = 80% accuracy
`

Now we compare to 3 less-performing, but highly uncorrelated models:

`
1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy
`

When we ensemble this with a majority vote we get:

`
1111111101 = 90% accuracy
`

We get an improvement which is much higher than any of our individual models. Low correlation between model predictions can lead to better performance. 

In [None]:
np.corrcoef(mnb_predicted, lr_predicted)[0][1]