# Modern Methods for Text Classification

What technical topics will the reader learn about during this chapter?
- HEADING 1: Classification: Bread and Butter of NLP
- HEADING 2: Classic Machine Learning: LR, DT, RFC, xgboost
- HEADING 3: How to select the best classifiers using ROC curves
- HEADING 4: Ensemble methods: The Intuition
- HEADING 5: Ensemble methods: Programming our own ensembles

Skills learned: For each heading, insert what the reader will learn to DO in this chapter?
- SKILL 1: Using Machine Learning Classifiers
- SKILL 2: scikit-learn
- SKILL 3: Model Evaluation basics (deeper dive in Model Understanding section)
- SKILL 4: Ensemble methods: intuition
- SKILL 5: Writing our own ensemble implementation

In [1]:
import gzip
from urllib.request import urlretrieve
from tqdm import tqdm
import os
import numpy as np
# if you are using the fastAI environment, all of these imports work
from pathlib import Path
import pandas as pd

In [2]:
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None: self.total = tsize
        self.update(b * bsize - self.n)

In [3]:
def get_data(url, filename):
    """
    Download data if the filename does not exist already
    Uses Tqdm to show download progress
    """
    if not os.path.exists(filename):

        dirname = os.path.dirname(filename)
        if not os.path.exists(dirname):
            os.makedirs(dirname)

        with TqdmUpTo(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
            urlretrieve(url, filename, reporthook=t.update_to)

In [4]:
# Let's download some data:
data_url = 'http://files.fast.ai/data/aclImdb.tgz'
get_data(data_url, 'data/imdb.tgz')

## !MANUAL STEP 
Manually extract the files above - your extractor depends on your Operating System. I used 7z on Windows and dtrx on Linux-Ubuntu16.04LTS. 

Let's see what the directory contains:

In [5]:
data_dir = !pwd 
data_dir = Path(data_dir[0])/'data'/'imdb'/'aclImdb'
assert data_dir.exists()

In [6]:
for pathroute in os.walk(data_dir):
    next_path = pathroute[1]
    for stop in next_path:
        print(stop)

test
train
pos
all
neg
unsup
pos
all
neg


This really badly written utility tells us that there are atleast two folders: `train` and `test`. Each of these folders in turn has atleast 3 folders:
```bash
Test
|- all
|- neg
|- pos
```
and

```bash
Train
|- all
|- neg
|- pos
|- unsup
```

The pos and neg folders contain reviews which are positive and negative respectively. The `unsup` folder stands for unsupervised. They are useful for building language models, specially for Deep Learning. We will not use that here. Similarly, the folder `all` is redundant because these reviews are repeated in pos and neg folders. 

# Read data into separate dataframes/strings

In [7]:
train_path = data_dir/'train'
test_path = data_dir/'test'
assert train_path.exists()
assert test_path.exists()

In [8]:
def load_data(dir_path):
    
    def load_dir_reviews(reviews_path):    
        files_list = list(reviews_path.iterdir())
        reviews = []
        for filename in files_list:
            f = open(filename, 'r', encoding='utf-8')
            reviews.append(f.read())
        return pd.DataFrame({'text':reviews})
        
    
    pos_path = dir_path/'pos'
    neg_path = dir_path/'neg'
    pos_reviews, neg_reviews = load_dir_reviews(pos_path), load_dir_reviews(neg_path)
    pos_reviews['label'] = 1
    neg_reviews['label'] = 0
    merged = pd.concat([pos_reviews, neg_reviews])
    merged.reset_index(inplace=True)
    return merged

In [9]:
train = load_data(train_path)
test = load_data(test_path)

In [10]:
X_train, y_train = train['text'], train['label']
X_test, y_test = test['text'], test['label']

In [11]:
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Logistic Regression
The simplest of all, we replicate the exact steps which we saw from Chapter 01. 

Feature Extraction: 
- Bag of Words
- TF-IDF

In [12]:
from sklearn.linear_model import LogisticRegression as LR

### No preprocessing

In [13]:
lr_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',LR())])

TK: Explain pipeline again in brief here


In [14]:
%%time
lr_clf.fit(X=X_train, y=y_train)

CPU times: user 15.3 s, sys: 736 ms, total: 16.1 s
Wall time: 6.57 s


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

TK mention fit, fit_transform and partial_fit here 
- add code examples for partial fit here

In [15]:
lr_predicted = lr_clf.predict(X_test)
lr_acc = sum(lr_predicted == y_test)/len(lr_predicted)
lr_acc

0.88312

### Remove Stopwords

In [16]:
lr_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf',LR())])
lr_clf.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [17]:
lr_predicted = lr_clf.predict(X_test)
lr_acc = sum(lr_predicted == y_test)/len(lr_predicted)
lr_acc

0.879

We notice that removing stop words actually hurts the performance of the Logistic Regression. This could be because that the class distribution of stop words varies across our target labels

### Change ngram_range=(1, 3)
Let us include bigrams and trigrams in the CountVectorizer stage

In [18]:
lr_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',LR())])
lr_clf.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [19]:
lr_predicted = lr_clf.predict(X_test)
lr_acc = sum(lr_predicted == y_test)/len(lr_predicted)
lr_acc

0.86596

### Keep Stop Words

In [20]:
lr_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',LR())])
lr_clf.fit(X=X_train, y=y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [21]:
lr_predicted = lr_clf.predict(X_test)
lr_acc = sum(lr_predicted == y_test)/len(lr_predicted)
lr_acc

0.87752

# Multinomial Naive Bayes

Note: Why is the above called Naive? There are more powerful and complex methods involving Bayesian approaches. 

In [22]:
from sklearn.naive_bayes import MultinomialNB as MNB
mnb_clf = Pipeline([('vect', CountVectorizer()), ('clf',MNB())])

In [23]:
mnb_clf.fit(X=X_train, y=y_train)
mnb_predicted = mnb_clf.predict(X_test)
sum(mnb_predicted == y_test)/len(y_test)

0.81356

### Add TF-IDF

Now, let's try the above model with TF-IDF as another step after the Bag of Words (Unigrams)

In [24]:
mnb_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',MNB())])

In [25]:
mnb_clf.fit(X=X_train, y=y_train)
mnb_predicted = mnb_clf.predict(X_test)
sum(mnb_predicted == y_test)/len(y_test)

0.82956

### Remove Stop Words

In [26]:
mnb_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()), ('clf',MNB())])

In [27]:
mnb_clf.fit(X=X_train, y=y_train)
mnb_predicted = mnb_clf.predict(X_test)
sum(mnb_predicted == y_test)/len(y_test)

0.82992

### Add Ngram Range from 1 to 3

In [28]:
mnb_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range=(1,3))), ('tfidf', TfidfTransformer()), ('clf',MNB())])

In [29]:
mnb_clf.fit(X=X_train, y=y_train)
mnb_predicted = mnb_clf.predict(X_test)
sum(mnb_predicted == y_test)/len(y_test)

0.8572

### Why we don't try Gaussian Naive Bayes?

Gaussian Naive Bayes assumes that the underlying features matrix (our TF-IDF) is densely packed. Owing to the nature of text (where every word is a feature), this is not the case. Our TF-IDF matrix is not densely packed. 

Additionally, our feature matrix is not even close to a Gaussian distribution.  

We don't use Gaussian Naive Bayes for text classification, because it would not meet our requirements and assumptions. 

# Support Vector Machines 
used as Classifiers

In [30]:
from sklearn.svm import SVC

In [31]:
svc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',SVC())])

Warning: The next line takes about 15 minutes to run on 8 core CPU 

In [32]:
# %%time
# svc_clf.fit(X=X_train, y=y_train)

In [33]:
# svc_predicted = svc_clf.predict(X_test)
# sum(svc_predicted == y_test)/len(y_test)

# Tree Baseed Models

# Decision Trees

In [34]:
from sklearn.tree import DecisionTreeClassifier as DTC
dtc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',DTC())])

In [35]:
dtc_clf.fit(X=X_train, y=y_train)
dtc_predicted = dtc_clf.predict(X_test)
sum(dtc_predicted == y_test)/len(y_test)

0.70308

## Random Forest Classifier 

In [36]:
from sklearn.ensemble import RandomForestClassifier as RFC
rfc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',RFC())])

In [37]:
rfc_clf.fit(X=X_train, y=y_train)
rfc_predicted = rfc_clf.predict(X_test)
sum(rfc_predicted == y_test)/len(y_test)

0.73696

## Extra Trees Classifier 

In [38]:
from sklearn.ensemble import ExtraTreesClassifier as XTC
xtc_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf',XTC())])

In [39]:
xtc_clf.fit(X=X_train, y=y_train)
xtc_predicted = xtc_clf.predict(X_test)
sum(xtc_predicted == y_test)/len(y_test)

0.75044

# Implementing Bag of Words baseline with bigrams with Naive Bayes SVM
https://github.com/mesnilgr/nbsvm/blob/master/nbsvm.py

Paper: https://www.aclweb.org/anthology/P12-2018