 # ML text classification using Bag-Of-Words [Tutorial](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)
 
##### [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/)<br> Dan Jurafsky and James H. Martin

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Load Newspaper20 data

In [2]:
# load data
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

train = twenty_train
train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [3]:
print(train.target_names[train.target[0]])
train.data[0]

rec.autos


"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

## Extract Features from Text

We will be using bag-of-words model. BOW segments each text file into words (splitting by space), and count # of times each word occurs in each document. Lastly, it assign's each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

using [scikit.CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
CV = CountVectorizer()
CV

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
train_counts = CV.fit_transform(train.data)# returns n_samples, n_feautres
train_counts.shape

(11314, 130107)

## Term Frequencies

$$ TF =  \frac{Unique-word-count}{total-words} $$


$$ IDF =  \frac{total-number-docs}{number of docs with word} $$
#### IDF is the log scaled inverse fraction of documents that contain a word 

$$TF-IDF = tf(t,d).idf(t,d)$$
#### A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
TF = TfidfTransformer()

In [7]:
train_tfidf = TF.fit_transform(train_counts)
train_tfidf.shape

(11314, 130107)

# Naive Bayes Classifier

In [8]:
from sklearn.naive_bayes import MultinomialNB

In [9]:
len(train.data),len(train.target)

(11314, 11314)

In [10]:
NB = MultinomialNB().fit(train_tfidf, train.target)
NB

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Evaluate NB model
[dimension error](https://stackoverflow.com/questions/12484310/scipy-and-scikit-learn-valueerror-dimension-mismatch)

In [11]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
test = twenty_test
test.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [12]:
test_counts = CV.transform(test.data)# returns n_samples, n_feautres
test_counts.shape

(7532, 130107)

In [13]:
test_tfidf = TF.transform(test_counts)
test_tfidf.shape

(7532, 130107)

In [14]:
NB_predict = NB.predict(test_tfidf)
print('NB accuracy: %.3f' %np.mean(NB_predict == test.target))

NB accuracy: 0.774


# Pipelines
allow you to wrap processing step and modelling to single function.

In [15]:
from sklearn.pipeline import Pipeline

## Naive Bayes pipe

In [16]:
NB_pipe = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('model', MultinomialNB())])

NB_pipe = NB_pipe.fit(train.data, train.target)
NB_predict = NB_pipe.predict(test.data)
print('NB accuracy: %.3f' %np.mean(NB_predict == test.target))

NB accuracy: 0.774


## SVM pipe

In [17]:
from sklearn.linear_model import SGDClassifier

In [18]:
SVM_pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('model',
                      SGDClassifier(
                          loss='hinge', penalty='l2',
                          alpha=1e-3, max_iter=5, random_state=42)),
                        ])

In [19]:
SVM_pipe = SVM_pipe.fit(train.data, train.target)
SVM_predict = SVM_pipe.predict(test.data)
print('SVM accuracy: %.3f' %np.mean(SVM_predict == test.target))

SVM accuracy: 0.824


## Logit pipe

In [20]:
LOG_pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('model',
                      SGDClassifier(
                          loss='log', penalty='l2',
                          alpha=1e-3, max_iter=5, random_state=42)),
                        ])

LOG_pipe = LOG_pipe.fit(train.data, train.target)
LOG_predict = LOG_pipe.predict(test.data)
print('LOG accuracy: %.3f' %np.mean(LOG_predict == test.target))

LOG accuracy: 0.749


In [21]:
# A special case of Log and Huber loss metrics,
# we can see the probs that of each label
LOG_predict_p = LOG_pipe.predict_proba(test.data)
LOG_predict_p.shape,len(train.target_names)

((7532, 20), 20)

In [22]:
results = pd.DataFrame({
    'target':test.target,
    'NB':NB_predict,'SVM':SVM_predict,'LOG':LOG_predict})
results = results[['target', 'NB', 'SVM', 'LOG']].copy()
results[:10]

Unnamed: 0,target,NB,SVM,LOG
0,7,7,7,7
1,5,11,1,1
2,0,0,0,0
3,17,17,17,17
4,19,0,0,0
5,13,13,13,13
6,15,15,15,15
7,15,15,2,2
8,5,5,5,5
9,1,1,1,1


# Model Tuning
Tune model parameters using [GridSearch](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 

Also possible to tune using [RandomisedSearch](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

In [55]:
from sklearn.model_selection import GridSearchCV

### NaiveBayes tune
- CountVectorizer
  - [n-grams](https://en.wikipedia.org/wiki/N-gram)<br>
  - [n-grams SO](https://stackoverflow.com/questions/24005762/understanding-the-ngram-range-argument-in-a-countvectorizer-in-sklearn)
- Tokenizer
  - [IDF Enable inverse-document-frequency reweighting.](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)
- NaiveBayes Classifier.
  - [Alpha:  Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).](https://stackoverflow.com/a/33840514/4538066)
  - [docs](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.predict)

In [56]:
print(NB_pipe)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ear_tf=False, use_idf=True)), ('model', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])


In [64]:
# model param settings are recognised using "name__"
NB_params = {'vect__ngram_range':[(1,1),(1,2)],
             'tfidf__use_idf':('True','False'),
             'model__alpha':(1e-1, 1e-3)}

In [66]:
gs_NB = GridSearchCV(NB_pipe, param_grid=NB_params, n_jobs=2)
gs_NB = gs_NB.fit(train.data, train.target)

In [73]:
gs_NB.best_score_, gs_NB.best_params_

(0.90622237935301397,
 {'model__alpha': 0.001,
  'tfidf__use_idf': 'True',
  'vect__ngram_range': (1, 2)})

In [74]:
NBcv_predict = gs_NB.predict(test.data)
print('NB-CV accuracy: %.3f' %np.mean(NBcv_predict == test.target))

NB-CV accuracy: 0.836


# SVM tune
[examples tuning SVM](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py)

In [90]:
# model param settings are recognised using "name__"
SVM_params = {'vect__ngram_range':[(1,1),(1,2)],
              'tfidf__use_idf':('True','False'),
              'model__alpha':(1e-1, 1e-3),
              'model__penalty': ('l2', 'elasticnet'),
              'model__n_iter': (10, 50, 80)}

In [None]:
gs_SVM = GridSearchCV(SVM_pipe, param_grid=SVM_params, n_jobs=-1)
gs_SVM = gs_SVM.fit(train.data, train.target)

In [84]:
print(gs_SVM.best_score_, gs_SVM.best_params_)
SVMcv_predict = gs_SVM.predict(test.data)
print('SVM-CV accuracy: %.3f' %np.mean(SVMcv_predict == test.target))

0.89791408874 {'model__alpha': 0.001, 'tfidf__use_idf': 'True', 'vect__ngram_range': (1, 2)}
SVM-CV accuracy: 0.833
