# Text classification via vector semantics and open source software

In [123]:
from nltk.corpus import inaugural

Let's use the Inaugural speeches again.

Grab the president's name as a label by ignoring the year and the .txt extension on each speech's file id.

In [124]:
train_data = [sentence for fileid in inaugural.fileids() for sentence in inaugural.sents(fileid)]
train_labels = [fileid[5:-4] for fileid in inaugural.fileids() for sent in inaugural.sents(fileid)]

Let's use a pre-trained word embedding.

In this case based on ~100 billion words from Google News used to represent a vocab of ~3 million words in a 300-dimension vector space.

Requires a 2GB download first time loaded then fills a good 4GB RAM.

We could also train our own embedding too

In [125]:
# gensim can directly access a few models and dataset
import gensim.downloader as api
api.info('word2vec-google-news-300')

{'num_records': 3000000,
 'file_size': 1743563840,
 'base_dataset': 'Google News (about 100 billion words)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py',
 'license': 'not found',
 'parameters': {'dimension': 300},
 'description': "Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).",
 'read_more': ['https://code.google.com/archive/p/word2vec/',
  'https://arxiv.org/abs/1301.3781',
  'https://arxiv.org/abs/1310.4546',
  'https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvec

In [126]:
wv = api.load('word2vec-google-news-300')

Ok, let's now convert the sentences to a vector in this space by taking the mean of the word vectors for each sentence.

In [127]:
import numpy as np
# we're ignoring any words not in our embeddings vocsb
# so there is a small risk here that no words in the sentence are in the vocab which would leave an empty array
train_vecs = np.array([
    np.mean([wv[word] for word in sentence if word in wv.vocab], axis=0)
    for sentence in train_data
])

train_vecs

array([[ 0.01371111,  0.03740583, -0.06473214, ..., -0.00234549,
         0.20612444,  0.08544922],
       [ 0.00393775,  0.021065  ,  0.03127806, ..., -0.03549157,
        -0.00971197, -0.04127293],
       [ 0.01896888,  0.05594985,  0.03749165, ..., -0.01623395,
         0.02293476,  0.0019733 ],
       ...,
       [ 0.20052083,  0.08162435,  0.02587891, ..., -0.24495442,
        -0.08821615,  0.07167307],
       [-0.00097656, -0.05688477,  0.10229492, ..., -0.20751953,
        -0.19726562,  0.09008789],
       [ 0.17285156,  0.12792969,  0.02197266, ..., -0.18684895,
        -0.02832031,  0.05301412]], dtype=float32)

In [128]:
from gensim.models import Word2Vec

from sklearn.linear_model import LogisticRegression

In [129]:
clf = LogisticRegression().fit(train_vecs, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [130]:
import pandas as pd
pd.set_option("max_rows", None)
pd.DataFrame({"predicted": clf.predict(sentence_vecs[:100]), "actual": train_labels[:100]})

Unnamed: 0,predicted,actual
0,McKinley,Washington
1,Harrison,Washington
2,Harrison,Washington
3,Harrison,Washington
4,Harrison,Washington
5,Harrison,Washington
6,Harrison,Washington
7,Harrison,Washington
8,Harrison,Washington
9,Harrison,Washington


Benefits of not rolling your own:

1. Can effectively work with some much larger datasets with minimal compute resources
2. Can achieve a lot in a short time due to someone else having optimized code and tuned embeddings, so can focus on trying different approaches and hyperparameters.

But

1. It doesn't give much insight into how word2vec (or logistic regression) works so when hit problems it's not necessarily easy to debug
1. Less control over what data goes into my models (maybe it contains biases that I'd like to avoid)
1. Still a lot of glue code

### Pipelines
Can use a Scikit Learn Pipeline to tidy-up and hide some of this glue code

In [131]:
from sklearn.pipeline import Pipeline

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([wv[word] for word in sentence if word in wv.vocab], axis=0)
            for sentence in X
        ])

logreg_w2v = Pipeline([
    ("word2vec vectorizer", MeanEmbeddingVectorizer(wv)),
    ("log reg", LogisticRegression())])

Now we can just pass the array of arrays of words as our training data for the classifier.

In [132]:
logreg_w2v.fit(train_data, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(steps=[('word2vec vectorizer',
                 <__main__.MeanEmbeddingVectorizer object at 0x7fe75ea395d0>),
                ('log reg', LogisticRegression())])

In [133]:
pd.DataFrame({"predicted": logreg_w2v.predict(train_data[:100]), "actual": train_labels[:100]})

Unnamed: 0,predicted,actual
0,McKinley,Washington
1,Harrison,Washington
2,Harrison,Washington
3,Harrison,Washington
4,Harrison,Washington
5,Harrison,Washington
6,Harrison,Washington
7,Harrison,Washington
8,Harrison,Washington
9,Harrison,Washington


### Training embeddings

And we could create our own embeddings if we wanted to control those.