# Which is better for news article classification: Naive Bayes using Count Vectorizer or Logistic Regression using word vectors from a model trained on Google News?

## Load model and data

Google has pretrained a neural network on news articles. We can extract the word vectors, or word embeddings, that the model has learned and use them to train our own model. Let's load in the model here. It is relatively large so it will take a minute to load.

In [1]:
import numpy as np
import gensim
from sklearn.datasets import load_files

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(
    './GoogleNews-vectors-negative300.bin', 
    binary=True
)  

Next, we need to load in the text from the news articles that we want to categorize. If we use Scikit Learn's `load_files` function from `sklearn.datasets`, the data and categories will be automatically loaded, so long as the text files are in properly labeled folders.

The `news` variable will store both data and target labels, each of which can be accessed by calling `news.data` and `news.target`, respectively.

In [3]:
news = load_files(
    './bbc/', 
    encoding='utf-8',
    decode_error='ignore',
    random_state=42
)

We can display a few sentences of one of the articles to get an idea of what they look like, and see what the target labels look like as well.

In [4]:
def first_n_sentences(article:str, n):
    "Returns the first n sentences in an article, or None if that number of sentences does not exist"
    first_n = ''
    period_counter = 0
    for character in article:
        first_n += character
        if character == '.':
            period_counter += 1
        if period_counter == n:
            return first_n
    print(f'There are not {n} sentences in this article')
    return None

In [5]:
first_n_sentences(news.data[0], 3)

'UK house prices dip in November\n\nUK house prices dipped slightly in November, the Office of the Deputy Prime Minister (ODPM) has said.\n\nThe average house price fell marginally to £180,226, from £180,444 in October. Recent evidence has suggested that the UK housing market is slowing after interest rate increases, and economists forecast a drop in prices during 2005.'

In [6]:
first_n_sentences(news.data[1000], 3)

"Collins named UK Athletics chief\n\nUK Athletics has ended its search for a new performance director by appointing psychologist Dave Collins.\n\nCollins, who worked with the British teams at the 2000 and 2004 Olympics, takes over from Max Jones. Six candidates were interviewed for the job, including Denise Lewis' coach Charles van Commenee and former British triple jumper Keith Connor."

In [7]:
news.target_names

['business', 'entertainment', 'politics', 'sport', 'tech']

## Naive Bayes

Naive Bayes can take counts of words and generate a model that looks at condtitional class probabilities. In this case, the classes correspond to the five categories of news. Let's start with Naive Bayes since it is quick to prepare. We can use Scikit Learn's `Pipeline` class to easily pass the data through the counting and fitting steps of the Naive Bayes modeling process.

In [8]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [9]:
def text_classification(clf: Pipeline, train_data, test_data, train_target, test_target):
    "Helper function wrapping the Pipeline, courtesy of Brian Spiering"
    clf.fit(train_data, train_target) 
    predicted = clf.predict(test_data)
    accuracy = np.mean(predicted==test_target)
    print(f"The accuracy on the test data is {accuracy:.2%}")
    return predicted

In [10]:
clf = Pipeline([('vect', CountVectorizer()),
                ('clf', MultinomialNB())])
nb_scores = cross_val_score(clf, news.data, news.target, cv=10)

In [44]:
print(f'Naive Bayes Accuracy: {100*nb_scores.mean():0.2f}% (+/- {100*nb_scores.std() * 2:0.2f}%)')

Naive Bayes Accuracy: 97.57% (+/- 1.52%)


Naive Bayes is already doing a great job, predicting close to 98%.

## Logistic Regression

### Pre-processing

#### Tokenize and remove punctuation

It is helpful to make things easy for the model, which is why we will clean and tokenize the text. The `Spacy` library does a great job of tokenizing, or separating out individual words. This process counts punctuation characters and numbers as tokens, so we can follow up with a very basic regular expression substitution that removes these. Removing punctuation may not always be useful, like in the case of sentiment analysis where punctuation may add additional meaning or emphasis. In our case, punctuation will likely not give any clues to the topic of the article.

In [12]:
import re
import spacy
nlp = spacy.load('en')

def tokenize_bbc(text_data):
    "Tokenize BBC news articles and remove punctuation and digits"
    clean_text = list()
    for text_chunk in text_data:
        doc = nlp(text_chunk)
        clean = ' '.join([token.text for token in doc])
        clean = re.sub('[\\n\.\,\(\)\'\"\-\:\;\&\#\/0-9]', '', clean)
        clean_text.append(clean)
    return clean_text

In [13]:
news.data = tokenize_bbc(news.data)

#### Find the unique words in the corpus and see how many match up with the model vocabulary

The goal is to use pre-trained word vectors to train a classification model. If none of the words in our articles match up with words in the pre-trained model, then this effort will be for nothing. This next step checks to make sure we are capturing enough of the words to have good results.

In [14]:
def find_unique_words(text:list)->set: 
    word_set = set()
    for word_list in text:
        word_set = word_set | set(word_list)
    return word_set

In [15]:
def calculate_matched_vocab(articles:list, model:gensim):
    data_vocab = find_unique_words([article.split() for article in articles])  # Vocabulary of the dataset
    model_vocab = set(model.vocab.keys())  # Vocabulary of the Google model
    matched_vocab = data_vocab & model_vocab  # Intersecting vocabulary
    print(f'{len(matched_vocab) * 100 / len(data_vocab) : .2f}% of the dataset vocabulary has been matched')
    return matched_vocab

In [16]:
matched_vocab = calculate_matched_vocab(news.data, model)

 97.00% of the dataset vocabulary has been matched


#### Convert words to embeddings

In this step, we create a dictionary that allows us to look up the word embedding for any word. We choose to only include matched words, since those are the only ones that will be used. Then we convert each word in each article to its corresponding word vector. Finally we save our converted embedded text so we do not have to repeat these intial preprocessing steps.

In [17]:
embed_dict = {word:model.get_vector(word) for word in matched_vocab}

In [18]:
def string_to_vector(data, embed_dict, matched_vocab):
    "Converts string tokens to word vectors using an embedding dictionary and a set of intersecting vocabulary."
    embedded_text = list()
    for article in data:
        embedded_text.append(
            np.array([embed_dict[word] for word in article.strip().split() if word in matched_vocab])
        )
    return embedded_text

In [19]:
embedded_text = string_to_vector(news.data, embed_dict, matched_vocab)

#### Find the minimum word article

When using regression, our inputs must all be the same length. To ensure this, we can find the article with the smallest number of words and then subset all other articles by that length.

In [20]:
def shrink_to_smallest(embedded_text):
    "Shrinks all embedded text documents to match the minimum length document"
    min_length = np.min([article.shape[0] for article in embedded_text])
    embedded_text = [article[:min_length] for article in embedded_text]
    return embedded_text

In [21]:
embedded_text = shrink_to_smallest(embedded_text)

#### Save our work

Now that we are done pre-processing, we should save our work. We don't want to have to repeat all these steps if we just want to tune or re-run the model in the future.

In [22]:
np.save('./tmp/bbc_text_embed_converted.npy', embedded_text)

### Model time

In [23]:
embedded_text = np.load('./tmp/bbc_text_embed_converted.npy')

We can average the word embeddings to give a notion of meaning to each article and then fit a logistic regression.

In [24]:
from sklearn.linear_model import LogisticRegression

In [25]:
def create_average_vectors(embedded_text):
    "Calculate the average of the word embeddings for each text in a corpus"
    avg_embed_text = list()
    for article in embedded_text:
        avg_embed_text.append(np.sum([word for word in article], axis=0) / len(article))
    return np.array(avg_embed_text)

In [None]:
avg_embed_text = create_average_vectors(embedded_text)
clf = lr = LogisticRegression(C=100)
lr_scores = cross_val_score(clf, avg_embed_text, news.target, cv=10)

In [45]:
print(f'Logistic Regression Accuracy: {100*lr_scores.mean():0.2f}% (+/- {100*lr_scores.std() * 2:0.2f}%)')

Logistic Regression Accuracy: 97.71% (+/- 1.62%)


## Repeat with newsgroup data

Both models did really well on the news articles, but I wonder how we would do on a different data set. There is a newsgroup data set that has online postings on various topics. The original data set has more categories, but I made my own aggregate categories (anything having to do with religion was put into the religion folder, etc.) and made sure there were close to equal amounts of posts in each.

In [27]:
data_path = './newsgroups/'
boards = load_files(
    data_path, 
    encoding='utf-8',
    decode_error='ignore',
    random_state=42
)

In [51]:
boards.target_names

['computer', 'politics', 'recreation', 'religion', 'science']

The tokenization process is slightly different for the newsgroup data, mainly because the `Spacy` tokenizer doesn't seem to do as good a job at separating the much less nicely formatted text. We need to replace punctuation with spaces first so `Spacy` can do a better job.

In [28]:
def tokenize_newsgroup(text_data):
    "Tokenize newsgroup posts"
    clean_text = list()
    for text_chunk in text_data:
        text_chunk = re.sub('[\\n\.\,\(\)\'\"\-\:\;\&\#\!\*\<\>\@\^\`\~\|\\\$\/0-9]', ' ', text_chunk)
        doc = nlp(text_chunk)
        clean = ' '.join([token.text.strip() for token in doc if len(token) > 1])
        clean_text.append(clean)
    return clean_text

Before:

In [54]:
boards.data[0]

"Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!pitt.edu!zaphod.mps.ohio-state.edu!howland.reston.ans.net!gatech!emory!athena!aisun3.ai.uga.edu!mcovingt\nFrom: mcovingt@aisun3.ai.uga.edu (Michael Covington)\nNewsgroups: sci.med\nSubject: Re: Any info. on Vasomotor Rhinitis\nMessage-ID: <C5t573.L18@athena.cs.uga.edu>\nDate: 21 Apr 93 00:25:51 GMT\nReferences: <1r1t1a$njq@europa.eng.gtefsd.com>\nSender: usenet@athena.cs.uga.edu\nOrganization: AI Programs, University of Georgia, Athens\nLines: 15\nNntp-Posting-Host: aisun3.ai.uga.edu\n\n(Disclaimer: I'm a sufferer, not a doctor.)\n\nI'm not sure there's a really sharp distinction between allergic and\nvasomotor rhinitis.  Basically, vasomotor rhinitis means your nose is\nstuffy when it has no reason to be (not even an identifiable allergy).\n\nDecongestants and steroid sprays work for vasomotor rhinitis.  Also,\nI can get surprising relief from purely superficial measures such as\nsaline moisturizing spray and moisturizing gel.

In [None]:
boards.data = tokenize_newsgroup(boards.data)

After:

In [55]:
boards.data[0]

'Path cantaloupe srv cs cmu edu magnesium club cc cmu edu pitt edu zaphod mps ohio state edu howland reston ans net gatech emory athena aisun ai uga edu mcovingt From mcovingt aisun ai uga edu Michael Covington Newsgroups sci med Subject Re Any info on Vasomotor Rhinitis Message ID    athena cs uga edu Date  Apr  GMT References  njq europa eng gtefsd com Sender usenet athena cs uga edu Organization AI Programs University of Georgia Athens Lines  Nntp Posting Host aisun ai uga edu  Disclaimer sufferer not doctor  not sure there really sharp distinction between allergic and vasomotor rhinitis  Basically vasomotor rhinitis means your nose is stuffy when it has no reason to be not even an identifiable allergy  Decongestants and steroid sprays work for vasomotor rhinitis  Also can get surprising relief from purely superficial measures such as saline moisturizing spray and moisturizing gel  Michael Covington Associate Research Scientist  Artificial Intelligence Programs  mcovingt ai uga edu 

Check Naive Bayes first

In [68]:
clf = Pipeline([('vect', CountVectorizer()),
                ('clf', MultinomialNB())])
nb_scores_2 = cross_val_score(clf, boards.data, boards.target, cv=10)

In [69]:
print(f'Naive Bayes Accuracy: {100*nb_scores_2.mean():0.2f}% (+/- {100*nb_scores_2.std() * 2:0.2f}%)')

Naive Bayes Accuracy: 95.88% (+/- 0.99%)


It is doing slightly worse in this case. For logistic regression, let's match the vocabulary and repeat the modeling steps from before.

In [29]:
matched_vocab_2 = calculate_matched_vocab(boards.data, model)

 60.31% of the dataset vocabulary has been matched


In [30]:
embed_dict_2 = {word:model.get_vector(word) for word in matched_vocab_2}
embedded_text_2 = string_to_vector(boards.data, embed_dict_2, matched_vocab_2)
embedded_text_2 = shrink_to_smallest(embedded_text_2)
np.save('./tmp/newsgroup_text_embed_converted.npy', embedded_text_2)

In [70]:
embedded_text_2 = np.load('./tmp/newsgroup_text_embed_converted.npy')

In [71]:
avg_embed_text_2 = create_average_vectors(embedded_text_2)
clf_2 = LogisticRegression(C=100)
lr_scores_2 = cross_val_score(clf_2, avg_embed_text_2, boards.target, cv=10)

In [72]:
print(f'Newsgroup Logistic Regression Accuracy: {100*lr_scores_2.mean():0.2f}% (+/- {100*lr_scores_2.std() * 2:0.2f}%)')

Newsgroup Logistic Regression Accuracy: 97.63% (+/- 0.89%)


It appears that the logistic regression method is more robust, seeing as how it has a similiar average accuracy on both data sets. The variance in accuracy dropped in this round, likely because there was more data in this second try.

## Why always 98%? Let's model some random data

I'm starting to get suspicious that the logistic regression is doing so well, so I am going to generate some random text files and see how well the approach works on them. I will start by loading in the `words.txt` file which is a list of 466,000 English words. Then I will randomly sample it 1000 times, each time grabbing 100 words with replacement, and assign each 'text file' a random category. At that point we can use the same steps as before to predict the categories and see how well we do.

In [34]:
model_vocab = set(model.vocab.keys())
with open('./words.txt') as f:
    words_list = f.readlines()
words_list = [word.lower().replace('\n','') for word in words_list if word.lower().replace('\n','') in model_vocab]

In [35]:
X,y = list(), list()
for i in range(1000):
    X.append(' '.join(np.random.choice(words_list, 100)))
    y.append(np.random.choice([i for i in range(5)], 1)[0])

In [47]:
matched_vocab_3 = calculate_matched_vocab(X, model)

 100.00% of the dataset vocabulary has been matched


In [48]:
embed_dict_3 = {word:model.get_vector(word) for word in matched_vocab_3}
embedded_text_3 = string_to_vector(X, embed_dict_3, matched_vocab_3)

In [49]:
avg_embed_text_3 = create_average_vectors(embedded_text_3)
clf_3 = LogisticRegression(C=100)
lr_scores_3 = cross_val_score(clf_3, avg_embed_text_3, y, cv=10)

In [73]:
print(f'Logistic Regression Accuracy on Random Data: {100*lr_scores_3.mean():0.2f}% (+/- {100*lr_scores_3.std() * 2:0.2f}%)')

Logistic Regression Accuracy on Random Data: 22.00% (+/- 9.56%)


As we can see, the approach is not broken, it is just working very well on the large data sets. In the next section we will cut down the number of training samples and figure out whether Naive Bayes or Logistic Regression works better in the small data regime.