# Which is better for news article classification: Naive Bayes using Count Vectorizer or Logistic Regression using word vectors from a model trained on Google News?

## Load model and data

Google has pretrained a neural network on news articles. We can extract the word vectors, or word embeddings, that the model has learned and use them to train our own model. Let's load in the model here. It is relatively large so it will take a minute to load.

In [1]:
import numpy as np
import gensim
from sklearn.datasets import load_files

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format(
    './GoogleNews-vectors-negative300.bin', 
    binary=True
)  

Next, we need to load in the text from the news articles that we want to categorize. If we use Scikit Learn's `load_files` function from `sklearn.datasets`, the data and categories will be automatically loaded, so long as the text files are in properly labeled folders.

The `news` variable will store both data and target labels, each of which can be accessed by calling `news.data` and `news.target`, respectively.

In [176]:
news = load_files(
    './bbc/', 
    encoding='utf-8',
    decode_error='ignore',
    random_state=42
)

We can display a few sentences of one of the articles to get an idea of what they look like, and see what the target labels look like as well.

In [177]:
def first_n_sentences(article:str, n):
    "Returns the first n sentences in an article, or None if that number of sentences does not exist"
    first_n = ''
    period_counter = 0
    for character in article:
        first_n += character
        if character == '.':
            period_counter += 1
        if period_counter == n:
            return first_n
    print(f'There are not {n} sentences in this article')
    return None

In [178]:
first_n_sentences(news.data[0], 3)

'UK house prices dip in November\n\nUK house prices dipped slightly in November, the Office of the Deputy Prime Minister (ODPM) has said.\n\nThe average house price fell marginally to £180,226, from £180,444 in October. Recent evidence has suggested that the UK housing market is slowing after interest rate increases, and economists forecast a drop in prices during 2005.'

In [179]:
first_n_sentences(news.data[1000], 3)

"Collins named UK Athletics chief\n\nUK Athletics has ended its search for a new performance director by appointing psychologist Dave Collins.\n\nCollins, who worked with the British teams at the 2000 and 2004 Olympics, takes over from Max Jones. Six candidates were interviewed for the job, including Denise Lewis' coach Charles van Commenee and former British triple jumper Keith Connor."

In [180]:
news.target_names

['business', 'entertainment', 'politics', 'sport', 'tech']

## Naive Bayes

Naive Bayes can take counts of words and generate a model that looks at condtitional class probabilities. In this case, the classes correspond to the five categories of news. Let's start with Naive Bayes since it is quick to prepare. We can use Scikit Learn's `Pipeline` class to easily pass the data through the counting and fitting steps of the Naive Bayes modeling process.

In [181]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [182]:
def text_classification(clf: Pipeline, train_data, test_data, train_target, test_target):
    "Helper function wrapping the Pipeline, courtesy of Brian Spiering"
    clf.fit(train_data, train_target) 
    predicted = clf.predict(test_data)
    accuracy = np.mean(predicted==test_target)
    print(f"The accuracy on the test data is {accuracy:.2%}")
    return predicted

In [183]:
clf = Pipeline([('vect', CountVectorizer()),
                ('clf', MultinomialNB())])
nb_scores = cross_val_score(clf, news.data, news.target, cv=10)

In [184]:
print("Accuracy: %0.2f (+/- %0.2f)" % (nb_scores.mean(), nb_scores.std() * 2))

Accuracy: 0.98 (+/- 0.02)


Naive Bayes is already doing a great job, predicting close to 98%.

## Logistic Regression

### Pre-processing

#### Tokenize and remove punctuation

It is helpful to make things easy for the model, which is why we will clean and tokenize the text. The `Spacy` library does a great job of tokenizing, or separating out individual words. This process counts punctuation characters and numbers as tokens, so we can follow up with a very basic regular expression substitution that removes these. Removing punctuation may not always be useful, like in the case of sentiment analysis where punctuation may add additional meaning or emphasis. In our case, punctuation will likely not give any clues to the topic of the article.

In [185]:
import re
import spacy
nlp = spacy.load('en')

def tokenize_bbc(text_data):
    "Tokenize BBC news articles and remove punctuation and digits"
    clean_text = list()
    for text_chunk in text_data:
        doc = nlp(text_chunk)
        clean = ' '.join([token.text for token in doc])
        clean = re.sub('[\\n\.\,\(\)\'\"\-\:\;\&\#\/0-9]', '', clean)
        clean_text.append(clean)
    return clean_text

In [186]:
news.data = tokenize_bbc(news.data)

#### Find the unique words in the corpus and see how many match up with the model vocabulary

The goal is to use pre-trained word vectors to train a classification model. If none of the words in our articles match up with words in the pre-trained model, then this effort will be for nothing. This next step checks to make sure we are capturing enough of the words to have good results.

In [187]:
def find_unique_words(text:list): 
    word_set = set()
    for word_list in text:
        word_set = word_set | set(word_list)
    return word_set

In [188]:
def calculate_matched_vocab(articles:list, model:gensim):
    data_vocab = find_unique_words([article.split() for article in articles])  # Vocabulary of the dataset
    model_vocab = set(model.vocab.keys())  # Vocabulary of the Google model
    matched_vocab = data_vocab & model_vocab  # Intersecting vocabulary
    print(f'{len(matched_vocab) * 100 / len(data_vocab) : .2f}% of the dataset vocabulary has been matched')
    return matched_vocab

In [189]:
matched_vocab = calculate_matched_vocab(news.data, model)

 97.00% of the dataset vocabulary has been matched


#### Convert words to embeddings

In this step, we create a dictionary that allows us to look up the word embedding for any word. We choose to only include matched words, since those are the only ones that will be used. Then we convert each word in each article to its corresponding word vector. Finally we save our converted embedded text so we do not have to repeat these intial preprocessing steps.

In [22]:
embed_dict = {word:model.get_vector(word) for word in matched_vocab}

In [23]:
def string_to_vector(data, embed_dict, matched_vocab):
    "Converts string tokens to word vectors using an embedding dictionary and a set of intersecting vocabulary."
    embedded_text = list()
    for article in data:
        embedded_text.append(
            np.array([embed_dict[word] for word in article.strip().split() if word in matched_vocab])
        )
    return embedded_text

In [24]:
embedded_text = string_to_vector(news.data, embed_dict, matched_vocab)

#### Find the minimum word article

When using regression, our inputs must all be the same length. To ensure this, we can find the article with the smallest number of words and then subset all other articles by that length.

In [25]:
def shrink_to_smallest(embedded_text):
    "Shrinks all embedded text documents to match the minimum length document"
    min_length = np.min([article.shape[0] for article in embedded_text])
    embedded_text = [article[:min_length] for article in embedded_text]
    return embedded_text

In [26]:
embedded_text = shrink_to_smallest(embedded_text)

#### Save our work

Now that we are done pre-processing, we should save our work. We don't want to have to repeat all these steps if we just want to tune or re-run the model in the future.

In [28]:
np.save('./tmp/bbc_text_embed_converted.npy', embedded_text)

### Model time

In [82]:
embedded_text = np.load('./tmp/bbc_text_embed_converted.npy')

We can average the word embeddings to give a notion of meaning to each article and then fit a logistic regression.

In [83]:
from sklearn.linear_model import LogisticRegression

In [204]:
def create_average_vectors(embedded_text):
    "Calculate the average of the word embeddings for each text in a corpus"
    avg_embed_text = list()
    for article in embedded_text:
        avg_embed_text.append(np.sum([word for word in article], axis=0) / len(article))
    return np.array(avg_embed_text)

In [205]:
avg_embed_text = create_average_vectors(embedded_text)
clf = lr = LogisticRegression(C=100)
lr_scores = cross_val_score(clf, avg_embed_text, news.target, cv=10)
print(f'Accuracy: {lr_scores.mean() : 0.2f} (+/- {lr_scores.std() * 2 : 0.2f})')

Accuracy:  0.98 (+/-  0.02)


## Repeat with newsgroup data

In [102]:
data_path = './newsgroups/'
boards = load_files(
    data_path, 
    encoding='utf-8',
    decode_error='ignore',
    random_state=42
)

In [164]:
def tokenize_newsgroup(text_data):
    "Tokenize newsgroup posts"
    clean_text = list()
    for text_chunk in text_data:
        text_chunk = re.sub('[\\n\.\,\(\)\'\"\-\:\;\&\#\!\*\<\>\@\^\`\~\|\\\$\/0-9]', ' ', text_chunk)
        doc = nlp(text_chunk)
        clean = ' '.join([token.text.strip() for token in doc if len(token) > 1])
        clean_text.append(clean)
    return clean_text

In [167]:
boards.data = tokenize_newsgroup(boards.data)

In [191]:
matched_vocab_2 = calculate_matched_vocab(boards.data, model)

 60.31% of the dataset vocabulary has been matched


In [192]:
embed_dict_2 = {word:model.get_vector(word) for word in matched_vocab_2}

In [193]:
embedded_text_2 = string_to_vector(boards.data, embed_dict_2, matched_vocab_2)

In [194]:
embedded_text_2 = shrink_to_smallest(embedded_text_2)

In [195]:
np.save('./tmp/newsgroup_text_embed_converted.npy', embedded_text_2)

In [None]:
embedded_text_2 = np.load('./tmp/newsgroup_text_embed_converted.npy')

In [206]:
avg_embed_text_2 = create_average_vectors(embedded_text_2)
clf_2 = LogisticRegression(C=100)
lr_scores_2 = cross_val_score(clf_2, avg_embed_text_2, boards.target, cv=10)
print(f'Accuracy: {lr_scores_2.mean() : 0.2f} (+/- {lr_scores_2.std() * 2 : 0.2f})')

Accuracy:  0.98 (+/-  0.01)
