# News Classification with Naive Bayes

Naive Bayes classifier is well-known for its good performance on `text classification` task. Here we will go through the main concepts of text classification, before building our own model.

## The Bag of Words Model

One of the most important sub-tasks in pattern classification are `feature extraction` and `selection`. Prior to fitting the model and using machine learning algorithms for training, we need to think about how to best represent a text document as a `feature vector`. 

A commonly used model in Natural Language Processing (NLP) is the so-called `bag of words` model. The idea behind this model is very intuitive. First comes the creation of the `vocabulary` — the collection of all different words that occur in the training set and each word is associated with a count of how often it occurs.

The vocabulary can then be used to construct the d-dimensional feature vectors for the individual documents where the dimensionality is equal to the number of different words in the vocabulary. This process is called `vectorization`.

When doing the above feature extraction, we may come across whether we should consider `word occurrencies` (encoding with 0s (if word is not in text) and 1s (if word is in text)) or `word frequencies` (absolute counts of the words) in the text. The answer depends on the data, and it is necessary to try both approaches. In general, the first method is usually better when applied on small texts.

## Tokenization

`Tokenization` describes the general process of breaking down a text corpus into individual elements that serve as input for various NLP algorithms (we have performed this task earlier in the programming module). Usually, tokenization is accompanied by other optional processing steps, such as the removal of `stop words` and `punctuation characters`, `stemming` or `lemmatizing`, and the construction of `n-grams`. Below is an example of a simple but typical tokenization step that splits a sentence into individual words, removes punctuation, and converts all letters to lowercase.

![tokenization](https://sebastianraschka.com/images/blog/2014/naive_bayes_1/tokenization-1.png)

## Stop words

Stop words are words that are particularly common in a text corpus and thus considered as rather un-informative (e.g., words such as `so`, `and`, `or`, `the`, ...). One approach to stop word removal is to search against a language-specific stop word dictionary. An alternative approach is to create a stop list by sorting all words in the entire text corpus by frequency. The stop list — after conversion into a set of non-redundant words — is then used to remove all those words from the input documents that are ranked among the top n words in this stop list.

![stop words](https://sebastianraschka.com/images/blog/2014/naive_bayes_1/stop-1.png)

## Stemming and Lemmatization

`Stemming` describes the process of transforming a word into its root form. The original stemming algorithm was developed my Martin F. Porter in 1979 and is hence known as Porter stemmer.

![stemming](https://sebastianraschka.com/images/blog/2014/naive_bayes_1/porter-1.png)

Stemming can create non-real words, such as "thu" in the example above. 

In contrast to stemming, `lemmatization` aims to obtain the canonical (grammatically correct) forms of the words, the so-called lemmas. Lemmatization is computationally more difficult and expensive than stemming, and in practice, both stemming and lemmatization have little impact on the performance of text classification.

![lemmatization](https://sebastianraschka.com/images/blog/2014/naive_bayes_1/lemma-1.png)

## N-Grams

In the `n-gram` model, a token can be defined as a sequence of n items. The simplest case is the so-called unigram (1-gram) where each word consists of exactly one word, letter, or symbol. All previous examples were unigrams so far. Choosing the optimal number n depends on the language as well as the particular application. 

![n-grams](https://sebastianraschka.com/images/blog/2014/naive_bayes_1/grams-1.png)

## Term Frequency - Inverse Document Frequency (Tf-idf)

The term frequency - inverse document frequency (Tf-idf) is another alternative for characterizing text documents. It can be understood as a weighted term frequency, which is especially useful if stop words have not been removed from the text corpus. 

The Tf-idf approach assumes that the importance of a word is inversely proportional to how often it occurs across all documents. 

$$\text{Tf-idf}=tf(t,d)\cdot idf(t),$$

where $tf(t,d)$ is the count of term $t$ in document $d$  
$$idf(t)=log \Big(\frac{N+1\cdot\alpha}{df(t) + 1\cdot\alpha} \Big) + 1,$$
where $N$ is the number of documents in the corpus, $df(t)$ is the number of documents containing the term $t$ and $\alpha = \{0, 1\}$ is the smoothing parameter.

## Basic examples of Vectorizing

In [9]:
X_train = ['This is a sentence!', 'This is the other']
X_test = ["Is this a sentence?", "I am the second sentence", "And the third one!", "The sentence and a sentence"]

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Vectorizing with frequencies
# cv = CountVectorizer() 
# cv = CountVectorizer(stop_words='english')

# Vectorizing with occurrences 
cv = CountVectorizer(binary = True)

trained = cv.fit_transform(X_train)
tested = cv.transform(X_test)

print('Vocabulary')
print(cv.get_feature_names()) #vocabulary
print('Train')
print(trained.toarray())
print('Test')
print(tested.toarray())

Vocabulary


AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [2]:
# tfidf = TfidfVectorizer()
# tfidf = TfidfVectorizer(smooth_idf=False)
tfidf = TfidfVectorizer(stop_words='english')
 
trained1 = tfidf.fit_transform(X_train)
tested1 = tfidf.transform(X_test)

print('Vocabulary')
print(tfidf.get_feature_names())
print('Idf')
print(tfidf.idf_)
print('Train')
print(trained1.toarray())
print('Test')
print(tested1.toarray())

Vocabulary


AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names'

To prevent CountVectorizer from removing symbols or separate chars use token_pattern!

## Examples on real dataset
We will be working on a real dataset that consists of news from two categories `army` and `economy` scraped from [this website](https://armenpress.am/eng/).

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
news1 = pd.read_csv("data/raw/armenpress_army.csv", encoding="utf8")
news2 = pd.read_csv("data/raw/armenpress_economy.csv", encoding="utf8")

news1.head()

Unnamed: 0,article_title,article_paragraph
0,Chief of General Staff of Armenian Armed Force...,"YEREVAN, APRIL 22, ARMENPRESS. Chief of Genera..."
1,Russian mobile lab deployed in Armenian milita...,"YEREVAN, APRIL 8, ARMENPRESS. The mobile lab w..."
2,1 out of 58 new confirmed coronavirus cases in...,"YEREVAN, MARCH 30, ARMENPRESS. 1 out of the 58..."
3,Many quarantined Armenia servicemen return to ...,"YEREVAN, MARCH 30, ARMENPRESS. “Dozens” of qua..."
4,Armenia soldier wounded by Azerbaijani shooting,"YEREVAN, MARCH 27, ARMENPRESS. Soldier of the ..."


In [3]:
# As each article paragraph starts with the location and date of an event and the name of the news agency => this data does not provide any information that can be used for news classification => they can be deleted from article paragraphs
news1.article_paragraph = news1.article_paragraph.str.split('[0-9], ARMENPRESS.', expand=True)[1]
news2.article_paragraph = news2.article_paragraph.str.split('[0-9], ARMENPRESS.', expand=True)[1]

# Attach a column for a label to be predicted (the label is the news type: military or economy)
news1['type'] = 'military'
news2['type'] = 'economy'

In [4]:
# Join 2 data frames in order to work with united dataset
news_df = pd.concat([news1, news2], axis=0, ignore_index=True)

# Check if there are null values in dataset
news_df.isna().sum()

article_title        0
article_paragraph    1
type                 0
dtype: int64

In [5]:
# If null values are in dataset (e.g. there is an article paragraph without title), then drop that data
news_df = news_df.dropna()

# Check once again for null values to make sure they are no longer in the dataset
news_df.isna().sum()

article_title        0
article_paragraph    0
type                 0
dtype: int64

### Fitting on word frequencies

News classification based on article paragraph

In [6]:
X_train, X_test, y_train, y_test = train_test_split(news_df['article_paragraph'], news_df['type'], random_state = 0)

print("Training dataset: {instance_count} instances".format(instance_count=X_train.shape[0]))
print("Test dataset: {instance_count} instances".format(instance_count=X_test.shape[0]))

Training dataset: 279 instances
Test dataset: 93 instances


In [7]:
# Get counts of instances with all labels (i.e. count of military news and count of economy news) in test dataset
y_test.value_counts()

type
military    48
economy     45
Name: count, dtype: int64

In [14]:
# CountVectorizer object ignores pre-defined stop words of English
frequency_vector_paragraph = CountVectorizer(stop_words ='english')

# Creates a vocabulary from training data by associating each word with its frequency
training_data = frequency_vector_paragraph.fit_transform(X_train)

# Creates a vocabulary from testing data by associating each word with its frequency and using training vocabulary
testing_data = frequency_vector_paragraph.transform(X_test)

# count_vector.get_feature_names()
# training_data.toarray()

In [13]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB(alpha=1e-10)
naive_bayes.fit(training_data, y_train)

ValueError: Found input variables with inconsistent numbers of samples: [2, 279]

In [12]:
predictions = naive_bayes.predict(testing_data)

from sklearn.metrics import accuracy_score

print("Accuracy score by Scikit-learn accuracy metric: ", accuracy_score(y_test, predictions))

NotFittedError: This MultinomialNB instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
def accuracy(actual, predicted):
  return sum(actual == predicted) / len(predicted)

print("Accuracy by custom accuracy function: ", accuracy(y_test, predictions))

In [None]:
# Indexes and true labels of misclassified instances
y_test[predictions != y_test]

In [None]:
# Indexes and paragraphs of misclassified articles
with pd.option_context('max_colwidth', 100):
    print(X_test[predictions != y_test])

News classification based on article title

In [15]:
X_train1, X_test1, y_train, y_test = train_test_split(news_df['article_title'], news_df['type'], random_state = 0)

print("Training dataset: {instance_count} instances".format(instance_count=X_train1.shape[0]))
print("Test dataset: {instance_count} instances".format(instance_count=X_test1.shape[0]))

Training dataset: 279 instances
Test dataset: 93 instances


In [16]:
frequency_vector_title = CountVectorizer(stop_words ='english')
training_data1 = frequency_vector_title.fit_transform(X_train1)
testing_data1 = frequency_vector_title.transform(X_test1)

In [17]:
naive_bayes1 = MultinomialNB(alpha=1e-10)
naive_bayes1.fit(training_data1, y_train)
predictions = naive_bayes1.predict(testing_data1)
print("Accuracy score by Scikit-learn accuracy metric: ", accuracy_score(y_test, predictions))

NameError: name 'accuracy_score' is not defined

In [None]:
y_test[predictions != y_test]

In [None]:
with pd.option_context('max_colwidth', 100):
  print(X_test1[predictions != y_test])

### Fitting on word occurrences

News classification based on article paragraph

In [None]:
# If binary is true in CountVectorizer object, then it will only consider whether the word is present in text (will be encoded as 1) or not (will be encoded as 0)
occurrence_vector_paragraph = CountVectorizer(stop_words ='english', binary=True)

training_data2 = occurrence_vector_paragraph.fit_transform(X_train)
testing_data2 = occurrence_vector_paragraph.transform(X_test)

In [None]:
from sklearn.naive_bayes import BernoulliNB

bernoulli_naive_bayes = BernoulliNB()
bernoulli_naive_bayes.fit(training_data2, y_train)
predictions = bernoulli_naive_bayes.predict(testing_data2)
print("Accuracy score by Scikit-learn accuracy metric: ", accuracy_score(y_test, predictions))

The model performance decreased when the word occurrences were given as input to the model.

News classification based on article title

In [None]:
occurrence_vector_title = CountVectorizer(stop_words = 'english', binary=True)
training_data3 = occurrence_vector_title.fit_transform(X_train1)
testing_data3 = occurrence_vector_title.transform(X_test1)

In [None]:
bernoulli_naive_bayes = BernoulliNB()
bernoulli_naive_bayes.fit(training_data3, y_train)
predictions = bernoulli_naive_bayes.predict(testing_data3)
print("Accuracy score by Scikit-learn accuracy metric: ", accuracy_score(y_test, predictions))

For article titles the model performance is same when we consider the word occurrences instead of word frequencies.

### Fitting on tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector = TfidfVectorizer(stop_words = 'english')
training_data4 = tfidf_vector.fit_transform(X_train)
testing_data4 = tfidf_vector.transform(X_test)

In [None]:
naive_bayes = MultinomialNB(alpha=1e-10)
naive_bayes.fit(training_data4, y_train)
predictions = naive_bayes.predict(testing_data4)
print("Accuracy score by Scikit-learn accuracy metric: ", accuracy_score(y_test, predictions))

### Fitting on 2-grams

Concept explanation on simple example.

In [None]:
corpus = ['This is the first document.', 'This document is the second document.', 
          'And this is the third one.', 'Is this the first document?']

# Consider phrases made up of 1, 2 and 3 words
vectorizer2 = CountVectorizer(ngram_range=(1, 3))

X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names())
print(X2.toarray())

In [None]:
# Consider phrases made up of 1, 2, 3 and 4 words
frequency_vector = CountVectorizer(stop_words ='english', ngram_range=(1, 4))
training_data = frequency_vector.fit_transform(X_train)
testing_data = frequency_vector.transform(X_test)

naive_bayes = MultinomialNB(alpha=1e-10)
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)
print("Accuracy score by Scikit-learn accuracy metric: ", accuracy_score(y_test, predictions))