# Bag of words and Naive Bayes Classifier    

Here the goal is to try the Naive Bayes classifier with the BoW method.

## What is bag of words anyway?     

The Bag of Words (BoW) method is a popular way to **represent text data** in machine learning, which treats **each document as an unordered collection** or "bag" of words. This method is used for feature extraction in text data. 

In the Bag of Words method, **a text** (such as a sentence or a document) **is represented as the bag (multiset) of its words**, **disregarding grammar** and even **word order** but **keeping multiplicity**. The **frequency of each word** is used as a **feature** for training a classifier.        

## How does it work?       

1. **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation. In this project our dadaset is already splited in sentences and is labeled.          

2. **Vocabulary building**: All the words are collected and a dictionary is created where words are key and the indexes are values. The length of the dictionary is the length of the individual text representation.

3. **Text to Vector**: Each sentence or document is converted into a vector of length equal to vocabulary. The presence of word from the vocabulary in the text will make the respective position in the vector 1, and if the word is absent the position will be 0. If 'n' number of times a word occurs in the text, the respective position's value will be 'n'.

For example, let's consider two sentences:
- Sentence 1: "The cat sat on the mat."
- Sentence 2: "The dog sat on the log."

In a BoW model, these sentences would first be tokenized into:

- Sentence 1: ["The", "cat", "sat", "on", "the", "mat"]
- Sentence 2: ["The", "dog", "sat", "on", "the", "log"]

Then, the vocabulary of unique words would be: ["The", "cat", "sat", "on", "the", "mat", "dog", "log"]

Lastly, the sentences would be transformed into vectors based on this vocabulary:

- Sentence 1: [2, 1, 1, 1, 1, 0, 0]
- Sentence 2: [2, 0, 1, 1, 0, 1, 1]

As you can see, each index in the vector corresponds to a word in the vocabulary, and the value at each index corresponds to the number of times that word appears in the sentence. 

The BoW approach is simple and effective, but it has some downsides. It creates sparse vectors because the length of the vector is the same as the length of the vocabulary, and for each sentence or document, many positions will be zero if the word is not present in the document. Also, this approach doesn't account for word order or context, so it might not be as effective for tasks where these elements are important.

We start by loading the dataset, then, we convert the training data into a list of strings, and after that we pass the list to **CountVectorizer**. CountVectorizer transforms the text data into a bag of words representation.

In [35]:
#reading the dataset
from datasets import load_dataset
raw_datasets = load_dataset("ag_news")

Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 326.15it/s]


In [36]:
# lets first have a look at the data
print("keys: ",raw_datasets.keys())
print(type(raw_datasets['train']))
print(raw_datasets['train'].features)
print(f"example of train datapoint: \n {raw_datasets['train'][0]}")
print(raw_datasets['train'][0]['text'])
print(raw_datasets['train'][0]['label'])

keys:  dict_keys(['train', 'test'])
<class 'datasets.arrow_dataset.Dataset'>
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}
example of train datapoint: 
 {'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


In [37]:
# printing the first 10 lines of the data:
for i in range(10):
    print(raw_datasets['train'][i])

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', 'label': 2}
{'text': "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.", 'label': 2}
{'text': 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.', 'lab

In [38]:
# Convert the training data into list of strings
train_texts = [example['text'] for example in raw_datasets['train']]
train_labels = [example['label'] for example in raw_datasets['train']]

In [39]:
print(train_texts[0])
print(train_labels[0])

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


In [40]:
# Create an instance of CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

The CountVectorizer transforms each text into a vector in a high-dimensional space. The dimension of this space is equal to the size of the learned vocabulary (i.e., the number of unique words in all documents of the training data). Each unique word has its own dimension and the value in that dimension is the count of this word in the corresponding document.

In [41]:
# Learn a vocabulary dictionary of all tokens in the raw documents and return term-document matrix
X_train = vectorizer.fit_transform(train_texts)


In [42]:
type(X_train)
print("Shape of X_train: ", X_train.shape)
print(type(X_train[0]))
print(X_train[0])

Shape of X_train:  (120000, 65006)
<class 'scipy.sparse._csr.csr_matrix'>
  (0, 62536)	2
  (0, 54842)	1
  (0, 7396)	1
  (0, 12600)	1
  (0, 6510)	1
  (0, 30178)	1
  (0, 57946)	1
  (0, 8345)	1
  (0, 48864)	2
  (0, 52479)	1
  (0, 51606)	1
  (0, 55636)	1
  (0, 18921)	1
  (0, 6832)	1
  (0, 40992)	1
  (0, 60125)	1
  (0, 15556)	1
  (0, 5306)	1
  (0, 51513)	1
  (0, 25524)	1
  (0, 3522)	1


**Note** 

The above output is a representation of a sparse matrix in the Compressed Sparse Row (CSR) format. Let's break down the information:

The type of the object (that is a matrix representation of a datapoint/sentence), is a sparse matrix in CSR format from the SciPy library (`<class 'scipy.sparse._csr.csr_matrix'>`)

Following that, there are multiple lines with a specific format `(row_index, column_index) value`. These lines **represent the non-zero elements of the sparse matrix**. Here's an explanation of each line:

- `(0, 7250) 2`: This line indicates that the value `2` is present at row index `0` and column index `7250`.        
- `(0, 6270) 1`: This line indicates that the value `1` is present at row index `0` and column index `6270`.         
- `(0, 738) 1`: This line indicates that the value `1` is present at row index `0` and column index `738`.          
- ...         

Each subsequent line follows the same pattern, representing the row index, column index, and value of a non-zero element in the sparse matrix.         

In summary, this output is a **representation of a sparse matrix in CSR format**, where the **non-zero elements are shown with their corresponding row and column indices**.

We also convert labels to a numpy array for later use with Scikit-learn:

In [43]:
import numpy as np
y_train = np.array(train_labels)

In [44]:
print("Shape of y_train: ", y_train.shape)
print(type(y_train[10]))
print(y_train[10])

Shape of y_train:  (120000,)
<class 'numpy.int64'>
2


In [45]:
unique_values, counts = np.unique(y_train, return_counts=True)

print("Unique values:", unique_values)
print("Counts:", counts)

Unique values: [0 1 2 3]
Counts: [30000 30000 30000 30000]


Now lets try the Naive Bayes classifier with a small sample of data:

In [46]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Load the dataset
raw_datasets = load_dataset("ag_news")

# Select the first 1000 samples
train_subset = raw_datasets['train'].select(range(1000))


# Convert the train data into list of strings
train_texts = train_subset['text']
train_labels = train_subset['label']

# Initialize a CountVectorizer
vectorizer = CountVectorizer()

# Transform the train data into the BoW representation
X_train = vectorizer.fit_transform(train_texts)

# Convert labels to numpy array
y_train = np.array(train_labels)

# Printing the counts of each class to make sure all classes have datapoints
unique_values, counts = np.unique(y_train, return_counts=True)

print("Unique values:", unique_values)
print("Counts:", counts)


# Initialize the Multinomial Naive Bayes classifier
naive_bayes_clf = MultinomialNB()

# Train the classifier
naive_bayes_clf.fit(X_train, y_train)

# Select the first 1000 samples from test set
test_subset = raw_datasets['test'].select(range(1000))

# Convert the test data into list of strings
test_texts = test_subset['text']
test_labels = test_subset['label']

# Transform the test data into the BoW representation
X_test = vectorizer.transform(test_texts)

# Convert labels to numpy array
y_test = np.array(test_labels)

# Predict the labels of the test data
y_pred = naive_bayes_clf.predict(X_test)

# Print the accuracy of the classifier
accuracy = (y_pred == y_test).mean()
print(f"Accuracy: {accuracy}")


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 383.15it/s]

Unique values: [0 1 2 3]
Counts: [212 142 174 472]
Accuracy: 0.726





We get a 0.72 accuracy with Naive Bayes algorithm which shows weaker prediction power in compare to accuracy of 0.885 with fine-tuned BERT. 

#  TF-IDF and Naive Bayes Classifier

Now lets try the Naive Bayes classifier with the TF-IDF method and compare the accuracy with BoW method.

## What is the TF-IDF?

TF-IDF stands for **Term Frequency-Inverse Document Frequency**, and it's a **numerical statistic** used to reflect **how important a word is to a document** in a collection or corpus. It's one of the most popular techniques used for information retrieval to represent how important a specific word or phrase is to a given document.       

TF-IDF is a combination of two concepts: term frequency (TF) and inverse document frequency (IDF):

**Term Frequency (TF)**: This measures the **frequency of a word in a document**. That is, if a word appears more times in a document, its TF will increase. It is given by:    

**TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)**     

**Inverse Document Frequency (IDF)**: This measures the **importance of a word in the entire corpus**. If a word appears in many documents, it's not a unique identifier, therefore, these words are usually less important. It is given by:

**IDF(t) = log_e(Total number of documents / Number of documents with term t in it)**

The **TF-IDF** value is obtained by multiplying these two quantities: **TF * IDF**. This will increase proportionally to the number of times a word appears in the document but is offset by the number of documents in the corpus that contain the word.

What this technique does is, it **rescales the frequency of words by how often they appear in all documents**, so that the scores for frequent words like "the" that are also frequent across all documents are penalized. This allows for words that are more unique to the document to hold more weight, which can improve the performance of many text mining tasks like text classification, clustering, and information retrieval.

In [49]:
# TF-IDF and Naive Bayes Classifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from datasets import load_dataset

raw_datasets = load_dataset("ag_news")

train_subset = raw_datasets['train'].select(range(1000))
test_subset = raw_datasets['test'].select(range(1000))

train_texts = [example['text'] for example in train_subset]
train_labels = [example['label'] for example in train_subset]
test_texts = [example['text'] for example in test_subset]
test_labels = [example['label'] for example in test_subset]

# Apply the TF-IDF Vectorizer
tfidf_vect = TfidfVectorizer()
X_train_tfidf = tfidf_vect.fit_transform(train_texts)
X_test_tfidf = tfidf_vect.transform(test_texts)

# Train Naive Bayes Classifier
clf = MultinomialNB().fit(X_train_tfidf, train_labels)

# Make predictions and evaluate the model
predicted = clf.predict(X_test_tfidf)
print(f"Accuracy: {metrics.accuracy_score(test_labels, predicted)}")


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 436.52it/s]

Accuracy: 0.402





The accuracy of the Naive Bayes with TF-IDF is much lower than the fine-tubed BERT and Naive Bayes with BoW.

In [50]:
print(type(X_train_tfidf)) 

<class 'scipy.sparse._csr.csr_matrix'>


In [51]:
# how does a matrix representation of a document look like in TF-IDF?
print(X_train_tfidf[0] )

  (0, 306)	0.1892144669935855
  (0, 2984)	0.22803951498848163
  (0, 5840)	0.2222928298136565
  (0, 508)	0.11395365544824688
  (0, 1755)	0.25387980729831355
  (0, 6999)	0.25387980729831355
  (0, 4539)	0.05940129262366908
  (0, 684)	0.22803951498848163
  (0, 2167)	0.24315511699740658
  (0, 6389)	0.2021992226786497
  (0, 5855)	0.24315511699740658
  (0, 5972)	0.20544297736560763
  (0, 5569)	0.2041636162525598
  (0, 838)	0.2089961086129379
  (0, 6724)	0.04471510643864858
  (0, 3479)	0.13808618375812984
  (0, 657)	0.1706122451939926
  (0, 1357)	0.25387980729831355
  (0, 738)	0.2348364009227698
  (0, 6270)	0.21292391297955668
  (0, 7250)	0.4043984453572994


# MLP with TF-IDF 



In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from datasets import load_dataset

# Load the dataset
raw_datasets = load_dataset("ag_news")
train_subset = raw_datasets['train'].select(range(1000))
test_subset = raw_datasets['test'].select(range(1000))

# Prepare training data
train_texts = [example['text'] for example in train_subset]
train_labels = [example['label'] for example in train_subset]

# Prepare test data
test_texts = [example['text'] for example in test_subset]
test_labels = [example['label'] for example in test_subset]

# Initialize a TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Define MLP Classifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))

# Train the classifier
mlp.fit(X_train, train_labels)

# Test the classifier
predicted = mlp.predict(X_test)

# Print the classification report
print(classification_report(test_labels, predicted))


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 412.12it/s]


              precision    recall  f1-score   support

           0       0.76      0.66      0.71       268
           1       0.94      0.61      0.74       274
           2       0.74      0.49      0.59       205
           3       0.53      0.94      0.68       253

    accuracy                           0.68      1000
   macro avg       0.74      0.68      0.68      1000
weighted avg       0.75      0.68      0.68      1000



# MLP with BERT vectorization (feature extraction)   

Using BERT for vectorization (also known as feature extraction) and MLP as a classifier. 

In [53]:
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from datasets import load_dataset

# Load the dataset
raw_datasets = load_dataset("ag_news")
train_subset = raw_datasets['train'].select(range(1000))
test_subset = raw_datasets['test'].select(range(1000))

# Prepare training data
train_texts = [example['text'] for example in train_subset]
train_labels = [example['label'] for example in train_subset]

# Prepare test data
test_texts = [example['text'] for example in test_subset]
test_labels = [example['label'] for example in test_subset]

# Load pretrained model/tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# BERT Vectorization
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128, return_tensors="pt")
with torch.no_grad():
    train_features = model(**train_encodings)['pooler_output'].numpy()

test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128, return_tensors="pt")
with torch.no_grad():
    test_features = model(**test_encodings)['pooler_output'].numpy()

# Define MLP Classifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))

# Train the classifier
mlp.fit(train_features, train_labels)

# Test the classifier
predicted = mlp.predict(test_features)

# Print the classification report
print(classification_report(test_labels, predicted))


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 414.27it/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClass

              precision    recall  f1-score   support

           0       0.80      0.81      0.80       268
           1       0.90      0.87      0.88       274
           2       0.78      0.42      0.55       205
           3       0.64      0.89      0.74       253

    accuracy                           0.77      1000
   macro avg       0.78      0.75      0.74      1000
weighted avg       0.78      0.77      0.76      1000



On this small sample of data, MLP works better with BERT feature extraction (accuracy of 0.77) compared to TF-IDF vectorization.