# Bag of words and Naive Bayes Classifier

## What is bag of words anyway?     

The Bag of Words (BoW) method is a popular way to **represent text data** in machine learning, which treats **each document as an unordered collection** or "bag" of words. This method is used for feature extraction in text data. 

In the Bag of Words method, **a text** (such as a sentence or a document) **is represented as the bag (multiset) of its words**, **disregarding grammar** and even **word order** but **keeping multiplicity**. The **frequency of each word** is used as a **feature** for training a classifier.        

## How does it work?       

1. **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation. In this project our dadaset is already splited in sentences and is labeled.          

2. **Vocabulary building**: All the words are collected and a dictionary is created where words are key and the indexes are values. The length of the dictionary is the length of the individual text representation.

3. **Text to Vector**: Each sentence or document is converted into a vector of length equal to vocabulary. The presence of word from the vocabulary in the text will make the respective position in the vector 1, and if the word is absent the position will be 0. If 'n' number of times a word occurs in the text, the respective position's value will be 'n'.

For example, let's consider two sentences:
- Sentence 1: "The cat sat on the mat."
- Sentence 2: "The dog sat on the log."

In a BoW model, these sentences would first be tokenized into:

- Sentence 1: ["The", "cat", "sat", "on", "the", "mat"]
- Sentence 2: ["The", "dog", "sat", "on", "the", "log"]

Then, the vocabulary of unique words would be: ["The", "cat", "sat", "on", "the", "mat", "dog", "log"]

Lastly, the sentences would be transformed into vectors based on this vocabulary:

- Sentence 1: [2, 1, 1, 1, 1, 0, 0]
- Sentence 2: [2, 0, 1, 1, 0, 1, 1]

As you can see, each index in the vector corresponds to a word in the vocabulary, and the value at each index corresponds to the number of times that word appears in the sentence. 

The BoW approach is simple and effective, but it has some downsides. It creates sparse vectors because the length of the vector is the same as the length of the vocabulary, and for each sentence or document, many positions will be zero if the word is not present in the document. Also, this approach doesn't account for word order or context, so it might not be as effective for tasks where these elements are important.

We start by loading the dataset, then, we convert the training data into a list of strings, and after that we pass the list to **CountVectorizer**. CountVectorizer transforms the text data into a bag of words representation.

In [35]:
#reading the dataset
from datasets import load_dataset
raw_datasets = load_dataset("ag_news")

Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 326.15it/s]


In [36]:
# lets first have a look at the data
print("keys: ",raw_datasets.keys())
print(type(raw_datasets['train']))
print(raw_datasets['train'].features)
print(f"example of train datapoint: \n {raw_datasets['train'][0]}")
print(raw_datasets['train'][0]['text'])
print(raw_datasets['train'][0]['label'])

keys:  dict_keys(['train', 'test'])
<class 'datasets.arrow_dataset.Dataset'>
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}
example of train datapoint: 
 {'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


In [37]:
# printing the first 10 lines of the data:
for i in range(10):
    print(raw_datasets['train'][i])

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', 'label': 2}
{'text': "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.", 'label': 2}
{'text': 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.', 'lab

In [38]:
# Convert the training data into list of strings
train_texts = [example['text'] for example in raw_datasets['train']]
train_labels = [example['label'] for example in raw_datasets['train']]

In [39]:
print(train_texts[0])
print(train_labels[0])

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
2


In [40]:
# Create an instance of CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

The CountVectorizer transforms each text into a vector in a high-dimensional space. The dimension of this space is equal to the size of the learned vocabulary (i.e., the number of unique words in all documents of the training data). Each unique word has its own dimension and the value in that dimension is the count of this word in the corresponding document.

In [41]:
# Learn a vocabulary dictionary of all tokens in the raw documents and return term-document matrix
X_train = vectorizer.fit_transform(train_texts)


In [42]:
type(X_train)
print("Shape of X_train: ", X_train.shape)
print(type(X_train[0]))
print(X_train[0])

Shape of X_train:  (120000, 65006)
<class 'scipy.sparse._csr.csr_matrix'>
  (0, 62536)	2
  (0, 54842)	1
  (0, 7396)	1
  (0, 12600)	1
  (0, 6510)	1
  (0, 30178)	1
  (0, 57946)	1
  (0, 8345)	1
  (0, 48864)	2
  (0, 52479)	1
  (0, 51606)	1
  (0, 55636)	1
  (0, 18921)	1
  (0, 6832)	1
  (0, 40992)	1
  (0, 60125)	1
  (0, 15556)	1
  (0, 5306)	1
  (0, 51513)	1
  (0, 25524)	1
  (0, 3522)	1


**Note** 

The above output is a representation of a sparse matrix in the Compressed Sparse Row (CSR) format. Let's break down the information:

The type of the object (that is a matrix representation of a datapoint/sentence), is a sparse matrix in CSR format from the SciPy library (`<class 'scipy.sparse._csr.csr_matrix'>`)

Following that, there are multiple lines with a specific format `(row_index, column_index) value`. These lines **represent the non-zero elements of the sparse matrix**. Here's an explanation of each line:

- `(0, 7250) 2`: This line indicates that the value `2` is present at row index `0` and column index `7250`.        
- `(0, 6270) 1`: This line indicates that the value `1` is present at row index `0` and column index `6270`.         
- `(0, 738) 1`: This line indicates that the value `1` is present at row index `0` and column index `738`.          
- ...         

Each subsequent line follows the same pattern, representing the row index, column index, and value of a non-zero element in the sparse matrix.         

In summary, this output is a **representation of a sparse matrix in CSR format**, where the **non-zero elements are shown with their corresponding row and column indices**.

We also convert labels to a numpy array for later use with Scikit-learn:

In [43]:
import numpy as np
y_train = np.array(train_labels)

In [44]:
print("Shape of y_train: ", y_train.shape)
print(type(y_train[10]))
print(y_train[10])

Shape of y_train:  (120000,)
<class 'numpy.int64'>
2


In [45]:
unique_values, counts = np.unique(y_train, return_counts=True)

print("Unique values:", unique_values)
print("Counts:", counts)

Unique values: [0 1 2 3]
Counts: [30000 30000 30000 30000]


Now lets try the Naive Bayes classifier with a small sample of data:

In [46]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Load the dataset
raw_datasets = load_dataset("ag_news")

# Select the first 1000 samples
train_subset = raw_datasets['train'].select(range(1000))


# Convert the train data into list of strings
train_texts = train_subset['text']
train_labels = train_subset['label']

# Initialize a CountVectorizer
vectorizer = CountVectorizer()

# Transform the train data into the BoW representation
X_train = vectorizer.fit_transform(train_texts)

# Convert labels to numpy array
y_train = np.array(train_labels)

# Printing the counts of each class to make sure all classes have datapoints
unique_values, counts = np.unique(y_train, return_counts=True)

print("Unique values:", unique_values)
print("Counts:", counts)


# Initialize the Multinomial Naive Bayes classifier
naive_bayes_clf = MultinomialNB()

# Train the classifier
naive_bayes_clf.fit(X_train, y_train)

# Select the first 1000 samples from test set
test_subset = raw_datasets['test'].select(range(1000))

# Convert the test data into list of strings
test_texts = test_subset['text']
test_labels = test_subset['label']

# Transform the test data into the BoW representation
X_test = vectorizer.transform(test_texts)

# Convert labels to numpy array
y_test = np.array(test_labels)

# Predict the labels of the test data
y_pred = naive_bayes_clf.predict(X_test)

# Print the accuracy of the classifier
accuracy = (y_pred == y_test).mean()
print(f"Accuracy: {accuracy}")


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 383.15it/s]

Unique values: [0 1 2 3]
Counts: [212 142 174 472]
Accuracy: 0.726





We get a 0.72 accuracy with Naive Bayes algorithm which shows weaker prediction power in compare to accuracy of 0.885 with fine-tuned BERT. 