# Bag Of Words (BOW)
* The bag-of-words (BOW) model is a **representation** for a text.
* An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on [Distributional Structure](https://www.tandfonline.com/doi/abs/10.1080/00437956.1954.11659520).
* It creates arbitrary text into **fixed-length vectors** by counting how many times each word appears.
* This process is also called **vectorization**.
* The bag-of-words model is commonly used in methods of **document classification** where the (frequency of) occurrence of each word is used as a feature for training a classifier.
* The BOW model is one example of a Vector Space Model (VSM).

In [None]:
%pip install keras

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer  # Updated import

docs = [
  'the cat sat',
  'the cat sat in the hat',
  'the cat with the hat',
]

## Step 1: Determine the Vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
print(f'Vocabulary: {list(tokenizer.word_index.keys())}')
print(f'Word index: {tokenizer.word_index}')

## Step 2: Count
vectors = tokenizer.texts_to_matrix(docs, mode='count')
print("\nCount vectors:")
print(vectors)

Vocabulary: ['the', 'cat', 'sat', 'hat', 'in', 'with']
Word index: {'the': 1, 'cat': 2, 'sat': 3, 'hat': 4, 'in': 5, 'with': 6}

Count vectors:
[[0. 1. 1. 1. 0. 0. 0.]
 [0. 2. 1. 1. 1. 1. 0.]
 [0. 2. 1. 0. 1. 0. 1.]]


In [10]:
sentence1 = ['John likes to watch movies. Mary likes movies too.']
sentence2 = ['Mary also likes to watch football games.']

def print_bow(sentence) -> None:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentence)
    sequences = tokenizer.texts_to_sequences(sentence)
    word_index = tokenizer.word_index 
    bow = {}
    for key in word_index:
        bow[key] = sequences[0].count(word_index[key])

    print(f"Bag of word sentence 1:\n{bow}")
    print(f'We found {len(word_index)} unique tokens.')

In [12]:
print_bow(sentence1)

Bag of word sentence 1:
{'likes': 2, 'movies': 2, 'john': 1, 'to': 1, 'watch': 1, 'mary': 1, 'too': 1}
We found 7 unique tokens.


In [14]:
print_bow(sentence2)

Bag of word sentence 1:
{'mary': 1, 'also': 1, 'likes': 1, 'to': 1, 'watch': 1, 'football': 1, 'games': 1}
We found 7 unique tokens.
