**Bag of Words (BoW)**

**Concept:**
* A document is represented as a bag (multiset) of its words, ignoring grammar and word order.

* It only keeps word frequency (how many times each word appears).

* Useful for machine learning models that need numerical input.



**Example**


Let’s say we have 2 sentences:

1. "I love data"

2. "I love machine learning"

Step-by-step:

Build Vocabulary: Unique words from both sentences
➤ ["I", "love", "data", "machine", "learning"]

Vectorize: Count each word per sentence


| Sentence                      | I | love | data | machine | learning |
|------------------------------|---|------|------|---------|----------|
| "I love data"                | 1 | 1    | 1    | 0       | 0        |
| "I love machine learning"    | 1 | 1    | 0    | 1       | 1        |


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This this this  is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)


feature_names= vectorizer.get_feature_names_out()
print(feature_names)


['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [2]:
dense_matrix = X.toarray()

In [3]:
import pandas as pd
df = pd.DataFrame(dense_matrix, columns=feature_names)
print(df)

   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     3
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1


**Scratch**

In [4]:
corpus = [
    'This this this  is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
]

In [5]:
#tokenizer

def tocknizer(text):
    return text.lower().split()

In [6]:
#vocabulary
vocab = []
for i in corpus:
    vocab.extend(tocknizer(i))

vocab = sorted(set(vocab))  
print("Vocabulary:", vocab)


Vocabulary: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [7]:
emb = []


In [8]:
for i in corpus:
    txt = tocknizer(i)
    freq = {word: 0 for word in vocab} # assigned 0 for each key.
    # print(freq)

    for word in txt:
        if word in freq:
            freq[word] += 1
    print(freq)

    emb.append(freq)

{'and': 0, 'document': 1, 'first': 1, 'is': 1, 'one': 0, 'second': 0, 'the': 1, 'third': 0, 'this': 3}
{'and': 0, 'document': 2, 'first': 0, 'is': 1, 'one': 0, 'second': 1, 'the': 1, 'third': 0, 'this': 1}
{'and': 1, 'document': 0, 'first': 0, 'is': 1, 'one': 1, 'second': 0, 'the': 1, 'third': 1, 'this': 1}
{'and': 0, 'document': 1, 'first': 1, 'is': 1, 'one': 0, 'second': 0, 'the': 1, 'third': 0, 'this': 1}


In [9]:
emb

[{'and': 0,
  'document': 1,
  'first': 1,
  'is': 1,
  'one': 0,
  'second': 0,
  'the': 1,
  'third': 0,
  'this': 3},
 {'and': 0,
  'document': 2,
  'first': 0,
  'is': 1,
  'one': 0,
  'second': 1,
  'the': 1,
  'third': 0,
  'this': 1},
 {'and': 1,
  'document': 0,
  'first': 0,
  'is': 1,
  'one': 1,
  'second': 0,
  'the': 1,
  'third': 1,
  'this': 1},
 {'and': 0,
  'document': 1,
  'first': 1,
  'is': 1,
  'one': 0,
  'second': 0,
  'the': 1,
  'third': 0,
  'this': 1}]

In [10]:
import pandas as pd
df = pd.DataFrame(emb)
print("\nBag of Words Embedding:\n")
print(df)



Bag of Words Embedding:

   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     3
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1


In [11]:
corpus = [
    'This this this  is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
]




In [12]:
import pandas as pd

def BOW(corpus):


    def tocknizer(text):
        return text.lower().split()
    
    vocab = []
    for i in corpus:
        vocab.extend(tocknizer(i))

    vocab = sorted(set(vocab))  
    # print("Vocabulary:", vocab)

    emb = []


    for i in corpus:
        txt = tocknizer(i)
        freq = {word: 0 for word in vocab} # assigned 0 for each key.
        # print(freq)

        for word in txt:
            # if word in freq:
            freq[word] += 1
        # print(freq)

        emb.append(freq)

    df = pd.DataFrame(emb)
    # print("\nBag of Words Embedding:\n")
    # print(df)
    return df





In [13]:
BOW(corpus)

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,3
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1
