# ***Engr.Muhammad Javed***

# 1. Bag of Words (BoW)

BoW makes a vocabulary of all unique words and counts their occurrences in each document.

## Vocabulary
The set of unique words in the corpus.

## CountVectorizer
`sklearn` provides `CountVectorizer` to implement BoW.

## Pros & Cons
**Pros:**
- Simple to understand and implement.

**Cons:**
- Sparse matrix (high memory).
- Ignores word order.
- Treats all words equally (frequent words might dominate).

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Load Data
df_train = pd.read_csv('../Dataset/train.txt', sep=';', names=['text', 'emotion'])
# Use a small subset for demonstration
documents = df_train['text'].iloc[:5].tolist()

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and Transform
X = vectorizer.fit_transform(documents)

# Get Vocabulary
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

# Show Array
print("\nEncoded Document (Array):\n", X.toarray())

Vocabulary: ['about' 'am' 'and' 'around' 'awake' 'being' 'can' 'cares' 'damned'
 'didnt' 'ever' 'feel' 'feeling' 'fireplace' 'from' 'go' 'grabbing'
 'greedy' 'grouchy' 'hopeful' 'hopeless' 'humiliated' 'im' 'is' 'it'
 'just' 'know' 'minute' 'nostalgic' 'on' 'post' 'property' 'so' 'someone'
 'still' 'that' 'the' 'to' 'who' 'will' 'wrong']

Encoded Document (Array):
 [[0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]
 [0 0 1 1 1 1 1 1 1 0 0 0 1 0 2 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 2 1 0 0
  0 1 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
  0 1 0 0 1]
 [1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1
  2 0 0 1 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0]]
