# Bag of Words (BoW) - Text Representation

## Overview

**Bag of Words (BoW)** is one of the most fundamental text representation techniques in NLP. It converts text into fixed-length numerical vectors by counting word occurrences, completely ignoring grammar and word order.

### Key Concepts

| Concept | Description |
|:--------|:------------|
| **Vocabulary** | Set of all unique words across all documents |
| **Document Vector** | A vector where each dimension represents a word count |
| **Sparse Matrix** | Most values are 0 (words not present in document) |

### How It Works

```
Document: "Thor is eating pizza"
Vocabulary: [eating, is, pizza, thor]
Vector:     [1, 1, 1, 1]
```

---

## üìä Dataset: SMS Spam Collection

We'll use a spam classification dataset to demonstrate BoW in action.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.Category.value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

### Class Distribution Analysis

Understanding the class distribution is crucial before building any classifier:
- **Balanced dataset**: Similar number of samples in each class
- **Imbalanced dataset**: One class dominates (requires special handling)

Our spam dataset is imbalanced - there are more "ham" (legitimate) messages than "spam".

In [4]:
df['spam'] = df['Category'].apply(lambda x: 1 if x =='spam' else 0)

In [5]:
df.shape

(5572, 3)

### Label Encoding

We convert categorical labels ("spam"/"ham") to numerical values (1/0) because machine learning algorithms work with numbers, not text.

| Category | Numeric Label |
|:---------|:-------------:|
| spam | 1 |
| ham | 0 |

In [6]:
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


---

## üîÄ Train-Test Split

We split our data into training and testing sets:
- **Training set (80%)**: Used to train the model
- **Test set (20%)**: Used to evaluate model performance on unseen data

**Important**: Never evaluate on training data - it leads to overly optimistic results!

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

In [8]:
X_train.shape

(4457,)

In [9]:
X_test.shape

(1115,)

In [10]:
type(X_train)

pandas.core.series.Series

In [11]:
X_train[:4]

3394                                          Ok thanx...
890          Wife.how she knew the time of murder exactly
1596          Pls confirm the time to collect the cheque.
1880    U have a secret admirer who is looking 2 make ...
Name: Message, dtype: object

In [12]:
type(y_train)

pandas.core.series.Series

In [13]:
y_train[:4]

3394    0
890     0
1596    0
1880    1
Name: spam, dtype: int64

In [14]:
type(X_train.values)

numpy.ndarray

---

## üî¢ Creating Bag of Words with CountVectorizer

`CountVectorizer` from scikit-learn converts text documents into a matrix of word counts.

### Key Methods:
- **`fit()`**: Learn the vocabulary from training data
- **`transform()`**: Convert documents to vectors using learned vocabulary
- **`fit_transform()`**: Do both in one step (only on training data!)

### Important Concept:
```python
# On training data: fit_transform() - learns vocabulary AND transforms
X_train_cv = vectorizer.fit_transform(X_train)

# On test data: transform() ONLY - uses existing vocabulary
X_test_cv = vectorizer.transform(X_test)
```

‚ö†Ô∏è **Never use `fit_transform()` on test data** - it would create a different vocabulary!

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 59188 stored elements and shape (4457, 7736)>

In [16]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0], shape=(7736,))

In [17]:
X_train_cv.shape

(4457, 7736)

### Understanding the Sparse Matrix

The output shape `(4457, 7456)` means:
- **4457 documents** (training samples)
- **7456 unique words** (vocabulary size)

This creates a sparse matrix where most values are 0 (a document typically contains only a small fraction of all vocabulary words).

In [34]:
v.get_feature_names_out()[1771]

'chickened'

In [19]:
v.vocabulary_

{'ok': 4910,
 'thanx': 6810,
 'wife': 7504,
 'how': 3503,
 'she': 6063,
 'knew': 3942,
 'the': 6815,
 'time': 6901,
 'of': 4885,
 'murder': 4636,
 'exactly': 2675,
 'pls': 5250,
 'confirm': 1950,
 'to': 6934,
 'collect': 1884,
 'cheque': 1760,
 'have': 3340,
 'secret': 5967,
 'admirer': 821,
 'who': 7489,
 'is': 3734,
 'looking': 4178,
 'make': 4308,
 'contact': 1970,
 'with': 7541,
 'find': 2839,
 'out': 4997,
 'they': 6838,
 'reveal': 5755,
 'thinks': 6847,
 'ur': 7198,
 'so': 6270,
 'special': 6352,
 'call': 1600,
 'on': 4927,
 '09058094594': 176,
 'reaching': 5585,
 'home': 3458,
 'in': 3630,
 'min': 4464,
 'bring': 1507,
 'some': 6285,
 'wendy': 7450,
 'message': 4438,
 'from': 3008,
 'am': 934,
 'at': 1121,
 'truro': 7056,
 'hospital': 3486,
 'ext': 2718,
 'you': 7697,
 'can': 1622,
 'phone': 5178,
 'me': 4393,
 'here': 3397,
 'as': 1086,
 'by': 1583,
 'my': 4652,
 'side': 6141,
 'yeah': 7671,
 'probably': 5423,
 'but': 1568,
 'not': 4819,
 'sure': 6618,
 'ilol': 3604,
 'let': 40

In [20]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], shape=(7736,))

In [21]:
np.where(X_train_np[0]!=0)

(array([4910, 6810]),)

In [23]:
X_train[1579]

"How to Make a girl Happy? It's not at all difficult to make girls happy. U only need to be... 1. A friend 2. Companion 3. Lover 4. Chef . . .  &lt;#&gt; . Good listener  &lt;#&gt; . Organizer  &lt;#&gt; . Good boyfriend  &lt;#&gt; . Very clean  &lt;#&gt; . Sympathetic  &lt;#&gt; . Athletic  &lt;#&gt; . Warm . . .  &lt;#&gt; . Courageous  &lt;#&gt; . Determined  &lt;#&gt; . True  &lt;#&gt; . Dependable  &lt;#&gt; . Intelligent . . .  &lt;#&gt; . Psychologist  &lt;#&gt; . Pest exterminator  &lt;#&gt; . Psychiatrist  &lt;#&gt; . Healer . .  &lt;#&gt; . Stylist  &lt;#&gt; . Driver . . Aaniye pudunga venaam.."

In [24]:
X_train_np[0][1771]

np.int64(0)

---

## ü§ñ Training Naive Bayes Classifier

**Multinomial Naive Bayes** is ideal for text classification with word counts because:

1. **Based on word frequencies**: Assumes features represent counts
2. **Fast training**: Very efficient even with large vocabularies
3. **Works well with sparse data**: Perfect for BoW representations
4. **Probabilistic**: Provides probability estimates for predictions

### The Math Behind It

Naive Bayes calculates:
$$P(\text{spam} | \text{words}) \propto P(\text{spam}) \times \prod P(\text{word}_i | \text{spam})$$

The "naive" assumption is that words are independent given the class.

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


In [26]:
X_test_cv = v.transform(X_test)

### Transform Test Data

‚ö†Ô∏è **Critical**: Use `transform()` (not `fit_transform()`) on test data to use the same vocabulary learned from training data.

---

## üìà Evaluate Performance

### Understanding the Classification Report

| Metric | Description |
|:-------|:------------|
| **Precision** | Of all predicted spam, how many were actually spam? |
| **Recall** | Of all actual spam, how many did we catch? |
| **F1-Score** | Harmonic mean of precision and recall |
| **Support** | Number of samples in each class |

For spam detection:
- **High Precision** = Few false positives (legitimate emails not marked as spam)
- **High Recall** = Few false negatives (spam emails not caught)

In [27]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       974
           1       0.97      0.97      0.97       141

    accuracy                           0.99      1115
   macro avg       0.98      0.98      0.98      1115
weighted avg       0.99      0.99      0.99      1115



### Testing on New Emails

Let's test our model on completely new messages it has never seen:

In [28]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

---

## üîß Using Sklearn Pipeline

**Pipeline** combines multiple steps into a single object, making code cleaner and preventing data leakage.

### Benefits of Pipeline:
1. **Cleaner code**: One object instead of multiple
2. **Prevents leakage**: Ensures proper fit/transform sequence
3. **Easy deployment**: Save and load entire pipeline
4. **Cross-validation friendly**: Works seamlessly with GridSearchCV

In [29]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [30]:
clf.fit(X_train, y_train)

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('vectorizer', ...), ('nb', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (strip_accents and lowercase) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"stop_words  stop_words: {'english'}, list, default=None If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.",
,"token_pattern  token_pattern: str or None, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w\\w+\\b'
,"ngram_range  ngram_range: tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means unigrams and bigrams, and ``(2, 2)`` means only bigrams. Only applies if ``analyzer`` is not callable.","(1, ...)"

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


In [31]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       974
           1       0.97      0.97      0.97       141

    accuracy                           0.99      1115
   macro avg       0.98      0.98      0.98      1115
weighted avg       0.99      0.99      0.99      1115



---

## üéØ Key Takeaways

1. **Bag of Words** converts text to numerical vectors by counting word occurrences
2. **CountVectorizer** creates a vocabulary and transforms documents to vectors
3. **Always use `transform()`** on test data, not `fit_transform()`
4. **Multinomial Naive Bayes** is excellent for text classification with word counts
5. **Pipeline** simplifies the workflow and prevents common mistakes

### Limitations of BoW:
- ‚ùå Loses word order ("dog bites man" = "man bites dog")
- ‚ùå Cannot capture semantics ("happy" and "joyful" are unrelated)
- ‚ùå High dimensionality with large vocabularies
- ‚ùå Cannot handle out-of-vocabulary words

**Next**: Learn about **N-grams** to capture some word order!