# Bag of Words

Bag of Words is a text representation technique where text is converted into numerical vectors based on word frequency, ignoring grammar and word order.

Machine cannot understand text, so BoW converts text → numbers.

NLP Pipeline (where BoW fits)

1. Text cleaning
2. Tokenization
3. Stopword removal
4. Bag of Words
5. ML Model

### Key Characteristics of Bag of Words

✔ Simple and easy to implement
✔ Fast computation
✔ Works well for basic text classification

❌ Ignores semantic meaning
❌ Ignores word order and context

### Types of Bag of Words

1. Binary Bag of Words
    - Uses 0 or 1
    - Indicates presence or absence of a word

2. Count-based Bag of Words
    - Uses word frequency
    - Most commonly used

### Applications of Bag of Words

- Text classification
- Spam detection
- Document similarity
- Information retrieval

### Required libraries:

- re → text cleaning
- nltk → stopwords
- sklearn → Bag of Words model

In [91]:
import re
import nltk

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [92]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Purvi
[nltk_data]     jain\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Purvi
[nltk_data]     jain\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [93]:
paragraph = '''
Machine Learning (ML) is a branch of Artificial Intelligence that enables systems to learn from data and make predictions or decisions without explicit programming.
What is Machine Learning?
Machine Learning is a subset of Artificial Intelligence (AI) that focuses on developing algorithms and statistical models that allow computers to perform specific tasks without using explicit instructions. Instead, they rely on patterns and inference derived from data. This capability enables machines to improve their performance over time as they are exposed to more data. 
Types of Machine Learning
Supervised Learning: In this approach, models are trained on labeled data, meaning the input data is paired with the correct output. The model learns to predict outcomes based on this training. Common algorithms include Linear Regression and Decision Trees.
Unsupervised Learning: This type involves training models on data without labeled responses. The model tries to find patterns or groupings in the data. Clustering algorithms, such as K-means, are examples of unsupervised learning.
Reinforcement Learning: Here, an agent learns to make decisions by taking actions in an environment to maximize cumulative rewards. This approach is often used in robotics and game playing.
Semi-Supervised and Self-Supervised Learning: These methods combine labeled and unlabeled data to improve learning efficiency, especially when labeling data is costly. 
'''

In [94]:
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['\nMachine Learning (ML) is a branch of Artificial Intelligence that enables systems to learn from data and make predictions or decisions without explicit programming.', 'What is Machine Learning?', 'Machine Learning is a subset of Artificial Intelligence (AI) that focuses on developing algorithms and statistical models that allow computers to perform specific tasks without using explicit instructions.', 'Instead, they rely on patterns and inference derived from data.', 'This capability enables machines to improve their performance over time as they are exposed to more data.', 'Types of Machine Learning\nSupervised Learning: In this approach, models are trained on labeled data, meaning the input data is paired with the correct output.', 'The model learns to predict outcomes based on this training.', 'Common algorithms include Linear Regression and Decision Trees.', 'Unsupervised Learning: This type involves training models on data without labeled responses.', 'The model tries to find 

In [95]:
corpus = []
for sentence in sentences:
    review = re.sub('[^a-zA-Z]', ' ', sentence)  #⚠ Regex [a-zA-Z] removes Unicode characters -> Use \w and \s for real-world NLP text
    review = review.lower()
    review = review.split()

    review = [
        word for word in review 
        if word not in stopwords.words('english')
    ]
    
    review = ' '.join(review)
    corpus.append(review)


In [96]:
corpus

['machine learning ml branch artificial intelligence enables systems learn data make predictions decisions without explicit programming',
 'machine learning',
 'machine learning subset artificial intelligence ai focuses developing algorithms statistical models allow computers perform specific tasks without using explicit instructions',
 'instead rely patterns inference derived data',
 'capability enables machines improve performance time exposed data',
 'types machine learning supervised learning approach models trained labeled data meaning input data paired correct output',
 'model learns predict outcomes based training',
 'common algorithms include linear regression decision trees',
 'unsupervised learning type involves training models data without labeled responses',
 'model tries find patterns groupings data',
 'clustering algorithms k means examples unsupervised learning',
 'reinforcement learning agent learns make decisions taking actions environment maximize cumulative rewards',

In [97]:
cv = CountVectorizer()
x = cv.fit_transform(corpus)

In [99]:
cv.vocabulary_

{'machine': 47,
 'learning': 44,
 'ml': 54,
 'branch': 8,
 'artificial': 6,
 'intelligence': 39,
 'enables': 23,
 'systems': 80,
 'learn': 43,
 'data': 17,
 'make': 49,
 'predictions': 66,
 'decisions': 19,
 'without': 94,
 'explicit': 27,
 'programming': 67,
 'subset': 78,
 'ai': 2,
 'focuses': 30,
 'developing': 21,
 'algorithms': 3,
 'statistical': 77,
 'models': 56,
 'allow': 4,
 'computers': 13,
 'perform': 62,
 'specific': 76,
 'tasks': 82,
 'using': 93,
 'instructions': 38,
 'instead': 37,
 'rely': 70,
 'patterns': 61,
 'inference': 35,
 'derived': 20,
 'capability': 9,
 'machines': 48,
 'improve': 33,
 'performance': 63,
 'time': 83,
 'exposed': 28,
 'types': 89,
 'supervised': 79,
 'approach': 5,
 'trained': 84,
 'labeled': 41,
 'meaning': 51,
 'input': 36,
 'paired': 60,
 'correct': 14,
 'output': 59,
 'model': 55,
 'learns': 45,
 'predict': 65,
 'outcomes': 58,
 'based': 7,
 'training': 85,
 'common': 12,
 'include': 34,
 'linear': 46,
 'regression': 68,
 'decision': 18,
 't

In [100]:
x.toarray()

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 1, 1],
       ...,
       [1, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [101]:
cv.get_feature_names_out()

array(['actions', 'agent', 'ai', 'algorithms', 'allow', 'approach',
       'artificial', 'based', 'branch', 'capability', 'clustering',
       'combine', 'common', 'computers', 'correct', 'costly',
       'cumulative', 'data', 'decision', 'decisions', 'derived',
       'developing', 'efficiency', 'enables', 'environment', 'especially',
       'examples', 'explicit', 'exposed', 'find', 'focuses', 'game',
       'groupings', 'improve', 'include', 'inference', 'input', 'instead',
       'instructions', 'intelligence', 'involves', 'labeled', 'labeling',
       'learn', 'learning', 'learns', 'linear', 'machine', 'machines',
       'make', 'maximize', 'meaning', 'means', 'methods', 'ml', 'model',
       'models', 'often', 'outcomes', 'output', 'paired', 'patterns',
       'perform', 'performance', 'playing', 'predict', 'predictions',
       'programming', 'regression', 'reinforcement', 'rely', 'responses',
       'rewards', 'robotics', 'self', 'semi', 'specific', 'statistical',
       'subse

In [102]:
print(x.shape)
print(len(cv.get_feature_names_out()))

(14, 95)
95


In [105]:
import pandas as pd

bow_df = pd.DataFrame(
    x.toarray(),
    columns=cv.get_feature_names_out()
)

bow_df

Unnamed: 0,actions,agent,ai,algorithms,allow,approach,artificial,based,branch,capability,...,training,trees,tries,type,types,unlabeled,unsupervised,used,using,without
0,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
7,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,1
9,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
