# Bag Of Words
It is a text representation technique used to convert text to numerical form so that models can process it. Steps:
1. Tokenization → Split text into words (tokens).
2. Vocabulary Creation → Create a list of all unique words in the dataset.
3. Vectorization → Convert each text into a vector based on word occurrences.


In [2]:
text = '''The cat sat on the mat.
The dog barked at the stranger.
The bird is singing in the tree.
The sun is shining brightly.
The cat and dog are playing together.
I love reading books on artificial intelligence.
The weather is cold and rainy today.
My laptop battery died while working.
The football team won the championship.
The chef is preparing a delicious meal.
She enjoys hiking in the mountains.
The train arrived at the station on time.
Scientists are researching new medical treatments.
The smartphone has a powerful camera.
The students are studying for their exams.
The movie was full of action and suspense.
He listens to music while coding.
The artist painted a beautiful landscape.
The internet speed is very slow today.
The bakery sells fresh bread every morning.
'''
from io import StringIO
text_io = StringIO(text)

In [3]:
import pandas as pd
messages = pd.read_csv(text_io,sep='\t',names=['message'])
messages

Unnamed: 0,message
0,The cat sat on the mat.
1,The dog barked at the stranger.
2,The bird is singing in the tree.
3,The sun is shining brightly.
4,The cat and dog are playing together.
5,I love reading books on artificial intelligence.
6,The weather is cold and rainy today.
7,My laptop battery died while working.
8,The football team won the championship.
9,The chef is preparing a delicious meal.


In [4]:
import re,nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [6]:
corpus = []
for i in range(0,len(messages)):
    review = re.sub('[^a-zA-Z]',' ',messages['message'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if word not in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)


In [7]:
corpus

['cat sat mat',
 'dog bark stranger',
 'bird sing tree',
 'sun shine brightli',
 'cat dog play togeth',
 'love read book artifici intellig',
 'weather cold raini today',
 'laptop batteri die work',
 'footbal team championship',
 'chef prepar delici meal',
 'enjoy hike mountain',
 'train arriv station time',
 'scientist research new medic treatment',
 'smartphon power camera',
 'student studi exam',
 'movi full action suspens',
 'listen music code',
 'artist paint beauti landscap',
 'internet speed slow today',
 'bakeri sell fresh bread everi morn']

In [None]:
#Create the BOW model
from sklearn.feature_extraction.text import CountVectorizer
#for Binary BOW enable binary=True
cv = CountVectorizer(max_features=50,binary = True,ngram_range=(2,3)) #include bigram and trigram
x = cv.fit_transform(corpus).toarray()


# N grams
A N-gram is a contiguous sequence of N words from a given text. It helps in capturing word order and context, unlike the Bag of Words(BoW) model, which ignores word sequences.

In [25]:
cv.vocabulary_

{'cat sat': np.int64(20),
 'cat sat mat': np.int64(21),
 'dog bark': np.int64(28),
 'bark stranger': np.int64(8),
 'dog bark stranger': np.int64(29),
 'bird sing': np.int64(12),
 'bird sing tree': np.int64(13),
 'cat dog': np.int64(18),
 'dog play': np.int64(30),
 'cat dog play': np.int64(19),
 'dog play togeth': np.int64(31),
 'love read': np.int64(48),
 'book artifici': np.int64(14),
 'artifici intellig': np.int64(3),
 'love read book': np.int64(49),
 'book artifici intellig': np.int64(15),
 'cold raini': np.int64(24),
 'cold raini today': np.int64(25),
 'laptop batteri': np.int64(44),
 'batteri die': np.int64(9),
 'die work': np.int64(27),
 'laptop batteri die': np.int64(45),
 'batteri die work': np.int64(10),
 'footbal team': np.int64(35),
 'footbal team championship': np.int64(36),
 'chef prepar': np.int64(22),
 'delici meal': np.int64(26),
 'chef prepar delici': np.int64(23),
 'enjoy hike': np.int64(32),
 'hike mountain': np.int64(41),
 'enjoy hike mountain': np.int64(33),
 'arri