**Bag of words (BoW) model in NLP**

In Natural Language Processing (NLP) text data needs to be converted into numbers so that machine learning algorithms can understand it. One common method to do this is Bag of Words (BoW) model. It turns text like sentence, paragraph or document into a collection of words and counts how often each word appears but ignoring the order of the words. It does not consider the order of the words or their grammar but focuses on counting how often each word appears in the text.

**Key Components of BoW**

**Vocabulary:** It is a list of all unique words from the entire dataset. Each word in the vocabulary corresponds to a feature in the model.

**Document Representation:** Each document is represented as a vector where each element shows the frequency of the words from the vocabulary in that document. The frequency of each word is used as a feature for the model.

In [1]:
import nltk
import re

para="Beans. I was trying to explain to somebody as we were flying in, that's corn. That's beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country and we're lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven't seen in a long time and somehow he has not aged and I have. And it's great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn't speak at the commencement."

para=nltk.sent_tokenize(para)

para

['Beans.',
 "I was trying to explain to somebody as we were flying in, that's corn.",
 "That's beans.",
 'And they were very impressed at my agricultural knowledge.',
 'Please give it up for Amaury once again for that outstanding introduction.',
 "I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country and we're lucky to have him, your Senator, Dick Durbin is here.",
 "I also noticed, by the way, former Governor Edgar here, who I haven't seen in a long time and somehow he has not aged and I have.",
 "And it's great to see you, Governor.",
 'I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today.',
 'And I am deeply honored at the Paul Douglas Award that is being given to me.',
 'He is somebody who set the path for so much outstanding public service here in Illinois.',
 'Now, I want to start by addressing the elephant in the room.',
 "I know people are s

In [2]:
for i in range (len(para)):
    para[i]=para[i].lower()
    para[i]=re.sub(r"\W"," ",para[i])
    para[i]=re.sub(r"\s+"," ",para[i])

for i,sentence in enumerate(para):
    print(f"Sentence{i+1}:{sentence}")

Sentence1:beans 
Sentence2:i was trying to explain to somebody as we were flying in that s corn 
Sentence3:that s beans 
Sentence4:and they were very impressed at my agricultural knowledge 
Sentence5:please give it up for amaury once again for that outstanding introduction 
Sentence6:i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here 
Sentence7:i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have 
Sentence8:and it s great to see you governor 
Sentence9:i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today 
Sentence10:and i am deeply honored at the paul douglas award that is being given to me 
Sentence11:he is somebody who set the path for so much outstanding public service here in illinois 
Sentence12:now i want to st

**Advanatage of Bag of Words(BoW)**

Simple intution and easy to implement using Python

Fixed Size input-> easy for ml algorithm

**Disadvantage of Bag of Words(BoW)**

Sparse Matrix por Array->Overfitting

Ordering of the word is getting changed

Out of Vocabulary(OOV)

# Implementing Bag of Words using a csv

In [3]:
import pandas as pd

messages=pd.read_csv(r"C:\Users\aasif\Downloads\Streamlit\spamhamdata.csv",sep="\t",names=["label","message"])

In [4]:
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [5]:
#Data cleaning and preprocessing
import re
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aasif\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [7]:
corpus = []

for i in range(0,len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages["message"][i])
    review = review.lower()
    review = review.split()
    corpus.append(review)
    review=[ps.stem(word) for word in review if word not in set(stopwords.words("english"))]
    review=" ".join(review)
    corpus.append(review)

In [8]:
corpus

[['go',
  'until',
  'jurong',
  'point',
  'crazy',
  'available',
  'only',
  'in',
  'bugis',
  'n',
  'great',
  'world',
  'la',
  'e',
  'buffet',
  'cine',
  'there',
  'got',
  'amore',
  'wat'],
 'go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 ['ok', 'lar', 'joking', 'wif', 'u', 'oni'],
 'ok lar joke wif u oni',
 ['free',
  'entry',
  'in',
  'a',
  'wkly',
  'comp',
  'to',
  'win',
  'fa',
  'cup',
  'final',
  'tkts',
  'st',
  'may',
  'text',
  'fa',
  'to',
  'to',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  't',
  'c',
  's',
  'apply',
  'over',
  's'],
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 ['u', 'dun', 'say', 'so', 'early', 'hor', 'u', 'c', 'already', 'then', 'say'],
 'u dun say earli hor u c alreadi say',
 ['nah',
  'i',
  'don',
  't',
  'think',
  'he',
  'goes',
  'to',
  'usf',
  'he',
  'lives',
  'around',
  'here',
  'though'],
 'nah thin

In [15]:
pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-packag

In [16]:
#Creating Bag of Words(BoW) model
corpus_str = [' '.join(tokens) for tokens in corpus]
from sklearn.feature_extraction.text import CountVectorizer
cov=CountVectorizer(max_features=2500,lowercase=True) #for binary BOW enable binary=True

In [17]:
X=cov.fit_transform(corpus_str).toarray()

In [18]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [19]:
X.shape

(11144, 2500)

In [20]:
para="taj mahal is a beautiful monument"

cov=CountVectorizer(max_features=5)
y=cov.fit_transform([para]).toarray()

In [21]:
print(y)
print(cov.get_feature_names_out())


[[1 1 1 1 1]]
['beautiful' 'is' 'mahal' 'monument' 'taj']


**Note: Find the dataset file in assets**