# <div class='alert alert-warning'> Bag of Words</div>

**In this article, we are going to discuss a Natural Language Processing technique of text modeling known as Bag of Words model. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.**

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('spotify_reviews.csv')

In [3]:
df.head(2)

Unnamed: 0,Time_submitted,Review,Rating,Total_thumbsup,Reply
0,2022-07-09 15:00:00,"Great music service, the audio is high quality...",5,2,
1,2022-07-09 14:21:22,Please ignore previous negative rating. This a...,5,1,


In [4]:
dg=df[['Review']]

In [5]:
dg.shape

(61594, 1)

**We are making the data set small for faster precessing**

In [6]:
dg=dg.head(100)

In [7]:
dg.shape

(100, 1)

In [8]:
dg.head()

Unnamed: 0,Review
0,"Great music service, the audio is high quality..."
1,Please ignore previous negative rating. This a...
2,"This pop-up ""Get the best Spotify experience o..."
3,Really buggy and terrible to use as of recently
4,Dear Spotify why do I get songs that I didn't ...


In [9]:
dg.isnull().sum()

Review    0
dtype: int64

In [10]:
dg.shape

(100, 1)

In [11]:
len(dg)

100

In [12]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [13]:
ps=PorterStemmer()

In [14]:
c=[]

In [15]:
# nltk.download('stopwords')

In [16]:
for i in range(0,len(dg)):
    t=re.sub('[^a-zA-Z]',' ',dg['Review'][i])
    t=t.lower()
    t=t.split()
    t=[ps.stem(word) for word in t if not word in stopwords.words('english')]
    t=' '.join(t)
    c.append(t)

In [17]:
c

['great music servic audio high qualiti app easi use also quick friendli support',
 'pleas ignor previou neg rate app super great give five star',
 'pop get best spotifi experi android annoy pleas let get rid',
 'realli buggi terribl use recent',
 'dear spotifi get song put playlist shuffl play',
 'player control sometim disappear reason app restart forget play fix issu',
 'love select lyric provid song listen',
 'still extrem slow chang storag extern sd card convinc done purpos spotifi know issu done noth solv time chang sd card faster read write speed samsung brand pleas add like song never appear search playlist',
 'great app best mp music app ever use one problem play song find song despit app wonder recommend best',
 'delet app follow reason app fail busi model whether stream servic like consum want pay music fulli ad success upon log singl song much close app ad number patient way profit alreadi peak left declin',
 'love spotifi usual app best other state control button disappear

### <font color='green'> Lets Create the bag of words now</font>

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
cv=CountVectorizer()

In [20]:
bw=cv.fit(c)

In [21]:
tokens=bw.get_feature_names()

In [22]:
vect=bw.transform(c)

In [23]:
print(tokens)

['abil', 'abl', 'acc', 'accept', 'access', 'account', 'ad', 'add', 'album', 'alexa', 'alot', 'alreadi', 'also', 'alway', 'amaz', 'amazon', 'android', 'annoy', 'anoth', 'anymor', 'anyth', 'anytim', 'anywher', 'app', 'appear', 'appl', 'applic', 'armi', 'artist', 'ask', 'audio', 'auto', 'avail', 'awesom', 'babi', 'back', 'bar', 'bare', 'bc', 'becam', 'becom', 'behaviour', 'best', 'better', 'big', 'birthday', 'bit', 'black', 'blink', 'blue', 'bot', 'bottom', 'brand', 'briefli', 'bring', 'brother', 'brought', 'buggi', 'busi', 'button', 'buzz', 'cant', 'card', 'cast', 'chang', 'check', 'choos', 'christian', 'clear', 'click', 'close', 'collect', 'come', 'commerci', 'compar', 'connect', 'constant', 'constantli', 'consum', 'continu', 'control', 'convinc', 'cool', 'correct', 'cost', 'could', 'crash', 'creat', 'current', 'custom', 'cut', 'data', 'day', 'deal', 'dear', 'declin', 'delet', 'desktop', 'desper', 'despit', 'devic', 'differ', 'directli', 'disappear', 'disappoint', 'disney', 'disturb', '

In [25]:
df_vect=pd.DataFrame(data=vect.toarray(),columns=tokens)

In [27]:
df_vect

Unnamed: 0,abil,abl,acc,accept,access,account,ad,add,album,alexa,...,wouldnt,wow,write,xbox,ye,yeah,year,yo,youth,youtub
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Hence we gotr our bag of words

                                          #Happy Learning