# TalentSprint

## Objective

At the end of the experiment, you will be able to

* understand Bag of words
* apply Bag of words on any dataset to extract meaningful insights from the dataset

## Dataset

The dataset choosen for this experiment is McDonalds review dataset. The dataset contains 1525 samples and two columns.

**Importing Necessary Libraries**

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import string
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
data = pd.read_csv("McDonalds-Yelp-Sentiment-DFE.csv", encoding='latin')

In [35]:
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   _unit_id                      1525 non-null   int64  
 1   _golden                       1525 non-null   bool   
 2   _unit_state                   1525 non-null   object 
 3   _trusted_judgments            1525 non-null   int64  
 4   _last_judgment_at             1525 non-null   object 
 5   policies_violated             1471 non-null   object 
 6   policies_violated:confidence  1471 non-null   object 
 7   city                          1438 non-null   object 
 8   policies_violated_gold        0 non-null      float64
 9   review                        1525 non-null   object 
dtypes: bool(1), float64(1), int64(2), object(6)
memory usage: 108.8+ KB


In [36]:
data = data[['city', 'review']]

In [37]:
data.columns

Index(['city', 'review'], dtype='object')

In [38]:
data.shape

(1525, 2)

**Removing unwanted columns**

In [11]:
data.head()

Unnamed: 0,city,review
0,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,Atlanta,Terrible customer service. Î¾I came in at 9:30...
2,Atlanta,"First they ""lost"" my order, actually they gave..."
3,Atlanta,I see I'm not the only one giving 1 star. Only...
4,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [39]:
reviews = data["review"]

In [41]:
reviews[1]

'Terrible customer service. Î¾I came in at 9:30pm and stood in front of the register and no one bothered to say anything or help me for 5 minutes. Î¾There was no one else waiting for their food inside either, just outside at the window. Î¾ I left and went to Chickfila next door and was greeted before I was all the way inside. This McDonalds is also dirty, the floor was covered with dropped food. Obviously filled with surly and unhappy workers.'

**Pre-processing**

In [42]:
#start replaceTwoOrMore
def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character and replace with the character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    return pattern.sub(r"\1\1", s)

In [43]:
#start process_review
def processReview(review):
    #Convert to lower case
    review = review.lower()
    review = review.translate(str.maketrans('', '', string.punctuation))
    # Removing numbers
    review = re.sub('[0-9]', '', review) 
    #Remove additional white spaces
    review = re.sub('[\s]+', ' ', review)
    #Replace #word with word
    review = re.sub(r'#([^\s]+)', r'\1', review)
    #trim
    review = review.strip('\'"')
    review = review.strip('.,')
    review = replaceTwoOrMore(review)
    return review

In [44]:
processedReviews = []
for review in reviews:
  processedReviews.append(processReview(review))

In [47]:
print(processedReviews[1])

terrible customer service î¾i came in at pm and stood in front of the register and no one bothered to say anything or help me for minutes î¾there was no one else waiting for their food inside either just outside at the window î¾ i left and went to chickfila next door and was greeted before i was all the way inside this mcdonalds is also dirty the floor was covered with dropped food obviously filled with surly and unhappy workers


**Bow using sklearn**

In [48]:
vectorizer = CountVectorizer()
featurevector = vectorizer.fit_transform(processedReviews).todense()

In [57]:
print(len(vectorizer.vocabulary_))

10041


In [54]:
featurevector[0]

matrix([[0, 0, 0, ..., 0, 0, 0]])

**Bow of words without using sklearn**

In [55]:
# Building vocabulary

vocabulary = set()
for review in processedReviews:
  for word in review.split():
    vocabulary.add(word)

In [59]:
len(vocabulary)

10064

In [60]:
def bagofwords(sentence, words):
    # frequency word count
    bag = np.zeros(len(words))
    for sw in sentence:
        for i,word in enumerate(words):
            if word == sw: 
                bag[i] += 1
                
    return np.array(bag)

In [65]:
print(bagofwords(processedReviews[1500], vocabulary))

[0. 0. 0. ... 0. 0. 0.]
