Scrape at least 400 full reviews and ratings from Yelp for a restaurant that has mixed reviews. (10 pts.)

Clean and pre-process the data. (5 pts.)

Develop a Word2Vec model using the reviews. (5 pts.)

Identify words that are most similar to 3 items on the restaurant menu using the Word2Vec model. Comment on whether the similar words make sense. (5 pts.)

In [1]:
import pandas as pd

df=pd.read_csv('/content/panchos_burritos.csv')

In [2]:
df

Unnamed: 0,name,rating,location,date,body,sentiment
0,Sarah P.,3,"Montclair, NJ","Mar 7, 2023",Let's start with the good - the ambiance and s...,Negative
1,Joseph F.,4,"New Milford, NJ","Dec 23, 2023",Pancho's food and service are very good BUT ri...,Positive
2,Dena H.,3,"Englewood, NJ","Jan 26, 2023",I really loved the vibes of this place and the...,Negative
3,Ellie O.,4,"Westwood, NJ","Jan 12, 2024",We have way too much fun when we come here! Fo...,Positive
4,Lenny G.,3,"Verona, NJ","Jan 14, 2023",The food is pretty mediocre and is way overpri...,Negative
...,...,...,...,...,...,...
633,Janis H.,2,"New Milford, NJ","Jun 16, 2010","This place is great for drinks, guac and chips...",Negative
634,Rita S.,5,"Manhattan, NY","Mar 4, 2012",The bean and cheese burritos are the best I've...,Positive
635,C B.,5,"Wood-Ridge, NJ","Sep 19, 2013",my coworker brought me here one afternoon. it ...,Positive
636,Michelle S.,3,"New Milford, NJ","Jan 31, 2014","ive had better.....food is pretty good, staff ...",Negative


In [44]:
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from gensim.models import Word2Vec

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    cleaned_text = ' '.join(tokens)
    return tokens

cleaned_reviews = df['body'].apply(clean_text).tolist()

w2v_model = Word2Vec(sentences=cleaned_reviews, vector_size=100, window=5, min_count=1, workers=4)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [39]:
print(cleaned_reviews)



In [45]:
w2v_model.wv.most_similar(positive=['burrito'], topn=5)

[('got', 0.9998496770858765),
 ('chicken', 0.9998348355293274),
 ('ordered', 0.999828040599823),
 ('taco', 0.9998267889022827),
 ('food', 0.999812126159668)]

In [46]:
w2v_model.wv.most_similar(positive=['margarita'], topn=5)

[('would', 0.9997919797897339),
 ('drink', 0.999788224697113),
 ('like', 0.999782919883728),
 ('im', 0.9997766017913818),
 ('go', 0.9997730255126953)]

In [47]:
w2v_model.wv.most_similar(positive=['guacamole'], topn=5)

[('food', 0.9997414350509644),
 ('burrito', 0.9997348785400391),
 ('got', 0.9997231364250183),
 ('drink', 0.9997216463088989),
 ('taco', 0.9997212290763855)]

Results were extremely high but words seemed accurate for burrito but other words looked random. Especially 'margarita' only had drink that seemed fitting.  

I adjusted parameters because the words seemed a little basic and the results were too high. Results remained high scores but the words seemed such more accurate.

In [56]:
w2v_model = Word2Vec(sentences=cleaned_reviews, vector_size=150, window=10, min_count=5, workers=4,epochs=10)


In [57]:
w2v_model.wv.most_similar(positive=['burrito'], topn=5)

[('steak', 0.9991416931152344),
 ('chicken', 0.9990193247795105),
 ('taco', 0.9985199570655823),
 ('cheese', 0.9982560873031616),
 ('rice', 0.9975676536560059)]

In [58]:
w2v_model.wv.most_similar(positive=['margarita'], topn=5)

[('super', 0.9990503787994385),
 ('amazing', 0.9982589483261108),
 ('frozen', 0.9982538819313049),
 ('bulldog', 0.9981478452682495),
 ('love', 0.9975672960281372)]

In [59]:
w2v_model.wv.most_similar(positive=['guacamole'], topn=5)

[('bowl', 0.9991681575775146),
 ('cream', 0.9989661574363708),
 ('sour', 0.9989660978317261),
 ('grilled', 0.9989659786224365),
 ('sauce', 0.9989293217658997)]

These results were much stronger and fit the words better. Burrito had all ingredients and taco which is similar. Margarita had great descriptions specifically fitting their frozen bulldog margarita menu item. Finally guacamole had much more fitting towards being a dip as opposed to how it was prior to parameter adjustment. It leaned toward sauce and sour cream which makes sense.