## Yelp Restaurant Clustering

July 2017

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('last_2_years_restaurant_reviews.csv')

In [3]:
df.head(3)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,name,categories,avg_stars
0,--9e1ONYQuAa-CB_Rrw7Tw,1,2016-05-17,0,0Qc1THNHSapDL7cv-ZzW5g,5,What can I say.. Wowzers! Probably one of the ...,0,4LxKRRIikhr65GfPDW626w,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0
1,--9e1ONYQuAa-CB_Rrw7Tw,0,2017-01-20,0,L8lo5SKXfZRlbn1bpPiC9w,5,Went here for guys weekend. Unbelievable. Ravi...,0,nT8zgjoc-PbdBoQsFEXFLw,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0
2,--9e1ONYQuAa-CB_Rrw7Tw,52,2016-09-25,30,6eUT3IwwWPP3CZkAhxqOIw,5,"One word my friends: tableside!!! Yes, tablesi...",56,7RlyCglsIzhBn081inwvcg,Delmonico Steakhouse,"['Steakhouses', 'Cajun/Creole', 'Restaurants']",4.0


### 1. Cluster the review text data for all the restaurants

In [4]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['text'].values

#### Define 'favorable' reviews as the reviews with 4 or more stars

In [5]:
df['favorable'] = df['stars'] > 4
target = df['favorable'].values

#### Train Test Split

In [6]:
from sklearn.cross_validation import train_test_split



In [7]:
documents_train, documents_test, target_train, target_test = train_test_split(
    documents, target, test_size = 0.9, random_state = 16)

#### Get NLP representation of the documents

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 1000)
vectors_train = vectorizer.fit_transform(documents_train).toarray()

# Get the vocab
words = vectorizer.get_feature_names()
# Use the trained model to transform all the reviews
vectors_documents = vectorizer.transform(documents).toarray()

#### Cluster reviews with KMeans

In [9]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 5, init='k-means++', max_iter = 300)
kmeans.fit(vectors_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### Inspect the top 10 features for each cluster.

In [10]:
# Top 10 features
n = 11
top_centroids = kmeans.cluster_centers_.argsort()[:, -1:-n:-1]

In [13]:
print("Top Features for Each Cluster: ")
for idx, ctrd in enumerate(top_centroids):
    print("Centroid #%d: %s" % (idx+1, ', '.join(words[x] for x in ctrd)))

Top Features for Each Cluster: 
Centroid #1: great, food, service, place, good, amazing, staff, friendly, love, awesome
Centroid #2: good, chicken, food, really, like, ordered, burger, place, just, fries
Centroid #3: food, order, service, time, just, minutes, didn, like, came, got
Centroid #4: place, food, best, vegas, amazing, love, delicious, service, friendly, good
Centroid #5: pizza, crust, place, good, great, slice, cheese, order, just, service


### 2. Cluster restaurants from their category information

#### Research the restaurant clusters for the 5-star restaurants

In [14]:
docu_category = df['categories'].values
target = (df['stars'] == 5).astype(int).values

#### Train Test Split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(docu_category, target, test_size = 0.99, random_state = 30)

#### NLP Representation

In [16]:
vectorizer_ctry    = TfidfVectorizer(stop_words='english', max_features = 1000)
vectors_ctry_train = vectorizer_ctry.fit_transform(X_train).toarray()
words_ctry         = vectorizer_ctry.get_feature_names()
vectors_ctry_test  = vectorizer_ctry.transform(X_test).toarray()
vectors_ctry_docu  = vectorizer_ctry.transform(docu_category).toarray()

#### User K-Means to perform the clustering

In [17]:
kmeans_ctry = KMeans(n_clusters=10, random_state=16)
kmeans_ctry.fit(vectors_ctry_train)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=16, tol=0.0001, verbose=0)

In [18]:
predict_ctry = kmeans_ctry.predict(vectors_ctry_docu)

#### Check the top 5 features for clustering restaurants using category info

In [19]:
def print_features(cluster, n, vocab):
    for idx, centroid in enumerate(cluster):
        print("No.%d Centroid contains these words:" % (idx+1))
        print("     %s" % ", ".join(vocab[i] for i in centroid.argsort()[::-1][:n]))

print_features(kmeans_ctry.cluster_centers_, 6, words_ctry)

No.1 Centroid contains these words:
     steakhouses, restaurants, american, seafood, new, traditional
No.2 Centroid contains these words:
     breakfast, brunch, american, traditional, restaurants, sandwiches
No.3 Centroid contains these words:
     mexican, restaurants, nightlife, bars, fast, tacos
No.4 Centroid contains these words:
     american, traditional, burgers, new, restaurants, southern
No.5 Centroid contains these words:
     pizza, italian, restaurants, salad, food, seafood
No.6 Centroid contains these words:
     asian, fusion, korean, barbeque, restaurants, japanese
No.7 Centroid contains these words:
     sushi, japanese, bars, restaurants, ramen, noodles
No.8 Centroid contains these words:
     bars, nightlife, wine, american, restaurants, cocktail
No.9 Centroid contains these words:
     tea, coffee, food, cafes, desserts, juice
No.10 Centroid contains these words:
     food, restaurants, chinese, fast, seafood, thai
