# Clustering Analysis

- According to the previous analysis, rating score is not completely reasonable. For example, reviews from the 4-5 stars cases are very similar and same circumstance also happens in 1-2 stars cases. Therefore, we can infer that one user rating a restaurant 4 satrs or 5 stars both means this user really recommend it.
- In this part, I am going to replace the existing rating system with a two-degree system and a three-degree system respectively to see which clustering analysis has a better performance. In a two-dgree system, all the reviews are corresponding 2 degrees, like and dislike, whereas in a three-dgree system, reviews are connected to 3 degrees, most like, average like and dislike. The clutsering method used here is K-means. And it turns out that three clusters has better performance than two clusters.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('2017_restaurant_reviews.csv')
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew
5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA


# Cluster the review text data for all the restaurants

#### 1. Create a new variable to contain all reviews

In [3]:
Reviews = df['text']

#### 2. Create training dataset and test dataset in 7:3

In [4]:
from sklearn.model_selection import train_test_split
Reviews_train, Reviews_test= train_test_split(Reviews, test_size=0.3)
# The clsutering algorithms is slow so I will take samples to find optima number of clusters

#### 3. NLP

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize a TfidfVectorizer object
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with training data
# Transform the text training data into a bag of words
Reviews_train_vec = vectorizer.fit_transform(Reviews_train).toarray()[:300]
# Get these words name
words = vectorizer.get_feature_names()
# Transform all the reviews
Reviews_vec = vectorizer.transform(Reviews).toarray()#[:300]
Reviews_test_vec = vectorizer.transform(Reviews_test).toarray()#[:300]

#### 4. Cluster reviews with KMeans (two clusters) in training data

In [7]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)
km_clf.fit(Reviews_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

##### Use this model to make predictions on training data, test data, all data respectively

In [8]:
cluster_train = km_clf.predict(Reviews_train_vec)
cluster_test = km_clf.predict(Reviews_test_vec)
cluster = km_clf.predict(Reviews_vec)

#### 5. Inspect the centroids
Available to find centroids of the clusters is a good point of KMeans alogorithm. By inspecting the centroids, we can find the key points of each cluster. Here, the centroids are simply a bunch of vectors corresponding reviews' topic keywords. These vectors could be regarded as average review.

In [9]:
km_clf.cluster_centers_

array([[1.08420217e-19, 0.00000000e+00, 5.79728622e-03, ...,
        1.75513006e-03, 0.00000000e+00, 5.96311195e-19],
       [1.44271932e-03, 0.00000000e+00, 1.37713646e-02, ...,
        5.76241133e-04, 0.00000000e+00, 4.77990782e-04]])

#### 6. Map the centroids to words and present the top 20 features for each cluster.
Mapping centroids vectors back into 'word space' to make any sense of these numbers. These words would be the most representative of a cluster. Then print out top 20 keywords in each cluster.

In [10]:
cluster_top_features = list()
# Sort each centroid vector to find the top 20 features
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:20])
# Map vector back to words and print out
for num, centroid in enumerate(cluster_top_features):
    print ('Cluster',num,':',(", ".join(words[i] for i in centroid)))

Cluster 0 : food, great, best, sushi, recommend, fresh, place, good, highly, service, love, worth, coffee, probably, excellent, time, delicious, nice, amazing, really
Cluster 1 : good, food, service, place, just, great, like, don, chicken, delicious, really, came, restaurant, try, table, meal, definitely, got, sauce, order


It seems that two clusters are both for positive reviews. This is quite strange. Then randomly to find samples from both 2 clusters with rating scores and reviews to get a sense of the cluster.

In [11]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster_train.shape[0])[ cluster_train == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 1 stars.
The review is:
So disappointed in my experience today. I've been to Romados several times before when the food was decent with generous portions but today was not their day. First off, if you've bought a Groupon, they don't tell you to first head to the cash to validate it. So we waited in line to order food only to then be told to go line up at the cash before coming back to order our food. Then we go back to the guy prepping the food with our validated Groupon who's fulfilling our order and he hands us a bag of fries. We're hungry, we've been waiting for a while so we start eating them until he says "no, you don't get fries, only poutine" and takes it back only to throw the rest in with the large batch of fries for other customers. How DISGUSTING and unhygienic. Finally (to put the cherry on top), we decided at the last minute to take a small bottle of coke instead of the 1.5L water bottle included in our order. The old woman behind the cash li

In [13]:
clu_num=[]
pop=[]
for j in range(1000):
    for i in range(km_clf.n_clusters):
        sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
        sample = np.random.choice(sub_cluster, 1)
        clu_num.append(i+1)
        res=df['business_id'].iloc[sample[0]]
        pop.append((df['business_id']==res).value_counts())
pp=[]
for i in pop:
    pp.append(i[1])
from scipy.stats.stats import pearsonr 
print('The correlation coefficient between number of reviews and average rating is %f.' \
      % (pearsonr(clu_num, pp)[0]))

The correlation coefficient between number of reviews and average rating is 0.323903.


As what i thought before, this cluster performs bad even in the training data.

#### 7. Try 3 clusters

In [14]:
# same as 2 clusters used before
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
km_3_clusters.fit(Reviews_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [15]:
cluster_train_3 = km_3_clusters.predict(Reviews_train_vec)
cluster_test_3 = km_3_clusters.predict(Reviews_test_vec)
cluster_3 = km_3_clusters.predict(Reviews_vec)

In [16]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:20])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: food,great,service,place,meal,delicious,don,restaurant,definitely,best,nice,little,wait,worth,night,just,amazing,good,like,recommend
2: good,food,chicken,place,just,really,like,came,service,great,got,time,try,hot,love,come,sauce,fresh,ordered,nice
3: sushi,pizza,et,excellent,est,recommend,service,le,highly,good,rolls,best,try,indian,great,place,nous,bar,tried,delicious


It seems that three clusters can get the positive, negative and in-between types.

Randomly to find samples from 3 clusters with rating scores and reviews to get a sense.

In [17]:
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_train_3.shape[0])[cluster_train_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 1 stars.
The review is:
Very small portion for the price. Food tasted funny.  Won't come back again, won't recommend to anyone..

The cluster is 2.
The star is: 3 stars.
The review is:
I saw the raving reviews and am staying right across the street from this food joint, so, I had to stop by. The setup is pretty much made for a quick food grab. The staff was kind and the service was quick. I ordered a chicken combo that comes with chicken, fries, and salad. I also got some rice. The seasoning on the fries and rice really helped bring the quality of the meal up because I really wasn't shocked by this meal. I was expecting amazing chicken and was let down. It was alright, but definitely not what some of these reviews make it out to be. The salad had this weird mustard flavored dressing that I was not fond of..

The cluster is 3.
The star is: 3 stars.
The review is:
I have been to this place since it was called Aka Teppan. It is a great place to eat if you ar

It seems much better than 2 clusters. Then to test the model in test data and entire dataset.

In [18]:
#in test data
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_test_3.shape[0])[cluster_test_3 == i]
    
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 1 stars.
The review is:
It's my first time writing review on Yelp. But I am so disgusted by the dish I had for take out so I have to speak out. It's just a bowl of brown thick goo with a little ribs, bell pepper and onion pieces. The ribs and veggie tastes weird and a little bit bitter. And my throat felt itchy after tasting it. What my bf had is also terrible. Definitely one of the most horrible food I had for my entire life!.

The cluster is 2.
The star is: 4 stars.
The review is:
Tasty, authentic Mexican street food that gives appropriate portions relative to the prices..

The cluster is 3.
The star is: 5 stars.
The review is:
Meilleur poulet portugais à Montréal. Les frites sont délicieuses. A deux, on commande le demi poulet avec extra sauce piquante à côté. On peut acheter leurs épices piri piri pour cuisiner chez soi: j'en mets sur le pop corn c'est vraiment bon !.



In [19]:
# in entire dataset
c_n=[]
star=[]
con=[]
sub=[]
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_3.shape[0])[cluster_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    sub.append(sample)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

    

The cluster is 1.
The star is: 1 stars.
The review is:
My parents are never one to complain, but their experience at this restaurant was so negative that they asked me to post this review on their behalf since they don't have Yelp accounts. The abridged version: My parents arrived around 8:30pm and left at 10:30pm last night having not received any of the food they ordered despite asking three times about the status of their order.

Last night, they waited about half an hour to be seated and over an hour for their food, which they never received. They noticed 2-3 tables who were seated and ordered after them receive their food. Finally, after ONE HOUR at least, they asked their server about the food and the server apologized and informed them that the "kitchen was experiencing problems". They questioned the server about why tables who were seated and ordered after them had received their food already and were given the same excuse about kitchen issues. They ordered meals off the menu w

much better to associate reviews with cluster than past rating system!

In [None]:
df['new_cluster']=[(cluster+1) for cluster in cluster_3]  #df['new_cluster']=list((cluster+1) for cluster in three_cluster_pred)

In [None]:
df.to_csv('2017_restaurant_new_reviews.csv')