# Clustering Analysis

- According to the previous analysis, rating score is not completely reasonable. For example, reviews from the 4-5 stars cases are very similar and same circumstance also happens in 1-2 stars cases. Therefore, we can infer that one user rating a restaurant 4 satrs or 5 stars both means this user really recommend it.
- In this part, I am going to replace the existing rating system with a two-degree system and a three-degree system respectively to see which clustering analysis has a better performance. In a two-dgree system, all the reviews are corresponding 2 degrees, like and dislike, whereas in a three-dgree system, reviews are connected to 3 degrees, most like, average like and dislike. The clutsering method used here is K-means. And it turns out that three clusters has better performance than two clusters.


In [123]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

In [124]:
df = pd.read_csv('2017_restaurant_reviews.csv')
df.head(2)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew


# Cluster the review text data for all the restaurants

#### 1. Create a new variable to contain all reviews

In [125]:
Reviews = df['text']

#### 2. Create training dataset and test dataset in 7:3

In [126]:
from sklearn.model_selection import train_test_split
Reviews_train, Reviews_test= train_test_split(Reviews, test_size=0.3)
# The clsutering algorithms is slow so I will take samples to find optima number of clusters

#### 3. NLP

In [127]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize a TfidfVectorizer object
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with training data
# Transform the text training data into a bag of words
Reviews_train_vec = vectorizer.fit_transform(Reviews_train).toarray()#[:300]
# Get these words name
words = vectorizer.get_feature_names()
# Transform all the reviews
Reviews_vec = vectorizer.transform(Reviews).toarray()#[:300]
Reviews_test_vec = vectorizer.transform(Reviews_test).toarray()#[:300]

In [80]:
words[90:100]

['accidentally',
 'accommodate',
 'accommodated',
 'accommodating',
 'accompanied',
 'accompaniment',
 'accompany',
 'accompanying',
 'according',
 'accordingly']

#### 4. Cluster reviews with KMeans (two clusters) in training data

In [128]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)
km_clf.fit(Reviews_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

##### Use this model to make predictions on training data, test data, all data respectively

In [129]:
cluster_train = km_clf.predict(Reviews_train_vec)
cluster_test = km_clf.predict(Reviews_test_vec)
cluster = km_clf.predict(Reviews_vec)

#### 5. Inspect the centroids
Available to find centroids of the clusters is a good point of KMeans alogorithm. By inspecting the centroids, we can find the key points of each cluster. Here, the centroids are simply a bunch of vectors corresponding reviews' topic keywords. These vectors could be regarded as average review.

In [130]:
km_clf.cluster_centers_

array([[3.90312782e-18, 0.00000000e+00, 8.17484752e-03, ...,
        1.51544771e-03, 0.00000000e+00, 0.00000000e+00],
       [2.53495591e-03, 0.00000000e+00, 7.67806628e-03, ...,
        1.30104261e-18, 0.00000000e+00, 0.00000000e+00]])

#### 6. Map the centroids to words and present the top 20 features for each cluster.
Mapping centroids vectors back into 'word space' to make any sense of these numbers. These words would be the most representative of a cluster. Then print out top 20 keywords in each cluster.

In [131]:
cluster_top_features = list()
# Sort each centroid vector to find the top 20 features
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:20])
# Map vector back to words and print out
for num, centroid in enumerate(cluster_top_features):
    print ('Cluster',num,':',(", ".join(words[i] for i in centroid)))

Cluster 0 : food, great, ramen, amazing, friendly, chicken, service, really, place, delicious, staff, nice, atmosphere, best, definitely, ll, spot, authentic, got, try
Cluster 1 : place, good, food, service, just, like, pizza, order, got, time, great, restaurant, really, love, came, ordered, chicken, meat, come, try


It seems that two clusters are both for positive reviews. This is quite strange. Then randomly to find samples from both 2 clusters with rating scores and reviews to get a sense of the cluster.

In [133]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster_train.shape[0])[ cluster_train == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 4 stars.
The review is:
Flaming Kitchen is located in First Markham Place and serves personal hotplate meals. The concept is from Pepper Lunch in Japan. Hot metal plates that are heated to 500 degrees Fahrenheit. raw meat with vegetables and/or rice are then placed on the plates, where they cook in front of the customer. The meat can be eaten rare or well done, depending on the customer's preference.

The restaurant is located in the same location as Aka Teppan before and they have had a complete rebranding. The menu was revemped with lots more dessert additions, with over 15 pages of food ranging from snacks, hot plates, dessert, and bubble tea drinks. 

Appetizers
We tried a variety of appetizers which are mostly fried. Flaming Kitchen is quite generous with the portion of their appetitizer so it is best to be shared among a few persons

Curry Fries is made with Japanese curry served on a hotplate. We love Japanese curry and their sweet and spicy flavor

In [134]:
clu_num=[]
pop=[]
for j in range(1000):
    for i in range(km_clf.n_clusters):
        sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
        sample = np.random.choice(sub_cluster, 1)
        clu_num.append(i+1)
        res=df['business_id'].iloc[sample[0]]
        pop.append((df['business_id']==res).value_counts())
pp=[]
for i in pop:
    pp.append(i[1])
from scipy.stats.stats import pearsonr 
print('The correlation coefficient between number of reviews and average rating is %f.' \
      % (pearsonr(clu_num, pp)[0]))

The correlation coefficient between number of reviews and average rating is 0.109034.


As what i thought before, this cluster performs bad even in the training data.

#### 7. Try 3 clusters

In [135]:
# same as 2 clusters used before
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
km_3_clusters.fit(Reviews_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [136]:
cluster_train_3 = km_3_clusters.predict(Reviews_train_vec)
cluster_test_3 = km_3_clusters.predict(Reviews_test_vec)
cluster_3 = km_3_clusters.predict(Reviews_vec)

In [137]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:20])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: place,meal,like,bad,chicken,ordered,service,good,going,great,cheese,went,beer,time,outside,come,food,instead,seating,just
2: pizza,friendly,staff,quick,great,best,place,slice,ramen,love,food,la,service,fantastic,line,pretty,just,amazing,experience,really
3: food,good,place,service,really,great,got,restaurant,order,chicken,just,like,try,time,delicious,amazing,menu,dishes,sushi,better


It seems that three clusters can get the positive, negative and in-between types.

Randomly to find samples from 3 clusters with rating scores and reviews to get a sense.

In [138]:
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_train_3.shape[0])[cluster_train_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
Last minute stopped for a drink while my daughter had lunch. Service was outstanding and we were given complimentary drinks and a sample tapas which was amazing!!  Yes and my daughter said her veggie burger was the best..

The cluster is 2.
The star is: 5 stars.
The review is:
This is our first time here with the renovations and new management, and all we have to say is, "Wow!". They really stepped up their game! The decor is amazing and the food is to die for. They have started making their own tortillas (so good!) and have added several new items to their already fantastic menu. The shrimp especiale taco is UNREAL. If that tamarind sauce was the last thing I ever ate, I would die happy. We also checked out the new fish taco and chicharonn taco - perfect! This was my favourite place in the city for authentic Mexican before, and now I would say it rivals my favourites in the Southern USA..

The cluster is 3.
The star is: 4 stars.
T

It seems much better than 2 clusters. Then to test the model in test data and entire dataset.

In [139]:
#in test data
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_test_3.shape[0])[cluster_test_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 2 stars.
The review is:
Came here hoping for a semi quick lunch with the hubs. Was throughly disappointed with the experience this past time. We stood at the door for a good ten minutes as there was no host and no servers we could get the attention of. We've always known that service is slower here, but waiting an hour for a stir-fry just didn't seem right! The restaurant was fairly empty when we went, and the server did apologize several times for her slow speed. 
We both opted for stir fry bowls, which took forever to get to us to fill. Once filled and dropped off, we watched as my food sat on the counter for 15 minutes getting cold. In the past I've tried to get it off the bar myself, but after being told no numerous times, I've learned not to try. We were able to speak with a manager during our last visit (which was another 15 minute wait as well) and get my meal compted as it had arrived ice cold.  
Overall, the concept of the restaurant is great, bu

In [140]:
# in entire dataset
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_3.shape[0])[cluster_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
Last minute stopped for a drink while my daughter had lunch. Service was outstanding and we were given complimentary drinks and a sample tapas which was amazing!!  Yes and my daughter said her veggie burger was the best..

The cluster is 2.
The star is: 5 stars.
The review is:
Meilleur poulet portugais à Montréal. Les frites sont délicieuses. A deux, on commande le demi poulet avec extra sauce piquante à côté. On peut acheter leurs épices piri piri pour cuisiner chez soi: j'en mets sur le pop corn c'est vraiment bon !.

The cluster is 3.
The star is: 5 stars.
The review is:
Food took some time to prepare but the wait was definitely worth it, since you're sure that it's freshly cooked. Will come back soon!.



much better to associate reviews with cluster than past rating system!