# Clustering Analysis

- According to the previous analysis, rating score is not completely reasonable. For example, reviews from the 4-5 stars cases are very similar and same circumstance also happens in 1-2 stars cases. Therefore, we can infer that one user rating a restaurant 4 satrs or 5 stars both means this user really recommend it.
- In this part, I am going to replace the existing rating system with a two-degree system and a three-degree system respectively to see which clustering analysis has a better performance. In a two-dgree system, all the reviews are corresponding 2 degrees, like and dislike, whereas in a three-dgree system, reviews are connected to 3 degrees, most like, average like and dislike. The clutsering method used here is K-means. And it turns out that three clusters has better performance than two clusters.


In [221]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

In [224]:
df = pd.read_csv('2017_restaurant_reviews.csv')
df.head(3)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew
5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA


# Cluster the review text data for all the restaurants

#### 1. Create a new variable to contain all reviews

In [225]:
Reviews = df['text']

#### 2. Create training dataset and test dataset in 7:3

In [226]:
from sklearn.model_selection import train_test_split
Reviews_train, Reviews_test= train_test_split(Reviews, test_size=0.3)
# The clsutering algorithms is slow so I will take samples to find optima number of clusters

#### 3. NLP

In [227]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize a TfidfVectorizer object
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with training data
# Transform the text training data into a bag of words
Reviews_train_vec = vectorizer.fit_transform(Reviews_train).toarray()[:300]
# Get these words name
words = vectorizer.get_feature_names()
# Transform all the reviews
Reviews_vec = vectorizer.transform(Reviews).toarray()#[:300]
Reviews_test_vec = vectorizer.transform(Reviews_test).toarray()#[:300]

#### 4. Cluster reviews with KMeans (two clusters) in training data

In [228]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)
km_clf.fit(Reviews_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

##### Use this model to make predictions on training data, test data, all data respectively

In [229]:
cluster_train = km_clf.predict(Reviews_train_vec)
cluster_test = km_clf.predict(Reviews_test_vec)
cluster = km_clf.predict(Reviews_vec)

#### 5. Inspect the centroids
Available to find centroids of the clusters is a good point of KMeans alogorithm. By inspecting the centroids, we can find the key points of each cluster. Here, the centroids are simply a bunch of vectors corresponding reviews' topic keywords. These vectors could be regarded as average review.

In [230]:
km_clf.cluster_centers_

array([[-6.50521303e-19,  0.00000000e+00,  3.70386544e-03, ...,
         3.86245330e-03,  1.38112500e-03,  2.30667030e-03],
       [ 8.05673056e-04,  0.00000000e+00,  9.24211355e-03, ...,
        -2.16840434e-19,  1.30104261e-18,  2.38524478e-18]])

#### 6. Map the centroids to words and present the top 20 features for each cluster.
Mapping centroids vectors back into 'word space' to make any sense of these numbers. These words would be the most representative of a cluster. Then print out top 20 keywords in each cluster.

In [231]:
cluster_top_features = list()
# Sort each centroid vector to find the top 20 features
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:20])
# Map vector back to words and print out
for num, centroid in enumerate(cluster_top_features):
    print ('Cluster',num,':',(", ".join(words[i] for i in centroid)))

Cluster 0 : good, place, pork, service, chicken, ramen, food, just, restaurant, ordered, bad, like, delicious, rice, soup, taste, don, shrimp, ve, dessert
Cluster 1 : food, great, place, good, delicious, service, really, just, amazing, nice, like, drinks, time, went, atmosphere, best, definitely, friendly, little, staff


It seems that two clusters are both for positive reviews. This is quite strange. Then randomly to find samples from both 2 clusters with rating scores and reviews to get a sense of the cluster.

In [232]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster_train.shape[0])[ cluster_train == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 4 stars.
The review is:
Friendly service. Convenient location. Good menu choices. Like it that they buried the sauce within your ride so that the sizzling pan won't dry it up.  The place is small, and with the many sizzling plates, the ventilation is poor. Don't like coming out smelling like food..

The cluster is 2.
The star is: 5 stars.
The review is:
Great options in the menu!  Tough choices.  There are a variety of flavours to suit any mood.  We will be back.  Service was very friendly and had great recommendations..



In [166]:
df111 = pd.read_csv('output.csv')



In [167]:
df111.head(5)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
1,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
2,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
3,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
4,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster


In [146]:
pd.DataFrame(final).to_csv('final.csv')

In [233]:
clu_num=[]
pop=[]
for j in range(1000):
    for i in range(km_clf.n_clusters):
        sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
        sample = np.random.choice(sub_cluster, 1)
        clu_num.append(i+1)
        res=df['business_id'].iloc[sample[0]]
        pop.append((df['business_id']==res).value_counts())
pp=[]
for i in pop:
    pp.append(i[1])
from scipy.stats.stats import pearsonr 
print('The correlation coefficient between number of reviews and average rating is %f.' \
      % (pearsonr(clu_num, pp)[0]))

The correlation coefficient between number of reviews and average rating is -0.199299.


As what i thought before, this cluster performs bad even in the training data.

#### 7. Try 3 clusters

In [234]:
# same as 2 clusters used before
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
km_3_clusters.fit(Reviews_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [235]:
cluster_train_3 = km_3_clusters.predict(Reviews_train_vec)
cluster_test_3 = km_3_clusters.predict(Reviews_test_vec)
cluster_3 = km_3_clusters.predict(Reviews_vec)

In [237]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:20])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: service,food,good,like,great,place,pizza,time,delicious,just,don,definitely,experience,best,wait,et,rude,ve,came,worth
2: food,great,atmosphere,drinks,place,nice,love,staff,delicious,amazing,went,service,good,recommend,loved,feel,really,patio,little,couple
3: chicken,good,rice,pork,delicious,just,really,food,spicy,ordered,ramen,place,fish,soup,ve,taste,order,time,friendly,pretty


It seems that three clusters can get the positive, negative and in-between types.

Randomly to find samples from 3 clusters with rating scores and reviews to get a sense.

In [238]:
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_train_3.shape[0])[cluster_train_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
Fantastic spot for lunch with great value for your money.   I was attending meetings at the building next door and went out to grab a bite with a colleague.  There isn't much around so this was chosen out of convenience but am I glad I did!  I only work a few blocks over and will definitely be back now that I have been introduced to this gem.  First of the staff are fantastic.  Super friendly, keep the orders moving and offer suggestions if you are stuck for what to chose.  There is a nice selection of drinks including mexica sodas (cane sugar) which is a thing for me.  I had quesadilla with the tinga chicken which had a bit of heat but not too much, just a nice kick.  The tortilla was fresh and they put the right amount of cheese, not too much, not greasy, just right.  Fresh lettuce, tomato, jalapenos are an option just not for me.  The portion size was probably a bit big for me for lunch but I couldn't help myself and finished th

It seems much better than 2 clusters. Then to test the model in test data and entire dataset.

In [239]:
#in test data
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_test_3.shape[0])[cluster_test_3 == i]
    
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 3 stars.
The review is:
We waited for more than 30 mins being the second table to order at dinner time. 
The space is very small and tight.
The portion of an entrée is small for what we paid. 
Food is nothing than ordinary with a distintive  presentation on a sizzling plate..

The cluster is 2.
The star is: 1 stars.
The review is:
Each time I come to this TH, there are consistent errors with my order. Today will be the last time I come to this location.  I ask for an english muffin breakfast sandwich to be toasted, it arrives cold and stale. I returned the sandwich to the counter for a properly toasted sandwich, the staff only toasts one slice of the sandwich. I tell them that is unacceptable as now i have one toasted side and one stale and cold side. Their solution is to microwave the entire sandwich. Seriously??  The amount of time spent trying to get a toasted sandwich was rather assinine. This happens all too often here. Save yourself the trouble and 

In [240]:
# in entire dataset
c_n=[]
star=[]
con=[]
sub=[]
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_3.shape[0])[cluster_3 == i]
    sample = np.random.choice(sub_cluster, 1)
    sub.append(sample)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

    

The cluster is 1.
The star is: 4 stars.
The review is:
4.5 Stars. First off, I am super picky with pho. The dish itself is pretty simple, so I believe it should be done right. The broth here was definitely up to standard! The portions were good, as well as the price. 

The service was pleasant as well, tea was brought right away and everything was done in a timely manner. I was super impressed by the cleanliness of this restaurant (though I know it's pretty new, so I guess this part may take some time to gauge properly). 

It's nice to see some Vietnamese cuisine in the Mahogany plaza since it's mostly Western food! I'll definitely be back and hope they don't change their prices/portions. They also had a few big tables (seats probably 8-10) for larger parties! There is plenty of parking in the plaza..

The cluster is 2.
The star is: 5 stars.
The review is:
A Montreal staple, Romados serves up some damn good Portuguese chicken along with other bar b q specialties. The fries doused in he

much better to associate reviews with cluster than past rating system!

In [242]:
content=[]
cluster_num=[]
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, cluster_3.shape[0])[ cluster_3 == i]
    for j in sub_cluster:
        content.append(pd.DataFrame(df.iloc[j]))
        cluster_num.append(i+1)
        
 

In [262]:
final=[]

for i in range(len(content)):
    new1=content[i].T
    new1['cluster']=cluster_num[i]
    final.append(new1)

In [265]:
a=final[1].append(final[2])

In [264]:
final[1]

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,1


In [272]:
final[1].append(final[2])

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,cluster
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,1
6,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4,0,2018-01-11,0,hhVr1uH7XaRlbIHI8dYvbA,5,Glad these guys moved in on the block! It's qu...,0,FYhU1fKQ7n11WQ7gcFYOag,1


In [275]:
type(final[1])

pandas.core.frame.DataFrame

In [276]:
type(final)

list

In [165]:
import csv
with open("output.csv",'w') as f:
    wr = csv.writer(f, dialect='excel')
    for i in final:
        wr.writerow(i[])
f.close()