# Yelp Data Challenge - Clustering Analysis
Summary: 
- From Setiment analysis, I come to the conclusion that rating score is unreasonable in 4-5 stars cases and 1-2 stars cases. I suggest to replace 1-5 rating system with like/dislike or below_average/average/above_average systems. Here I want to determine which one is better through clustering analysis.
- The clutsering method is K-means. It turns out that three clusters (below_average/average/above_average) has better performance than two clusters (like/dislike).


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('2017_restaurant_reviews.csv')
df.head(2)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew


# Cluster the review text data for all the restaurants

#### 1. Define my feature variables - the text of the review

In [3]:
documents = df['text']
print(len(documents))

198993


#### 2. Create training dataset and test dataset

In [4]:
from sklearn.model_selection import train_test_split
# The clsutering algorithms is slow so I will take samples to find optima number of clusters
documents_train, documents_test= train_test_split(documents, test_size=0.3)

#### 3. Get NLP representation of the documents

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with my training data
documents_train_vec = vectorizer.fit_transform(documents_train).toarray()#[:300]
# Get the vocab of my tfidf
words = vectorizer.get_feature_names()
# Use the trained model to transform all the reviews
documents_vec = vectorizer.transform(documents).toarray()#[:300]  #documents=df['text']

#### 4. Cluster reviews with KMeans (two clusters)

Fit k-means clustering with the training vectors and apply it on all the data

In [31]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)  

In [None]:
km_clf.fit(documents_train_vec) #memory error  

#### 5. Make predictions on all your data

In [30]:
cluster = km_clf.predict(documents_vec)

NotFittedError: This KMeans instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

#### 6. Inspect the centroids
To find out what "topics" Kmeans has discovered I must inspect the centroids. Print out the centroids of the Kmeans clustering.
These centroids are simply a bunch of vectors.  To make any sense of them I need to map these vectors back into 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.

In [8]:
km_clf.cluster_centers_

array([[0.00134598, 0.00027696, 0.00757935, ..., 0.00081127, 0.0001628 ,
        0.00046474],
       [0.00135244, 0.00027081, 0.00895964, ..., 0.00073902, 0.00019527,
        0.00089532]])

#### 7. Find the top 20 features for each cluster.
For topics I am only really interested in the most present words, i.e. features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.

* Sort each centroid vector to find the top 10 features
* Go back to my vectorizer object to find out what words each of these features corresponds to.


In [9]:
cluster_top_features = list()
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:40])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num, ", ".join(words[i] for i in centroid)))  #'text'列-评论列的feature_names

0: food, great, place, good, service, really, time, like, just, restaurant, delicious, nice, amazing, staff, definitely, friendly, best, menu, pizza, ve, came, ordered, come, try, love, got, order, fresh, sushi, don, recommend, experience, price, wait, toronto, pretty, little, table, went, people
1: chicken, good, rice, food, fried, sauce, place, ordered, soup, ramen, pork, spicy, like, noodles, really, beef, thai, just, got, curry, came, great, meat, restaurant, delicious, try, definitely, time, dish, order, service, dishes, ve, hot, bit, pretty, bowl, small, taste, come


It seems that two clusters are both for positive reviews. This is quite strange. Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [10]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]  #cluster = km_clf.predict(documents_vec)
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
4 stars for quality of food and 5 stars for experience and atmosphere. This place is a must to everyone who loves south Asian food and culture. Usually a few min wait time, so I suggest walking around and exploring the area a bit.
Make sure you get the kabobs!.

The cluster is 2.
The star is: 5 stars.
The review is:
Ordered from Ubereats April 22. 
The food came hot and organized in a paper bag. I wouldn't normally bother reviewing an Ubereats order, but their food was that good!

Pros
Achiote roasted chicken - Amazing flavour, tender and flavourful. The tostones were an unexpected highlight as I don't eat plaintains often.

Pork shoulder - Very unique and tender also came with 2 tostones. The cabbage slaw is surprisingly good. 

Buttermilk donuts - Comes with 6 timbits, wow these were amazing, they melted in my mouth. 

Cons
Food portions a bit too big for my liking, I was only able to half of the portion. I guess it's

In [29]:
###########################################
clu_num=[]
pop=[]
for j in range(1000):
    for i in range(km_clf.n_clusters):
        sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
        sample = np.random.choice(sub_cluster, 1)
        clu_num.append(i+1)
        res=df['business_id'].iloc[sample[0]]
        pop.append((df['business_id']==res).value_counts())
pp=[]
for i in pop:
    pp.append(i[1])
from scipy.stats.stats import pearsonr 
print('The correlation coefficient between number of reviews and average rating is %f.' \
      % (pearsonr(clu_num, pp)[0]))

NameError: name 'cluster' is not defined

#### 8. Try different k = 3

In [7]:
# To be implemented
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
# km_3_clusters.fit(documents_train_vec)

In [8]:
km_3_clusters.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [9]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:40])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))  

1: great,food,service,place,good,friendly,amazing,staff,delicious,love,atmosphere,nice,definitely,recommend,really,best,excellent,restaurant,time,fresh,highly,menu,awesome,super,ve,come,price,prices,try,spot,quality,drinks,little,experience,lunch,just,selection,fast,portions,dinner
2: food,good,place,like,just,service,time,really,restaurant,order,ordered,pizza,came,got,menu,don,ve,nice,come,try,best,delicious,didn,sushi,table,definitely,wait,went,pretty,fresh,better,people,experience,little,cheese,burger,great,bit,eat,small
3: chicken,rice,good,food,soup,fried,noodles,beef,place,pork,ramen,spicy,ordered,sauce,like,really,just,restaurant,got,came,try,thai,dish,delicious,noodle,dishes,time,service,meat,curry,pho,definitely,great,order,broth,ve,hot,pretty,taste,bit


It seems that three clusters can get the positive, negative and in-between types.

In [10]:
three_cluster_pred = km_3_clusters.predict(documents_vec)

Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [11]:
print(three_cluster_pred.shape)
print(three_cluster_pred.shape[0])
print(np.arange(0, three_cluster_pred.shape[0]).shape)

(198993,)
198993
(198993,)


In [12]:
for i in range(km_3_clusters.n_clusters):  #0，1，2
    sub_cluster = np.arange(0, three_cluster_pred.shape[0])[three_cluster_pred == i]  #布尔索引，过滤出聚类分别为0，1，2的在three_cluster_pred中的索引
    sample = np.random.choice(sub_cluster, 1)  #Generates a random sample(第二个参数size) from a given 1-D array
    print(sub_cluster.size,sample.shape)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

44659 (1,)
The cluster is 1.
The star is: 5 stars.
The review is:
This place was an awesome surprise. The ambiance was comfortable and inviting. The food was excellent and the pricing was fair. Our waitress Colleen absolutely made entire experience fantastic..

114849 (1,)
The cluster is 2.
The star is: 1 stars.
The review is:
This restaurant is ridiculous. I can't believe how can they be so shameful. I ordered a soon tofu stew for takeout. And guess what?  only 1/3 bowl has thing in it, which is 1/3 height of a white plastic spoon.  The onion isnt cooked at all. And I told them that I will come back in 10 mins to pick it up, when i'm back, the chef was sitting there doing nothing. Didn't start to cook mine until I came back and told them I'm here to pick up what I ordered. ALSO, their side dish is soooo bad. like... the kimchi is not even kimchi, it's just raw cabbage with spicy sauce. Don't even go. What a waste of money..

39485 (1,)
The cluster is 3.
The star is: 4 stars.
The revie

In [23]:
df['new_cluster']=[(cluster+1) for cluster in three_cluster_pred]  #df['new_cluster']=list((cluster+1) for cluster in three_cluster_pred)

In [24]:
df.head(10)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,new_cluster
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw,1
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,1
5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA,2
6,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2018-01-11,0,hhVr1uH7XaRlbIHI8dYvbA,5,Glad these guys moved in on the block! It's qu...,0,FYhU1fKQ7n11WQ7gcFYOag,1
7,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2018-04-22,0,BIecLw546kAlD7kmlk7vXA,2,I have been in search of good grilled steak ta...,0,sTVGcezME7gYBhIlYtcfpg,2
12,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,2,2017-05-01,1,ZULaTgcgMFJ4SHo_9ZcgxQ,5,Fantastic spot for lunch with great value for ...,2,ZS_qdLMh1XyXKGTrpKM-4Q,2
13,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-23,0,EiJoov0ZiIeAB527KizL5Q,5,This is our first time here with the renovatio...,0,P3V5p1Wt3znW2-4YtgafbA,2
14,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-07-26,0,X4PwY0mMHxqOf2O1pwYdIQ,5,"Delicious! One of the best burritos, salsa and...",0,4YHZzwlxEMi7zWO6Osszdw,1
16,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-04-03,0,Ueb33Y77jRDQhkjf7nmCUw,5,"We were in the mood for tacos, and came across...",0,Le81nQXs_LwAxRk9eK8oQQ,1
17,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-27,0,UfKXTOleA4mtNOOMqzCBTQ,4,"Friendly, fresh, authentic Mexican food for a ...",0,ncf_USO7wFU-B5bJjBkt9w,1


In [20]:
df[df['new_cluster']==3] #df_filtered = df_business[df_business['city'] == "Toronto"]

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,new_cluster
20,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2018-05-06,0,8QUwzeXeyJ3L15lKfhKLsQ,1,I had the chicken Tinga enchilada in mole sauc...,0,Bsy9F-59sl9OT_bvZNl3hA,3
21,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-09,0,LyEriIaCulKrzrjEULH_zQ,4,Downtown west which honestly feels like Downto...,0,TzU30D-CjtPP3XumggK0Mg,3
62,--DaPTJW3-tB1vP-PfdTEg,Sunnyside Grill,"Restaurants, Breakfast & Brunch",3.5,4,2018-05-10,1,WxvsMJUo8RVvuS_5rXgEVg,4,Stopped in on a weekday while in Toronto just ...,4,fn3Wim-2j042IEDe6VhJFQ,3
85,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,0,2017-02-07,0,fJgaKuRGsnlKrOgfuDIDgw,4,Came here on a Sunday afternoon and left feeli...,2,6eCgSb66TE1LeiWPPBPnTg,3
91,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,0,2017-08-17,0,EapUDyY8P8wOnvJW9R-zpg,3,MochiPopcorn Rating Chart\nAveraged 5 Star Rat...,0,isJEo2IISFonbenri9KQcw,3
97,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,1,2017-06-18,0,8vpBc4T8bs9cg3t9tygmfg,3,"Surprising for a Chinese restaurant, there is ...",0,fM9LC2P8jQrQGGsXKccLQw,3
100,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,0,2017-03-17,0,hCwFKMmpX4U4GacYkLtwng,4,This is pretty much a neighborhood institution...,0,XbiKsujS_qxU3xsr0xUqmQ,3
102,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,1,2017-03-14,0,OnYwBwVMw3_vcaACT0eHIA,2,Come to think of it I haven't been to this Keu...,1,G-6X-llgA_qAxGxocykHzQ,3
107,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,0,2017-12-01,0,mQfBDt8Ewym8h059n3JpAQ,3,This is the go-to restaurant for Chinese Canto...,0,pHUutGZOxIEARm27ODBgvw,3
111,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"Chinese, Restaurants",3.5,4,2017-08-18,0,ipmUehX6756Ccw6zKQYQhQ,4,Authentic is what i would call this restaurant...,6,CjbfWpCRLbA-L_eS_ztd6Q,3


In [25]:
df.to_csv('2017_restaurant_new_reviews.csv')

In [26]:
df=pd.read_csv('2017_restaurant_new_reviews.csv')

In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id,new_cluster
0,2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw,1
1,3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew,1
2,5,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,1,2017-05-10,0,9CLEOpUCqRkIR02sx-JsMQ,5,A little hole in the wall for some really deli...,0,atyCaAjUYatIFDOGKy00SA,2
3,6,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2018-01-11,0,hhVr1uH7XaRlbIHI8dYvbA,5,Glad these guys moved in on the block! It's qu...,0,FYhU1fKQ7n11WQ7gcFYOag,1
4,7,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2018-04-22,0,BIecLw546kAlD7kmlk7vXA,2,I have been in search of good grilled steak ta...,0,sTVGcezME7gYBhIlYtcfpg,2


In [28]:
df.count()

Unnamed: 0     198993
business_id    198993
name           198993
categories     198993
avg_stars      198993
cool           198993
date           198993
funny          198993
review_id      198993
stars          198993
text           198993
useful         198993
user_id        198993
new_cluster    198993
dtype: int64