# Yelp Data Challenge - Clustering Analysis
Summary: 
- From Setiment analysis, I come to the conclusion that rating score is unreasonable in 4-5 stars cases and 1-2 stars cases. I suggest to replace 1-5 rating system with like/dislike or below_average/average/above_average systems. Here I want to determine which one is better through clustering analysis.
- The clutsering method is K-means. It turns out that three clusters (below_average/average/above_average) has better performance than two clusters (like/dislike).


In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")

In [14]:
df = pd.read_csv('2017_restaurant_reviews.csv')
df.head(2)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
2,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-12-12,0,e1HiHHD7CzY5NKZG7hvhTw,5,Absolutely delicious! And great service as wel...,0,Sew1Nht6Q0sGTIZeNvRfLw
3,--1UhMGODdWsrMastO9DZw,The Spicy Amigos,"Restaurants, Mexican",4.0,0,2017-08-09,0,oKm8UTv-QSC0oCbniqwxjg,4,"Tasty, authentic Mexican street food that give...",0,NoQCmYKyMPs4D01Wa6dZew


# Cluster the review text data for all the restaurants

#### 1. Define my feature variables - the text of the review

In [15]:
documents = df['text']

#### 2. Create training dataset and test dataset

In [16]:
from sklearn.model_selection import train_test_split
# The clsutering algorithms is slow so I will take samples to find optima number of clusters
documents_train, documents_test= train_test_split(documents, test_size=0.3)

#### 3. Get NLP representation of the documents

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with my training data
documents_train_vec = vectorizer.fit_transform(documents_train).toarray()#[:300]
# Get the vocab of my tfidf
words = vectorizer.get_feature_names()
# Use the trained model to transform all the reviews
documents_vec = vectorizer.transform(documents).toarray()#[:300]

#### 4. Cluster reviews with KMeans (two clusters)

Fit k-means clustering with the training vectors and apply it on all the data

In [18]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)
km_clf.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### 5. Make predictions on all your data

In [19]:
cluster = km_clf.predict(documents_vec)

#### 6. Inspect the centroids
To find out what "topics" Kmeans has discovered I must inspect the centroids. Print out the centroids of the Kmeans clustering.
These centroids are simply a bunch of vectors.  To make any sense of them I need to map these vectors back into 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.

In [20]:
km_clf.cluster_centers_

array([[ 1.93795602e-03,  0.00000000e+00,  7.25715004e-03, ...,
        -1.73472348e-18,  8.13151629e-19, -2.71050543e-18],
       [ 0.00000000e+00,  0.00000000e+00,  9.55734865e-03, ...,
         5.01363055e-03,  1.01010525e-03,  2.88367898e-03]])

#### 7. Find the top 20 features for each cluster.
For topics I am only really interested in the most present words, i.e. features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.

* Sort each centroid vector to find the top 10 features
* Go back to my vectorizer object to find out what words each of these features corresponds to.


In [21]:
cluster_top_features = list()
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:40])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num, ", ".join(words[i] for i in centroid)))

0: food, good, place, really, restaurant, service, chicken, time, just, like, got, order, nice, ordered, menu, people, came, delicious, experience, bit, went, best, amazing, better, ve, try, don, sweet, wait, pretty, little, minutes, sauce, busy, tried, definitely, dish, tasty, taste, say
1: great, place, service, friendly, amazing, food, good, breakfast, staff, est, visit, atmosphere, super, nice, love, cooked, restaurant, delicious, le, family, definitely, et, best, spot, friends, recommend, sandwich, beer, time, pizza, hot, brisket, selection, just, value, excellent, really, fast, large, consistently


It seems that two clusters are both for positive reviews. This is quite strange. Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [22]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
Romados was fantastic. Super quick, fresh and clean food. Lots of portion for the price we paid. We paid about $12 dollars for two of us and still couldnt finish all the food. Chicken was really good. Tender and falling apart from the bone and very tasty. It should be must when visiting Montreal :).

The cluster is 2.
The star is: 5 stars.
The review is:
Absolutely delicious! And great service as well! Glad I forgot a lunch today and discovered this little gem!.



#### 8. Try different k = 3

In [23]:
# To be implemented
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
km_3_clusters.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [24]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:40])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: place,food,delicious,great,definitely,amazing,nice,good,service,sweet,breakfast,friendly,restaurant,toronto,try,menu,dinner,atmosphere,ordered,est,bar,dish,come,came,spot,cheese,area,just,super,best,le,experience,cream,dishes,got,little,like,love,et,tea
2: food,time,good,service,restaurant,place,great,staff,minutes,really,order,like,busy,wait,know,customer,better,amazing,nice,menu,visit,location,coming,going,best,came,make,drinks,bit,don,people,got,just,experience,10,bad,line,times,table,manager
3: food,good,chicken,place,really,just,ve,service,like,soup,got,meat,ordered,little,don,restaurant,went,pretty,burger,amazing,eat,order,friendly,rice,curry,indian,bit,delicious,taste,best,sauce,fast,great,nice,fresh,spice,people,pork,decent,price


It seems that three clusters can get the positive, negative and in-between types.

In [25]:
three_cluster_pred = km_3_clusters.predict(documents_vec)

Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [26]:
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, three_cluster_pred.shape[0])[three_cluster_pred == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 1 stars.
The review is:
This was our first visit to this restaurant and we will not go back. We did feel sorry for the server who was serving the entire place and her service was adequate given what she had to do. 

One order was missing the sausages that were supposed to be there. The other order was supposed to have swiss cheese not cheddar on the sandwich which was a very disappointing Monte Cristo.  (Very little egg in the French toast). All the home fries were undercooked, tasteless and too cold to enjoy. The portion was large though but we left most of them on the plate.  

Owner...you need to get more staff in the kitchen and serving. The place was busy for a weekday but they may not come back..

The cluster is 2.
The star is: 5 stars.
The review is:
The food was very good! It wasn't extremely to die for I'm moving to Canada for this restaurant, but it was very good! The owner and waitress were really nice. They are very child friendly and I love t