# Yelp Data Challenge - Clustering Analysis
Summary: 
- From Setiment analysis, I come to the conclusion that rating score is unreasonable in 4-5 stars cases and 1-2 stars cases. I suggest to replace 1-5 rating system with like/dislike or below_average/average/above_average systems. Here I want to determine which one is better through clustering analysis.
- The clutsering method is K-means. It turns out that three clusters (below_average/average/above_average) has better performance than two clusters (like/dislike).


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('/Users/LiangTan/Documents/BitTigerDS/Yelp/Yelp_Data_Challenge_Project/dataset/last_year_restaurant_reviews.csv')
df.head(2)

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
6,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,0,2017-02-14,0,Xp3ppynEvVu1KxDHQ3ae8w,5,Delmonico Steakhouse is a steakhouse owned by ...,0,KC8H7qTZVPIEnanw9fG43g
9,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"['Cajun/Creole', 'Steakhouses', 'Restaurants']",4.0,1,2017-05-28,0,LEzphAnz0vKE32PUCbjLgQ,4,One of the top steak places I've had in Vegas ...,2,3RTesI_MAwct13LWm4rhLw


# Cluster the review text data for all the restaurants

#### 1. Define my feature variables - the text of the review

In [3]:
documents = df['text']

#### 2. Create training dataset and test dataset

In [4]:
from sklearn.cross_validation import train_test_split
# The clsutering algorithms is slow so I will take samples to find optima number of clusters
documents_train, documents_test= train_test_split(documents, test_size=0.3)



#### 3. Get NLP representation of the documents

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = 'word', stop_words = 'english', 
                             lowercase = True, max_features = 5000
                            )
# Train the model with my training data
documents_train_vec = vectorizer.fit_transform(documents_train).toarray()
# Get the vocab of my tfidf
words = vectorizer.get_feature_names()
# Use the trained model to transform all the reviews
documents_vec = vectorizer.transform(documents).toarray()

#### 4. Cluster reviews with KMeans (two clusters)

Fit k-means clustering with the training vectors and apply it on all the data

In [6]:
from sklearn.cluster import KMeans
km_clf = KMeans(verbose = 0,n_clusters = 2)
km_clf.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

#### 5. Make predictions on all your data

In [7]:
cluster = km_clf.predict(documents_vec)

#### 6. Inspect the centroids
To find out what "topics" Kmeans has discovered I must inspect the centroids. Print out the centroids of the Kmeans clustering.
These centroids are simply a bunch of vectors.  To make any sense of them I need to map these vectors back into 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" review or the average occurances of words for that cluster.

In [8]:
km_clf.cluster_centers_

array([[5.12012020e-04, 7.61292393e-05, 3.80931978e-03, ...,
        2.10223635e-04, 8.32331572e-05, 3.02091912e-04],
       [2.48614214e-03, 2.90765899e-04, 8.00771057e-03, ...,
        1.38446230e-03, 1.41921420e-04, 6.51843630e-04]])

#### 7. Find the top 20 features for each cluster.
For topics I am only really interested in the most present words, i.e. features/dimensions with the greatest representation in the centroid.  Print out the top ten words for each centroid.

* Sort each centroid vector to find the top 10 features
* Go back to my vectorizer object to find out what words each of these features corresponds to.


In [10]:
cluster_top_features = list()
for i in range(km_clf.n_clusters):
    cluster_top_features.append(np.argsort(km_clf.cluster_centers_[i])[::-1][:40])
for num, centroid in enumerate(cluster_top_ten):
    print ('%d: %s' % (num, ", ".join(words[i] for i in centroid)))

0: great, food, service, place, amazing, good, friendly, love, best, staff, vegas, definitely, delicious, recommend, awesome, excellent, time, atmosphere, nice, restaurant, come, sushi, really, fresh, highly, try, favorite, experience, ve, customer, eat, super, server, menu, breakfast, prices, fast, clean, las, coming
1: good, food, place, just, like, time, ordered, order, chicken, service, really, got, came, don, vegas, pizza, ve, restaurant, try, delicious, best, come, didn, nice, menu, wait, eat, great, little, better, definitely, pretty, burger, went, cheese, fries, sauce, did, love, minutes


It seems that two clusters are both for positive reviews. This is quite strange. Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [16]:
for i in range(km_clf.n_clusters):
    sub_cluster = np.arange(0, cluster.shape[0])[cluster == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 5 stars.
The review is:
Our favorite restaurant in Vegas just got better!  The premium omakase was the best that we've ever had here.  The roasted fresh Alaskan King Crab was the ultimate indulgence.  Can't say enough.  The food was fresh and well prepared, the staff was great and it made our anniversary a special one..

The cluster is 2.
The star is: 4 stars.
The review is:
Went in the other night and really enjoyed my visit. The bartenders were very friendly and greeted me as soon as I sat down. They were both very attentive. I ordered some food to take home. The mushroom Swiss burger was really good. The chicken philly was a little lack luster but it tasted ok. The cheese was just on half of the sandwich and it could have really used some kind of sauce like mayo on it. 
I'll definitely be going back again just probably won't be ordering the chicken philly again..



#### 8. Try different k = 3

In [11]:
# To be implemented
km_3_clusters = KMeans(verbose = 0, n_clusters = 3)
km_3_clusters.fit(documents_train_vec)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [13]:
cluster_top_features = list()
for i in range(km_3_clusters.n_clusters):
    cluster_top_features.append(np.argsort(km_3_clusters.cluster_centers_[i])[::-1][:40])
for num, centroid in enumerate(cluster_top_features):
    print ('%d: %s' % (num+1, ",".join(words[i] for i in centroid)))

1: good,food,place,chicken,like,best,really,vegas,just,delicious,ordered,time,service,try,ve,restaurant,love,nice,definitely,got,menu,great,amazing,fresh,come,little,eat,burger,don,pretty,came,sushi,fries,sauce,friendly,rice,better,favorite,cheese,staff
2: great,food,service,place,amazing,good,friendly,staff,love,definitely,best,vegas,delicious,awesome,excellent,recommend,time,atmosphere,nice,come,restaurant,really,customer,experience,highly,fresh,server,try,sushi,prices,drinks,eat,favorite,super,menu,breakfast,ve,fast,clean,coming
3: pizza,order,food,time,minutes,just,service,came,didn,like,place,got,ordered,said,asked,good,wait,took,don,table,told,people,manager,come,server,customer,did,went,bad,really,waited,restaurant,know,10,long,waiting,location,wasn,rude,ve


It seems that three clusters can get the positive, negative and in-between types.

In [14]:
three_cluster_pred = km_3_clusters.predict(documents_vec)

Print out the rating and review of a random sample of the reviews assigned to each cluster to get a sense of the cluster.

In [15]:
for i in range(km_3_clusters.n_clusters):
    sub_cluster = np.arange(0, three_cluster_pred.shape[0])[three_cluster_pred == i]
    sample = np.random.choice(sub_cluster, 1)
    print("The cluster is %d." % (i+1))
    print("The star is: %s stars." % df['stars'].iloc[sample[0]])
    print("The review is:\n%s.\n" % df['text'].iloc[sample[0]])

The cluster is 1.
The star is: 4 stars.
The review is:
At the Venetian and wanted something quick to go, I decided on getting a gourmet burger from a restaurant owned by chefs Mario Batali and Joe Bastianich -  "B&B" for Burger and Beer or Batali and Bastianich - I guess it can go either way.

While waiting at the bar for my "Drive Thru" takeout cheeseburger, the waitress was kind enough to offer me a fountain drink as my order was taking longer than expected, 

The burger was big and moist oozing with cheese and prepared just right. The staff was friendly and service was great! I would recommend B&B in Vegas if you are looking for a decent burger in a casual dining environment..

The cluster is 2.
The star is: 5 stars.
The review is:
In town once a year always take business friends with me, great food and service.
Looking forward to next years meal..

The cluster is 3.
The star is: 1 stars.
The review is:
They refused to give me service for not having an ID!!! Ridiculous they should h