## Using Yelp Open Dataset

##### Currently, users' opinions about a particular restaurant on Yelp are heavily influenced by only a few aspects displayed on the business's yelp page. Here are just a few:

1. Number of reviews on the restaurant
2. The overall rating of the restaurant (out of 5 stars)
3. Photos
4. The top listed reviews picked by Yelp
5. Popular dishes

Despite Yelp's attempt to give users an overview of what the restaurant is like, it is still a very cumbersome job for users to browser through all these information and make an informed decision which yields a positive experience for the user overall.


This project is an attempt to increase overall user experience by analyzing all of a restaurant's reviews and summarizing the main topics in a form of unigram adjectives clouds describing each of the main topics. 

#### Set up MongoClient 

In [1]:
from pymongo import MongoClient
from pprint import pprint

In [2]:
client = MongoClient()

In [3]:
# A list of database on MongoDB
client.list_database_names()

['admin', 'books', 'config', 'local', 'my_tool_database', 'outings', 'yelp']

#### Access and store yelp database

In [4]:
yelp_db = client.yelp

In [5]:
# a list of collections in the yelp database
yelp_db.list_collection_names()

['review', 'business']

#### Create some helper functions to query specific attributes from each collection names

In [6]:
def yelp_business(feature, limit):
    query = yelp_db.business.find({}, {'_id': 0, feature: 1}).limit(limit)
    return list(query)

In [7]:
def yelp_review(feature, limit):
    query = yelp_db.review.find({}, {'_id': 0, feature: 1}).limit(limit)
    return list(query)

#### Import pandas and numpy

In [8]:
import pandas as pd, numpy as np

#### Let's grab information from the business collection and turn it into a dataframe

In [9]:
business_info = list(yelp_db.business.find({}))

In [10]:
business_df = pd.DataFrame(business_info)

In [11]:
business_df.shape

(192609, 15)

#### Let's grab information from the review collection and turn it into a dataframe

In [12]:
review_info = list(yelp_db.review.find({}))

In [13]:
review_df = pd.DataFrame(review_info)

In [15]:
groupby_reviews_per_business = review_df.groupby('business_id').agg({
    'text': lambda x: '@@@@'.join(x)
}).reset_index()

#### Now we can use join to combine our dataframes on index = business_id

In [21]:
yelp_df = business_df.join(groupby_reviews_per_business.set_index('business_id'), on = 'business_id');

In [24]:
yelp_df.shape

(192609, 16)

#### Pickle Checkpoint

In [3]:
import pandas as pd, numpy as np

In [None]:
yelp_df.to_pickle('./Data/yelp_df.pkl')

In [4]:
yelp_df = pd.read_pickle('./Data/yelp_df.pkl')