**NOTE**: This file should be in the same folder as your `PA_businesses.json` and `PA_reviews_full.json` files, and `cached_api` folder.

# Yelp Homework Solutions

In [4]:
import re
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

In [5]:
# This method is used to convert a large JSON file into a Pandas dataframe.
def fetch_yelp(k, takeout_file):
    yelp = None
    with open(takeout_file) as in_file:
        raw = pd.read_json(in_file)
        reviews = raw[k]
        yelp = json_normalize(reviews)
        print(yelp.shape)
        print(yelp.columns)
    return yelp

In [6]:
businesses_yelp = fetch_yelp("businesses", "PA_businesses.json")
reviews_yelp = fetch_yelp("reviews", "PA_reviews_full.json")


(4237, 56)
Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'categories', 'hours', 'attributes.RestaurantsTakeOut',
       'attributes.BusinessParking', 'attributes.Ambience',
       'attributes.RestaurantsDelivery', 'attributes.RestaurantsReservations',
       'attributes.BusinessAcceptsCreditCards',
       'attributes.RestaurantsPriceRange2',
       'attributes.RestaurantsGoodForGroups', 'attributes.DriveThru',
       'attributes.GoodForKids', 'attributes.GoodForMeal', 'attributes.HasTV',
       'attributes.OutdoorSeating', 'attributes.CoatCheck',
       'attributes.HappyHour', 'attributes.Smoking', 'attributes.WiFi',
       'attributes.RestaurantsTableService', 'attributes.Alcohol',
       'attributes.Caters', 'attributes.Music', 'attributes.BestNights',
       'attributes.WheelchairAccessible', 'attributes.BusinessAcceptsBitcoin',
       'attributes.GoodForDancing', 'attributes.BikePa

## Question 1

I accomplished this by grouping reviews by business ID, then measuring the size of the resulting groups.

In [7]:
reviews_by_business = reviews_yelp.groupby('business_id')
business_means = reviews_by_business.mean()

In [8]:
business_means["rounded"] = round(business_means["stars"])
stars_count = business_means.groupby("rounded").size()

stars_percentages = stars_count / len(business_means)
print(stars_percentages)

rounded
1.0    0.009677
2.0    0.093698
3.0    0.316970
4.0    0.486193
5.0    0.093462
dtype: float64


Alternatively, you can accesss business reviews directly in the businesses JSON file, but these are in half-star increments.

In [9]:
# Step 1
stars_count_direct = businesses_yelp.groupby("stars").size()
# Step 2
stars_count_direct = stars_count_direct / len(businesses_yelp)
print(stars_count_direct)

stars
1.0    0.004956
1.5    0.015105
2.0    0.044843
2.5    0.084730
3.0    0.168515
3.5    0.240264
4.0    0.242152
4.5    0.164503
5.0    0.034930
dtype: float64


In [10]:
reviews_yelp["text_len"] = reviews_yelp["text"].apply(lambda x: len(x.split(" ")))

#lambda function is equivalent to a definition
# def find_text_length(x):
#    return len(x.split(" "))



print(reviews_yelp.loc[:,["text","text_len"]])

reviews_yelp.groupby("stars").mean()["text_len"]

                                                     text  text_len
0       I'll be the first to admit that I was not exci...       295
1       Wow. So surprised at the one and two star revi...       212
2       if i can give this place no stars i would, i o...       123
3       This place epitomizes the rumored transformati...        45
4       Here's why I don't write reviews for Chinese r...       174
...                                                   ...       ...
216851  managed not to mess up order like north hills ...        16
216852  Amazing! Drove up from cleveland to try it...n...        36
216853  Tasty authentic Mexican food! Eat lunch here f...        25
216854  I came for lunch time. It was full. The place ...        91
216855  This is the worst pizza and other food restaur...        32

[216856 rows x 2 columns]


stars
1.0    124.991192
2.0    141.032934
3.0    137.921116
4.0    121.083072
5.0     94.313278
Name: text_len, dtype: float64

# Basic Statistics

## Star Distributions

| Stars | Percent (Option 1) | Percent (Option 2) |
|-------|---------|---------|
| 1     | 0.96    | 0.50 |
| 1.5   |   N/A   | 1.51 |
| 2     | 9.37    | 4.48 |
| 2.5   |   N/A   | 8.47 |
| 3     | 31.70   | 16.85 |
| 3.5   |  N/A    | 24.03 |
| 4     | 48.62   | 24.22 |
| 4.5   |  N/A    | 16.45 |
| 5     | 9.35    | 3.49 |

## Word Count Distributions

| Stars | Word Count |
|-------|------------|
| 1     | 125        |
| 2     | 141        |
| 3     | 138        |
| 4     | 121        |
| 5     | 94         |

## Notes

   - Star distributions are per-restaurant while word count distributions are per-review. The following code shows a common error where students answered distributions of stars per-review. The difference is pretty extreme at the bottom end, with 1-star reviews making up 10% of the data, but only 1% of restaurants having a 1-star average.

In [11]:
review_counts_by_star = reviews_yelp.groupby("stars").size()
review_counts_by_star = review_counts_by_star / len(reviews_yelp)
print(review_counts_by_star)

stars
1.0    0.104189
2.0    0.093532
3.0    0.139830
4.0    0.281828
5.0    0.380621
dtype: float64


## Question 2

This problem required breaking apart a string into its component parts. I did that using the .split() command, though there are other options.

341 values is probably too many to give everything in a written report. I didn't take off any points for anyone that pasted the whole thing in, but a reasonable best practice would be to separate the full output to a separate file, and only give some highlights to the reader.

In [12]:
# Step 1: find those label
possible_labels = []
for row in businesses_yelp["categories"]:
    labels = row.split(", ")
    for label in labels:
        if label not in possible_labels:
            possible_labels.append(label)
print(f"There are {len(possible_labels)} labels.")

There are 341 labels.


In [13]:
for label in possible_labels:
    # Step 1:
    # str. contains() function is used to test if pattern or regex is contained within a string of a Series or Index. 
    subset_businesses = businesses_yelp.loc[businesses_yelp.categories.str.contains(label)]
    # Step 2: count businesses under every labels
    # 
    # reviews_by_business = reviews_yelp.groupby('business_id')
    # business_means = reviews_by_business.mean()
    # Index.isin(self, values, level=None)
    # Return a boolean array where the index values are in values.
    subset_stats = business_means.loc[business_means.index.isin(subset_businesses.business_id)]
    subset_mean = subset_stats.stars.mean()

    # Don't print out everything, too exhausting
    # 100 is subjective to choose，means the occurrence of certain label is more than 100.
    if len(subset_stats) > 100:
        print(f"{label}: {len(subset_stats)} stores, the mean is {subset_mean:.2f}")

Sandwiches: 572 stores, the mean is 3.54
Salad: 213 stores, the mean is 3.49
Restaurants: 3512 stores, the mean is 3.51
Burgers: 329 stores, the mean is 2.99
Nightlife: 686 stores, the mean is 3.53
Bars: 826 stores, the mean is 3.59
Beer: 169 stores, the mean is 3.75
Wine & Spirits: 127 stores, the mean is 3.79
Food: 1894 stores, the mean is 3.62
Fast Food: 328 stores, the mean is 2.83
Pizza: 690 stores, the mean is 3.43
Delis: 135 stores, the mean is 3.89
Cafes: 166 stores, the mean is 4.03
Bakeries: 180 stores, the mean is 3.87
Event Planning & Services: 201 stores, the mean is 3.81
Caterers: 144 stores, the mean is 3.81
Desserts: 167 stores, the mean is 3.93
Ice Cream & Frozen Yogurt: 162 stores, the mean is 3.91
Italian: 412 stores, the mean is 3.47
Grocery: 194 stores, the mean is 3.54


  return func(self, *args, **kwargs)


Coffee & Tea: 382 stores, the mean is 3.64
Breakfast & Brunch: 333 stores, the mean is 3.55
Convenience Stores: 101 stores, the mean is 3.33
Specialty Food: 179 stores, the mean is 4.04
Chicken Wings: 191 stores, the mean is 3.13
Mexican: 175 stores, the mean is 3.38
Sports Bars: 114 stores, the mean is 3.25
Seafood: 146 stores, the mean is 3.50
Diners: 156 stores, the mean is 3.51
Shopping: 146 stores, the mean is 3.60
Chinese: 227 stores, the mean is 3.45
