In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Yelp Reviews and Restaurants

## **Motivation**
We are working with two datasets from Yelp. One containing a subset of businesses in USA and Canada and one containing reviews of these businesses. We chose this dataset as we thought it would be fun and interesting to analyse review trends both time-wise and geographically. Furthermore, the datasets is larget enough and contain enough information to hopefully gain alot of fun insights. As the dataset contains only a subset of the cities on yelp we chose to work specifically on the Philidelphia part of the dataset as this was the largest city present.

In the end, we want to inform the reader not only about interesting review trends worth taking into account but also showcase specific areas' ratings to guide them in where to go next for good experience.


## **Basic Stats**
In this section we cover the two datasets and out choices in data preparation and preprocessing as well as some basic stats of the data.

### *Data: Businesses*
The full business dataset contains ~150.000 businesses and for each of these it has:
- Name
- Business ID
- Location 
- Average rating from 1-5
- Number of reviews
- Categories (Hotel, Restaurant, etc.)
- Various attributes (Such as parking or payment options)
- Opening hours

Firstly we want to filter the business data to contain only those in Philadelphia:

In [11]:
# Load and filter the business dataset
df_business = pd.read_json('../data/yelp_academic_dataset_business.json', lines=True)
df_business = df_business[df_business['city'] == 'Philadelphia']
print(f"Number of businesses in Philadelphia: {len(df_business)}")
print("Sample data from the business dataset:")
# Display the first row of the filtered dataset
df_business.head(1)

Number of businesses in Philadelphia: 14569
Sample data from the business dataset:


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."


We then want to find those that are labeled as restaurants:

In [12]:
restaurant_ids = set()
for _, b in df_business.iterrows():
    if b['categories'] and 'Restaurants' in b['categories']:
        restaurant_ids.add(b['business_id'])
print(f"Number of restaurants in Philadelphia: {len(restaurant_ids)}")


Number of restaurants in Philadelphia: 5852


We see that we have data on 5852 restaurants in Philadelphia.

### *Data: Reviews*
The full review dataset contains ~7.000.000 reviews and for each of these it has:
- Review ID
- User ID
- Business ID
- Rating
- Other users' opinion of the review
- Textual review
- Date and time of day

Firstly, we want to look solely on the reviews for restaurants which we have data on:

In [13]:
df_reviews = pd.read_json('../data/philadelphia_restaurant_reviews.json', lines=True)
# Filter reviews to include only those for restaurants in Philadelphia
df_reviews = df_reviews[df_reviews['business_id'].isin(restaurant_ids)]
print(f"Number of reviews for restaurants in Philadelphia: {len(df_reviews)}")
print("Sample data from the reviews dataset:")
# Display the first row of the filtered dataset
df_reviews.head(1)

Number of reviews for restaurants in Philadelphia: 687289
Sample data from the reviews dataset:


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03


We see that we have just under 700.000 reviews of restaurants in Philadelphia

## **Data Analysis**


## **Genre**


## **Visualizations**


## **Discussion**


## **Contributions**
