# Yelp Data Analysis

Goal:
Identify one American city with sufficient data to analyze seasonal tourism patterns 
and customer sentiment across multiple years.

Requirements:
1. Minimum 1000 reviews for hospitality businesses
2. Balanced seasonal distribution (all 4 seasons represented)
3. Multiple years of data for comparison
4. Focus on restaurants and hotels (primary tourism touchpoints)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import sys

In [3]:
# Import helper functions from the "scripts" package

from scripts import data_extraction as d_extract
from scripts import data_filtering as d_filter
from scripts import feature_engineering as f_eng
from scripts import visualization as viz

##   Yelp Dataset Structure

In [4]:
# Count of the total number of JSON records (businesses)

total_businesses = d_extract.count_records('../data/raw/yelp_academic_dataset_business.json')


Total records: 150,346


In [5]:
# Load all businesses
df_business = d_extract.extract_all_businesses('../data/raw/yelp_academic_dataset_business.json')

Processed 50,000 businesses...
Processed 100,000 businesses...
Processed 150,000 businesses...

Total businesses loaded: 150,346


In [6]:
# Count of the total number of reviews

total_reviews = d_extract.count_records(
    '../data/raw/yelp_academic_dataset_review.json', 
    'reviews'
)


Total reviews: 6,990,280


In [38]:
# 1.2: What does the data look like? Let's print a sample of business records

d_extract.peek_json('../data/raw/yelp_academic_dataset_business.json', n=1)


--- Record 1 ---
{
  "business_id": "Pns2l4eNsfO8kk83dixA6A",
  "name": "Abby Rappoport, LAC, CMQ",
  "address": "1616 Chapala St, Ste 2",
  "city": "Santa Barbara",
  "state": "CA",
  "postal_code": "93101",
  "latitude": 34.4266787,
  "longitude": -119.7111968,
  "stars": 5.0,
  "review_count": 7,
  "is_open": 0,
  "attributes": {
    "ByAppointmentOnly": "True"
  },
  "categories": "Doctors, Traditional Chinese Medicine, Naturopathic/Holistic, Acupuncture, Health & Medical, Nutritionists",
  "hours": null
}


In [7]:
# Load all businesses. The function extract_all_businesses() from "scripts" package returns pd.DataFrame with all business data.

df_business = d_extract.extract_all_businesses('../data/raw/yelp_academic_dataset_business.json')

Processed 50,000 businesses...
Processed 100,000 businesses...
Processed 150,000 businesses...

Total businesses loaded: 150,346


In [13]:
# Save a full businesses dataframe for reuse
#df_business.to_csv('all_businesses.csv', index=False)

In [12]:
#Checking the overall date range of reviews in the Yelp review dataset.

dataset_date_range = d_extract.check_dataset_date_range('../data/raw/yelp_academic_dataset_review.json')


Scanning Entire Review Dataset for Global Date Range
  Processed 1,000,000 lines... (Current range: 2005-03-01 17:47:15 → 2022-01-19 19:47:59, matches: 1,000,000)
  Processed 2,000,000 lines... (Current range: 2005-03-01 16:57:17 → 2022-01-19 19:47:59, matches: 2,000,000)
  Processed 3,000,000 lines... (Current range: 2005-03-01 16:57:17 → 2022-01-19 19:48:16, matches: 3,000,000)
  Processed 4,000,000 lines... (Current range: 2005-03-01 16:57:17 → 2022-01-19 19:48:16, matches: 4,000,000)
  Processed 5,000,000 lines... (Current range: 2005-02-16 03:23:22 → 2022-01-19 19:48:45, matches: 5,000,000)
  Processed 6,000,000 lines... (Current range: 2005-02-16 03:23:22 → 2022-01-19 19:48:45, matches: 6,000,000)

✓ Completed scan
--------------------------------------------------------------------------------
Earliest review: 2005-02-16 03:23:22
Latest review:   2022-01-19 19:48:45
--------------------------------------------------------------------------------


------------------------------------
**Considering that the reviews data is covered only until the beggining of 2022, we'll focus on pre-pandemic years, before 2020.**

##  Business Category Analysis

In [15]:
#Which business categories exist in the Yelp dataset?

category_distribution = d_filter.get_categories_distribution(df_business)

print(category_distribution.head(50).to_string())

                     category  business_count
0                 Restaurants           52268
1                        Food           27781
2                    Shopping           24395
3               Home Services           14356
4               Beauty & Spas           14292
5                   Nightlife           12281
6            Health & Medical           11890
7              Local Services           11198
8                        Bars           11065
9                  Automotive           10773
10  Event Planning & Services            9895
11                 Sandwiches            8366
12     American (Traditional)            8139
13                Active Life            7687
14                      Pizza            7093
15               Coffee & Tea            6703
16                  Fast Food            6472
17         Breakfast & Brunch            6239
18             American (New)            6097
19            Hotels & Travel            5857
20              Home & Garden     

____________________________
**As we can see in the record sample above, the categories are not exclusive: one business can have several categories.**

Given the inconsistencies in the categories names, we'll add a broader *tourism establishment type* classification to entire business dataset. We'll focus on restaurants, cafes and hotels, and assume that other descriptors, such as keywords indicating cuisine type (e.g. "pizza", "Chinese" etc.), are complementary to these macro categories.  

The function .classify_tourism_establishment identifies hospitality and food service businesses, by using keyword-based matching applied to the "categories" column.

**For the purposes of this analysis, we used the following keywords to identify *tourism establishment type*:**

- Hotels: "hotel", "hotels"
- Restaurants: "restaurant", "restaurants", "cafe", "cafes"

Note that bars are not included in this analysis, due to the fact that customer behaviour in nightlife dynamics might differ significantly from the food service context.

In [16]:
df_business['tourism_establishment_type'] = df_business['categories'].apply(
    d_filter.classify_tourism_establishment
)

In [17]:
# Show overall distribution of hotels and restaurants

total_businesses = len(df_business)
for est_type, count in df_business['tourism_establishment_type'].value_counts().items():
    pct = (count / total_businesses) * 100
    print(f"  {est_type:20s}: {count:6,} ({pct:5.1f}%)")

  Other               : 92,462 ( 61.5%)
  Restaurants         : 51,923 ( 34.5%)
  Hotels              :  5,858 (  3.9%)
  Unknown             :    103 (  0.1%)


In [20]:
#Uncomment to save the csv with the added tourism_establishment_type column

#df_business.to_csv('all_businesses_classified.csv', index=False)

##  Geographic Distribution

In [18]:
#the funciton get_all_city_states() returns a Counter object with (city, state) counts

city_distribution = d_extract.get_all_city_states('../data/raw/yelp_academic_dataset_business.json')

Processed 50,000 businesses...
Processed 100,000 businesses...
Processed 150,000 businesses...

Total businesses: 150,346
Unique locations: 1,467



In [43]:
#What are the top-30 cities by the number of records (businesses)?

for i, (city_state, count) in enumerate(city_distribution.most_common(30), 1):
    print(f"{i:2d}. {city_state:35s}: {count:6,} businesses")

 1. Philadelphia, PA                   : 14,567 businesses
 2. Tucson, AZ                         :  9,249 businesses
 3. Tampa, FL                          :  9,048 businesses
 4. Indianapolis, IN                   :  7,540 businesses
 5. Nashville, TN                      :  6,968 businesses
 6. New Orleans, LA                    :  6,208 businesses
 7. Reno, NV                           :  5,932 businesses
 8. Edmonton, AB                       :  5,054 businesses
 9. Saint Louis, MO                    :  4,827 businesses
10. Santa Barbara, CA                  :  3,829 businesses
11. Boise, ID                          :  2,937 businesses
12. Clearwater, FL                     :  2,221 businesses
13. Saint Petersburg, FL               :  1,663 businesses
14. Metairie, LA                       :  1,643 businesses
15. Sparks, NV                         :  1,623 businesses
16. Wilmington, DE                     :  1,445 businesses
17. Franklin, TN                       :  1,410 business

----------------
 
 **New Orleans is a known tourism destination with distinct seasonal events (Mardi Gras, French Quarter, Jazz Fest) and it seems to be well represented in the Yelp dataset.** 
 
 **In the next steps we will verify if the number of reviews for chosen business categories in New Orleans is sufficient, and that it has sufficient temporal coverage and seasonal balance.**


 As for the choice of the years for the analysis (pre-pandemic), we will consider the following: 

- 2013: Super Bowl XLVII (February) 
- 2016: Normal tourism year (baseline comparison)
- 2018: New Orleans Tricentennial (300th Anniversary celebration)

 -------------------

##   Generating a New Orleans tourism dataset

In [None]:
#Extracting a New Orleans businesses database for desired categories.

df_nola_hospitality = d_filter.filter_by_city_and_establishment_type(
    df_business,
    'New Orleans',
    'LA',
    ['Restaurants', 'Hotels']
)

In [27]:
#Checking the distribution between restaurants and hotels.

df_nola_hospitality['tourism_establishment_type'].value_counts()

tourism_establishment_type
Restaurants    2224
Hotels          620
Name: count, dtype: int64

---------------------------
**The prevalence of the restaurants has to be noted and considered in the further analysis.**

In [31]:
#Extracting the reviews for the New Orleans hospitality businesses, for desired years.

df_nola_reviews = d_extract.extract_city_dataset(
    business_df=df_nola_hospitality,
    review_file='../data/raw/yelp_academic_dataset_review.json',
    user_file='../data/raw/yelp_academic_dataset_user.json',
    target_years=[2013, 2016, 2018]
)

Extracting data for New Orleans, LA
Target years: [2013, 2016, 2018]
Businesses to track: 2,844

STEP 1: Extracting Reviews
  Processed 500,000 reviews... Found 15,984 relevant reviews
  Processed 1,000,000 reviews... Found 26,539 relevant reviews
  Processed 1,500,000 reviews... Found 34,939 relevant reviews
  Processed 2,000,000 reviews... Found 47,351 relevant reviews
  Processed 2,500,000 reviews... Found 60,177 relevant reviews
  Processed 3,000,000 reviews... Found 70,690 relevant reviews
  Processed 3,500,000 reviews... Found 80,215 relevant reviews
  Processed 4,000,000 reviews... Found 94,070 relevant reviews
  Processed 4,500,000 reviews... Found 108,079 relevant reviews
  Processed 5,000,000 reviews... Found 118,788 relevant reviews
  Processed 5,500,000 reviews... Found 132,664 relevant reviews
  Processed 6,000,000 reviews... Found 144,166 relevant reviews
  Processed 6,500,000 reviews... Found 154,333 relevant reviews

✓ Extracted 164,041 reviews

STEP 2: Extracting User 

In [41]:
#Chceking the yearly seasonal distribution of data


viz.print_dataset_summary(df_nola_reviews)

DATASET SUMMARY
Summarizing ALL YEARS in dataset: [np.int32(2013), np.int32(2016), np.int32(2018)]

2013:
  Total reviews: 30,448
  Unique businesses: 1457
  Unique users: 13249
  Avg review rating: 3.79
  Seasonal distribution:
    Winter  : 7,289 ( 23.9%)
    Spring  : 7,866 ( 25.8%)
    Summer  : 7,506 ( 24.7%)
    Fall    : 7,787 ( 25.6%)

2016:
  Total reviews: 58,590
  Unique businesses: 1750
  Unique users: 30283
  Avg review rating: 3.94
  Seasonal distribution:
    Winter  : 15,000 ( 25.6%)
    Spring  : 16,511 ( 28.2%)
    Summer  : 14,043 ( 24.0%)
    Fall    : 13,036 ( 22.2%)

2018:
  Total reviews: 75,003
  Unique businesses: 1890
  Unique users: 40550
  Avg review rating: 4.00
  Seasonal distribution:
    Winter  : 17,186 ( 22.9%)
    Spring  : 21,472 ( 28.6%)
    Summer  : 19,696 ( 26.3%)
    Fall    : 16,649 ( 22.2%)


In [42]:
#Saving into a csv

df_nola_reviews.to_csv('New_Orleans_tourism_reviews.csv', index=False)

In [39]:
df_nola_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164041 entries, 0 to 164040
Data columns (total 27 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   review_id               164041 non-null  object        
 1   business_id             164041 non-null  object        
 2   user_id                 164041 non-null  object        
 3   review_stars            164041 non-null  float64       
 4   review_date             164041 non-null  datetime64[ns]
 5   review_text             164041 non-null  object        
 6   useful                  164041 non-null  int64         
 7   funny                   164041 non-null  int64         
 8   cool                    164041 non-null  int64         
 9   user_name               164041 non-null  object        
 10  user_review_count       164041 non-null  int64         
 11  user_yelping_since      164041 non-null  object        
 12  user_average_stars      164041