# **Yelp Dataset**
Download dataset [here](https://www.yelp.com/dataset/download).

Dataset Link: 
https://www.yelp.com/dataset

| File            | Description                                                                                                                                                                 |
|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `business.json` | Contains business data including location data, attributes, and categories.                                                                                                |
| `review.json`   | Contains full review text data including the user_id that wrote the review and the business_id the review is written for.                                                  |
| `user.json`     | User data including the user's friend mapping and all the metadata associated with the user.                                                                               |
| `checkin.json`  | Checkins on a business.                                                                                                                                                     |
| `tip.json`      | Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.                                                                   |

Reference: https://github.com/ahegel/yelp-dataset

Official sample project from Yelp: [here](https://github.com/Yelp/dataset-examples)

## **1. Check-In.json**

In [4]:
# import libraries
import pandas as pd
import json
import numpy as np
import pandas as pd


In [6]:
#checkin_json_file_path = '/content/drive/MyDrive/BA820 Unsupervised and Unstructured Machine Learning/Yelp Dataset/yelp_academic_dataset_checkin.json'
checkin_json_file_path = 'yelp_academic_dataset_checkin.json'

# List to store data from JSON file
checkin_data_list = []

# Open and read the JSON file line by line
with open(checkin_json_file_path, 'r') as file:
    for checkin_line in file:
        try:
            # Load each JSON object into a Python dictionary
            checkin_data = json.loads(checkin_line)
            checkin_data_list.append(checkin_data)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")

# Convert the list of dictionaries into a DataFrame
checkin_df = pd.DataFrame(checkin_data_list)

checkin_df

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."
...,...,...
131925,zznJox6-nmXlGYNWgTDwQQ,"2013-03-23 16:22:47, 2013-04-07 02:03:12, 2013..."
131926,zznZqH9CiAznbkV6fXyHWA,2021-06-12 01:16:12
131927,zzu6_r3DxBJuXcjnOYVdTw,"2011-05-24 01:35:13, 2012-01-01 23:44:33, 2012..."
131928,zzw66H6hVjXQEt0Js3Mo4A,"2016-12-03 23:33:26, 2018-12-02 19:08:45"


In [7]:
checkin_df.isna().sum()

business_id    0
date           0
dtype: int64

Below's dataframe is for each businessId, the number of check-ins for each hour of the day. Please note that "0" is midnight, "12" is noon, "13" is 1 PM, and so on up to "23" which is 11 PM.

In [8]:
# create a list of date-time strings
checkin_df['date_times_list'] = checkin_df['date'].str.split(', ')

# have each date-time as a separate row
checkin_df_exploded = checkin_df.explode('date_times_list')

# Convert the 'date_times_list' to datetime objects
checkin_df_exploded['date_times_list'] = pd.to_datetime(checkin_df_exploded['date_times_list'], errors='coerce')

# create new columns 'hour' and 'day_of_week'
checkin_df_exploded['hour'] = checkin_df_exploded['date_times_list'].dt.hour
checkin_df_exploded['day_of_week'] = checkin_df_exploded['date_times_list'].dt.day_name()

# get hours as columns with a count of check-ins per hour for each day of the week
checkin_df_exploded['count'] = 1
checkin_hourly_ext = checkin_df_exploded.pivot_table(index='business_id', columns=['day_of_week', 'hour'], values='count', aggfunc='sum', fill_value=0)

checkin_hourly_ext.columns = [' '.join([str(col) for col in cols]) for cols in checkin_hourly_ext.columns.values]

# add 'business_id' column again
checkin_hourly_ext.reset_index(inplace=True)

checkin_hourly_ext.head()

Unnamed: 0,business_id,Friday 0,Friday 1,Friday 2,Friday 3,Friday 4,Friday 5,Friday 6,Friday 7,Friday 8,...,Wednesday 14,Wednesday 15,Wednesday 16,Wednesday 17,Wednesday 18,Wednesday 19,Wednesday 20,Wednesday 21,Wednesday 22,Wednesday 23
0,---kPU91CF4Lq2-WlRu9Lw,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,--0iUa4sNDFiZFrAdIWhZQ,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
2,--30_8IhuyMHbSOcNWd6DQ,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,--7PUidqRWpRSpXebiyxTg,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,--7jw19RH9JKXgFohspgQw,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Below's dataframe showed the aggregate number of check-ins for each hour.

In [None]:
checkin_df['date_times_list'] = checkin_df['date'].str.split(', ')

checkin_df = checkin_df.explode('date_times_list')

checkin_df['date_times_list'] = pd.to_datetime(checkin_df['date_times_list'], errors='coerce')

# create a new column 'hour'
checkin_df['hour'] = checkin_df['date_times_list'].dt.hour

# get hours as columns with a count of check-ins per hour
checkin_df['count'] = 1
checkin_hourly = checkin_df.pivot_table(index='business_id', columns='hour', values='count', aggfunc='sum', fill_value=0)

if isinstance(checkin_hourly.columns, pd.MultiIndex):
    checkin_hourly.columns = checkin_hourly.columns.droplevel(0)

# add 'business_id' to become a column again
checkin_hourly = checkin_hourly.reset_index()

checkin_hourly.columns = ['business_id'] + [f'hour_{col}' if isinstance(col, int) else col for col in checkin_hourly.columns[1:]]

In [10]:
checkin_hourly.head()

Unnamed: 0,business_id,hour_0,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7,hour_8,...,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23
0,---kPU91CF4Lq2-WlRu9Lw,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,6,2,0
1,--0iUa4sNDFiZFrAdIWhZQ,2,0,0,0,0,0,1,0,0,...,0,1,0,1,0,0,1,1,1,2
2,--30_8IhuyMHbSOcNWd6DQ,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,--7PUidqRWpRSpXebiyxTg,0,0,2,0,0,0,0,2,0,...,0,2,1,1,0,0,0,0,0,0
4,--7jw19RH9JKXgFohspgQw,0,0,0,0,0,0,0,0,0,...,5,3,2,0,1,3,3,1,0,1


In [None]:
import matplotlib.pyplot as plt


checkin_hourly.drop('business_id', axis=1, inplace=True)

hourly_sums = checkin_hourly.sum()

plt.figure(figsize=(14, 7))
plt.bar(hourly_sums.index, hourly_sums.values, color='skyblue')

plt.xlabel('Hour of the Day')
plt.ylabel('Number of Check-ins')
plt.title('Aggregate Number of Check-ins for Each Hour')
plt.xticks(rotation=45)

# to avoid overlapping 
plt.tight_layout()
plt.show()

## **2. Tip.json**

In [None]:
#tip_json_file_path = '/content/drive/MyDrive/BA820 Unsupervised and Unstructured Machine Learning/Yelp Dataset/yelp_academic_dataset_tip.json'
tip_json_file_path = 'yelp_academic_dataset_tip.json'

# List to store data from JSON file
tip_data_list = []

# Open and read the JSON file line by line
with open(tip_json_file_path, 'r') as file:
    for tip_line in file:
        try:
            # Load each JSON object into a Python dictionary
            tip_data = json.loads(tip_line)
            tip_data_list.append(tip_data)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")

# Convert the list of dictionaries into a DataFrame
tip_df = pd.DataFrame(tip_data_list)

tip_df

In [None]:
tip_df.isna().sum()

In [None]:
# filter out the non-zero compliment count
non_zero_compliments = tip_df[tip_df['compliment_count'] > 0]

# group by business_id and sum the compliment counts
business_compliments = non_zero_compliments.groupby('business_id')['compliment_count'].sum().sort_values()

# get the first 10
top_10_business_compliments = business_compliments.nlargest(10)

plt.figure(figsize=(10, 6))
top_10_business_compliments.plot(kind='bar', color='skyblue')

plt.title('Top 10 Businesses by Non-Zero Compliment Count')
plt.xlabel('Business ID')
plt.ylabel('Total Compliment Count')
plt.xticks(rotation=90)
plt.tight_layout()

plt.show()

## **3. User.json**

In [None]:
#user_json_file_path = '/content/drive/MyDrive/BA820 Unsupervised and Unstructured Machine Learning/Yelp Dataset/yelp_academic_dataset_user.json'
user_json_file_path = 'yelp_academic_dataset_user.json'

# List to store data from JSON file
user_data_list = []

# Open and read the JSON file line by line
with open(user_json_file_path, 'r') as file:
    for user_line in file:
        try:
            # Load each JSON object into a Python dictionary
            user_data = json.loads(user_line)
            user_data_list.append(user_data)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")

# Convert the list of dictionaries into a DataFrame
user_df = pd.DataFrame(user_data_list)

user_df.head()

In [None]:
user_df.isna().sum()

In [None]:
# Assuming 'user_df' is your DataFrame and has been read into the environment properly.
# For the purpose of this example, we will create a mock DataFrame with a similar structure.

# For each type of count, find the top 10 users including their names
top_useful = user_df.nlargest(10, 'useful')[['user_id', 'name', 'useful']]
top_funny = user_df.nlargest(10, 'funny')[['user_id', 'name', 'funny']]
top_cool = user_df.nlargest(10, 'cool')[['user_id', 'name', 'cool']]

# Displaying the top 10 for each category
print("Top 10 Users for 'Useful' Compliments:")
print(top_useful)
print("\nTop 10 Users for 'Funny' Compliments:")
print(top_funny)
print("\nTop 10 Users for 'Cool' Compliments:")
print(top_cool)



In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 15))

# 'useful'
top_useful.plot(kind='bar', x='name', y='useful', ax=axes[0], color='blue')
axes[0].set_title('Top 10 Users by "Useful" Compliments')
axes[0].set_xlabel('User Name')
axes[0].set_ylabel('Count of Useful Compliments')

# 'funny'
top_funny.plot(kind='bar', x='name', y='funny', ax=axes[1], color='orange')
axes[1].set_title('Top 10 Users by "Funny" Compliments')
axes[1].set_xlabel('User Name')
axes[1].set_ylabel('Count of Funny Compliments')

# 'cool'
top_cool.plot(kind='bar', x='name', y='cool', ax=axes[2], color='green')
axes[2].set_title('Top 10 Users by "Cool" Compliments')
axes[2].set_xlabel('User Name')
axes[2].set_ylabel('Count of Cool Compliments')

plt.tight_layout()
plt.show()

## **4. Review.json**

In [None]:
#review_json_file_path = '/content/drive/MyDrive/BA820 Unsupervised and Unstructured Machine Learning/Yelp Dataset/'yelp_academic_dataset_review.json'
review_json_file_path = 'yelp_academic_dataset_review.json'

# List to store data from JSON file
review_data_list = []

# Open and read the JSON file line by line
with open(review_json_file_path, 'r') as file:
    for review_line in file:
        try:
            # Load each JSON object into a Python dictionary
            review_data = json.loads(review_line)
            review_data_list.append(review_data)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")

# Convert the list of dictionaries into a DataFrame
review_df = pd.DataFrame(review_data_list)

# Print only the first 3 rows of the DataFrame
review_df


In [None]:
review_df.isna().sum()

In [None]:
import seaborn as sns

sns.histplot(review_df['stars'], kde=False)

plt.show()