This notebook cleaned the data and writes the cleaned data to the data cleaning folder. The primary task of the cleaning step is to remove the attributes that are not used in the analysis.

First we import the necessary libraries and create the folders to store the cleaned data. We also define helper functions.

In [None]:
import os
import json

import pandas as pd

In [None]:
# Create the "data/01_cleaned" directory if it doesn't exist
if not os.path.exists("data/01_cleaned"):
    os.makedirs("data/01_cleaned")
# Create the "data/01_cleaned/sample" directory if it doesn't exist
if not os.path.exists("data/01_cleaned/sample"):
    os.makedirs("data/01_cleaned/sample")

In [None]:
def silent_remove(filename):
    try:
        os.remove(filename)
    except OSError:
        pass

This notebook is designed so that it can be run on the sample data or the full data just by switching one variable.

In [None]:
# Set this to True to run the script on the sample data
# Set this to False to run the script on the full data (takes much longer)
SAMPLE = False

## Businesses

First we clean the business dataset. We load the data and keep the attributes defined in the `keys_to_keep` variable.

In [None]:
# Creates a new list object 'business' (loading in the JSON file)
with open(f"data/00_original/{'sample/' if SAMPLE else ''}yelp_academic_dataset_business.json", "r") as f:
    businesses = pd.DataFrame([json.loads(line) for line in f])

In [None]:
# Cleaned unused attributes from the business data

attributes_to_keep = [
    "longitude",
    "name",
    "categories",
    "review_count",
    "stars",
    "latitude",
    "business_id",
]

businesses = businesses[attributes_to_keep]

In [None]:
file_name = f"data/01_cleaned/{'sample/' if SAMPLE else ''}businesses.csv"
with open(file_name, "w") as f:
    businesses.to_csv(f, index=False, header=True)

## Reviews/Ratings

Next we clean the data for reviews and users in much the same way as the business data.

In [None]:
# Load review data
# [!] 5 min
with open(f"data/00_original/{'sample/' if SAMPLE else ''}yelp_academic_dataset_review.json", "r") as f:
    review_df = pd.DataFrame([json.loads(line) for line in f])

In [None]:
# Drop columns not in the attributes_to_keep list

attributes_to_keep = [
    "business_id",
    "date",
    "review_id",
    "stars",
    "text",
    "user_id",
]

review_df = review_df[attributes_to_keep]

In [None]:
# Clean text column
# [!] 4 min

# Remove newlines from the text column
review_df["text"] = review_df["text"].str.replace("\n", " ")
# Remove quotes from the text column
review_df["text"] = review_df["text"].str.replace('"', "")

In [None]:
# Write the review data to a CSV
# [!] 4 min
file_name = f"data/01_cleaned/{'sample/' if SAMPLE else ''}reviews.csv"
with open(file_name, "w") as f:
    review_df[["business_id", "review_id", "user_id", "text"]].to_csv(f, index=False, header=True)

Now we remove the review data and write a new file without the reviews. This is to save time and space in the preprocessing step.

In [None]:
rating_df = review_df[["business_id", "review_id", "user_id", "stars"]]
rating_df = rating_df.rename(columns={"review_id": "rating_id"})

In [None]:
# Write the ratings data to a CSV
# [!] 4 min
file_name = f"data/01_cleaned/{'sample/' if SAMPLE else ''}ratings.csv"
with open(f"data/01_cleaned/{'sample/' if SAMPLE else ''}ratings.csv", "w") as f:
    rating_df.to_csv(f, index=False, header=True)

# Users

In [None]:
with open(f"data/00_original/{'sample/' if SAMPLE else ''}yelp_academic_dataset_user.json", "r") as f:
    users = pd.read_json(f, lines=True)

In [None]:
attributes_to_keep = [
    "average_stars",
    "friends",
    "name",
    "review_count",
    "user_id",
]

users = users[attributes_to_keep]

The following step extracts the `friends` column and makes it its own data frame

In [None]:
# Extract the friends column into a separate table
users["friends"] = users["friends"].apply(lambda x: x.split(", "))
users["friends"] = users["friends"].apply(lambda x: [] if x == ["None"] else x)
friends = users[["user_id", "friends"]].explode("friends").rename(columns={"friends": "friend_id"})
friends.head()

In [None]:
file_name = f"data/01_cleaned/{'sample/' if SAMPLE else ''}users.csv"
with open(file_name, "w") as f:
    users.to_csv(f, index=False, header=True)
file_name = f"data/01_cleaned/{'sample/' if SAMPLE else ''}friends.csv"
with open(file_name, "w") as f:
    friends.to_csv(f, index=False, header=True)