This notebook cleaned the data and writes the cleaned data to the data cleaning folder. The primary task of the cleaning step is to remove the attributes that are not used in the analysis.

First we import the necessary libraries and create the folders to store the cleaned data. We also define helper functions.

In [None]:
import os
import json

In [None]:
# Create the "data/01_cleaned" directory if it doesn't exist
if not os.path.exists("data/01_cleaned"):
    os.makedirs("data/01_cleaned")
# Create the "data/01_cleaned/sample" directory if it doesn't exist
if not os.path.exists("data/01_cleaned/sample"):
    os.makedirs("data/01_cleaned/sample")

In [None]:
def silent_remove(filename):
    try:
        os.remove(filename)
    except OSError:
        pass

This notebook is designed so that it can be run on the sample data or the full data just by switching one variable.

In [None]:
# Set this to True to run the script on the sample data
# Set this to False to run the script on the full data (takes much longer)
SAMPLE = True

## Businesses

First we clean the business dataset. We load the data and keep the attributes defined in the `keys_to_keep` variable.

In [None]:
# Creates a new list object 'business' (loading in the JSON file)
with open(f"data/00_original/{'sample/' if SAMPLE else ''}yelp_academic_dataset_business.json", "r") as f:
    businesses = [json.loads(line) for line in f]

In [None]:
# Cleaned unused attributes from the business data

attributes_to_keep = [
    "longitude",
    "name",
    "categories",
    "review_count",
    "stars",
    "latitude",
    "business_id",
]

for business in businesses:
    for key in list(business.keys()):
        if key not in attributes_to_keep:
            business.pop(key)

In [None]:
silent_remove(f"data/01_cleaned/{'sample/' if SAMPLE else ''}businesses.json")
with open(f"data/01_cleaned/{'sample/' if SAMPLE else ''}businesses.json", "a") as f:
    for business in businesses:
        f.write(json.dumps(business) + "\n")

In [None]:
# Optional cell to release memory (only within Python because Python doesn't like to release memory back to the system)
businesses = None

## Reviews and Users

Next we clean the data for reviews and users in much the same way as the business data.

In [None]:
with open(f"data/00_original/{'sample/' if SAMPLE else ''}yelp_academic_dataset_review.json", "r") as f:
    reviews = [json.loads(line) for line in f]

In [None]:
attributes_to_keep = [
    "business_id",
    "date",
    "review_id",
    "stars",
    "text",
    "user_id",
]

for review in reviews:
    for key in list(review.keys()):
        if key not in attributes_to_keep:
            review.pop(key)

In [None]:
silent_remove(f"data/01_cleaned/{'sample/' if SAMPLE else ''}reviews.json")
with open(f"data/01_cleaned/{'sample/' if SAMPLE else ''}reviews.json", "a") as f:
    for review in reviews:
        f.write(json.dumps(review) + "\n")

Now we remove the review data and write a new file without the reviews. This is to save time and space in the preprocessing step.

In [None]:
for review in reviews:
    if "text" in review.keys():
        review.pop("text")
    review["rating_id"] = review.pop("review_id")

In [None]:
silent_remove(f"data/01_cleaned/{'sample/' if SAMPLE else ''}ratings.json")
with open(f"data/01_cleaned/{'sample/' if SAMPLE else ''}ratings.json", "a") as f:
    for review in reviews:
        f.write(json.dumps(review) + "\n")

In [None]:
reviews = None

Now we clean user data

In [None]:
with open(f"data/00_original/{'sample/' if SAMPLE else ''}yelp_academic_dataset_user.json", "r") as f:
    users = [json.loads(line) for line in f]

In [None]:
attributes_to_keep = [
    "average_stars",
    "friends",
    "name",
    "review_count",
    "user_id",
]

for user in users:
    for key in list(user.keys()):
        if key not in attributes_to_keep:
            user.pop(key)

The following step changes the `friends` attribute to be a list instead of a string

In [None]:
for user in users:
    user["friends"] = user["friends"].split(", ")
    if user["friends"] == ["None"]:
        user["friends"] = []

In [None]:
silent_remove(f"data/01_cleaned/{'sample/' if SAMPLE else ''}users.json")
with open(f"data/01_cleaned/{'sample/' if SAMPLE else ''}users.json", "a") as f:
    for user in users:
        f.write(json.dumps(user) + "\n")