# Preprocessing of the data

The dataset used is the [Yelp dataset](https://www.yelp.com/dataset/download).
It contains ~6.9 million reviews from ~150,000 businesses.

The dataset is provided in json files.

We want to convert the data to csv files and filter it because we only want to look at the reviews for McDonald's restaurants.

In [1]:
import json
import csv
import re

In [28]:
# Regex to match all names that include McDonald's
mcRegex = re.compile(r"\b[mM][cC] ?[dD][oO][nN][aA][lL][dD]'?[sS]?\b")
mcHashMap = {}

# Load the JSON data from the file
with open('yelp_dataset/yelp_academic_dataset_business.json', 'r', encoding='utf-8') as json_file:
    with open('csv/mcdonalds_businesses.csv', 'w', newline='') as csv_file:
        fieldnames = ["business_id", "name"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for line in json_file:
            data = json.loads(line)
            # check if name matches mcRegex
            if mcRegex.search(data["name"]):
                writer.writerow({"business_id": data["business_id"], "name": data["name"]})
                mcHashMap[data["business_id"]] = data["name"]
        

In [29]:
print(len(mcHashMap))

717


We found 717 McDonald's in the dataset.
However, looking at the data, we saw that some of the found restaurants were not actually McDonald's, like for example:
- McDonald Park
- McDonald Pest Control
- Max McDonald Surfboard Repair

So we decided to only match the name "McDonald's" exactly instead of the regex.

In [35]:
mcHashMap = {}

# Load the JSON data from the file
with open('yelp_dataset/yelp_academic_dataset_business.json', 'r', encoding='utf-8') as json_file:
    with open('csv/mcdonalds_businesses.csv', 'w', newline='') as csv_file:
        fieldnames = ["business_id", "name"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for line in json_file:
            data = json.loads(line)
            if data["name"] == "McDonald's":
                writer.writerow({"business_id": data["business_id"], "name": data["name"]})
                mcHashMap[data["business_id"]] = data["name"]
        

In [34]:
print(len(mcHashMap))

703


It looks like we only lost 14 Businesses by matching to "McDonald's" exactly.

In [39]:
mcReviews = []
errorCount = 0

with open('yelp_dataset/yelp_academic_dataset_review.json', 'r', encoding='utf-8') as json_file:
    with open('csv/mcdonalds_reviews.csv', 'w', newline='') as csv_file:
        fieldnames = ["business_id", "stars", "text"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for line in json_file:
            try:
                data = json.loads(line)
                if data["business_id"] in mcHashMap.keys():
                    writer.writerow({"business_id": data["business_id"], "stars": data["stars"], "text": ''.join(data["text"].splitlines())})
                    mcReviews.append(data["stars"])
            except:
                errorCount += 1

In [38]:
print("Number of McDonald's reviews: " + str(len(mcReviews)))
print("Number of Errors: " + str(errorCount))

Number of McDonald's reviews: 18149
Number of Errors: 61


We now have a total of 18149 reviews in the csv files. 61 Reviews could not be read because of some error in the encoding, we don't know why but the number of affected reviews is small, so we don't really care.

### We now have the csv files for the McDonald's data!