# Data Preprocessing

In this notebook, we'll preprocess the individual `.json` source files before loading them into the graph database. Our eventual goal is to have a dataset that we can run inference and queries on to identify hotspots and gaps in local markets.

Defining some paths first:

In [2]:
PREPROC_DIR = "local_data/preproc"


## Businesses

Counting the total number of businesses:

In [3]:
BUSINESS_FILE = "local_data/raw/business.json"

count = 0

with open(BUSINESS_FILE) as f:
	for line in f: count += 1

print(count)

150346



### Sanitization

We'll sanitize by fixing `repr`'d attributes and converting category strings to proper JSON arrays. 

In [None]:
import json, ast

with open(BUSINESS_FILE) as fin, open(f"{PREPROC_DIR}/business.json","w") as fout:
	for line in fin:
		b = json.loads(line)

		# parse full attributes object
		attrs = b.get("attributes")

		if isinstance(attrs, dict):
			for k, v in attrs.items():
				if isinstance(v, str):
					
					# fix python-repr'd attributes
					try: b["attributes"][k] = ast.literal_eval(v)
					except Exception: pass

		# convert categories string to list
		cats = b.get("categories")
		if isinstance(cats, str): b["categories"] = [c.strip() for c in cats.split(',')]

		fout.write(json.dumps(b) + "\n")

## Users

Counting the total number of users:

In [5]:
USER_FILE = "local_data/raw/user.json"

count = 0

with open(USER_FILE) as f:
	for line in f: count += 1

print(count)

1987897


### Downsizing

We've got far too many users to process in the short time span we have, so we'll need to figure out how to sample from the users in a way that satisfies our insight criteria. The idea here is this - we keep all the businesses in the dataset, and we select the top $n$% most relevant users for the richest possible dataset; finally, we'll filter reviews to only include those written by our selected users. This way, we preserve our primary vector - businesses - and downsize the overall database while preserving the links and relationships we need.

We first create a users $\to$ fans mapping:

In [6]:
fans_map = {}

with open(USER_FILE) as f:
	for line in f:
		u = json.loads(line)
		fans_map[u["user_id"]] = u.get("fans", 0)

# stability check - no. of fans = no. of users
len(fans_map)

1987897

And next a business $\leftarrow$ (reviewing) users map (we'll check for dangling reviews from deleted users):

In [7]:
from collections import defaultdict

REVIEW_FILE = "local_data/raw/review.json"

reviewers_map = defaultdict(set)

with open(REVIEW_FILE) as f:
	for line in f:
		r = json.loads(line)

		# add user if exists in users.json (proxied by fans_map)
		if r["user_id"] in fans_map: reviewers_map[r["business_id"]].add(r["user_id"])

# stability check - no. of entries = no. of businesses
len(reviewers_map)

150346

Finally, get top $n$% of users for all businesses:

In [8]:
import math

final_users = set()

for b, users in reviewers_map.items():
	sorted_users = list(users)

	# sort users based on popularity
	sorted_users.sort(key=lambda uid: fans_map.get(uid, 0), reverse=True)

	# get top 10% and add to set
	final_users.update(sorted_users[:max(math.ceil(len(users) * 0.1), 3)])

# count no. of chosen users
len(final_users)

164821

And now to create the preprocessed `.json` file itself:

In [12]:
with open(USER_FILE) as fin, open(f"{PREPROC_DIR}/user.json","w") as fout:
	for line in fin:
		u = json.loads(line)

		# select user if in final selection
		if u["user_id"] in final_users:	
			
			# convert friends string to list
			friends = u.get("friends")
			if isinstance(friends, str): u["friends"] = [f.strip() for f in friends.split(",")]
			
			fout.write(json.dumps(u) + "\n")

## Reviews

Counting the total number of reviews:

In [11]:
count = 0

with open(REVIEW_FILE) as f:
	for line in f: count += 1

print(count)

6990280


### Filtering

We'll go ahead and filter reviews by those written by our selected users:

In [None]:
with open(REVIEW_FILE) as fin, open(f"{PREPROC_DIR}/review.json","w") as fout:
	for line in fin:
		r = json.loads(line)

		# select review if author in final selection
		if r["user_id"] in final_users: fout.write(json.dumps(r) + "\n")

## Tips

Checking for size first:

In [9]:
count = 0

with open("local_data/raw/tip.json") as f:
	for line in f: count += 1

print(count)

908915
