# Data Preprocessing

In this notebook, we'll preprocess the individual `.ndjson` source files before loading them into the graph database. Our eventual goal is to have a dataset that we can run inference and queries on to identify hotspots and gaps in local markets.

Defining some paths first:

In [1]:
PREPROC_DIR = "local_data/preproc"


## Businesses

Counting the total number of businesses:

In [2]:
BUSINESS_FILE = "local_data/raw/business.ndjson"

count = 0

with open(BUSINESS_FILE) as f:
	for line in f: count += 1

print(count)

150346



### Sanitization

We'll sanitize by fixing `repr`'d attributes and converting category strings to proper JSON arrays. 

In [3]:
import json, ast

with open(BUSINESS_FILE) as fin, open(f"{PREPROC_DIR}/business.ndjson","w") as fout:
	for line in fin:
		b = json.loads(line)

		# parse full attributes object
		attrs = b.get("attributes")

		if isinstance(attrs, dict):
			for k, v in attrs.items():
				if isinstance(v, str):
					
					# fix python-repr'd attributes
					try: b["attributes"][k] = ast.literal_eval(v)
					except Exception: pass

		# convert categories string to list
		cats = b.get("categories")
		if isinstance(cats, str): b["categories"] = [c.strip() for c in cats.split(',')]

		fout.write(json.dumps(b) + "\n")

## Users

Counting the total number of users:

In [4]:
USER_FILE = "local_data/raw/user.ndjson"

count = 0

with open(USER_FILE) as f:
	for line in f: count += 1

print(count)

1987897


### Downsizing

We've got far too many users to process in the short time span we have, so we'll need to figure out how to sample from the users in a way that satisfies our insight criteria. The idea here is this - we keep all the businesses in the dataset, and we select the top $n$% most relevant users for the richest possible dataset; finally, we'll filter reviews to only include those written by our selected users. This way, we preserve our primary vector - businesses - and downsize the overall database while preserving the links and relationships we need.

We first create a users $\to$ fans mapping:

In [59]:
fans_map = {}

with open(USER_FILE) as f:
	for line in f:
		u = json.loads(line)
		fans_map[u["user_id"]] = u.get("fans", 0)

# stability check - no. of fans = no. of users
len(fans_map)

1987897

And next a business $\leftarrow$ (reviewing) users mapping - also, we'll:

- Check for dangling reviews from deleted users
- Filter for data post 2015, just as an arbitrary "makes sense" cutoff

We'll first define the cutoff:

In [48]:
from datetime import datetime

cutoff = datetime(2017, 1, 1)

And then create the map:

In [49]:
from collections import defaultdict

REVIEW_FILE = "local_data/raw/review.ndjson"

reviewers_map = defaultdict(set)

with open(REVIEW_FILE) as f:
	for line in f:
		r = json.loads(line)

		# parse datetime
		dt = datetime.strptime(r["date"], "%Y-%m-%d %H:%M:%S")

		# add user if exists in users.ndjson (proxied by fans_map)
		if (r["user_id"] in fans_map) and (dt > cutoff): reviewers_map[r["business_id"]].add(r["user_id"])

# stability check - no. of entries <= no. of businesses
len(reviewers_map)

135542

Finally, get top $n$% of users for all businesses:

In [None]:
import math

final_users = set()

for b, users in reviewers_map.items():
	sorted_users = list(users)

	# sort users based on popularity
	sorted_users.sort(key=lambda uid: fans_map.get(uid, 0), reverse=True)

	# get top 100% and add to set
	final_users.update(sorted_users[:max(math.ceil(len(users) * 1.0), 0)])

	final_users.update(sorted_users)

# count no. of chosen users
len(final_users)

1357246

And now to create the preprocessed `.ndjson` file itself:

In [51]:
with open(USER_FILE) as fin, open(f"{PREPROC_DIR}/user.ndjson","w") as fout:
	for line in fin:
		u = json.loads(line)

		# select user if in final selection
		if u["user_id"] in final_users:	

			# make dates ISO compliant
			dt = datetime.strptime(u["yelping_since"], "%Y-%m-%d %H:%M:%S")
			u["yelping_since"] = dt.isoformat() + "Z"

			# convert elite years string to list and fix 2020 error
			elite = u.get("elite")
			u["elite"] = list(set(2020 if int(y) == 20 else int(y) for y in elite.split(","))) if len(elite) else []

			# convert friends string to list, filter on final users
			friends = u.get("friends")
			u["friends"] = [f.strip() for f in friends.split(",") if f.strip() in final_users]
			
			fout.write(json.dumps(u) + "\n")

## Reviews

Counting the total number of reviews:

In [52]:
count = 0

with open(REVIEW_FILE) as f:
	for line in f: count += 1

print(count)

6990280


### Filtering

We'll go ahead and filter reviews by those written by our selected users:

In [53]:
count = 0
unique_reviewers = set()

with open(REVIEW_FILE) as fin, open(f"{PREPROC_DIR}/review.ndjson","w") as fout:
	for line in fin:
		r = json.loads(line)

		# parse datetime
		dt = datetime.strptime(r["date"], "%Y-%m-%d %H:%M:%S")

		# get reviewer user id
		reviewer_id = r["user_id"]

		# select review if author in final selection
		if reviewer_id in final_users and dt > cutoff: 
			count += 1
			unique_reviewers.add(reviewer_id)
			
			# make dates ISO compliant
			r["date"] = dt.isoformat() + "Z"

			fout.write(json.dumps(r) + "\n")

# stability check - no. of reviewers = no. of selected users
print(f"No. of reviewers: {len(unique_reviewers)}")

print(f"No. of reviews: {count}")

No. of reviewers: 1357246
No. of reviews: 3838072


## Final Checks

We'll now run some final checks to factualize any remaining assumptions about the data.

### Bidirectional Friendship

We'll check if it's implied that given $A$ is a friend of $B$, then $B$ is a friend of $A$. We'll create a user $\to$ friends map first:

In [54]:
friend_map = {}

with open(f"{PREPROC_DIR}/user.ndjson") as f:
	for line in f:
		u = json.loads(line)
		
		friend_map[u["user_id"]] = u["friends"]

And then run the verification:

In [55]:
for user, friends in friend_map.items():
	for friend in friends:
		if user not in friend_map[friend]: print(user)

No output, so looks solid.

### User Relevancy

How important is a user to our inference task?

In [65]:
reviews_map = {}

with open(f"{PREPROC_DIR}/user.ndjson") as f:
	for line in f:
		u = json.loads(line)
		reviews_map[u["user_id"]] = u.get("review_count", 0)

# stability check - no. of review counts = no. of users
len(reviews_map)

1357246

In [63]:
count = 0

for user, friends in friend_map.items():
	if not len(friends) and reviews_map[user] < 2: count += 1

count

170917