# Dataset Setup

This notebook loads and parses the json file containing the Yelp businesses data to create a well-formed dataset.

In [1]:
import numpy as np
import pandas as pd
import json
import os
from ast import literal_eval

In [2]:
DATA_ROOT = './data'
BUSINESS_FNAME = 'business.json'

CATEGORIES = {
    'Active Life', 'Arts & Entertainment', 'Automotive',
    'Beauty & Spas', 'Education', 'Event Planning & Services',
    'Financial Services', 'Food', 'Health & Medical',
    'Home Services', 'Hotels & Travel', 'Local Flavor',
    'Local Services', 'Mass Media', 'Nightlife', 'Pets',
    'Professional Services', 'Public Services & Government',
    'Real Estate', 'Religious Organizations', 'Restaurants',
    'Shopping'
}

# Drop rows that have NA in these columns
DROP_NA_COLS = ['business_id', 'categories', 'longitude', 'latitude']
# Do not use these columns when constructing POI representation
EXCLUDE_COLS = ['business_id', 'categories', 'longitude', 'latitude']

TEST_SIZE = 0.2
SEED = 2019

Load data into a dataframe. We define helper functions to properly handle the nested format of the input json file. We drop POIs without 'business_id', 'categories', 'longitude' or 'latitude'. We only consider the main 22 'CATEGORIES' in the labels.

In [3]:
def literal(val):
    try:
        return literal_eval(val)
    except (ValueError, SyntaxError) as e:
        return val

def flatten_dict(dd, separator='.', prefix=''):
    return {prefix + separator + k if prefix else k: v for kk, vv in literal(dd).items() for k, v in
            flatten_dict(vv, separator, kk).items()} if isinstance(literal(dd), dict) else {prefix: dd}


with open(os.path.join(DATA_ROOT, BUSINESS_FNAME)) as f:
    data = [flatten_dict(json.loads(line)) for line in f]
df = pd.DataFrame(data)
df = df.dropna(subset=DROP_NA_COLS).reset_index(drop=True)
df = df.fillna('')
df['categories'] = df['categories'].apply(
    lambda x: ', '.join([c for c in x.split(', ') if c in CATEGORIES]))
df.shape

(192127, 109)

For each POI a sequential representation is extracted based on its attributes. More specifically, this representation consists of the concatenation of the POI's attributes using a <'attr_name'.'attr_value'> format. Attribute values with more than one word are splitted.

In [4]:
def extract_attr_sequences(df):
    cols = df.columns
    attrs = df.apply(
        lambda x: ' '.join([col + '.' + v for col, val in zip(cols, x)
                            if val != '' for v in str(val).split()]), axis=1)
    return attrs

df['sequence'] = extract_attr_sequences(df.drop(EXCLUDE_COLS, axis=1))

Split the dataset to train and test sets.

In [5]:
train_df = df.sample(frac=1-TEST_SIZE, random_state=SEED)
test_df = df.drop(train_df.index).reset_index(drop=True)
train_df = train_df.reset_index(drop=True)

train_df.to_csv(os.path.join(DATA_ROOT, 'train.csv'), index=False)
test_df.to_csv(os.path.join(DATA_ROOT, 'test.csv'), index=False)

train_df.shape, test_df.shape

((153702, 110), (38425, 110))