## Automatically Categorizing Yelp Businesses

Build a baseline based on the article from Yelp Software Team, [Automatically Categorizing Yelp Businesses](https://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html)  
Not using text information, guess the multi-label assignment of business. Here, all sample set contains 'Chinese' tag and try to see if the model can detect more subtle labeling.

In [81]:
from utils import * 
import pickle
import numpy as np
import random
import matplotlib.pyplot as plt
from co_occurrence_net.category_map import CategoryMap
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [2]:
# load data
chinese_business = pd.read_csv('chinese_business.csv', index_col = False)
chinese_reviews = pd.read_csv('chinese_review_clean.csv', index_col = False)
chinese_business =  [eval(i) for i in chinese_business['categories']]

In [37]:
G = CategoryMap()
G.build_graph(chinese_business['categories'])

In [42]:
G.get_subcategories('Chinese')[:10]

[('Asian Fusion', 449),
 ('Food', 398),
 ('Fast Food', 260),
 ('Thai', 237),
 ('Dim Sum', 230),
 ('Buffets', 211),
 ('Japanese', 179),
 ('Seafood', 163),
 ('Sushi Bars', 158),
 ('Specialty Food', 134)]

## A Series of Binary Classifiers: One for Each Category

is a business of a given category 

We extract terms from names and reviews, using standard lexical analysis techniques of tokenization, normalization (e.g. lowercasing), and stop word filtering. If the business has been categorized as part of a chain (which we’ll describe in an upcoming blog post!) we’ll include that chain’s URL as a feature, and if the business has NAICS codes from one of our data partners, we’ll include those as well.

In [84]:
def data_split(business_df, review_df, topic):
    '''
    Split the original data into 2 classes, ones that includes topic label and don't
    '''
    print ('topic: {}'.format(topic))
    includes = set()
    not_includes = set()
    for i, topics in enumerate(business_df['categories']):
        if topic in topics:
            includes.add(business_df.iloc[i]['business_id'])
        else:
            not_includes.add(business_df.iloc[i]['business_id'])
    review_included = review_df.loc[review_df['business_id'].isin(includes)]
    review_not_included = review_df.loc[review_df['business_id'].isin(not_includes)]
    print ('include topic:     {} business, {} reviews'.format(len(includes), len(review_included)))
    print ('not include topic: {} business, {} reviews'.format(len(not_includes), len(review_not_included)))
    
    return review_included, review_not_included

In [85]:
t, f = data_split(chinese_business, chinese_reviews, G.get_subcategories('Chinese')[4][0])

topic: Dim Sum
include topic:     230 business, 21960 reviews
not include topic: 3545 business, 156189 reviews


Yelp used following features:
- Tokenized Name
- Tokenized Review
- NAICS(we do not have an access)
- country (we disregard)
- Last Term in Name 

In [316]:
def genereate_feature(review_counter, name_counter, review_df, business_df, business_id):
    '''
    '''    
    # filter the restaurant name
    name = business_df.loc[business_df['business_id'] == business_id]['name']

    # filter the reviews for the specified business
    review = review_df.loc[review_df['business_id'] == business_id]['text']
    # extract the last word of the restaurant
    last_name = pd.Series(name_counter.build_analyzer()(name.values[0])[-1])

    # feature length 
    name_length = len(name_counter.get_feature_names()) 
    review_length = len(review_counter.get_feature_names())
    
    # NAME + LAST NAME + REVIEW
    name_feature = np.zeros(name_length)
    last_name_feature = np.zeros(name_length)
    review_feature = np.zeros(review_length)

    for r, d in zip(name_counter.transform(name).indices, name_counter.transform(name).data):
        name_feature[r] = d

    for r, d in zip(name_counter.transform(last_name).indices, name_counter.transform(last_name).data):
        last_name_feature[r] = d
    
    for r, d in zip(review_counter.transform(review).indices, review_counter.transform(review).data):
        review_feature[r] = d
    
    feature = np.array(np.concatenate((name_feature, last_name_feature, review_feature),axis = 0))
    return feature

In [330]:
def create_data_set(review_counter, name_counter, review_df, business_df, topic):
    '''
    '''
    # select business with topic tag
    t, f = data_split(business_df, review_df, topic)
    t_in =  (set(t['business_id']))
    t_not_in =  (set(f['business_id']))
    
    # output dimension
    name_length = len(name_counter.get_feature_names()) 
    review_length = len(review_counter.get_feature_names())
    feature_length = 2*name_length + review_length
    
    X_in = np.array([])
    Y_in = np.array([])
    
    for b in t_in:
        feature = genereate_feature(review_counter, name_counter, review_df, business_df, b)
        X_in = np.append(X_in, feature)
        Y_in = np.append(Y_in, 1)
    print ('positive set done')
    
    X_out = np.array([])
    Y_out = np.array([])
    
    for b in t_not_in:
        feature = genereate_feature(review_counter, name_counter, review_df, business_df, b)
        X_out = np.append(X_out, feature)
        Y_out = np.append(Y_out, 0)
    print ('negative set done')
    
    
    return X_in.reshape(len(t_in), feature_length), Y_in.reshape(-1,1), X_out.reshape(len(t_not_in), feature_length), Y_out.reshape(-1,1)

In [331]:
genereate_feature(review_counter, name_counter, chinese_reviews, chinese_business, 'OygJyqypKFZJIZ6r9dML7w')

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [332]:
create_data_set(review_counter, name_counter, chinese_reviews, chinese_business,'Dim Sum')

topic: Dim Sum
include topic:     230 business, 21960 reviews
not include topic: 3545 business, 156189 reviews
positive set done


IndexError: list index out of range

In [275]:
X = np.append(X, 1)

In [327]:
np.arange(12).reshape(1,-1)

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]])