# Categories

This file contains the processing of categories data in the yelp restaurant dataset.

**Goal:** Provide a categories_processing function, which can be used in main.ipynb.

## 1. Get the categories dataset

The categories dataset only contains price, rating, review_count, name, id, categories columns. Extraction of this dataset is shown in main.ipynb.

In [1]:
import pickle
import numpy as np
import pandas as pd

data_file = '../Dataset/categories_data'

with open(data_file, 'rb') as inFile:
    data = pickle.load(inFile)

display(data.head())
print("data.shape: {0}".format(data.shape))

Unnamed: 0,id,name,categories,rating,review_count,price
0,bwCj2AcoOroZfCTxb6rCcg,A Better Burger,"[{'alias': 'burgers', 'title': 'Burgers'}, {'a...",3.5,6,2
1,S9S9kFJSkmfpbjFForCWLQ,El Castillo,"[{'alias': 'mexican', 'title': 'Mexican'}]",4.0,2,1
2,np8uV1xll22Yr-Q-B-ImkA,Rooster's Pub,"[{'alias': 'restaurants', 'title': 'Restaurant...",4.5,4,1
3,HGY1ojoLu07P_ky2LeRguQ,Redstone Restaurant,"[{'alias': 'newamerican', 'title': 'American (...",4.5,3,1
4,J5XS3VmxnLKhNlpiwDJ-3A,Little Mexico,"[{'alias': 'mexican', 'title': 'Mexican'}]",4.0,5,1


data.shape: (5119, 6)


## 2. Transform the categories feature into categorical feature

As shown above, the categories column in the original dataset is a dictionary format object. In order to incorporate this feature into our classification model, we need to transform it into categorical feature.

In [2]:
title = []
for cates in data.categories:
    for cate in cates:
        title.append(cate['title'])
title = list(set(title))
print('Total number of distinct categories:',len(title))
print(title)

Total number of distinct categories: 213
['Food Stands', 'Local Flavor', 'Greek', 'Pubs', 'Halal', 'Venues & Event Spaces', 'Karaoke', 'Poke', 'Empanadas', 'International Grocery', 'Kosher', 'Live/Raw Food', 'Caribbean', 'Hookah Bars', 'Dominican', 'Fondue', 'Coffee Roasteries', 'Coffee & Tea', 'Street Vendors', 'Tuscan', 'Cafes', 'Spanish', 'Venezuelan', 'Egyptian', 'Seafood', 'Japanese Curry', 'Boat Charters', 'Pasta Shops', 'New Mexican Cuisine', 'Pretzels', 'Cooking Classes', 'Noodles', 'Tapas Bars', 'Cocktail Bars', 'Jazz & Blues', 'Mediterranean', 'Asian Fusion', 'Hawaiian', 'Comfort Food', 'Diners', 'Butcher', 'Active Life', 'Fish & Chips', 'Chinese', 'Whiskey Bars', 'Arcades', 'Brazilian', 'Malaysian', 'Izakaya', 'Food Trucks', 'Internet Cafes', 'Creperies', 'Health Markets', 'Peruvian', 'Imported Food', 'Belgian', 'Kebab', 'Music Venues', 'Vegetarian', 'Pan Asian', 'Laotian', 'Thrift Stores', 'Shaved Ice', 'Wine Tasting Room', 'Custom Cakes', 'Italian', 'Desserts', 'Beer, Wine

Convert the 213 categories into more general 19 categories.

In [3]:
food_dict ={
'french_food': ['French', 'Creperies', 'Cajun/Creole'],
'north_american_food': ['Tex-Mex', 'Colombian','Hawaiian', 'Dominican','American (New)', 'Caribbean', 'American (Traditional)', 'Salvadoran', 'Southern', 'Puerto Rican'],
'european_food': ['Portuguese', 'Tuscan','German', 'British', 'Turkish','Italian', 'Mediterranean', 'Belgian', 'Spanish', 'Modern European', 'Irish', 'Poutineries', 'Greek'],
'south_american_food': ['Mexican', 'Cuban', 'Peruvian','Latin American', 'Venezuelan', 'Argentine', 'New Mexican Cuisine', 'Honduran', 'Brazilian'],
'african_food': ['African', 'South African', 'Moroccan', 'Ethiopian'],
'east_asian_food': ['Shanghainese','Chinese','Izakaya', 'Japanese', 'Guamanian','Cantonese','Sushi Bars','Teppanyaki', 'Szechuan', 'Korean','Japanese Curry', 'Hot Pot', 'Taiwanese', 'Fondue'],
'south_asian_food': ['Thai','Pakistani', 'Malaysian', 'Bangladeshi','Indonesian', 'Indian',  'Burmese', 'Pan Asian', 'Laotian', 'Filipino', 'Vietnamese','Asian Fusion', 'Himalayan/Nepalese','Mongolian'],
'middle_eastern': ['Persian/Iranian', 'Egyptian', 'Middle Eastern', 'Arabian', 'Lebanese', 'Afghan'],
'dinner': ['Dinner Theater', 'Diners'],
'vegterian': ['Salad', 'Vegan', 'Vegetarian', 'Fruits & Veggies'],
'snacks': ['Shaved Ice','Ice Cream & Frozen Yogurt','Bagels', 'Empanadas', 'Cupcakes', 'Tacos', 'Waffles', 'Custom Cakes', 'Desserts', 'Gelato', 'Bakeries', 'Pretzels', 'Donuts'],
'seafood': ['Seafood', 'Seafood Markets'],
'bars_and_clubs': ['Distilleries', 'Dive Bars', 'Whiskey Bars', 'Beer Gardens', 'Bars', 'Irish Pub', 'Hookah Bars','Brasseries', 'Tapas Bars', 'Wine Tasting Room', 'Breweries', 'Brewpubs','Beer, Wine & Spirits', 'Wine Bars', 'Tiki Bars', 'Wineries', 'Cocktail Bars', 'Beer Bar', 'Pubs'],
'steaks': ['Cheesesteaks', 'Steakhouses'],
'cafes': ['Cafes', 'Bubble Tea','Tea Rooms','Internet Cafes', 'Coffee & Tea', 'Coffee Roasteries'],
'fastfood': ['Fish & Chips', 'Wraps','Food Stands','Pizza', 'Burgers', 'Fast Food', 'Sandwiches', 'Hot Dogs', 'Food Trucks', 'Street Vendors'],
'breakfast': ['Bed & Breakfast', 'Breakfast & Brunch'],
'other_service': ['Public Art', 'Golf','Lounges', 'Cooking Classes','Home Decor', 'Boat Charters', 'Pool Halls','Day Spas','Bowling','Music Venues','Petting Zoos', 'Food Delivery Services', 'Karaoke'],
'other_food': ['Local Flavor', 'Tapas/Small Plates', 'Dim Sum', 'Buffets', 'Noodles', 'Ramen', 'Barbeque', 'Kebab', 'Poke', 'Kebab', 'Chicken Wings', 'Kosher', 'Juice Bars & Smoothies', 'Halal', 'Specialty Food', 'Live/Raw Food','Do-It-Yourself Food', 'Imported Food','Comfort Food', 'Soul Food','Soup']
}

In [4]:
for food_cate in food_dict.keys():
    food_cate_list = []
    for cates in data.categories:
        num = 0
        for cate in cates:
            if cate['title'] in food_dict[food_cate]:
                num += 1
        food_cate_list.append(num)
    data[food_cate] = food_cate_list
data_new = data.drop(columns=['categories'])
data_new.head()

Unnamed: 0,id,name,rating,review_count,price,french_food,north_american_food,european_food,south_american_food,african_food,...,vegterian,snacks,seafood,bars_and_clubs,steaks,cafes,fastfood,breakfast,other_service,other_food
0,bwCj2AcoOroZfCTxb6rCcg,A Better Burger,3.5,6,2,0,0,0,0,0,...,0,0,0,0,0,0,3,0,0,0
1,S9S9kFJSkmfpbjFForCWLQ,El Castillo,4.0,2,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,np8uV1xll22Yr-Q-B-ImkA,Rooster's Pub,4.5,4,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HGY1ojoLu07P_ky2LeRguQ,Redstone Restaurant,4.5,3,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,J5XS3VmxnLKhNlpiwDJ-3A,Little Mexico,4.0,5,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## 3. Train the classifier with the categorical features

In order to prove that the categorical features are useful for our classification model, we train the classifiers with the numerical data combined with the categorical data.

In [5]:
from sklearn.model_selection import train_test_split

y = data_new[['price']].values
X = data_new.drop(columns=['id', 'name', 'price'])
X_train, X_test, y_train, y_test = train_test_split(X, y.ravel(), test_size=0.2, random_state=42)

In [6]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(nb, X_train, y_train, cv=5)
print('BernoulliNB Avg_Acc:',np.mean(nb_scores))

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_scores = cross_val_score(dt, X_train, y_train, cv=5)
print('DecisionTreeClassifier Avg_Acc:',np.mean(dt_scores))

svc = LinearSVC(multi_class='ovr', max_iter=10000)
svc.fit(X_train, y_train)
svc_scores = cross_val_score(svc, X_train, y_train, cv=5)
print('LinearSVC Avg_Acc:',np.mean(svc_scores))

lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr.fit(X_train, y_train)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5)
print('LogisticRegression Avg_Acc:',np.mean(lr_scores))

rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train)
rf_scores = cross_val_score(rf, X_train, y_train, cv=5)
print('RandomForestClassifier Avg_Acc:', np.mean(rf_scores))

nn = MLPClassifier(max_iter=10000)
nn.fit(X_train, y_train)
nn_scores = cross_val_score(nn, X_train, y_train, cv=5)
print('MLPClassifier Avg_Acc:', np.mean(nn_scores))

BernoulliNB Avg_Acc: 0.6590736736082271
DecisionTreeClassifier Avg_Acc: 0.6468725861571464




LinearSVC Avg_Acc: 0.7328387822959084
LogisticRegression Avg_Acc: 0.7176989176002644
RandomForestClassifier Avg_Acc: 0.6947225404817676
MLPClassifier Avg_Acc: 0.7279550519221489


In [7]:
print('BernoulliNB:', nb.score(X_test, y_test))
print('DecisionTreeClassifier:', dt.score(X_test, y_test))
print('LinearSVC:', svc.score(X_test, y_test))
print('LogisticRegression:', lr.score(X_test, y_test))
print('RandomForestClassifier:', rf.score(X_test, y_test))
print('MLPClassifier:', nn.score(X_test, y_test))

BernoulliNB: 0.62890625
DecisionTreeClassifier: 0.626953125
LinearSVC: 0.6630859375
LogisticRegression: 0.693359375
RandomForestClassifier: 0.6806640625
MLPClassifier: 0.7021484375


The classification accuracy is improved aftering adding these categorical features, henece we are going to incorporate the categorical data into our classification model.