# Categories

This file contains the processing of categories data in the yelp restaurant dataset.

**Goal:** Provide a categories_processing function, which can be used in main.ipynb.

## 1. Get the categories dataset

The categories dataset only contains price, rating, review_count, name, id, categories columns. Extraction of this dataset is shown in main.ipynb.

In [1]:
import pickle
import numpy as np
import pandas as pd

all_data_file = '../Dataset/categories_data'
train_data_file = '../Dataset/categories_data_train'
test_data_file = '../Dataset/categories_data_test'

with open(all_data_file, 'rb') as inFile:
    data = pickle.load(inFile)
with open(train_data_file, 'rb') as inFile:
    train_data = pickle.load(inFile)
with open(test_data_file, 'rb') as inFile:
    test_data = pickle.load(inFile)

display(data.head())
print("train_data.shape: {0}".format(train_data.shape))
print("test_data.shape: {0}".format(test_data.shape))

Unnamed: 0,id,name,categories,rating,review_count,price
0,bwCj2AcoOroZfCTxb6rCcg,A Better Burger,"[{'alias': 'burgers', 'title': 'Burgers'}, {'a...",3.5,6,2
1,S9S9kFJSkmfpbjFForCWLQ,El Castillo,"[{'alias': 'mexican', 'title': 'Mexican'}]",4.0,2,1
2,np8uV1xll22Yr-Q-B-ImkA,Rooster's Pub,"[{'alias': 'restaurants', 'title': 'Restaurant...",4.5,4,1
3,HGY1ojoLu07P_ky2LeRguQ,Redstone Restaurant,"[{'alias': 'newamerican', 'title': 'American (...",4.5,3,1
4,J5XS3VmxnLKhNlpiwDJ-3A,Little Mexico,"[{'alias': 'mexican', 'title': 'Mexican'}]",4.0,5,1


train_data.shape: (4095, 6)
test_data.shape: (1024, 6)


## 2. Transform the categories feature into categorical feature

As shown above, the categories column in the original dataset is a dictionary format object. In order to incorporate this feature into our classification model, we need to transform it into categorical feature.

In [2]:
alias={}
titles={}
for cates in data.categories:
    for cate in cates:
        if cate['alias'] in alias.keys():
            alias[cate['alias']]+=1
        else:
            alias[cate['alias']]=1
        if cate['title'] in titles.keys():
            titles[cate['title']]+=1
        else:
            titles[cate['title']]=1
alias_sorted=sorted(alias.items(), key=lambda d:d[1], reverse=True)
titles_sorted=sorted(titles.items(), key=lambda d:d[1], reverse=True)
print('Top 10 ranked categories: ')
for i in range(10):
    print('  '+alias_sorted[i][0]+':'+str(alias_sorted[i][1])+'; '+titles_sorted[i][0]+':'+str(titles_sorted[i][1]))

Top 10 ranked categories: 
  tradamerican:898; American (Traditional):898
  pizza:675; Pizza:675
  burgers:522; Burgers:522
  hotdogs:519; Fast Food:519
  italian:496; Italian:496
  sandwiches:479; Sandwiches:479
  seafood:457; Seafood:457
  mexican:422; Mexican:422
  newamerican:417; American (New):417
  breakfast_brunch:376; Breakfast & Brunch:376


In [3]:
title = []
for cates in data.categories:
    for cate in cates:
        title.append(cate['title'])
title = list(set(title))
print('Total number of distinct categories:',len(title))
print(title)

Total number of distinct categories: 213
['Shanghainese', 'Convenience Stores', 'Seafood', 'Imported Food', 'Fish & Chips', 'Cuban', 'Donuts', 'Salvadoran', 'Bagels', 'Halal', 'Live/Raw Food', 'Peruvian', 'Tapas Bars', 'Dim Sum', 'Gastropubs', 'Southern', 'African', 'Specialty Food', 'French', 'Wine Bars', 'Chicken Shop', 'Dinner Theater', 'Chicken Wings', 'Home & Garden', 'Cocktail Bars', 'New Mexican Cuisine', 'Laser Tag', 'Modern European', 'Burgers', 'Soul Food', 'Cupcakes', 'Social Clubs', 'Gluten-Free', 'Creperies', 'Malaysian', 'Poutineries', 'Food', 'Sports Bars', 'Szechuan', 'Greek', 'Breweries', 'Pubs', 'Pizza', 'German', 'Hot Dogs', 'Teppanyaki', 'Hot Pot', 'Farmers Market', 'Sushi Bars', 'Cigar Bars', 'Barbeque', 'Tuscan', 'Day Spas', 'Seafood Markets', 'Vietnamese', 'Local Flavor', 'Coffee & Tea', 'Restaurants', 'Middle Eastern', 'Indonesian', 'Comedy Clubs', 'Portuguese', 'Mongolian', 'Food Tours', 'Buffets', 'Poke', 'Cheese Shops', 'Ice Cream & Frozen Yogurt', 'Grocery',

Convert the 213 categories into 20 general categories.

In [4]:
food_dict ={
'french_food': ['French', 'Creperies', 'Cajun/Creole'],
'north_american_food': ['Tex-Mex', 'Colombian','Hawaiian', 'Dominican','American (New)', 'Caribbean', 'American (Traditional)', 'Salvadoran', 'Southern', 'Puerto Rican'],
'european_food': ['Portuguese', 'Tuscan','German', 'British', 'Turkish','Italian', 'Mediterranean', 'Belgian', 'Spanish', 'Modern European', 'Irish', 'Poutineries', 'Greek'],
'south_american_food': ['Mexican', 'Cuban', 'Peruvian','Latin American', 'Venezuelan', 'Argentine', 'New Mexican Cuisine', 'Honduran', 'Brazilian'],
'african_food': ['African', 'South African', 'Moroccan', 'Ethiopian'],
'east_asian_food': ['Shanghainese','Chinese','Izakaya', 'Japanese', 'Guamanian','Cantonese','Sushi Bars','Teppanyaki', 'Szechuan', 'Korean','Japanese Curry', 'Hot Pot', 'Taiwanese', 'Fondue'],
'south_asian_food': ['Thai','Pakistani', 'Malaysian', 'Bangladeshi','Indonesian', 'Indian',  'Burmese', 'Pan Asian', 'Laotian', 'Filipino', 'Vietnamese','Asian Fusion', 'Himalayan/Nepalese','Mongolian'],
'middle_eastern': ['Persian/Iranian', 'Egyptian', 'Middle Eastern', 'Arabian', 'Lebanese', 'Afghan'],
'dinner': ['Dinner Theater', 'Diners'],
'vegterian': ['Salad', 'Vegan', 'Vegetarian', 'Fruits & Veggies'],
'snacks': ['Shaved Ice','Ice Cream & Frozen Yogurt','Bagels', 'Empanadas', 'Cupcakes', 'Tacos', 'Waffles', 'Custom Cakes', 'Desserts', 'Gelato', 'Bakeries', 'Pretzels', 'Donuts'],
'seafood': ['Seafood', 'Seafood Markets'],
'bars_and_clubs': ['Distilleries', 'Dive Bars', 'Whiskey Bars', 'Beer Gardens', 'Bars', 'Irish Pub', 'Hookah Bars','Brasseries', 'Tapas Bars', 'Wine Tasting Room', 'Breweries', 'Brewpubs','Beer, Wine & Spirits', 'Wine Bars', 'Tiki Bars', 'Wineries', 'Cocktail Bars', 'Beer Bar', 'Pubs'],
'steaks': ['Cheesesteaks', 'Steakhouses'],
'cafes': ['Cafes', 'Bubble Tea','Tea Rooms','Internet Cafes', 'Coffee & Tea', 'Coffee Roasteries'],
'fastfood': ['Fish & Chips', 'Wraps', 'Pizza', 'Burgers', 'Fast Food', 'Sandwiches', 'Hot Dogs'],
'street_food': ['Food Stands', 'Food Trucks', 'Street Vendors'],
'breakfast': ['Bed & Breakfast', 'Breakfast & Brunch'],
'other_service': ['Public Art', 'Golf','Lounges', 'Cooking Classes','Home Decor', 'Boat Charters', 'Pool Halls','Day Spas','Bowling','Music Venues','Petting Zoos', 'Food Delivery Services', 'Karaoke'],
'other_food': ['Local Flavor', 'Tapas/Small Plates', 'Dim Sum', 'Buffets', 'Noodles', 'Ramen', 'Barbeque', 'Kebab', 'Poke', 'Kebab', 'Chicken Wings', 'Kosher', 'Juice Bars & Smoothies', 'Halal', 'Specialty Food', 'Live/Raw Food','Do-It-Yourself Food', 'Imported Food','Comfort Food', 'Soul Food','Soup']
}

In [5]:
for food_cate in food_dict.keys():
    food_cate_list = []
    for cates in train_data.categories:
        num = 0
        for cate in cates:
            if cate['title'] in food_dict[food_cate]:
                num += 1
        food_cate_list.append(num)
    train_data[food_cate] = food_cate_list
train_data = train_data.drop(columns=['categories'])

for food_cate in food_dict.keys():
    food_cate_list = []
    for cates in test_data.categories:
        num = 0
        for cate in cates:
            if cate['title'] in food_dict[food_cate]:
                num += 1
        food_cate_list.append(num)
    test_data[food_cate] = food_cate_list
test_data = test_data.drop(columns=['categories'])

train_data.head()

Unnamed: 0,id,name,rating,review_count,price,french_food,north_american_food,european_food,south_american_food,african_food,...,snacks,seafood,bars_and_clubs,steaks,cafes,fastfood,street_food,breakfast,other_service,other_food
2404,h7_M9pgY_hPdYMJNQdmHBg,Taylor's At Market Square,4.0,11,1,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3464,HU37kM5fC1zbchGILJTh3A,Green House Coffee,4.5,37,1,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
3616,aHo6OehrCJAM228b_2yTOA,Country Cookin,3.0,22,2,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4741,m7CIG06JpJdFNJaHb3uIjA,Anatolian Bistro,4.5,156,2,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3980,nbdz7uI4WUv11AfSvUpWKg,Ciro's Italian Pizzeria Restaurant,3.5,30,2,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


## 3. Train the classifier with the categorical features

In order to prove that the categorical features are useful for our classification model, we train the classifiers with the numerical data combined with the categorical data.

In [6]:
X_train = train_data.drop(columns=['id', 'name', 'price'])
y_train = train_data[['price']].values
y_train = y_train.ravel()
X_test = test_data.drop(columns=['id', 'name', 'price'])
y_test = test_data[['price']].values
y_test = y_test.ravel()

In [7]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

nb = BernoulliNB()
nb.fit(X_train, y_train)
nb_scores = cross_val_score(nb, X_train, y_train, cv=5)
print('BernoulliNB Avg_Acc:',np.mean(nb_scores))

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_scores = cross_val_score(dt, X_train, y_train, cv=5)
print('DecisionTreeClassifier Avg_Acc:',np.mean(dt_scores))

svc = LinearSVC(multi_class='ovr', max_iter=10000)
svc.fit(X_train, y_train)
svc_scores = cross_val_score(svc, X_train, y_train, cv=5)
print('LinearSVC Avg_Acc:',np.mean(svc_scores))

lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr.fit(X_train, y_train)
lr_scores = cross_val_score(lr, X_train, y_train, cv=5)
print('LogisticRegression Avg_Acc:',np.mean(lr_scores))

rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train)
rf_scores = cross_val_score(rf, X_train, y_train, cv=5)
print('RandomForestClassifier Avg_Acc:', np.mean(rf_scores))

nn = MLPClassifier(max_iter=10000)
nn.fit(X_train, y_train)
nn_scores = cross_val_score(nn, X_train, y_train, cv=5)
print('MLPClassifier Avg_Acc:', np.mean(nn_scores))

BernoulliNB Avg_Acc: 0.6703000534910687
DecisionTreeClassifier Avg_Acc: 0.6454136349759517




LinearSVC Avg_Acc: 0.6954903145028581
LogisticRegression Avg_Acc: 0.7184300314833939
RandomForestClassifier Avg_Acc: 0.6957085953003129
MLPClassifier Avg_Acc: 0.7328301466393242


The classification accuracy is improved aftering adding these categorical features, hence we are going to incorporate the categorical data into our classification model.

In [11]:
from sklearn.metrics import classification_report

target_names = ['$', '$$', '$$$', '$$$$']
print(classification_report(y_test, nn.predict(X_test), target_names=target_names))
print('Test Accuracy:',nn.score(X_test,y_test))

              precision    recall  f1-score   support

           $       0.65      0.70      0.67       392
          $$       0.74      0.76      0.75       596
         $$$       0.00      0.00      0.00        34
        $$$$       0.00      0.00      0.00         2

   micro avg       0.71      0.71      0.71      1024
   macro avg       0.35      0.36      0.36      1024
weighted avg       0.68      0.71      0.69      1024

Test Accuracy: 0.7060546875


  'precision', 'predicted', average, warn_for)


## 4. Save the final model and prediction on testing set

In [9]:
with open('models/category_clf.pickle', 'wb') as f:
    pickle.dump(nn, f)

In [10]:
outs = pd.DataFrame(columns=['id', 'y_pred'])
outs['id']=test_data['id']
outs['y_pred']=nn.predict(X_test)
outs.to_csv('../Dataset/category_preds.csv')