# Business Category Construction
In the dataset, each business has a set of business categories. Upon registration to Yelp, a business indicates business categories it feels best represents its essence, style, or brand. Wide reaching and varying in specificity, some example categories include “Comfort Food”, “Seafood”, “Venues and Event Spaces”, “Internet Service”, and “Ophthalmologists”. These categories can provide users with a preliminary semi-abstract idea of the services it offers. On the Yelp platform, a business can provide as many or as few categories as the owners or select registrants deem necessary, which, in the dataset, achieves a minimum of 0 categories and a maximum of 36 categories. I suspect that these categories can provide our deep learning models with relevant perspective on relationships between businesses.

However, due to the overwhelming power of self-identifcation and all the issues that come with it (indicated in the BusinessAI paper), there was a problem with grouping together similar businesses and discriminating dissimilar ones. To tackle this issue, I devise a method to create a hierarchical community structure of business categories to accomplish the aforementionned task using the Nested Stochastic Block Model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import json
from itertools import combinations
import networkx as nx
from glob import glob
from networkx.algorithms import community
import graph_tool.all as gt
import itertools
import csv
import seaborn as sns
import ast
%matplotlib inline

### Get business categories that are members of specified first level communities

In [10]:
busCatDf2 = busCatDf
busCatDf2 = busCatDf[['Community_1','nameOfNode','degreeOfNode','weightedDegreeOfNode']]
busCatDf2 = busCatDf2.groupby(['Community_1','nameOfNode']).count()
busCatDf2 = busCatDf2.drop(['degreeOfNode','weightedDegreeOfNode'],axis = 1)
busCatDf2 = busCatDf2.reset_index()

In [75]:
busCatDf2[busCatDf2['Community_1'] == 50] # example

Unnamed: 0,Community_1,nameOfNode
819,50,Cosmetic Dentists
820,50,Dentists
821,50,Eyewear & Opticians
822,50,General Dentistry
823,50,Laser Eye Surgery/Lasik
824,50,Ophthalmologists
825,50,Optometrists
826,50,Orthodontists
827,50,Orthotics
828,50,Urologists


In [97]:
# save our Category graph to csv

busCatDf.to_csv("categoryCommunities/categoryHierarchyDf.csv")

In [98]:
# FROM HERE YOU CAN READ IN THE CSV.

#busCatDf = pd.read_csv("categoryCommunities/categoryHierarchyDf.csv")
#busCatDf.head()

## Iterative Naming Process
- We now a hierarchical structure of business categories and a system for naming the communities. But, each business has multiple categories and these categories can belong to different communities. To navigate this, we will run the NSBM an arbitrary amount of times, and for each iteration, we will keep track of which communities the categories belong to. The community that appears the most frequently will be the business' community.

In [12]:
Yelp_Business = pd.read_json('YelpDataset/business.json',lines=True)
businessCatSeries = Yelp_Business['categories']
Yelp_Business['categoryDictCounts'] = Yelp_Business['categories'].apply(lambda x: {})

In [52]:
# LEVEL 1 DICT COUNTS; function to get the level two communities (labels) of each category that a business indicates
# counts the number of occurences of each of the community labels

def categoriesToDictCounts1(row):
    locList = []
    Row0BusinessCats = row['categories']
    Row0Dict1 = row['categoryDictCounts']
    for i in range(len(Row0BusinessCats)):
        locList.append(businessAttributeToCommunity(Row0BusinessCats[i],busCatDf,1,"string")) #level 2
    for elem in locList:
        if elem not in Row0Dict:
            Row0Dict1[elem] = 1
        elif elem in Row0Dict:
            Row0Dict1[elem] += 1
    return Row0Dict1


In [29]:
# LEVEL 2 DICT COUNTS; function to get the level two communities (labels) of each category that a business indicates
# counts the number of occurences of each of the community labels

def categoriesToDictCounts(row):
    locList = []
    Row0BusinessCats = row['categories']
    Row0Dict = row['categoryDictCounts']
    for i in range(len(Row0BusinessCats)):
        locList.append(businessAttributeToCommunity(Row0BusinessCats[i],busCatDf,2,"string")) #level 2
    for elem in locList: 
        for i in range(len(elem)):
            if elem[i] not in Row0Dict:
                Row0Dict[elem[i]] = 1
            elif elem[i] in Row0Dict:
                Row0Dict[elem[i]] += 1
    return Row0Dict


In [20]:
# Running a total of 17 times here (18 times total) where at each iteration, we run the NSBM, get the new state,
# get the communities of the business categories and add to frequency counts in dictionary 

count = 0
summaryDict = dict()
while count < 100:
    rowList = [count+1]
    state=gt.minimize_nested_blockmodel_dl(catGraph,deg_corr=True) 
    for i in range(4):
        try:
            rowList.append(len(state.get_levels()[i].get_blocks().get_array()))
        except IndexError:
            rowList.append(0)
    summaryDict[count] = rowList
    print(count)
    state.print_summary()
    print("\n")
    count+=1

0
l: 0, N: 1293, B: 75
l: 1, N: 75, B: 21
l: 2, N: 21, B: 4
l: 3, N: 4, B: 1


1
l: 0, N: 1293, B: 76
l: 1, N: 76, B: 20
l: 2, N: 20, B: 6
l: 3, N: 6, B: 1


2
l: 0, N: 1293, B: 79
l: 1, N: 79, B: 22
l: 2, N: 22, B: 5
l: 3, N: 5, B: 1


3
l: 0, N: 1293, B: 79
l: 1, N: 79, B: 19
l: 2, N: 19, B: 5
l: 3, N: 5, B: 1


4
l: 0, N: 1293, B: 76
l: 1, N: 76, B: 18
l: 2, N: 18, B: 4
l: 3, N: 4, B: 1


5
l: 0, N: 1293, B: 75
l: 1, N: 75, B: 19
l: 2, N: 19, B: 5
l: 3, N: 5, B: 1


6
l: 0, N: 1293, B: 85
l: 1, N: 85, B: 21
l: 2, N: 21, B: 5
l: 3, N: 5, B: 1


7
l: 0, N: 1293, B: 76
l: 1, N: 76, B: 16
l: 2, N: 16, B: 3
l: 3, N: 3, B: 1


8
l: 0, N: 1293, B: 63
l: 1, N: 63, B: 11
l: 2, N: 11, B: 1


9
l: 0, N: 1293, B: 77
l: 1, N: 77, B: 19
l: 2, N: 19, B: 3
l: 3, N: 3, B: 1


10
l: 0, N: 1293, B: 75
l: 1, N: 75, B: 20
l: 2, N: 20, B: 4
l: 3, N: 4, B: 1


11
l: 0, N: 1293, B: 66
l: 1, N: 66, B: 17
l: 2, N: 17, B: 4
l: 3, N: 4, B: 1


12
l: 0, N: 1293, B: 82
l: 1, N: 82, B: 19
l: 2, N: 19, B: 6
l: 3, 

In [23]:
df = pd.DataFrame.from_dict(summaryDict,orient='index',columns = ['Iteration','Level 0','Level 1','Level 2', \
                                                             'Level 3'])

In [24]:
df.to_csv("blockBreakdowns.csv")

In [55]:
import operator

# get the key with the max value 

def getKeyWithMaxValue(row):
    Row0Dict = row['categoryDictCounts']
    if row['categoryDictCounts'] != {}:
        return max(Row0Dict.items(), key=operator.itemgetter(1))[0]
    else:
        return
    
def getKeyWithMaxValueLevel1(row):
    Row0Dict = row['catDictCountsLevel1']
    if row['catDictCountsLevel1'] != {}:
        return max(Row0Dict.items(), key=operator.itemgetter(1))[0]
    else:
        return
    

In [425]:
Yelp_Business["businessLatentCategory"] = Yelp_Business.apply(getKeyWithMaxValue,axis=1)
Yelp_Business['businessCatLevel1'] = Yelp_Business.apply(getKeyWithMaxValueLevel1,axis=1)
Yelp_Business.to_csv("BusinessesWithLatentCategories.csv")