# Similarity of Cities

<p>
You know how different cities have a different feel when you visit them?
Sometimes, you can find unique types of business in them. In some cities, this is caused by environment, in some by history. No matter what is behind it, the whole area is prone to accept more gladly certain types of business over the others.</p>
<p>
As with any data analysis, that seems like common sense, but is it true?
I decided to check by determining the most common business categories mentioned in the Yelp data set. And comparing cities based on the top 15 business categories.</p>
<p>
Here is how I approached that issue.</p>

## Data preparation

Importing necessary modules:

In [1]:
import re
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import nltk as nl
import sys



The first step is importing data from JSON database and forming data frame in the pandas.

In [2]:
fileName = "/Users/Lexa/Documents/coding/yelp-api/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"
colNames = ['name', 'full_address', 'city', 'state', 'latitude', 'longitude', 
            'stars', 'review_count', 'categories', 'open']

f = open(fileName, 'rU')

data =[]
for line in f:    
    test = json.loads(line)
    data.append(test)
df = pd.DataFrame(data)


Next, I like to pickle my database. Years of experience of unexpected malfunctions taught me to save often.

In [3]:
with open('out.pickle', 'wb') as handle:
    pickle.dump(df, handle)


<p>Now, I'm starting with NLP procedures.
First I need to clean up the Yelp categories. They are imputed by people without constraints, meaning that there is a mix of upper and lower cases, singular and plural, and unnecessary adjectives. </p>
<p>
So I'll start by making a function that will stem words in the categories.</p>

In [4]:
def cleanup(categoryList):
    porter = nl.PorterStemmer()
    cati=[]
    tokens=[]
    words=[]
    vocab=[]
    for c in categoryList:
        tokens = nl.word_tokenize(c)
        words = [w.lower() for w in tokens]
        vocab = sorted(set(words))
        vocStem=[porter.stem(t) for t in vocab]
        if vocStem:
            for i in range(len(vocStem)):
                if '&' in vocStem[i]:
                    pass
                else:
                    cati.append(vocStem[i])
    return cati


Next, making a list of the unique values of the cities. 

In [5]:
city=df['city']
cities=np.unique(city.values.ravel())

Now I can form dictionary where I'll gather all business categories of a town in one place. 

In [6]:
cityCat={}

for cit in cities:
    if ',' in cit:
        test=cit.split(',')
        cit=test[0]
    citL=cit.lower()
    df1 = df[df['city'] == cit]
    cate=df1['categories']
    for ca in cate:
        cat=cleanup(ca)
        if citL in cityCat:
            for k in range(len(cat)-1):
                c=cat[k]
                cityCat[citL].append(c)
        else:
            cityCat[citL] = cat    
print len(cityCat)


362


<p>
Of course, a mixture of the letter cases exists in CITY column as well. So I decided to put all city names as lower cases. And that reduced the number of the towns to 362. </p>
<p>
Some of the column values have actually the whole address in it instead of a city name. However, since I'm looking for the categories that repeat more than 2 times, this is not a problem. Such data will be removed from the analysis in the next steps.</p>

In [7]:
#Saving the dataframe with cleaned up categories. 

pkl_file = open('out_cityCat.pkl', 'wb')
pickle.dump(cityCat,pkl_file)
pkl_file.close()

## Finding the most common categories

<p>So, next step is determining which categories are most common in cities.</p> <p>
 The procedure is relatively straightforward. I'm actually just determining the frequency of certain words. </p>

In [8]:
cities=cityCat.keys()

<p>My exploratory analysis from before showed that restaurants are quite typical in most cities, meaning that word will end up in the top 15 for sure, especially for the small places that have not so much variety. </p>
<p>Therefore, I think it is prudent also to determine the frequency of pairs of words and triplets of the words. This approach might give me a more accurate description of the typicalities of the business in the area.</p>

In [9]:
bigram_measures = nl.collocations.BigramAssocMeasures()
trigram_measures = nl.collocations.TrigramAssocMeasures()

<p> Next step is to calculate frequency of the words and phrases.</p>
<p> Because of mistakes in the cities themselves, I'm applying frequency filter to remove words that appear less than two times. This procedure will eliminate categories from any falsely imputed names of cities, leaving those lists empty. So in the next step, I will quickly remove such cities from my analysis. </p>

In [28]:
cityCatTop2={}
cityCatTop3={}
cityCatCommon={}

for i in range(len(cities)):
    cit=cities[i]
    category=cityCat[cit]
    fdist1 = nl.FreqDist(category)
    if cit in cityCatCommon:
        cat=fdist1.most_common(3) 
        for k in range(len(cat)-1):
            c=cat[k]
            cityCatCommon[cit].append(c)
    else:
        cityCatCommon[cit] = fdist1.most_common(3) 
    finder = nl.BigramCollocationFinder.from_words(category)
    finder2=nl.TrigramCollocationFinder.from_words(category)
    #finder.apply_freq_filter(2)
    #finder2.apply_freq_filter(2)
    if cit in cityCatTop2:
        cat=finder.nbest(bigram_measures.pmi, 1)
        for k in range(len(cat)-1):
            c=cat[k]
            cityCatTop2[cit].append(c)
    else:
        cityCatTop2[cit] = finder.nbest(bigram_measures.pmi, 1)
    if cit in cityCatTop3:
        cat=finder2.nbest(trigram_measures.pmi, 1)
        for k in range(len(cat)-1):
            c=cat[k]
            cityCatTop3[cit].append(c)
    else:
        cityCatTop3[cit] = finder2.nbest(trigram_measures.pmi, 1)

And of course, I'm saving newly obtained results.

In [29]:
pkl_fil = open('out_cityCatTop.pkl', 'wb')
pickle.dump (cityCatCommon, pkl_fil)
pkl_fil.close()

pkl_file2 = open('out_cityCatBigramTop.pkl', 'wb')
pickle.dump (cityCatTop2, pkl_file2)
pkl_file2.close()

pkl_file3 = open('out_cityCatTrigramTop.pkl', 'wb')
pickle.dump (cityCatTop3, pkl_file3)
pkl_file3.close()