# Introduction

## 1. Background and Problem

* As we all know, New York has always been the strongest state in US. Its gross state product in 2018 was $1.7 trillion. If New York State were an independent nation, it would rank as the 11th largest economy in the world.
* However, the economic disparity among cities is being reported all the time. Among the news, it's also worth to pay attention to industrial differentiation.
* According to reports, modern finance and business activity are the pillar industries in suburban counties and New York City, while tourism is the main industry supplemented by traditional industries in upstate.

<font color=red size=3.5>So, what are the industries' differences among cities in New York state?</font>
* This is what I aim to answer. With answering this questions, it can provide useful suggestions to stakeholders from all sides:
1. **For entrepreneurs**, acknowledging the similarity and dissimilarity among cities in New York state help them decide which cities to expand basing on corporate business. For example, tourism enterprise may intend to expand their business to surrounding cities with rich tourism resources.
2. **For job hunters**, they can reselect a city with abundant job opportunities in their areas rather than struggling in the same city. And move to a new city with similar facilities.

## 2. Data Description

* Inspired by the Toronto analysis, I decide to distinguish the cities basing on their venue categories. That is to say, there are three kinds of datasets I should collect, including **list of cities in New York state**, **the latitude and longitude of each city**, and **the information of main venues in each cities**.

<font color=red size=3.5>Where and how can I get the datasets?</font>
* These three datasets require different sources. I will collect them as following:
1. **list of cities in New York state**: Fortunately, it's easy to find the list in wikipedia. The link is <https://en.wikipedia.org/wiki/List_of_cities_in_New_York>, and I will scrape the cities' table using pandas.
2. **the latitude and longitude of each city**: As shown in the code, I will use geocoders and for syntax to get the location of each city. It might take some time to get the whole data, but the accuracy is guaranteed.
***
```
from geopy.geocoders import Nominatim
address = 'New York'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location[-1][0]
longitude = location[-1][1]
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
```
***
3. **the information of main venues in each cities**: This dataset will use Foursquare location data through API. The function I will use is exploring the nearby venues of a given location.

## 3. Methodology

### 3.1. Load Data from Wiki

In [71]:
# loading data from wiki
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_cities_in_New_York')[1]

In [72]:
# checking the data
df.head()

Unnamed: 0,City,County,Population [1][2](2010 census),Incorporationdate,FIPS code(subdivision),FIPS code (place)
0,Albany,Albany,97856,1686,3600101000,3601000
1,Amsterdam,Montgomery,18620,1830,3605702066,3602066
2,Auburn,Cayuga,27687,1848,3601103078,3603078
3,Batavia,Genesee,15465,1915,3603704715,3604715
4,Beacon,Dutchess,15541,1913,3602705100,3605100


In [6]:
df.shape

(62, 6)

In [7]:
# checking if there are missing values
df[['City']].isnull().sum()

City    0
dtype: int64

In [8]:
# cheaking if there are duplicated values
df['City'].value_counts().unique()

array([1], dtype=int64)

Obviously, we successfully get the all cities' names of New York state. Let's go ahead!

## 3.2. Get Location of Each Neighborhood

In [18]:
# import relative packages
from geopy.geocoders import Nominatim
import numpy as np
import time
from geopy.exc import GeocoderTimedOut

In [24]:
# try to get the location of Albany
address = 'Albany, NY'
geolocator = Nominatim(user_agent = "Albany_explorer", timeout = 20)
location = geolocator.geocode(address)
latitude = location[-1][0]
longitude = location[-1][1]
print('The geograpical coordinate of Albany City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Albany City are 42.0774779, -78.4298613.


In [25]:
def do_geocode(address, attempt=1, max_attempts=20):
    geolocator = Nominatim(user_agent="{}_explorer".format(address),timeout = 60)
    try:
        location = geolocator.geocode(address)
        return location
    except GeocoderTimedOut:
        if attempt <= max_attempts:
            return do_geocode(address, attempt=attempt+1)
        raise

In [None]:
# get the locations of all cities
latitudes = []
longitudes = []

for i in np.arange(df.shape[0]):
    address = '{}, NY'.format(df.loc[i,'City'])
    print(address)
    location = do_geocode(address)
    latitudes.append(location[-1][0])
    longitudes.append(location[-1][1])
    time.sleep(1)
print('All locations get!')

In [73]:
# create location variables
df['lat'] = latitudes
df['lon'] = longitudes
df.head()

Unnamed: 0,City,County,Population [1][2](2010 census),Incorporationdate,FIPS code(subdivision),FIPS code (place),lat,lon
0,Albany,Albany,97856,1686,3600101000,3601000,42.651167,-73.754968
1,Amsterdam,Montgomery,18620,1830,3605702066,3602066,42.943367,-74.185044
2,Auburn,Cayuga,27687,1848,3601103078,3603078,42.93202,-76.567203
3,Batavia,Genesee,15465,1915,3603704715,3604715,42.998014,-78.187551
4,Beacon,Dutchess,15541,1913,3602705100,3605100,41.504879,-73.969682


In [35]:
#visualize the cities
import folium
latitude = df[df['City'] == 'Ithaca']['lat'].values[0]
longitude = df[df['City'] == 'Ithaca']['lon'].values[0]

map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, city, county in zip(df['lat'], df['lon'], df['City'], df['County']):
    label = '{}, {}'.format(county,city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### 3.3. Utilizing the Foursquare API to Explore the citys

In [36]:
# load relative packages
import requests

In [44]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'EZG40Z10UTCLPZPEYR1QPUMXCW2TXHFFU1OIEBAPUDQE23ZP' 
CLIENT_SECRET = 'JC4XGJLY0J1GRXAHCP33OU03ABL02GI1I40PD4D2LX14JQHJ' 
VERSION = '20180605' # Foursquare API version

print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

CLIENT_ID: EZG40Z10UTCLPZPEYR1QPUMXCW2TXHFFU1OIEBAPUDQE23ZP
CLIENT_SECRET:JC4XGJLY0J1GRXAHCP33OU03ABL02GI1I40PD4D2LX14JQHJ


In [45]:
# define categories function
def get_category_type(df):
    try:
        categories_list = df['categories']
    except:
        categories_list = df['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Because of the instability of the API, we have to create a wrong list and scrape the data twice.

In [50]:
# define function to get nearby venues of each neighborhood
def getNearbyVenues(names, latitudes, longitudes):
    radius = 10000 # limit meters
    LIMIT = 100 # limit 100 nearby venues
    venues_list=[]
    wrong_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()['response']['groups'][0]['items']
            results = pd.json_normalize(results)
            filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
            results =results.loc[:, filtered_columns]
            results['venue.categories'] = results.apply(get_category_type, axis=1)

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                results.loc[i,'venue.name'], 
                results.loc[i,'venue.location.lat'], 
                results.loc[i,'venue.location.lng'],
                results.loc[i,'venue.categories'])for i in range(results.shape[0])])
        except:
            print(requests.get(url).json()['response'].keys())
            print('{} is wrong'.format(name))
            wrong_list.append(name)
   
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    time.sleep(5)
    return(nearby_venues, wrong_list)

In [53]:
# get the nearby venues
ny_venues, wrong_list = getNearbyVenues(names=df['City'],
                                        latitudes=df['lat'],
                                        longitudes=df['lon'])
ny_venues.head()

Albany
dict_keys(['suggestedFilters', 'headerLocation', 'headerFullLocation', 'headerLocationGranularity', 'totalResults', 'suggestedBounds', 'groups'])
Albany is wrong
Amsterdam
Auburn
Batavia
Beacon
Binghamton
Buffalo
Canandaigua
Cohoes
Corning
Cortland
Dunkirk
Elmira
Fulton
Geneva
Glen Cove
Glens Falls
Gloversville
Hornell
Hudson
Ithaca
Jamestown
Johnstown
Kingston
Lackawanna
Little Falls
Lockport
Long Beach
Mechanicville
Middletown
Mount Vernon
New Rochelle
New York
Newburgh
Niagara Falls
North Tonawanda
Norwich
Ogdensburg
Olean
Oneida
Oneonta
Oswego
Peekskill
Plattsburgh
dict_keys(['suggestedFilters', 'headerLocation', 'headerFullLocation', 'headerLocationGranularity', 'totalResults', 'suggestedBounds', 'groups'])
Plattsburgh is wrong
Port Jervis
Poughkeepsie
Rensselaer
Rochester
Rome
Rye
Salamanca
Saratoga Springs
Schenectady
Sherrill
Syracuse
dict_keys(['suggestedFilters', 'headerLocation', 'headerFullLocation', 'headerLocationGranularity', 'totalResults', 'suggestedBounds', 'gr

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Amsterdam,42.943367,-74.185044,Bosco's Restaurant & Bar,42.945115,-74.200738,Italian Restaurant
1,Amsterdam,42.943367,-74.185044,Parillo's Armory Grill,42.93344,-74.198602,Italian Restaurant
2,Amsterdam,42.943367,-74.185044,Panera Bread,42.965352,-74.187407,Bakery
3,Amsterdam,42.943367,-74.185044,Full House Buffet,42.968892,-74.184887,Chinese Restaurant
4,Amsterdam,42.943367,-74.185044,Recovery Sports Grill,42.958699,-74.185112,American Restaurant


In [81]:
# check what cities went wrong with API
wrong_list

['Albany', 'Plattsburgh', 'Syracuse']

In [80]:
print(df[df['City'] == wrong_list[0]].index.values)
print(df[df['City'] == wrong_list[1]].index.values)
print(df[df['City'] == wrong_list[2]].index.values)

[0]
[43]
[54]


In [83]:
# get the remaining nearby venues
ny_venues_2, wrong_list = getNearbyVenues(names=['Albany', 'Plattsburgh', 'Syracuse'],
                                        latitudes=df.loc[[0,43,54],'lat'],
                                        longitudes=df.loc[[0,43,54],'lon'])
wrong_list

Albany
Plattsburgh
Syracuse


[]

In [87]:
# get the final dataset and its size
ny_venues = pd.concat([ny_venues,ny_venues_2],axis=0,ignore_index=True )
print(ny_venues.shape)

(5151, 7)


In [88]:
# checking how many venues we have collected
ny_venues.groupby('City').count()['Venue'].sort_values(ascending = False).head(10)

City
Yonkers            100
North Tonawanda    100
Newburgh           100
New York           100
New Rochelle       100
White Plains       100
Middletown         100
Mechanicville      100
Long Beach         100
Lockport           100
Name: Venue, dtype: int64

In [89]:
# checking how many categories we have collected
print('There are {} uniques categories.'.format(len(ny_venues['Venue Category'].unique())))

There are 311 uniques categories.


### 3.4. Analyse Each City

In [109]:
# one hot encoding
ny_onehot = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add city column back to dataframe
ny_onehot['City'] = ny_venues['City'] 

# move city column to the first column
fixed_columns = [ny_onehot.columns[64]] + list(ny_onehot.columns[:64]) + list(ny_onehot.columns[65:])
ny_onehot = ny_onehot[fixed_columns]

ny_onehot.head()

Unnamed: 0,City,Accessories Store,Airport,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,Art Museum,...,Waterfront,Wedding Hall,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Amsterdam,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [110]:
# getting the shape of onehot dataset
ny_onehot.shape

(5151, 311)

In [111]:
# creating a grouped dataset 
ny_grouped = ny_onehot.groupby('City').mean().reset_index()
print(ny_grouped.shape)
ny_grouped.head()

(62, 311)


Unnamed: 0,City,Accessories Store,Airport,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,Art Museum,...,Waterfront,Wedding Hall,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Albany,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0
1,Amsterdam,0.0,0.0,0.0,0.036364,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0
2,Auburn,0.0,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.0,0.0
3,Batavia,0.0,0.016667,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Beacon,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [112]:
# creating a toplist dataset 
num_top_venues = 5
list = ['Albany','Amsterdam','Auburn'] # just looking some examples

for code in list:
    print("----"+code+"----")
    temp = ny_grouped[ny_grouped['City'] == code].T.reset_index().iloc[1:]
    temp.columns = ['venue','freq']
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Albany----
                 venue  freq
0    Convenience Store  0.07
1          Coffee Shop  0.06
2  American Restaurant  0.06
3       Ice Cream Shop  0.05
4                  Pub  0.04


----Amsterdam----
                  venue  freq
0     Convenience Store  0.11
1  Fast Food Restaurant  0.05
2    Italian Restaurant  0.05
3              Pharmacy  0.05
4        Discount Store  0.04


----Auburn----
                 venue  freq
0  American Restaurant  0.07
1           Restaurant  0.07
2               Bakery  0.04
3              Theater  0.04
4   Italian Restaurant  0.04




In [113]:
# defining a function to find the most common venues
def return_most_common_venues(df, num_top_venues):
    df_categories = df.iloc[1:]
    df_categories_sorted = df_categories.sort_values(ascending=False)  
    return df_categories_sorted.index.values[0:num_top_venues]

In [150]:
# creating the sorted dataset
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# creating columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# creating a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['City'] = ny_grouped['City']

for ind in np.arange(ny_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Albany,Convenience Store,Coffee Shop,American Restaurant,Ice Cream Shop,Pub,Grocery Store,New American Restaurant,Park,Pizza Place,Café,Sushi Restaurant,Bakery,Italian Restaurant,Bookstore,Hotel,Sports Bar,Shopping Mall,Bar,Burger Joint,Sandwich Place
1,Amsterdam,Convenience Store,Italian Restaurant,Pharmacy,Fast Food Restaurant,Sandwich Place,Discount Store,Food,Chinese Restaurant,American Restaurant,Hardware Store,Snack Place,Steakhouse,Supermarket,Supplement Shop,Mobile Phone Shop,Frozen Yogurt Shop,Mexican Restaurant,Donut Shop,Café,Gas Station
2,Auburn,American Restaurant,Restaurant,Theater,Discount Store,Bar,Mexican Restaurant,Bakery,Park,Italian Restaurant,Pet Store,Supermarket,Tex-Mex Restaurant,Furniture / Home Store,Cocktail Bar,Rest Area,Bowling Alley,New American Restaurant,Liquor Store,Donut Shop,Sandwich Place
3,Batavia,American Restaurant,Golf Course,Diner,Pharmacy,Discount Store,Department Store,Ice Cream Shop,Donut Shop,Pizza Place,Coffee Shop,Burger Joint,Candy Store,Mexican Restaurant,Campground,Furniture / Home Store,Racetrack,Miscellaneous Shop,Taco Place,Skating Rink,Farmers Market
4,Beacon,Trail,Italian Restaurant,Park,Restaurant,American Restaurant,Hotel,Café,Brewery,Coffee Shop,Ice Cream Shop,Liquor Store,Burger Joint,Pharmacy,Diner,Pizza Place,Frozen Yogurt Shop,Bar,New American Restaurant,Taco Place,BBQ Joint


### 3.5. Clustering the Cities

In [116]:
# import kmeans
from sklearn.cluster import KMeans

In [147]:
# building the model

# set number of clusters
kclusters = 5

ny_grouped_clustering = ny_grouped.drop('City', axis = 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init="k-means++", n_init=2, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([3, 0, 3, 2, 3, 3, 3, 1, 1, 2, 2, 2, 2, 0, 2, 1, 3, 0, 2, 3, 3, 2,
       0, 3, 1, 0, 2, 4, 3, 1, 1, 1, 3, 3, 3, 3, 2, 2, 2, 2, 3, 1, 3, 2,
       2, 1, 3, 3, 2, 1, 2, 3, 3, 2, 1, 3, 1, 1, 2, 3, 3, 1])

In [153]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [154]:
# merge venues_sorted with ny_data to add latitude/longitude for each city
df_merged = pd.merge(left = df[['City','County','lat','lon']],right = venues_sorted[['City', 'Cluster Labels']], how = 'left', on = 'City')

df_merged.head()

Unnamed: 0,City,County,lat,lon,Cluster Labels
0,Albany,Albany,42.651167,-73.754968,3
1,Amsterdam,Montgomery,42.943367,-74.185044,0
2,Auburn,Cayuga,42.93202,-76.567203,3
3,Batavia,Genesee,42.998014,-78.187551,2
4,Beacon,Dutchess,41.504879,-73.969682,3


## 4. Visualize the Results

In [124]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [148]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['lat'], df_merged['lon'], df_merged['City'], df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Discussion
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. 

In [133]:
# set the display format
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### 5.1. Cluster 0

In [134]:
venues_sorted.loc[venues_sorted['Cluster Labels'] == 0]

Unnamed: 0,Cluster Labels,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
1,0,Amsterdam,Convenience Store,Italian Restaurant,Pharmacy,Fast Food Restaurant,Sandwich Place,Discount Store,Food,Chinese Restaurant,American Restaurant,Hardware Store,Snack Place,Steakhouse,Supermarket,Supplement Shop,Mobile Phone Shop,Frozen Yogurt Shop,Mexican Restaurant,Donut Shop,Café,Gas Station
13,0,Fulton,Sandwich Place,Supermarket,American Restaurant,Lake,Discount Store,Gas Station,Pizza Place,Zoo Exhibit,Electronics Store,English Restaurant,Ethiopian Restaurant,Event Space,Exhibit,Farm,Fabric Shop,Drugstore,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Flower Shop
17,0,Gloversville,Convenience Store,Pharmacy,Fast Food Restaurant,Gas Station,Pizza Place,Sandwich Place,Supermarket,Donut Shop,Discount Store,Italian Restaurant,Diner,Restaurant,Ice Cream Shop,Grocery Store,Sporting Goods Shop,Hotel,Mobile Phone Shop,Liquor Store,Big Box Store,American Restaurant
22,0,Johnstown,Fast Food Restaurant,Convenience Store,Pharmacy,Sandwich Place,Gas Station,Discount Store,Pizza Place,Donut Shop,Supermarket,Italian Restaurant,Construction & Landscaping,Ice Cream Shop,Restaurant,American Restaurant,Diner,Farmers Market,Parking,Movie Theater,Sporting Goods Shop,Department Store
25,0,Little Falls,Rest Area,Discount Store,Fast Food Restaurant,Convenience Store,Pizza Place,Sports Bar,Gas Station,Beer Bar,Sandwich Place,Tree,Video Game Store,Grocery Store,Supermarket,Pharmacy,Hotel,Historic Site,Hardware Store,Donut Shop,Fabric Shop,Ethiopian Restaurant


* As the map shows, these five cities are located surrounding the Ferris Lake Wild Forest, with rich water resources
* The representive venues here, such as hardware store, farm, fabric shop, discount store and fish shop, demostrate that these cities's economy is mainly dominated by agriculture and fisheries, and people tend to shop with discount.
* Besides, there are also many exotic restraurants here, including ethiopian restaurant and chinese restaurant, which give us a hint that these places maybe more multicultural.
* To easier to understand, we can name this cluster **multi**, which means these places are multicultural and agricultural.

### 5.2. Cluster 1

In [208]:
print('Cluster 1 contains {} cities'.format(venues_sorted.loc[venues_sorted['Cluster Labels'] == 1].shape[0]))
cluster1 = venues_sorted.loc[venues_sorted['Cluster Labels'] == 1]
cluster1

Cluster 1 contains 14 cities


Unnamed: 0,Cluster Labels,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
7,1,Canandaigua,Brewery,Italian Restaurant,Pizza Place,Coffee Shop,Ice Cream Shop,Bar,Burger Joint,Park,Department Store,American Restaurant,Breakfast Spot,Bowling Alley,Bridal Shop,Sporting Goods Shop,Supermarket,Mexican Restaurant,Café,Ski Shop,Motel,Campground
8,1,Cohoes,Pizza Place,Bakery,Ice Cream Shop,Convenience Store,Bar,Park,Diner,American Restaurant,Sandwich Place,Italian Restaurant,Brewery,Hotel,Pub,Eastern European Restaurant,Indian Restaurant,Coffee Shop,Chinese Restaurant,Performing Arts Venue,Hot Dog Joint,Mexican Restaurant
15,1,Glen Cove,Italian Restaurant,Beach,American Restaurant,Pizza Place,Golf Course,Deli / Bodega,Greek Restaurant,Ice Cream Shop,Mediterranean Restaurant,Grocery Store,Sushi Restaurant,Bakery,Café,Harbor / Marina,Thai Restaurant,Pub,Park,Boxing Gym,Mexican Restaurant,French Restaurant
24,1,Lackawanna,Bar,Pizza Place,Italian Restaurant,Brewery,Ice Cream Shop,American Restaurant,Steakhouse,Beach,Supermarket,Coffee Shop,Mexican Restaurant,Diner,Hotel,Nature Preserve,Plaza,Event Space,Bakery,Sports Bar,Restaurant,Café
29,1,Middletown,Pizza Place,Park,Bakery,Deli / Bodega,Italian Restaurant,Latin American Restaurant,Mexican Restaurant,Caribbean Restaurant,Bagel Shop,Brewery,Zoo,Food & Drink Shop,Coffee Shop,Pub,Gym,Dessert Shop,American Restaurant,Botanical Garden,Market,Garden
30,1,Mount Vernon,Pizza Place,Italian Restaurant,Caribbean Restaurant,Bakery,Grocery Store,Gym / Fitness Center,Brewery,Café,Ice Cream Shop,Bar,Burger Joint,Restaurant,Event Space,Park,Botanical Garden,Coffee Shop,American Restaurant,Gym,Latin American Restaurant,Brazilian Restaurant
31,1,New Rochelle,Pizza Place,Italian Restaurant,Grocery Store,Park,Coffee Shop,Caribbean Restaurant,American Restaurant,Café,Gym / Fitness Center,Ice Cream Shop,Mexican Restaurant,Bakery,Beach,Southern / Soul Food Restaurant,Event Space,Restaurant,Beer Store,Brewery,Sporting Goods Shop,Cycle Studio
41,1,Oswego,Campground,Pizza Place,Donut Shop,Food,American Restaurant,Bar,Restaurant,Discount Store,Grocery Store,Diner,Zoo Exhibit,English Restaurant,Ethiopian Restaurant,Event Space,Fabric Shop,Exhibit,Eastern European Restaurant,Farm,Farmers Market,Fast Food Restaurant
45,1,Poughkeepsie,Pizza Place,Italian Restaurant,American Restaurant,Brewery,Coffee Shop,Convenience Store,Park,New American Restaurant,Bar,Bakery,Café,Diner,Farm,Restaurant,Scenic Lookout,Historic Site,French Restaurant,Japanese Restaurant,Breakfast Spot,Spanish Restaurant
49,1,Rye,Pizza Place,Italian Restaurant,Beach,Deli / Bodega,Park,American Restaurant,Golf Course,Ice Cream Shop,Bar,Latin American Restaurant,Coffee Shop,Salad Place,Bakery,Bagel Shop,Seafood Restaurant,Brazilian Restaurant,Gym,Gourmet Shop,Grocery Store,Mexican Restaurant


* As the map shows, cities of cluster 1 locate more dispersedly.
* We can find out that pizza place is extremely popular in these cities tant it becomes the most common venue. And compared to American restaurant, Italian restaurant seems more attractive. 
* However, except for the dietary preference, I can't see any other special features. Therefore, I decide to simply name it **ordinary**.

### 5.3. Cluster 2

In [207]:
print('Cluster 2 contains {} cities'.format(venues_sorted.loc[venues_sorted['Cluster Labels'] == 2].shape[0]))
cluster2 = venues_sorted.loc[venues_sorted['Cluster Labels'] == 2]
cluster2

Cluster 2 contains 19 cities


Unnamed: 0,Cluster Labels,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
3,2,Batavia,American Restaurant,Golf Course,Diner,Pharmacy,Discount Store,Department Store,Ice Cream Shop,Donut Shop,Pizza Place,Coffee Shop,Burger Joint,Candy Store,Mexican Restaurant,Campground,Furniture / Home Store,Racetrack,Miscellaneous Shop,Taco Place,Skating Rink,Farmers Market
9,2,Corning,American Restaurant,Hotel,Golf Course,Ice Cream Shop,Museum,Bar,Coffee Shop,Gas Station,Pizza Place,Sandwich Place,Brewery,Shoe Store,Fast Food Restaurant,Donut Shop,Historic Site,Gym,Trail,Plaza,Sporting Goods Shop,Italian Restaurant
10,2,Cortland,Bar,Fast Food Restaurant,Pizza Place,Diner,Bakery,Sandwich Place,Coffee Shop,Pub,Donut Shop,Discount Store,Hotel,American Restaurant,Moving Target,Mexican Restaurant,Restaurant,Breakfast Spot,Market,Lounge,Gas Station,Sports Bar
11,2,Dunkirk,Coffee Shop,Discount Store,Pizza Place,Chinese Restaurant,Ice Cream Shop,Italian Restaurant,Rental Car Location,Shoe Store,Bar,Beach,Mexican Restaurant,Big Box Store,Gastropub,Golf Course,Diner,Pharmacy,Supplement Shop,Liquor Store,Bakery,Department Store
12,2,Elmira,Sandwich Place,Ice Cream Shop,Discount Store,Pizza Place,Italian Restaurant,American Restaurant,Liquor Store,Grocery Store,Bakery,Park,Donut Shop,Steakhouse,Sports Bar,Coffee Shop,Bar,Supermarket,Pharmacy,Bookstore,Golf Course,Cosmetics Shop
14,2,Geneva,American Restaurant,Vineyard,Pizza Place,Bar,Restaurant,Harbor / Marina,Brewery,Resort,Big Box Store,Discount Store,Italian Restaurant,Pub,Ice Cream Shop,Coffee Shop,Hotel,Convenience Store,Wine Bar,Sports Bar,Breakfast Spot,Liquor Store
18,2,Hornell,Convenience Store,Pizza Place,Discount Store,American Restaurant,Construction & Landscaping,Grocery Store,Bar,Brewery,Sandwich Place,Hotel,Hardware Store,Donut Shop,Bowling Alley,Scenic Lookout,Shipping Store,Shoe Store,Diner,Fast Food Restaurant,Motorcycle Shop,Food
21,2,Jamestown,Discount Store,Coffee Shop,American Restaurant,Bar,Clothing Store,Supermarket,Hotel,Pizza Place,Pharmacy,Breakfast Spot,Seafood Restaurant,Grocery Store,Museum,Pub,Fast Food Restaurant,Lingerie Store,Bakery,Café,Rental Car Location,Mexican Restaurant
26,2,Lockport,Discount Store,Sandwich Place,Coffee Shop,Pizza Place,Pharmacy,Restaurant,American Restaurant,Ice Cream Shop,Video Store,Convenience Store,Hot Dog Joint,BBQ Joint,Bar,Bowling Alley,Italian Restaurant,Sports Bar,Grocery Store,Fruit & Vegetable Store,Drive-in Theater,Steakhouse
36,2,Norwich,Discount Store,Fast Food Restaurant,American Restaurant,Pizza Place,Supermarket,Video Store,Sandwich Place,Grocery Store,Tourist Information Center,Chinese Restaurant,Movie Theater,Airport Terminal,Donut Shop,Gas Station,Trail,Fried Chicken Joint,New American Restaurant,Café,Museum,Diner


In [209]:
# see the most common venues
s = pd.DataFrame(cluster2.iloc[:,2].value_counts().index,columns={cluster2.iloc[:,2].name})[:5]

for i in np.arange(3,21):
    temp = pd.Series(cluster2.iloc[:,i].value_counts().index)[:5]
    name = cluster2.iloc[:,i].name
    s.insert(int(i-2),name,temp)

s

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue
0,American Restaurant,Pizza Place,Pizza Place,Pizza Place,Pizza Place,Sandwich Place,Coffee Shop,Sandwich Place,Sandwich Place,Discount Store,Mexican Restaurant,Discount Store,Historic Site,Coffee Shop,Shipping Store,Gym,Bowling Alley,Supermarket,Brewery
1,Sandwich Place,Coffee Shop,Coffee Shop,Bar,Bank,Italian Restaurant,Discount Store,Grocery Store,Convenience Store,Clothing Store,Movie Theater,Shipping Store,Mexican Restaurant,Scenic Lookout,Border Crossing,Supermarket,New American Restaurant,Café,Breakfast Spot
2,Discount Store,Discount Store,Sandwich Place,Gas Station,Construction & Landscaping,Bar,Bar,Pizza Place,Pizza Place,Chinese Restaurant,Hardware Store,Restaurant,Steakhouse,Donut Shop,Gas Station,Hardware Store,Discount Store,Sports Bar,New American Restaurant
3,Fast Food Restaurant,Fast Food Restaurant,Discount Store,Ice Cream Shop,Harbor / Marina,American Restaurant,Sandwich Place,Hotel,Diner,Café,Department Store,Grocery Store,Video Game Store,Convenience Store,Bowling Alley,Sports Bar,Bakery,Chinese Restaurant,Bakery
4,Coffee Shop,American Restaurant,Gas Station,Chinese Restaurant,Supermarket,Grocery Store,Pizza Place,Shoe Store,Pub,Bar,Lake,BBQ Joint,Bowling Alley,Pub,Bar,Lingerie Store,Wine Bar,Fruit & Vegetable Store,Drive-in Theater


* From the map, we can clearly clarify that almost all cities of cluster 2 are on the brink of New York state. So, how does this affect them?
* As we mentioned above, there are some discount shops in cities of cluter 0. However, compared to cluster 0, people of cluter 2 are more addicted to discount stores, which suggests relatively poor economic productivity.
* Thus, we name this cluster **rural**, which means these places are considered to develope further.

### 5.4. Cluster 3

In [158]:
print('Cluster 3 contains {} cities'.format(venues_sorted.loc[venues_sorted['Cluster Labels'] == 3].shape[0]))
cluster3 = venues_sorted.loc[venues_sorted['Cluster Labels'] == 3]
cluster3

Cluster 3 contains 23 cities


Unnamed: 0,Cluster Labels,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,3,Albany,Convenience Store,Coffee Shop,American Restaurant,Ice Cream Shop,Pub,Grocery Store,New American Restaurant,Park,Pizza Place,Café,Sushi Restaurant,Bakery,Italian Restaurant,Bookstore,Hotel,Sports Bar,Shopping Mall,Bar,Burger Joint,Sandwich Place
2,3,Auburn,American Restaurant,Restaurant,Theater,Discount Store,Bar,Mexican Restaurant,Bakery,Park,Italian Restaurant,Pet Store,Supermarket,Tex-Mex Restaurant,Furniture / Home Store,Cocktail Bar,Rest Area,Bowling Alley,New American Restaurant,Liquor Store,Donut Shop,Sandwich Place
4,3,Beacon,Trail,Italian Restaurant,Park,Restaurant,American Restaurant,Hotel,Café,Brewery,Coffee Shop,Ice Cream Shop,Liquor Store,Burger Joint,Pharmacy,Diner,Pizza Place,Frozen Yogurt Shop,Bar,New American Restaurant,Taco Place,BBQ Joint
5,3,Binghamton,Diner,Italian Restaurant,Sandwich Place,Bar,Coffee Shop,Pizza Place,Grocery Store,Hardware Store,Ice Cream Shop,Park,Movie Theater,Café,Gift Shop,Spa,Clothing Store,Trail,Pet Store,Dive Bar,Steakhouse,Sporting Goods Shop
6,3,Buffalo,Bar,Brewery,American Restaurant,Coffee Shop,New American Restaurant,Café,Food Truck,Plaza,Hotel,Thai Restaurant,Market,Bakery,Event Space,Cocktail Bar,Pizza Place,Art Gallery,Park,Nature Preserve,Wine Shop,Science Museum
16,3,Glens Falls,Convenience Store,American Restaurant,Brewery,Coffee Shop,Ice Cream Shop,Supermarket,Sushi Restaurant,Café,Bakery,Burger Joint,Mexican Restaurant,Sandwich Place,Clothing Store,Lingerie Store,Bar,Pizza Place,Diner,Pharmacy,Park,Cafeteria
19,3,Hudson,Diner,Pizza Place,American Restaurant,Convenience Store,Bar,Chinese Restaurant,Furniture / Home Store,Italian Restaurant,Hotel,Restaurant,Sandwich Place,Movie Theater,Cheese Shop,Historic Site,Café,BBQ Joint,Farm,Concert Hall,Antique Shop,Mobile Phone Shop
20,3,Ithaca,American Restaurant,Coffee Shop,Ice Cream Shop,Park,Trail,Bagel Shop,Italian Restaurant,Sandwich Place,Garden,Mexican Restaurant,Organic Grocery,Brewery,New American Restaurant,Department Store,Pizza Place,Burger Joint,Bakery,Thai Restaurant,Café,State / Provincial Park
23,3,Kingston,American Restaurant,Café,Ice Cream Shop,Mexican Restaurant,Italian Restaurant,Coffee Shop,Pizza Place,Bar,Restaurant,Thai Restaurant,Brewery,Discount Store,Sushi Restaurant,Farmers Market,Bookstore,Burger Joint,Wine Shop,Antique Shop,Seafood Restaurant,Convenience Store
28,3,Mechanicville,Convenience Store,Pizza Place,Sandwich Place,American Restaurant,Pharmacy,Coffee Shop,Gas Station,Ice Cream Shop,Golf Course,Hotel,Bakery,Diner,Supermarket,Italian Restaurant,Burger Joint,Pet Store,Donut Shop,Mexican Restaurant,Steakhouse,Gym


In [206]:
# see the most common venues
s = pd.DataFrame(cluster3.iloc[:,2].value_counts().index,columns={cluster3.iloc[:,2].name})[:5]

for i in np.arange(3,21):
    temp = pd.Series(cluster3.iloc[:,i].value_counts().index)[:5]
    name = cluster3.iloc[:,i].name
    s.insert(int(i-2),name,temp)

s

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue
0,American Restaurant,Italian Restaurant,American Restaurant,Coffee Shop,Bar,Coffee Shop,Grocery Store,Park,Hotel,Café,Sandwich Place,Sandwich Place,Brewery,Lingerie Store,Pizza Place,Bowling Alley,Donut Shop,Burger Joint,Ice Cream Shop
1,Convenience Store,Coffee Shop,Park,Ice Cream Shop,Trail,Pizza Place,New American Restaurant,Mexican Restaurant,Italian Restaurant,Brewery,Brewery,Café,Taco Place,Hotel,Bookstore,Burger Joint,Farmers Market,Pharmacy,Park
2,Trail,Mexican Restaurant,Sandwich Place,Park,Ice Cream Shop,Mexican Restaurant,Ice Cream Shop,Italian Restaurant,Bakery,Pet Store,Bakery,Bakery,Outdoor Sculpture,Cocktail Bar,Café,Liquor Store,Dessert Shop,Cosmetics Shop,Steakhouse
3,Diner,Restaurant,Café,Bar,Italian Restaurant,Donut Shop,Gas Station,Taco Place,Pizza Place,Burger Joint,Supermarket,Taco Place,Cheese Shop,Sandwich Place,Burger Joint,Taco Place,Park,Hockey Arena,Mexican Restaurant
4,Park,Park,Ice Cream Shop,Bakery,Pier,Bakery,Bakery,Deli / Bodega,Bar,Gym,American Restaurant,Breakfast Spot,Gift Shop,New American Restaurant,Clothing Store,Pet Store,Bookstore,Sushi Restaurant,Donut Shop


* As we can see, this cluster contains commercial cities represented by New York city, with many kinds of facilities. 
* The recreational facilities such as park, Café, Bowling Alley and Gym are abundant in there areas.Furthermore, there are also many kinds of restaurants like italian restaurant, mexican restauran, taco place and donut shop.
* Thus, we can name this cluster **metropolitan**, which means these places have relatively perfect service facilities.

### 5.5. Cluster 4

In [145]:
print('Cluster 4 contains {} cities'.format(venues_sorted.loc[venues_sorted['Cluster Labels'] == 4].shape[0]))
venues_sorted.loc[venues_sorted['Cluster Labels'] == 4]

Cluster 4 contains 1 cities


Unnamed: 0,Cluster Labels,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
27,4,Long Beach,Beach,Seafood Restaurant,Pizza Place,Bagel Shop,Surf Spot,Coffee Shop,Ice Cream Shop,American Restaurant,Bar,Supermarket,Italian Restaurant,Bakery,Deli / Bodega,Grocery Store,Pub,Donut Shop,Café,Hotel Bar,Brewery,Mexican Restaurant


* Obviously, Long Beach is a tourism city by the sea that the 1st most common venue is beach, the 2nd most common venue is seafood restaurant and the 5th most common venue is surf spot. 
* Moreover, there are many kinds of places to drink and have fun, such as bar, bodega, pub and brewery. Thus, we can name this cluster **coastal**.
* If you'd like to develop industries about beach or its relative products, then this place should be the frist one you consider about.

## 6.Conclusion

* There are many ways to explore and cluster cities. In this notebook, I try to distinguish them basing on Foursquare's location information, and I think it's an interesting way to know more about cities.
* In this case, for my purpose, the cities are divided into five groups, which in turn are named **multi**, **ordinary**, **rural**, **metropolitan**, and **costal**. Each cluster has its features.
* However, there are some shortcomings in this study. I only consider the venue categories as features to cluster, ignoring other indicators like population, transpotation and GDP.
* in order to further cluster neighborhoods, we should add more features.

# THE END