# Capstone Project Notebook
This notebook will be used for the Data Science Capstone Project

In [1]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


K-means clustering will be implemented to find similar zip code areas based on top venue categories. Zip codes include those pertaining to Alexandria, Va as well as one of the zip codes to a nearby Kung Fu Tea location. In this way, we can see which areas are similar to the area with the restaurant and make predictions on suitable locations for future restaurant openings.

These are the necessary packages for the analysis.

In [2]:
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim

# for handling requests
import requests 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# Clustering package
from sklearn.cluster import KMeans

The data used in this project comes from https://www.unitedstateszipcodes.org/zip-code-database/

It is a csv file so pandas was used to read the file into the notebook.

In [3]:
df = pd.read_csv("zip_code_database.csv")

In [4]:
print(df.columns)
df.head()

Index(['zip', 'type', 'decommissioned', 'primary_city', 'acceptable_cities',
       'unacceptable_cities', 'state', 'county', 'timezone', 'area_codes',
       'world_region', 'country', 'latitude', 'longitude',
       'irs_estimated_population_2015'],
      dtype='object')


Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


## Cleaning the Data

For this analysis, we need a location with a current Kung Fu Tea place up and running. One of the nearby locations is situated in Springfield, Virginia.

In [5]:
kft_df = df.loc[df.zip == 22150]
kft_df

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
9251,22150,STANDARD,0,Springfield,,,VA,Fairfax County,America/New_York,703,,US,38.78,-77.17,27470


In [6]:
va_df = df.loc[df.state == 'VA']

We can pull up a list of counties in virginia. Alexandria City is an independent city and so it should appear as a county.

In [7]:
va_df.county.unique()

array(['Loudoun County', 'Culpeper County', 'Manassas City',
       'Prince William County', 'Manassas city', 'Fauquier County',
       'Fairfax County', 'Clarke County', 'Fairfax City',
       'Falls Church City', 'Falls Church city', 'Arlington County',
       'Alexandria city', 'Alexandria City', nan, 'Fredericksburg city',
       'Fredericksburg City', 'Stafford County', 'Spotsylvania County',
       'Caroline County', 'Northumberland County', 'Orange County',
       'Essex County', 'Westmoreland County', 'King George County',
       'Richmond County', 'Lancaster County', 'Winchester city',
       'Frederick County', 'Winchester City', 'Warren County',
       'Rappahannock County', 'Shenandoah County', 'Page County',
       'Madison County', 'Harrisonburg city', 'Harrisonburg City',
       'Rockingham County', 'Augusta County', 'Albemarle County',
       'Charlottesville city', 'Charlottesville City', 'Nelson County',
       'Greene County', 'Louisa County', 'Fluvanna County',
    

There are two alexandria cities in the list above; "Alexandria City" and "Alexandria city". 

In [8]:
city_df = va_df.loc[va_df.county.isin(['Alexandria City', 'Alexandria city'])]
city_df

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
9306,22301,STANDARD,0,Alexandria,Potomac,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.06,12980
9307,22302,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.08,16090
9309,22304,STANDARD,0,Alexandria,,"Cameron Station, Theological Seminary, Trade C...",VA,Alexandria city,America/New_York,571703,,US,38.81,-77.11,42670
9310,22305,STANDARD,0,Alexandria,,George Washington,VA,Alexandria city,America/New_York,571703,,US,38.84,-77.06,15520
9316,22311,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.83,-77.13,17150
9318,22313,PO BOX,0,Alexandria,,,VA,Alexandria City,America/New_York,571,,US,38.82,-77.08,457
9319,22314,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.81,-77.06,29600
9321,22320,PO BOX,0,Alexandria,,George Mason,VA,Alexandria City,America/New_York,571,,US,38.82,-77.08,184
9322,22321,UNIQUE,1,Alexandria,Firm Zip,,VA,Alexandria City,America/New_York,571,,US,38.8,-77.05,0
9323,22331,STANDARD,0,Alexandria,,,VA,Alexandria City,America/New_York,703,,US,38.82,-77.08,0


The dataset includes a table called **primary_city**. This column provides the city most associated with each zip code. This was used to filter for zip codes in Alexandria. By filtering this way, we filter in zip codes that are associated with Alexandria and not just Alexandria City. This provides us with more zip codes to work with and the potential for more suitable locations for a Kung Fu Tea.

In [9]:
city_df = va_df.loc[va_df.primary_city == 'Alexandria']
city_df

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
9306,22301,STANDARD,0,Alexandria,Potomac,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.06,12980
9307,22302,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.08,16090
9308,22303,STANDARD,0,Alexandria,"Jefferson Manor, Jefferson Mnr",,VA,Fairfax County,America/New_York,703,,US,38.79,-77.08,14150
9309,22304,STANDARD,0,Alexandria,,"Cameron Station, Theological Seminary, Trade C...",VA,Alexandria city,America/New_York,571703,,US,38.81,-77.11,42670
9310,22305,STANDARD,0,Alexandria,,George Washington,VA,Alexandria city,America/New_York,571703,,US,38.84,-77.06,15520
9311,22306,STANDARD,0,Alexandria,Community,,VA,Fairfax County,America/New_York,571703,,US,38.76,-77.1,31310
9312,22307,STANDARD,0,Alexandria,Belleview,,VA,Fairfax County,America/New_York,703,,US,38.77,-77.06,9770
9313,22308,STANDARD,0,Alexandria,,,VA,Fairfax County,America/New_York,571703,,US,38.73,-77.06,13570
9314,22309,STANDARD,0,Alexandria,Engleside,,VA,Fairfax County,America/New_York,571703,,US,38.72,-77.11,31980
9315,22310,STANDARD,0,Alexandria,Franconia,,VA,Fairfax County,America/New_York,703,,US,38.78,-77.12,28420


The dataframe was also filtered for standard type zip codes. Unique and PO Box zip codes typically fall within standard zip code boundaries and so are redundant for this analysis.

In [10]:
city_df = city_df.loc[city_df.type == 'STANDARD']
city_df

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
9306,22301,STANDARD,0,Alexandria,Potomac,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.06,12980
9307,22302,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.08,16090
9308,22303,STANDARD,0,Alexandria,"Jefferson Manor, Jefferson Mnr",,VA,Fairfax County,America/New_York,703,,US,38.79,-77.08,14150
9309,22304,STANDARD,0,Alexandria,,"Cameron Station, Theological Seminary, Trade C...",VA,Alexandria city,America/New_York,571703,,US,38.81,-77.11,42670
9310,22305,STANDARD,0,Alexandria,,George Washington,VA,Alexandria city,America/New_York,571703,,US,38.84,-77.06,15520
9311,22306,STANDARD,0,Alexandria,Community,,VA,Fairfax County,America/New_York,571703,,US,38.76,-77.1,31310
9312,22307,STANDARD,0,Alexandria,Belleview,,VA,Fairfax County,America/New_York,703,,US,38.77,-77.06,9770
9313,22308,STANDARD,0,Alexandria,,,VA,Fairfax County,America/New_York,571703,,US,38.73,-77.06,13570
9314,22309,STANDARD,0,Alexandria,Engleside,,VA,Fairfax County,America/New_York,571703,,US,38.72,-77.11,31980
9315,22310,STANDARD,0,Alexandria,Franconia,,VA,Fairfax County,America/New_York,703,,US,38.78,-77.12,28420


Zip codes with a population of 0 were also omitted from the analysis.

In [11]:
city_df = city_df.loc[city_df.irs_estimated_population_2015 > 0]
city_df

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
9306,22301,STANDARD,0,Alexandria,Potomac,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.06,12980
9307,22302,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.08,16090
9308,22303,STANDARD,0,Alexandria,"Jefferson Manor, Jefferson Mnr",,VA,Fairfax County,America/New_York,703,,US,38.79,-77.08,14150
9309,22304,STANDARD,0,Alexandria,,"Cameron Station, Theological Seminary, Trade C...",VA,Alexandria city,America/New_York,571703,,US,38.81,-77.11,42670
9310,22305,STANDARD,0,Alexandria,,George Washington,VA,Alexandria city,America/New_York,571703,,US,38.84,-77.06,15520
9311,22306,STANDARD,0,Alexandria,Community,,VA,Fairfax County,America/New_York,571703,,US,38.76,-77.1,31310
9312,22307,STANDARD,0,Alexandria,Belleview,,VA,Fairfax County,America/New_York,703,,US,38.77,-77.06,9770
9313,22308,STANDARD,0,Alexandria,,,VA,Fairfax County,America/New_York,571703,,US,38.73,-77.06,13570
9314,22309,STANDARD,0,Alexandria,Engleside,,VA,Fairfax County,America/New_York,571703,,US,38.72,-77.11,31980
9315,22310,STANDARD,0,Alexandria,Franconia,,VA,Fairfax County,America/New_York,703,,US,38.78,-77.12,28420


Now that we have the zip codes for Alexandria, we can concatenate the row with the Kung Fu Tea location.

In [12]:
city_df = pd.concat([city_df, kft_df], ignore_index = True)
city_df

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,22301,STANDARD,0,Alexandria,Potomac,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.06,12980
1,22302,STANDARD,0,Alexandria,,,VA,Alexandria city,America/New_York,571703,,US,38.82,-77.08,16090
2,22303,STANDARD,0,Alexandria,"Jefferson Manor, Jefferson Mnr",,VA,Fairfax County,America/New_York,703,,US,38.79,-77.08,14150
3,22304,STANDARD,0,Alexandria,,"Cameron Station, Theological Seminary, Trade C...",VA,Alexandria city,America/New_York,571703,,US,38.81,-77.11,42670
4,22305,STANDARD,0,Alexandria,,George Washington,VA,Alexandria city,America/New_York,571703,,US,38.84,-77.06,15520
5,22306,STANDARD,0,Alexandria,Community,,VA,Fairfax County,America/New_York,571703,,US,38.76,-77.1,31310
6,22307,STANDARD,0,Alexandria,Belleview,,VA,Fairfax County,America/New_York,703,,US,38.77,-77.06,9770
7,22308,STANDARD,0,Alexandria,,,VA,Fairfax County,America/New_York,571703,,US,38.73,-77.06,13570
8,22309,STANDARD,0,Alexandria,Engleside,,VA,Fairfax County,America/New_York,571703,,US,38.72,-77.11,31980
9,22310,STANDARD,0,Alexandria,Franconia,,VA,Fairfax County,America/New_York,703,,US,38.78,-77.12,28420


Any unnecessary columns were dropped to reduce clutter.

In [13]:
city_df.drop(labels = ['type', 'decommissioned', 'primary_city', 'unacceptable_cities', 'state', 'timezone', 'area_codes', 'world_region', 'country', 'irs_estimated_population_2015'], axis = 1, inplace = True)
city_df.head()

Unnamed: 0,zip,acceptable_cities,county,latitude,longitude
0,22301,Potomac,Alexandria city,38.82,-77.06
1,22302,,Alexandria city,38.82,-77.08
2,22303,"Jefferson Manor, Jefferson Mnr",Fairfax County,38.79,-77.08
3,22304,,Alexandria city,38.81,-77.11
4,22305,,Alexandria city,38.84,-77.06


## Analysis

To properly make a map of Alexandria, we need its latitude and longitude.

In [15]:
address = 'Alexandria, Virginia'

geolocator = Nominatim(user_agent="virginia_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Alexandria are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Alexandria are 38.8147596, -77.0902476527272.


Now that we have the coordinates, we can view a map of the city and plot markers for the zip codes usinf Folium.

In [16]:
# create map of New York using latitude and longitude values
map_alexandria = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, postal_code in zip(city_df['latitude'], city_df['longitude'], city_df['zip']):
    label = '{}'.format(postal_code)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_alexandria)  
    
map_alexandria

Now we need to use Foursquare to acquire venue data. We start by initializing our Foursquare credentials.

In [17]:
# Foursquare
CLIENT_ID = 'OG3QHO5S1GSFLINGRI1BHRE1TSTM5OOMV2QEPKJHBTBOWGKE' # your Foursquare ID
CLIENT_SECRET = 'YCOTAVLB554F0SFAZNSRS2JEWFETV0PMRDUWPDXDXXXEUFYN' # your Foursquare Secret
VERSION = '20190807' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: OG3QHO5S1GSFLINGRI1BHRE1TSTM5OOMV2QEPKJHBTBOWGKE
CLIENT_SECRET:YCOTAVLB554F0SFAZNSRS2JEWFETV0PMRDUWPDXDXXXEUFYN


These functions, obtained from the New York City analysis retrieve nearby venues and their catagories.

In [18]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# Function to get venues of neighborhoods
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
city_venues = getNearbyVenues(names=city_df['zip'],
                              latitudes=city_df['latitude'],
                              longitudes=city_df['longitude'],
                              radius = 1000
                              )

22301
22302
22303
22304
22305
22306
22307
22308
22309
22310
22311
22312
22314
22315
22150


In [20]:
print(city_venues.shape)
city_venues.head()

(610, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,22301,38.82,-77.06,Junction Bakery & Bistro,38.82038,-77.057719,Bakery
1,22301,38.82,-77.06,Walgreens,38.820469,-77.057498,Pharmacy
2,22301,38.82,-77.06,Majestic Lounge,38.824128,-77.058616,Lounge
3,22301,38.82,-77.06,Del Ray Cafe,38.82393,-77.057768,Café
4,22301,38.82,-77.06,The Front Porch,38.824192,-77.058409,Beer Garden


In [21]:
city_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
22150,71,71,71,71,71,71
22301,66,66,66,66,66,66
22302,6,6,6,6,6,6
22303,53,53,53,53,53,53
22304,55,55,55,55,55,55
22305,71,71,71,71,71,71
22306,4,4,4,4,4,4
22307,34,34,34,34,34,34
22308,7,7,7,7,7,7
22309,21,21,21,21,21,21


In [22]:
print('There are {} uniques categories.'.format(len(city_venues['Venue Category'].unique())))

There are 175 uniques categories.


One-hot encoding had to be done so that the venue features could be used with the k-means algorithm. The code below converts the values of the Venue Categories column to numerical values.

In [23]:
# one hot encoding
city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
city_onehot['Neighborhood'] = city_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [city_onehot.columns[-1]] + list(city_onehot.columns[:-1])
city_onehot = city_onehot[fixed_columns]

city_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Automotive Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,22301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,22301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,22301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,22301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,22301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The rows need to be grouped by zip code and use the average value for venue frequencies.

In [25]:
city_grouped = city_onehot.groupby('Neighborhood').mean().reset_index()
city_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Automotive Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,22150,0.014085,0.0,0.042254,0.0,0.014085,0.0,0.0,0.0,0.014085,...,0.0,0.014085,0.0,0.0,0.0,0.0,0.0,0.0,0.042254,0.0
1,22301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.030303
2,22302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,22303,0.0,0.0,0.0,0.018868,0.0,0.018868,0.0,0.0,0.0,...,0.0,0.0,0.0,0.037736,0.0,0.0,0.0,0.0,0.0,0.0
4,22304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018182,...,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,22305,0.0,0.0,0.0,0.0,0.0,0.014085,0.014085,0.0,0.014085,...,0.014085,0.0,0.0,0.0,0.0,0.0,0.0,0.014085,0.014085,0.0
6,22306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,22307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.029412,0.0,0.029412,0.0,0.0,0.029412
8,22308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,...,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,22309,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
city_grouped.shape

(15, 176)

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [57]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = city_grouped['Neighborhood']

for ind in np.arange(city_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,22150,Clothing Store,Cosmetics Shop,Coffee Shop,Women's Store,Sporting Goods Shop,American Restaurant,Italian Restaurant,Shoe Store,Hotel,Pizza Place
1,22301,Pizza Place,Spa,Gym / Fitness Center,Coffee Shop,Pharmacy,Mexican Restaurant,Lounge,Playground,Cycle Studio,Convenience Store
2,22302,Track,Rental Car Location,Flower Shop,Gas Station,Park,Gym,Yoga Studio,Food,Fast Food Restaurant,Farmers Market
3,22303,Pizza Place,Convenience Store,Smoke Shop,Grocery Store,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,Gym / Fitness Center,Hotel,Bakery
4,22304,Park,Grocery Store,Mexican Restaurant,Chinese Restaurant,Food Truck,Liquor Store,Restaurant,Residential Building (Apartment / Condo),Rental Car Location,Coffee Shop


With the top ten venue categories determined for each zip code, the data can now be clustered using the venue cataegories.

In [58]:
# set number of clusters
kclusters = 11

city_grouped_clustering = city_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(city_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:16] 

array([ 1,  1,  0,  5,  3,  1,  2,  9,  4,  7,  6, 10,  8,  1,  5])

The cluster labels can be added to the dataframe with the venue types. This dataframe can then be merged with the original dataframe.

In [59]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

city_merged = city_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
city_merged = city_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='zip')

city_merged # check the last columns!

Unnamed: 0,zip,acceptable_cities,county,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,22301,Potomac,Alexandria city,38.82,-77.06,1,Pizza Place,Spa,Gym / Fitness Center,Coffee Shop,Pharmacy,Mexican Restaurant,Lounge,Playground,Cycle Studio,Convenience Store
1,22302,,Alexandria city,38.82,-77.08,0,Track,Rental Car Location,Flower Shop,Gas Station,Park,Gym,Yoga Studio,Food,Fast Food Restaurant,Farmers Market
2,22303,"Jefferson Manor, Jefferson Mnr",Fairfax County,38.79,-77.08,5,Pizza Place,Convenience Store,Smoke Shop,Grocery Store,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,Gym / Fitness Center,Hotel,Bakery
3,22304,,Alexandria city,38.81,-77.11,3,Park,Grocery Store,Mexican Restaurant,Chinese Restaurant,Food Truck,Liquor Store,Restaurant,Residential Building (Apartment / Condo),Rental Car Location,Coffee Shop
4,22305,,Alexandria city,38.84,-77.06,1,Pizza Place,Grocery Store,Coffee Shop,Furniture / Home Store,Spa,Supermarket,Bank,Ice Cream Shop,Mediterranean Restaurant,Gym / Fitness Center
5,22306,Community,Fairfax County,38.76,-77.1,2,Park,BBQ Joint,Gym,Trail,Yoga Studio,Ethiopian Restaurant,Food & Drink Shop,Food,Flower Shop,Fast Food Restaurant
6,22307,Belleview,Fairfax County,38.77,-77.06,9,Park,Gym / Fitness Center,Pharmacy,Pool,Bank,Donut Shop,Nature Preserve,Health & Beauty Service,Coffee Shop,Rest Area
7,22308,,Fairfax County,38.73,-77.06,4,Playground,Furniture / Home Store,Business Service,Video Store,Convenience Store,Athletics & Sports,Park,Event Service,Food,Flower Shop
8,22309,Engleside,Fairfax County,38.72,-77.11,7,Spa,Convenience Store,BBQ Joint,Grocery Store,Shipping Store,Sandwich Place,Furniture / Home Store,Gas Station,Thai Restaurant,Bakery
9,22310,Franconia,Fairfax County,38.78,-77.12,6,Pizza Place,Fast Food Restaurant,Furniture / Home Store,Playground,Pool,Discount Store,Convenience Store,Drugstore,Clothing Store,Sandwich Place


Folium can be used again to recreate the map, but with color-coded zip code markers based on the cluster labels.

In [60]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_merged['latitude'], city_merged['longitude'], city_merged['zip'], city_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Below is the dataframe filtered for the cluster of importance. This cluster contains the zip code with an existing Kung Fu Tea restaurant and similar zip codes.

In [62]:
city_merged.loc[city_merged['Cluster Labels'] == 1, city_merged.columns[[0] + list(range(5, city_merged.shape[1]))]]

Unnamed: 0,zip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,22301,1,Pizza Place,Spa,Gym / Fitness Center,Coffee Shop,Pharmacy,Mexican Restaurant,Lounge,Playground,Cycle Studio,Convenience Store
4,22305,1,Pizza Place,Grocery Store,Coffee Shop,Furniture / Home Store,Spa,Supermarket,Bank,Ice Cream Shop,Mediterranean Restaurant,Gym / Fitness Center
12,22314,1,Hotel,Sandwich Place,Coffee Shop,Café,Pizza Place,New American Restaurant,Seafood Restaurant,Bookstore,Thrift / Vintage Store,Park
14,22150,1,Clothing Store,Cosmetics Shop,Coffee Shop,Women's Store,Sporting Goods Shop,American Restaurant,Italian Restaurant,Shoe Store,Hotel,Pizza Place
