**Methodology section**

As stated in the Introduction and Data sections above, our objective is clustering all of the data points so that we can determine the best cluster. Once we determine the best cluster, we can pass the data along to the restaurant owners so that they can decide which specific area they like as the best to open up the 2nd restaurant. 

First we must import the necessary libraries in order to properly start analyzing our data. Then, we must read our csv file with all the location data into a pandas dataframe so that we may begin analyzing the data.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np

In [2]:
zcode=pd.read_csv('uszips.csv')
print(zcode.shape)
zcode.head()

(33099, 16)


Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
0,601,18.18,-66.7522,Adjuntas,PR,Puerto Rico,True,,18570,111.4,72001,Adjuntas,"{'72001':99.43,'72141':0.57}",False,False,America/Puerto_Rico
1,602,18.3607,-67.1752,Aguada,PR,Puerto Rico,True,,41520,523.7,72003,Aguada,{'72003':100},False,False,America/Puerto_Rico
2,603,18.4544,-67.122,Aguadilla,PR,Puerto Rico,True,,54689,667.9,72005,Aguadilla,{'72005':100},False,False,America/Puerto_Rico
3,606,18.1672,-66.9383,Maricao,PR,Puerto Rico,True,,6615,60.4,72093,Maricao,"{'72093':94.88,'72121':1.35,'72153':3.78}",False,False,America/Puerto_Rico
4,610,18.2903,-67.1224,Anasco,PR,Puerto Rico,True,,29016,311.9,72011,Añasco,"{'72003':0.55,'72011':99.45}",False,False,America/Puerto_Rico


We check to see what the different unique states are in this dataframe, and verify that Texas is one of those states. Then we filter the dataframe to only show location data of Texas cities. We will later focus on Dallas zip codes only.

In [3]:
zcode["state_id"].unique()

array(['PR', 'MA', 'RI', 'NH', 'ME', 'VT', 'CT', 'NY', 'NJ', 'PA', 'DE',
       'DC', 'VA', 'MD', 'WV', 'NC', 'SC', 'GA', 'FL', 'AL', 'TN', 'MS',
       'KY', 'OH', 'IN', 'MI', 'IA', 'WI', 'MN', 'SD', 'ND', 'MT', 'IL',
       'MO', 'KS', 'NE', 'LA', 'AR', 'OK', 'TX', 'CO', 'WY', 'ID', 'UT',
       'AZ', 'NM', 'NV', 'CA', 'HI', 'OR', 'WA', 'AK'], dtype=object)

In [4]:
zcode = zcode.drop(zcode.index[zcode['state_id'] != 'TX'])
zcode.head()

Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
25843,75001,32.96,-96.8385,Addison,TX,Texas,True,,12414,1250.2,48113,Dallas,{'48113':100},False,False,America/Chicago
25844,75002,33.0897,-96.6075,Allen,TX,Texas,True,,63140,655.7,48085,Collin,{'48085':100},False,False,America/Chicago
25845,75006,32.9619,-96.897,Carrollton,TX,Texas,True,,46364,1060.8,48113,Dallas,{'48113':100},False,False,America/Chicago
25846,75007,33.0046,-96.8971,Carrollton,TX,Texas,True,,51624,1709.8,48121,Denton,"{'48113':5.79,'48121':94.21}",False,False,America/Chicago
25847,75009,33.3403,-96.7503,Celina,TX,Texas,True,,8785,35.5,48085,Collin,"{'48085':94.8,'48121':5.2}",False,False,America/Chicago


In [5]:
# We also want to make sure that there are no military zones left in the dataframe
zcode[zcode['military']!=False].size

0

Here we drop the columns that we deem as unnecessary for further analyzing. We really only need the zipcode, longitude, latitude, city name, state, and count

In [6]:
zcode = zcode.drop(['zcta','parent_zcta','population','density','county_fips','all_county_weights','imprecise','military','timezone'], axis=1)
zcode.head()

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name
25843,75001,32.96,-96.8385,Addison,TX,Texas,Dallas
25844,75002,33.0897,-96.6075,Allen,TX,Texas,Collin
25845,75006,32.9619,-96.897,Carrollton,TX,Texas,Dallas
25846,75007,33.0046,-96.8971,Carrollton,TX,Texas,Denton
25847,75009,33.3403,-96.7503,Celina,TX,Texas,Collin


In [7]:
zcode.shape

(1935, 7)

In [8]:
# We have 1935 zip codes located in Texas, now we'll focus on zip codes located in the Dallas area
zcode = zcode[zcode['county_name']=='Dallas']
zcode

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name
25843,75001,32.96,-96.8385,Addison,TX,Texas,Dallas
25845,75006,32.9619,-96.897,Carrollton,TX,Texas,Dallas
25850,75019,32.9633,-96.9855,Coppell,TX,Texas,Dallas
25861,75038,32.8746,-96.9976,Irving,TX,Texas,Dallas
25862,75039,32.8875,-96.9422,Irving,TX,Texas,Dallas
25863,75040,32.9277,-96.6201,Garland,TX,Texas,Dallas
25864,75041,32.8809,-96.6515,Garland,TX,Texas,Dallas
25865,75042,32.9139,-96.6749,Garland,TX,Texas,Dallas
25866,75043,32.8571,-96.5794,Garland,TX,Texas,Dallas
25867,75044,32.9626,-96.6532,Garland,TX,Texas,Dallas


In [9]:
zcode.shape

(84, 7)

In [10]:
zcode["city"].unique()

array(['Addison', 'Carrollton', 'Coppell', 'Irving', 'Garland', 'Sachse',
       'Grand Prairie', 'Richardson', 'Rowlett', 'Cedar Hill', 'Desoto',
       'Duncanville', 'Lancaster', 'Hutchins', 'Mesquite', 'Seagoville',
       'Wilmer', 'Balch Springs', 'Sunnyvale', 'Dallas'], dtype=object)

**We're going to create a map of Dallas to visually see these cities**

In [11]:
# As we need to import librairies for creating our map, 
# let's also import the other libraries we will use later

import json # library to handle JSON files

import requests # library to handle requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means for clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Done')

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Done


In [12]:
# We're going to create a map of Dallas, centered around the center of Dallas
latitude = 32.78
longitude = -96.8
map_dal = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lati, long, postcode, city in zip(zcode['lat'], zcode['lng'], zcode['zip'], zcode['city']):
    label = '{}, {}'.format(postcode, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dal)  
    
map_dal

**Work on our second database source**

In [13]:
# Our Foursquare credentials

CLIENT_ID = 'PGZ33VUOMSUUH4SUYAPQAOZSBH5YYZNMU1XRKFR3YEL4HF2X' # your Foursquare ID
CLIENT_SECRET = '2GP1WWAGWGBNSHIZSTY5PKILWVMT42DCYKDMVI4XIOYWJJA3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PGZ33VUOMSUUH4SUYAPQAOZSBH5YYZNMU1XRKFR3YEL4HF2X
CLIENT_SECRET:2GP1WWAGWGBNSHIZSTY5PKILWVMT42DCYKDMVI4XIOYWJJA3


First, we're going to use Foursquare API to help return location data using the location data we already have. We're going to request the top 100 venues around our central point (latitude = 32.78, longitude = -96.8). within a 500 mile radius. We create the URL with our credentials and parameters listed below. Once we send the request to Foursquare to get the list of venues, we'll get a JSON file with a list of the results.

In [14]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# lat90048, lon90048 (global variables) were computed before (see above, just before we created the first map of LA) 
# and are the coordinates of our central point (zip code = 90048)  

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c96a85c9fb6b73b71f88c38'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Main Street District',
  'headerFullLocation': 'Main Street District, Dallas',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 104,
  'suggestedBounds': {'ne': {'lat': 32.7845000045, 'lng': -96.794657660037},
   'sw': {'lat': 32.7754999955, 'lng': -96.80534233996299}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bc3321adce4eee1287c719d',
       'name': 'The Joule',
       'location': {'address': '1530 Main St',
        'crossStreet': 'btwn Akard & Ervay St',
        'lat': 32.78055847145238,
        'lng': -96

As you can see we have a large list of venues in the Dallas area in those specific zip codes. Now we're going to extract the venue's categories from this list, because that's the main feature we're going to use to compare how similar the different zip codes are. Once we determine their similarities we can begin clustering those areas.

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# now we can clean the json structure and build our dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,The Joule,Hotel,32.780558,-96.798247
1,The Westin Dallas Downtown,Hotel,32.780708,-96.801932
2,Weekend,Coffee Shop,32.780309,-96.798191
3,AT&T Store,Mobile Phone Shop,32.779811,-96.79875
4,Bread Zeppelin,Salad Place,32.780309,-96.800749


In [16]:
nearby_venues.shape

(100, 4)

In [17]:
nearby_venues['categories'].unique()

array(['Hotel', 'Coffee Shop', 'Mobile Phone Shop', 'Salad Place',
       'New American Restaurant', 'Bistro', 'Burger Joint',
       'French Restaurant', 'Italian Restaurant', 'Indian Restaurant',
       'Cupcake Shop', 'Department Store', 'Pool', 'Cocktail Bar',
       'Seafood Restaurant', 'Park', 'Mexican Restaurant', 'Restaurant',
       'Boutique', 'Sandwich Place', 'Plaza', 'Bar', 'Deli / Bodega',
       'Wings Joint', 'Taco Place', 'Sports Bar', 'American Restaurant',
       'Japanese Restaurant', 'Fast Food Restaurant', 'IT Services',
       'Convenience Store', 'Bank', 'Thai Restaurant', 'Gym',
       'Cajun / Creole Restaurant', 'Noodle House', 'Lounge',
       'Chinese Restaurant', 'Mediterranean Restaurant', 'Hotel Bar',
       'Café', 'Breakfast Spot', 'Smoothie Shop', 'Sushi Restaurant',
       'Pharmacy', 'BBQ Joint', 'Rental Car Location',
       'Fried Chicken Joint', 'Asian Restaurant',
       'Latin American Restaurant'], dtype=object)

The function "getNearbyVenues" will append the useful results. We're adding the parameter "min_venues" to see if any of the requests above have a small number of venues. If the number of venues is deemed too small (i.e. 0), we can double check the zip code to see if we need to explore that zip code more to understand why there are so few venues.

In [18]:
# Create a function to repeat a similar process for all the areas (postal codes)

def getNearbyVenues(postalcodes, latitudes, longitudes, radius=500, min_venues=0):
    
    venues_list=[]
# name is actually postal code name (name -- zipcode)
    for zipcode, lat, lng in zip(postalcodes, latitudes, longitudes):
        
        #print(zipcode)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        #print("results . shape = ", len(results))
        #if len(results) < min_venues : print("number of venues for zipcode {} is low : {} venues.".format(zipcode,len(results)))
        print("number of venues for zipcode {} is : {} venues.".format(zipcode,len(results)))
        
        # return only relevant information for each nearby venue
         ## if len(results) >= min_venues)
        
        for v in results :
                            venues_list.append([(
                                                zipcode, 
                                                lat, 
                                                lng, 
                                                v['venue']['name'], 
                                                v['venue']['location']['lat'], 
                                                v['venue']['location']['lng'],  
                                                v['venue']['categories'][0]['name'])])
        # for v in results if len(results) >= min_venues])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PC Latitude', 
                  'PC Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
# Now we can run the function above with our database of areas to get the number of venues per zip code

dal_venues = getNearbyVenues(postalcodes=zcode['zip'],
                            latitudes=zcode['lat'],
                            longitudes=zcode['lng'],
                            min_venues=0)

number of venues for zipcode 75001 is : 11 venues.
number of venues for zipcode 75006 is : 4 venues.
number of venues for zipcode 75019 is : 2 venues.
number of venues for zipcode 75038 is : 2 venues.
number of venues for zipcode 75039 is : 0 venues.
number of venues for zipcode 75040 is : 2 venues.
number of venues for zipcode 75041 is : 6 venues.
number of venues for zipcode 75042 is : 6 venues.
number of venues for zipcode 75043 is : 3 venues.
number of venues for zipcode 75044 is : 6 venues.
number of venues for zipcode 75048 is : 2 venues.
number of venues for zipcode 75050 is : 0 venues.
number of venues for zipcode 75051 is : 1 venues.
number of venues for zipcode 75052 is : 0 venues.
number of venues for zipcode 75060 is : 2 venues.
number of venues for zipcode 75061 is : 8 venues.
number of venues for zipcode 75062 is : 7 venues.
number of venues for zipcode 75063 is : 26 venues.
number of venues for zipcode 75080 is : 21 venues.
number of venues for zipcode 75081 is : 9 venue

In [20]:
print(dal_venues.shape)
dal_venues.head()

(815, 7)


Unnamed: 0,PostalCode,PC Latitude,PC Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,75001,32.96,-96.8385,Cindi's N.Y. Delicatessen,32.961864,-96.838873,Diner
1,75001,32.96,-96.8385,La Spiga,32.958278,-96.837417,Bakery
2,75001,32.96,-96.8385,Enterprise Rent-A-Car,32.960673,-96.838102,Rental Car Location
3,75001,32.96,-96.8385,Scanning Nations,32.961567,-96.839008,Business Service
4,75001,32.96,-96.8385,Ed's Lawn Equipment,32.958521,-96.839043,Hardware Store


We show the number of unique venue categories the names of those categories.

In [21]:
dal_venues['PostalCode'].unique().shape
print('There are {} unique categories.'.format(len(dal_venues['Venue Category'].unique())))

There are 184 unique categories.


In [22]:
dal_venues['Venue Category'].unique()

array(['Diner', 'Bakery', 'Rental Car Location', 'Business Service',
       'Hardware Store', 'Arts & Crafts Store', 'Music Store',
       'Art Studio', 'Furniture / Home Store', 'Jewelry Store',
       'History Museum', 'Park', 'Water Park',
       'Construction & Landscaping', 'Fried Chicken Joint', 'Golf Course',
       'Costume Shop', 'American Restaurant', 'Other Repair Shop', 'Food',
       'Mexican Restaurant', 'Music Venue', 'Print Shop', 'Movie Theater',
       'Cosmetics Shop', 'Pizza Place', 'Donut Shop', 'Baseball Field',
       'Gym', 'Trail', 'Video Store', 'Café', 'Pool',
       'Gym / Fitness Center', 'Bar', 'Discount Store',
       'Fast Food Restaurant', 'Salvadoran Restaurant',
       'Convenience Store', 'Gas Station', 'Indian Restaurant',
       'Grocery Store', 'Boutique', 'Burger Joint', 'Market',
       'New American Restaurant', 'Bowling Alley', 'Chinese Restaurant',
       'Sandwich Place', 'Smoothie Shop', 'Steakhouse',
       'Mediterranean Restaurant', 'Shi

**Use one-hot encoding to further break down each zip code**

The process we're about to perform is:

We apply one hot encoding to the data frame we created and gather all the venues from each area (with their category). Basically we have a 0 if a category isn't present in that zipcode and a 1 if it is present.

We get rid of the actual lat,lon coordinates of the venues, so that we may stay at the zipcode level. So, we make sure we add the zipcode, lat,lon info to the one hot dataframe.

We group rows by postal code and by taking the mean of the frequency of occurrence of each category, then we can print the 5 most common venue categories in each zip code.

We can put that information for all areas into a dataframe. Using a function to sort the venues categories in descending order, then we create the new dataframe and display the top 10 venues (categories) for each zip code we consider.

In [23]:
# one hot encoding
dal_venues_onehot = pd.get_dummies(dal_venues[['Venue Category']], prefix="", prefix_sep="")

# add postalcode column back to dataframe
dal_venues_onehot['PostalCode'] = dal_venues['PostalCode'] 

# move postalcode column to the first column
fixed_columns = [dal_venues_onehot.columns[-1]] + list(dal_venues_onehot.columns[:-1])
dal_venues_onehot = dal_venues_onehot[fixed_columns]

dal_venues_onehot.head()

Unnamed: 0,PostalCode,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Art Studio,Arts & Crafts Store,Asian Restaurant,Auto Garage,Automotive Shop,BBQ Joint,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beer Garden,Big Box Store,Bistro,Boutique,Bowling Alley,Breakfast Spot,Brewery,Bubble Tea Shop,Burger Joint,Bus Stop,Business Service,Cafeteria,Café,Cajun / Creole Restaurant,Campground,Carpet Store,Check Cashing Service,Cheese Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,College Bookstore,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Costume Shop,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dry Cleaner,Electronics Store,Farmers Market,Fast Food Restaurant,Field,Flea Market,Flower Shop,Food,Food Truck,Football Stadium,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gas Station,Gay Bar,Gift Shop,Go Kart Track,Golf Course,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Halal Restaurant,Hardware Store,Hawaiian Restaurant,Health Food Store,History Museum,Home Service,Hookah Bar,Hotel,IT Services,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Jewelry Store,Kids Store,Lake,Latin American Restaurant,Leather Goods Store,Light Rail Station,Liquor Store,Locksmith,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Monument / Landmark,Movie Theater,Museum,Music Store,Music Venue,Nail Salon,New American Restaurant,Nightclub,Nightlife Spot,Noodle House,Opera House,Optical Shop,Organic Grocery,Other Repair Shop,Outlet Store,Paper / Office Supplies Store,Park,Pawn Shop,Performing Arts Venue,Pet Service,Pet Store,Pharmacy,Pizza Place,Plaza,Pool,Print Shop,Recording Studio,Recreation Center,Rental Car Location,Residential Building (Apartment / Condo),Restaurant,Rugby Pitch,Salad Place,Salon / Barbershop,Salvadoran Restaurant,Sandwich Place,School,Seafood Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Soccer Field,South Indian Restaurant,Southern / Soul Food Restaurant,Spa,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tanning Salon,Tea Room,Tennis Court,Tex-Mex Restaurant,Thai Restaurant,Theater,Thrift / Vintage Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Shop,Wings Joint,Women's Store
0,75001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,75001,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,75001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,75001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,75001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [24]:
dal_venues_onehot.shape

(815, 185)

In [25]:
dal_venues_grouped = dal_venues_onehot.groupby('PostalCode').mean().reset_index()
dal_venues_grouped.head()

Unnamed: 0,PostalCode,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Art Studio,Arts & Crafts Store,Asian Restaurant,Auto Garage,Automotive Shop,BBQ Joint,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beer Garden,Big Box Store,Bistro,Boutique,Bowling Alley,Breakfast Spot,Brewery,Bubble Tea Shop,Burger Joint,Bus Stop,Business Service,Cafeteria,Café,Cajun / Creole Restaurant,Campground,Carpet Store,Check Cashing Service,Cheese Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,College Bookstore,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Costume Shop,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dry Cleaner,Electronics Store,Farmers Market,Fast Food Restaurant,Field,Flea Market,Flower Shop,Food,Food Truck,Football Stadium,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gas Station,Gay Bar,Gift Shop,Go Kart Track,Golf Course,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Halal Restaurant,Hardware Store,Hawaiian Restaurant,Health Food Store,History Museum,Home Service,Hookah Bar,Hotel,IT Services,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Jewelry Store,Kids Store,Lake,Latin American Restaurant,Leather Goods Store,Light Rail Station,Liquor Store,Locksmith,Lounge,Market,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Monument / Landmark,Movie Theater,Museum,Music Store,Music Venue,Nail Salon,New American Restaurant,Nightclub,Nightlife Spot,Noodle House,Opera House,Optical Shop,Organic Grocery,Other Repair Shop,Outlet Store,Paper / Office Supplies Store,Park,Pawn Shop,Performing Arts Venue,Pet Service,Pet Store,Pharmacy,Pizza Place,Plaza,Pool,Print Shop,Recording Studio,Recreation Center,Rental Car Location,Residential Building (Apartment / Condo),Restaurant,Rugby Pitch,Salad Place,Salon / Barbershop,Salvadoran Restaurant,Sandwich Place,School,Seafood Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Soccer Field,South Indian Restaurant,Southern / Soul Food Restaurant,Spa,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tanning Salon,Tea Room,Tennis Court,Tex-Mex Restaurant,Thai Restaurant,Theater,Thrift / Vintage Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wine Shop,Wings Joint,Women's Store
0,75001,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.181818,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,75006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0
2,75019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,75038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,75040,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
print(dal_venues_grouped.shape)

(70, 185)


In [27]:
# Following our process described above, we print each area along with the top 5 most common categories of venues¶
num_top_venues = 5

for area in dal_venues_grouped['PostalCode']:
    print("----- {} -----".format(area))
    #print("----"+area+"----")
    temp = dal_venues_grouped[dal_venues_grouped['PostalCode'] == area].T.reset_index()
    temp.columns = ['venue cateory','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----- 75001 -----
            venue cateory  freq
0     Arts & Crafts Store  0.18
1           Jewelry Store  0.09
2                  Bakery  0.09
3  Furniture / Home Store  0.09
4        Business Service  0.09


----- 75006 -----
                venue cateory  freq
0                  Water Park  0.25
1  Construction & Landscaping  0.25
2              History Museum  0.25
3                        Park  0.25
4         American Restaurant  0.00


----- 75019 -----
                venue cateory  freq
0                        Park   0.5
1  Construction & Landscaping   0.5
2         American Restaurant   0.0
3                   Nightclub   0.0
4              Nightlife Spot   0.0


----- 75038 -----
             venue cateory  freq
0      Fried Chicken Joint   0.5
1              Golf Course   0.5
2             Outlet Store   0.0
3  New American Restaurant   0.0
4                Nightclub   0.0


----- 75040 -----
         venue cateory  freq
0         Costume Shop   0.5
1     Business Service

In [28]:
# Create a pandas dataframe and display the top 10 categories of venues for each zip code.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postalcode_area_venues_sorted = pd.DataFrame(columns=columns)
postalcode_area_venues_sorted['PostalCode'] = dal_venues_grouped['PostalCode']

for ind in np.arange(dal_venues_grouped.shape[0]):
    postalcode_area_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dal_venues_grouped.iloc[ind, :], num_top_venues)


postalcode_area_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,75001,Arts & Crafts Store,Music Store,Bakery,Diner,Jewelry Store,Rental Car Location,Art Studio,Business Service,Hardware Store,Furniture / Home Store
1,75006,Water Park,Construction & Landscaping,History Museum,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field
2,75019,Construction & Landscaping,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
3,75038,Fried Chicken Joint,Golf Course,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
4,75040,Costume Shop,Business Service,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


**K-Means Clustering**

The database is ready to apply the k-means clustering method. We are going to start with a default 5 clusters, similar to what we did in the New York and Toronto labs. 

Based on the 84 total number of zip codes we have, we consider 5 clusters is a good number for classification purposes. As a reminder, our overall objective is to identify the best cluster to select. We will use that best cluster as a guidance and as a list of zip codes to target for our stakeholders to begin finding areas for the new Persian restaurant. So, via reviewing the clustering results, we will estimate the similarity inside each cluster and unsimilarities between two clusters. It will be a key phase when we work with the client and have a shared understanding of the results. In case of more clarity needed, or if we need to get our results more solid, we can run k-means for other values of k and explore further.

In [29]:
# set number of clusters
kclusters = 5

dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 0, 1, 3, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 3, 1, 1, 1, 1, 0, 2,
       1, 1, 1, 1], dtype=int32)

In [30]:
# Add clustering labels  
postalcode_area_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dal_merged = zcode

# merge data frames to add latitude/longitude for each zip code
dal_merged = dal_merged.join(postalcode_area_venues_sorted.set_index('PostalCode'), on='zip', how='inner')

dal_merged.head(20)

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25843,75001,32.96,-96.8385,Addison,TX,Texas,Dallas,1,Arts & Crafts Store,Music Store,Bakery,Diner,Jewelry Store,Rental Car Location,Art Studio,Business Service,Hardware Store,Furniture / Home Store
25845,75006,32.9619,-96.897,Carrollton,TX,Texas,Dallas,1,Water Park,Construction & Landscaping,History Museum,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field
25850,75019,32.9633,-96.9855,Coppell,TX,Texas,Dallas,0,Construction & Landscaping,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25861,75038,32.8746,-96.9976,Irving,TX,Texas,Dallas,1,Fried Chicken Joint,Golf Course,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
25863,75040,32.9277,-96.6201,Garland,TX,Texas,Dallas,3,Costume Shop,Business Service,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25864,75041,32.8809,-96.6515,Garland,TX,Texas,Dallas,1,American Restaurant,Bakery,Food,Mexican Restaurant,Other Repair Shop,Construction & Landscaping,Diner,Farmers Market,Food Truck,Department Store
25865,75042,32.9139,-96.6749,Garland,TX,Texas,Dallas,1,Print Shop,Movie Theater,Cosmetics Shop,Construction & Landscaping,Business Service,Music Venue,Electronics Store,Food,Flower Shop,Flea Market
25866,75043,32.8571,-96.5794,Garland,TX,Texas,Dallas,1,Donut Shop,Pizza Place,Baseball Field,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25867,75044,32.9626,-96.6532,Garland,TX,Texas,Dallas,1,Pool,Video Store,Café,Trail,Gym / Fitness Center,Gym,Women's Store,Dog Run,Flea Market,Field
25868,75048,32.972,-96.5808,Sachse,TX,Texas,Dallas,2,Bar,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


In [31]:
# Recreate the map to show the zip codes with their assigned clusters
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lati, long, city, poi, cluster in zip(dal_merged['lat'], dal_merged['lng'], dal_merged['city'], dal_merged['zip'], dal_merged['Cluster Labels']):
    label = folium.Popup(str(city) + ', ' + str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.99).add_to(map_clusters)
       
map_clusters

In [32]:
dal_merged.head(5)

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25843,75001,32.96,-96.8385,Addison,TX,Texas,Dallas,1,Arts & Crafts Store,Music Store,Bakery,Diner,Jewelry Store,Rental Car Location,Art Studio,Business Service,Hardware Store,Furniture / Home Store
25845,75006,32.9619,-96.897,Carrollton,TX,Texas,Dallas,1,Water Park,Construction & Landscaping,History Museum,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field
25850,75019,32.9633,-96.9855,Coppell,TX,Texas,Dallas,0,Construction & Landscaping,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25861,75038,32.8746,-96.9976,Irving,TX,Texas,Dallas,1,Fried Chicken Joint,Golf Course,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
25863,75040,32.9277,-96.6201,Garland,TX,Texas,Dallas,3,Costume Shop,Business Service,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


Majority of the points are located in Cluster 1, so we will need further analysis to determine if this is cluster is acceptable or if it needs to be broken down more. We can still review each cluster and build the functions we'll use later. We can also review more in details each cluster to see what level of similarity the points have in each cluster, and how different the clusters are.

**Examining each Cluster**

Below we'll examine each of the 5 clusters and see which venue categories distinguish each cluster from the next, as well as the top 10 most common venues for each zip code per cluster

**Cluster 0**

In [33]:
dal_merged.loc[dal_merged['Cluster Labels'] == 0, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25850,75019,Dallas,0,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25965,75212,Dallas,0,Women's Store,Dry Cleaner,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25986,75236,Dallas,0,Home Service,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25994,75247,Dallas,0,Warehouse Store,Women's Store,Donut Shop,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


**Cluster 1**

In [35]:
dal_merged.loc[dal_merged['Cluster Labels'] == 1, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25843,75001,Dallas,1,Music Store,Bakery,Diner,Jewelry Store,Rental Car Location,Art Studio,Business Service,Hardware Store,Furniture / Home Store
25845,75006,Dallas,1,Construction & Landscaping,History Museum,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field
25861,75038,Dallas,1,Golf Course,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
25864,75041,Dallas,1,Bakery,Food,Mexican Restaurant,Other Repair Shop,Construction & Landscaping,Diner,Farmers Market,Food Truck,Department Store
25865,75042,Dallas,1,Movie Theater,Cosmetics Shop,Construction & Landscaping,Business Service,Music Venue,Electronics Store,Food,Flower Shop,Flea Market
25866,75043,Dallas,1,Pizza Place,Baseball Field,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25867,75044,Dallas,1,Video Store,Café,Trail,Gym / Fitness Center,Gym,Women's Store,Dog Run,Flea Market,Field
25870,75051,Dallas,1,Women's Store,Dry Cleaner,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25877,75061,Dallas,1,Fast Food Restaurant,Discount Store,Gym / Fitness Center,Salvadoran Restaurant,Park,Gas Station,Mexican Restaurant,Women's Store,Dry Cleaner
25878,75062,Dallas,1,Market,Boutique,Grocery Store,Burger Joint,Convenience Store,Farmers Market,Football Stadium,Food Truck,Department Store


**Cluster 2**

In [36]:
dal_merged.loc[dal_merged['Cluster Labels'] == 2, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25868,75048,Dallas,2,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25876,75060,Dallas,2,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25970,75218,Dallas,2,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
25996,75249,Dallas,2,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market


**Cluster 3**

In [37]:
dal_merged.loc[dal_merged['Cluster Labels'] == 3, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25863,75040,Dallas,3,Business Service,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25952,75182,Dallas,3,Business Service,Women's Store,Electronics Store,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25988,75238,Dallas,3,Women's Store,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


**Cluster 4**

In [38]:
dal_merged.loc[dal_merged['Cluster Labels'] == 4, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25958,75205,Dallas,4,Women's Store,Donut Shop,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market


It is possible that we don't have enough clusters for the data to be split into. We set the number of clusters to 10 and 20 to see if increasing the number of clusters makes our data more accurate. 

In [39]:
# set number of clusters
kclusters = 10
dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans_k10 = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans_k10.labels_

array([1, 8, 8, 7, 3, 1, 1, 1, 1, 4, 5, 4, 1, 1, 1, 1, 1, 1, 6, 0, 1, 1,
       1, 1, 1, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 7, 8, 1, 1, 9, 1, 4, 1,
       1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 0, 1, 3, 1, 1, 1, 1, 8, 4,
       1, 1, 1, 1], dtype=int32)

In [40]:
kmeans_k10.labels_[4]

3

In [41]:
a=kmeans_k10.labels_
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts))

{0: 2, 1: 50, 2: 1, 3: 3, 4: 4, 5: 1, 6: 1, 7: 2, 8: 4, 9: 2}

In [42]:
# set number of clusters equal to 20
kclusters = 20
dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans_k20 = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
# kmeans_k20.labels_
unique, counts = np.unique(kmeans_k20.labels_, return_counts=True)
#pd.DataFrame(dict(zip(unique, counts)))

In [43]:
dict(zip(unique, counts))

{0: 1,
 1: 1,
 2: 13,
 3: 3,
 4: 2,
 5: 1,
 6: 2,
 7: 2,
 8: 1,
 9: 1,
 10: 1,
 11: 2,
 12: 1,
 13: 1,
 14: 31,
 15: 1,
 16: 1,
 17: 2,
 18: 1,
 19: 2}

In [44]:
unique, counts = np.unique(kmeans_k20.labels_, return_counts=True)

In [46]:
dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

for i in ([2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]):

# set number of clusters
    kclusters = i
#LA_venues_grouped_clustering = LA_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
    kmeans_ki = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
# kmeans_k20.labels_
    unique, counts = np.unique(kmeans_ki.labels_, return_counts=True)
    print("# of points per cluster for {} clusters: ".format(i))
    print(dict(zip(unique, counts)))

# of points per cluster for 2 clusters: 
{0: 66, 1: 4}
# of points per cluster for 3 clusters: 
{0: 63, 1: 3, 2: 4}
# of points per cluster for 4 clusters: 
{0: 62, 1: 3, 2: 4, 3: 1}
# of points per cluster for 5 clusters: 
{0: 4, 1: 58, 2: 4, 3: 3, 4: 1}
# of points per cluster for 6 clusters: 
{0: 3, 1: 4, 2: 57, 3: 1, 4: 1, 5: 4}
# of points per cluster for 7 clusters: 
{0: 7, 1: 3, 2: 4, 3: 4, 4: 47, 5: 4, 6: 1}
# of points per cluster for 8 clusters: 
{0: 54, 1: 3, 2: 6, 3: 1, 4: 3, 5: 1, 6: 1, 7: 1}
# of points per cluster for 9 clusters: 
{0: 2, 1: 11, 2: 3, 3: 42, 4: 2, 5: 1, 6: 1, 7: 6, 8: 2}
# of points per cluster for 10 clusters: 
{0: 2, 1: 50, 2: 1, 3: 3, 4: 4, 5: 1, 6: 1, 7: 2, 8: 4, 9: 2}
# of points per cluster for 11 clusters: 
{0: 6, 1: 19, 2: 1, 3: 3, 4: 1, 5: 2, 6: 2, 7: 1, 8: 2, 9: 27, 10: 6}
# of points per cluster for 12 clusters: 
{0: 2, 1: 2, 2: 48, 3: 4, 4: 2, 5: 3, 6: 1, 7: 1, 8: 2, 9: 1, 10: 2, 11: 2}
# of points per cluster for 13 clusters: 
{0: 5, 1: 3, 2:

With the further analysis done above, we have determined the number of data points associated with each clusters very cluster numbers varying from 1-20.

Based on the list generated above, it seems as though k=7 is the most optimal number of clusters to use for the data moving forward. With the number of points per cluster being {0: 7, 1: 3, 2: 4, 3: 4, 4: 47, 5: 4, 6: 1}, we can re-run our k-means clustering and display it on the map like we did above. 

In [47]:
kclusters=7
# Recreate the map with k=7 clusters
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lati, long, city, poi, cluster in zip(dal_merged['lat'], dal_merged['lng'], dal_merged['city'], dal_merged['zip'], dal_merged['Cluster Labels']):
    label = folium.Popup(str(city) + ', ' + str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.99).add_to(map_clusters)
       
map_clusters

As we can see the map has changed. Now we have 7 clusters to assign each zip code to, and we will analyze each cluster below like we did before.

**Cluster 0**

In [48]:
dal_merged.loc[dal_merged['Cluster Labels'] == 0, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25850,75019,Dallas,0,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25965,75212,Dallas,0,Women's Store,Dry Cleaner,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25986,75236,Dallas,0,Home Service,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25994,75247,Dallas,0,Warehouse Store,Women's Store,Donut Shop,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


**Cluster 1**

In [49]:
dal_merged.loc[dal_merged['Cluster Labels'] == 1, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25843,75001,Dallas,1,Music Store,Bakery,Diner,Jewelry Store,Rental Car Location,Art Studio,Business Service,Hardware Store,Furniture / Home Store
25845,75006,Dallas,1,Construction & Landscaping,History Museum,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field
25861,75038,Dallas,1,Golf Course,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
25864,75041,Dallas,1,Bakery,Food,Mexican Restaurant,Other Repair Shop,Construction & Landscaping,Diner,Farmers Market,Food Truck,Department Store
25865,75042,Dallas,1,Movie Theater,Cosmetics Shop,Construction & Landscaping,Business Service,Music Venue,Electronics Store,Food,Flower Shop,Flea Market
25866,75043,Dallas,1,Pizza Place,Baseball Field,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25867,75044,Dallas,1,Video Store,Café,Trail,Gym / Fitness Center,Gym,Women's Store,Dog Run,Flea Market,Field
25870,75051,Dallas,1,Women's Store,Dry Cleaner,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25877,75061,Dallas,1,Fast Food Restaurant,Discount Store,Gym / Fitness Center,Salvadoran Restaurant,Park,Gas Station,Mexican Restaurant,Women's Store,Dry Cleaner
25878,75062,Dallas,1,Market,Boutique,Grocery Store,Burger Joint,Convenience Store,Farmers Market,Football Stadium,Food Truck,Department Store


**Cluster 2**

In [50]:
dal_merged.loc[dal_merged['Cluster Labels'] == 2, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25868,75048,Dallas,2,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25876,75060,Dallas,2,Park,Women's Store,Dry Cleaner,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25970,75218,Dallas,2,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market
25996,75249,Dallas,2,Park,Women's Store,Donut Shop,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market


**Cluster 3**

In [52]:
dal_merged.loc[dal_merged['Cluster Labels'] == 3, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25863,75040,Dallas,3,Business Service,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25952,75182,Dallas,3,Business Service,Women's Store,Electronics Store,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant
25988,75238,Dallas,3,Women's Store,Electronics Store,Football Stadium,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant


**Cluster 4**

In [53]:
dal_merged.loc[dal_merged['Cluster Labels'] == 4, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25958,75205,Dallas,4,Women's Store,Donut Shop,Food Truck,Food,Flower Shop,Flea Market,Field,Fast Food Restaurant,Farmers Market


**Cluster 5**

In [76]:
dal_merged.loc[dal_merged['Cluster Labels'] == 5, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**Cluster 6**

In [54]:
dal_merged.loc[dal_merged['Cluster Labels'] == 6, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**We choose Cluster 1 as our cluster for finding the new restaurant**

This re-affirms that we will choose cluster 1 as our cluster, which will determine the exact zip code where the new Persian restaurant will go. After we run our kmeans algorithm for many values of k (from k=1 to k=20), we see that the more discriminating result is with k=7. Number of points per cluster for k=7 {0: 7, 1: 3, 2: 4, 3: 4, 4: 47, 5: 4, 6: 1} We'll explore the results in the Results section of this report. At this stage we realized that the number of clusters was crucial in the way our kmeans algorithm could actually efficiently do some classifying tasks. We've also seen how we had to work on the features we extracted got with the Foursquare API to make more relevant features to solve our problem. It is both thanks to working on many values of k for kmeans and thanks to features modifications that we could reach the results we present in the next section of our report.