# Exploring Newest Trends in Most Populous Cities of the United States

### Introduction

Company X desires to know which venues are trending in the various major US cities. This knowledge will help them solve a larger problem: in which companies should Company X invest in. Company X is particularly interested in whether or not there are correlations between percent-increase/percent-decrease of the most populous cities with the types of venues that are newly available (venues opening within the past decade).

##### Data Collection

To start with, we scrape Wikipedia's page for the list of most populous US cities with their corresponding percent-increase/percent-decrease:

https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

Then, using this initial data set with the corresponding GPS locations (included in the data table), we will obtain venues using the Foursquare API and investigate the trending venues.

I perform the scrape below.

##### Importing Libraries:

In [1]:
# import necessary libraries

# import numpy
import numpy as np

# import pandas
import pandas as pd

# import web scraping tools
from urllib.request import urlopen
from bs4 import BeautifulSoup

# library to handle JSON files
import json

# import geocoder
import geocoder

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# import folium for maps
import folium

# library to handle requests
import requests

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [2]:
def get_html_contents(page_url):
    results = requests.get(page_url) # access url location
    soup = BeautifulSoup(results.text,'html.parser') # parse through html of url and store page info
    return soup
    
cities_url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population' # define desired url

cities_contents = get_html_contents(cities_url)

In [3]:
city_pop_table = cities_contents.find('table',{'class': 'wikitable sortable'}) # find the first table of class 'wikitable sortable'
city_pop_header = city_pop_table.tr.find_all('th') # create html table headers list

column_names = [] # create empty list

for header in city_pop_header:
    column_names.append(header.text.strip().replace(',','_')) # collects column names as list
print(column_names)

['2018rank', 'City', 'State[c]', '2018estimate', '2010Census', 'Change', '2016 land area', '2016 population density', 'Location']


In [4]:
file_name = 'us_cities_data.csv' # give name to file

f = open(file_name,'w+', encoding='utf-8') # utf-8 encoding is used in html;
# will need to clean up

for i in range(len(column_names)): # write to file all column names as header to file
    if column_names[i] != column_names[-1]:
        f.write(column_names[i] + ',')
    else:
        f.write(column_names[i] + '\n')

table_rows = city_pop_table.find_all('tr') # list of all tr

for row in table_rows: # grab data from each cell and write to file
    cells = row.find_all('td')
    if len(cells) > 1:
        for i, cell in enumerate(cells):
            cell_data = cell.text.strip().replace(',','')
            if cells[i] != cells[-1]:
                if i == 7 or i == 9: # ignore metric units
                # assuming this is an American company so English units are desired
                    continue
                f.write(cell_data + ',')
            else:
                f.write(cell_data + '\n')
                    
f.close() # be sure to close the file!!!

In [5]:
us_cities_df = pd.read_csv('us_cities_data.csv') # read csv file to obtain dataframe
us_cities_df.head() # take a peak at dataframe

Unnamed: 0,2018rank,City,State[c],2018estimate,2010Census,Change,2016 land area,2016 population density,Location
0,1,New York[d],New York,8398748,8175133,+2.74%,301.5 sq mi,28317/sq mi,40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...
1,2,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,8484/sq mi,34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...
2,3,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,11900/sq mi,41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...
3,4,Houston[3],Texas,2325502,2100263,+10.72%,637.5 sq mi,3613/sq mi,29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...
4,5,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,3120/sq mi,33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...


##### Data Cleaning:

In [6]:
cities_cleaned = [] # initialize an empty list

for i, text in enumerate(us_cities_df.index): # cleans city names and appends to list
    cities_cleaned.append(us_cities_df.loc[i,'City'].split('[')[0])
    
print(cities_cleaned) # take a peak

['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose', 'Austin', 'Jacksonville', 'Fort Worth', 'Columbus', 'San Francisco', 'Charlotte', 'Indianapolis', 'Seattle', 'Denver', 'Washington', 'Boston', 'El Paso', 'Detroit', 'Nashville', 'Portland', 'Memphis', 'Oklahoma City', 'Las Vegas', 'Louisville', 'Baltimore', 'Milwaukee', 'Albuquerque', 'Tucson', 'Fresno', 'Mesa', 'Sacramento', 'Atlanta', 'Kansas City', 'Colorado Springs', 'Miami', 'Raleigh', 'Omaha', 'Long Beach', 'Virginia Beach', 'Oakland', 'Minneapolis', 'Tulsa', 'Arlington', 'Tampa', 'New Orleans', 'Wichita', 'Cleveland', 'Bakersfield', 'Aurora', 'Anaheim', 'Honolulu', 'Santa Ana', 'Riverside', 'Corpus Christi', 'Lexington', 'Stockton', 'Henderson', 'Saint Paul', 'St. Louis', 'Cincinnati', 'Pittsburgh', 'Greensboro', 'Anchorage', 'Plano', 'Lincoln', 'Orlando', 'Irvine', 'Newark', 'Toledo', 'Durham', 'Chula Vista', 'Fort Wayne', 'Jersey City', 'St. Petersburg

In [7]:
us_cities_df['City'] = pd.Series(cities_cleaned) # repleace 'City' column with newly cleaned names
us_cities_df.head()

Unnamed: 0,2018rank,City,State[c],2018estimate,2010Census,Change,2016 land area,2016 population density,Location
0,1,New York,New York,8398748,8175133,+2.74%,301.5 sq mi,28317/sq mi,40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...
1,2,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,8484/sq mi,34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...
2,3,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,11900/sq mi,41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...
3,4,Houston,Texas,2325502,2100263,+10.72%,637.5 sq mi,3613/sq mi,29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...
4,5,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,3120/sq mi,33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...


In [8]:
location_cleaned = [] # initialize empty list

for i, content in enumerate(us_cities_df.index): # cleaning location lat/long coords
    location_cleaned.append(us_cities_df.loc[i,'Location'].replace('\ufeff','').split('/')[2].split('(')[0].replace(' ',''))
    
lat_coords = [] # initialize
long_coords = [] # initialize

for i, text in enumerate(location_cleaned): # separate lat and long coords into separate lists
    lat, long = location_cleaned[i].split(';')
    lat_coords.append(lat)
    long_coords.append(long)

us_cities_df['Lat'] = pd.Series(lat_coords).astype(float) # add column for latidude
us_cities_df['Long'] = pd.Series(long_coords).astype(float) # add column for longitude
us_cities_df.head() # take a peak

Unnamed: 0,2018rank,City,State[c],2018estimate,2010Census,Change,2016 land area,2016 population density,Location,Lat,Long
0,1,New York,New York,8398748,8175133,+2.74%,301.5 sq mi,28317/sq mi,40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...,40.6635,-73.9387
1,2,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,8484/sq mi,34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...,34.0194,-118.4108
2,3,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,11900/sq mi,41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...,41.8376,-87.6818
3,4,Houston,Texas,2325502,2100263,+10.72%,637.5 sq mi,3613/sq mi,29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...,29.7866,-95.3909
4,5,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,3120/sq mi,33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...,33.5722,-112.0901


In [9]:
us_cities_df.drop('Location',axis=1,inplace=True) # drop messy location column
us_cities_df.head()

Unnamed: 0,2018rank,City,State[c],2018estimate,2010Census,Change,2016 land area,2016 population density,Lat,Long
0,1,New York,New York,8398748,8175133,+2.74%,301.5 sq mi,28317/sq mi,40.6635,-73.9387
1,2,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,8484/sq mi,34.0194,-118.4108
2,3,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,11900/sq mi,41.8376,-87.6818
3,4,Houston,Texas,2325502,2100263,+10.72%,637.5 sq mi,3613/sq mi,29.7866,-95.3909
4,5,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,3120/sq mi,33.5722,-112.0901


In [10]:
us_cities_df.rename(columns={'2018ranks': '2018 Rank',
                             'State[c]': 'State',
                             '2018estimate': '2018 Estimate',
                             '2010Census': '2010 Census',
                             '2016 land area': 'Land Area (sq mi)',
                             '2016 population density': 'Population Density (per sq mi)'},
                    inplace=True) # renaming columns for cleaner look and easier use
us_cities_df.head()

Unnamed: 0,2018rank,City,State,2018 Estimate,2010 Census,Change,Land Area (sq mi),Population Density (per sq mi),Lat,Long
0,1,New York,New York,8398748,8175133,+2.74%,301.5 sq mi,28317/sq mi,40.6635,-73.9387
1,2,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,8484/sq mi,34.0194,-118.4108
2,3,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,11900/sq mi,41.8376,-87.6818
3,4,Houston,Texas,2325502,2100263,+10.72%,637.5 sq mi,3613/sq mi,29.7866,-95.3909
4,5,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,3120/sq mi,33.5722,-112.0901


In [11]:
us_cities_df.drop('2018rank',axis=1,inplace=True) # rank unnecessary with pandas indexing
us_cities_df.head()

Unnamed: 0,City,State,2018 Estimate,2010 Census,Change,Land Area (sq mi),Population Density (per sq mi),Lat,Long
0,New York,New York,8398748,8175133,+2.74%,301.5 sq mi,28317/sq mi,40.6635,-73.9387
1,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,8484/sq mi,34.0194,-118.4108
2,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,11900/sq mi,41.8376,-87.6818
3,Houston,Texas,2325502,2100263,+10.72%,637.5 sq mi,3613/sq mi,29.7866,-95.3909
4,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,3120/sq mi,33.5722,-112.0901


In [12]:
land_area = [] # initialize
pop_density = [] # initialize

for i, text in enumerate(us_cities_df.index): # clean land area and population density columns
    land_area.append(us_cities_df.loc[i,'Land Area (sq mi)'].replace('\xa0sq\xa0mi',''))
    pop_density.append(us_cities_df.loc[i,'Population Density (per sq mi)'].split('/')[0])

us_cities_df['Land Area (sq mi)'] = pd.Series(land_area).astype(float)
us_cities_df['Population Density (per sq mi)'] = pd.Series(pop_density).astype(float)
us_cities_df.head() # take a peak

Unnamed: 0,City,State,2018 Estimate,2010 Census,Change,Land Area (sq mi),Population Density (per sq mi),Lat,Long
0,New York,New York,8398748,8175133,+2.74%,301.5,28317.0,40.6635,-73.9387
1,Los Angeles,California,3990456,3792621,+5.22%,468.7,8484.0,34.0194,-118.4108
2,Chicago,Illinois,2705994,2695598,+0.39%,227.3,11900.0,41.8376,-87.6818
3,Houston,Texas,2325502,2100263,+10.72%,637.5,3613.0,29.7866,-95.3909
4,Phoenix,Arizona,1660272,1445632,+14.85%,517.6,3120.0,33.5722,-112.0901


In [13]:
per_change_float = []

for i, content in enumerate(us_cities_df.index):
    per_change_float.append(us_cities_df.loc[i,'Change'].replace('+','').replace('%','').replace('−','-'))

us_cities_df['Change'] = pd.Series(per_change_float)

for i, content in enumerate(us_cities_df['Change']):
    try:
        us_cities_df.loc[i,'Change'] = float(us_cities_df.loc[i,'Change']) # attempt to convert string to float
    except:
        us_cities_df.drop(index=i,inplace=True) # throw exception for new cities and/or unkown % change
    
us_cities_df.rename(columns={'Change': '% Change'},inplace=True)
us_cities_df['% Change'] = us_cities_df['% Change'].astype(float)
us_cities_df.head()

Unnamed: 0,City,State,2018 Estimate,2010 Census,% Change,Land Area (sq mi),Population Density (per sq mi),Lat,Long
0,New York,New York,8398748,8175133,2.74,301.5,28317.0,40.6635,-73.9387
1,Los Angeles,California,3990456,3792621,5.22,468.7,8484.0,34.0194,-118.4108
2,Chicago,Illinois,2705994,2695598,0.39,227.3,11900.0,41.8376,-87.6818
3,Houston,Texas,2325502,2100263,10.72,637.5,3613.0,29.7866,-95.3909
4,Phoenix,Arizona,1660272,1445632,14.85,517.6,3120.0,33.5722,-112.0901


In [14]:
us_cities_df.info() # look at data types within dataframe

<class 'pandas.core.frame.DataFrame'>
Int64Index: 313 entries, 0 to 313
Data columns (total 9 columns):
City                              313 non-null object
State                             313 non-null object
2018 Estimate                     313 non-null int64
2010 Census                       313 non-null int64
% Change                          313 non-null float64
Land Area (sq mi)                 313 non-null float64
Population Density (per sq mi)    313 non-null float64
Lat                               313 non-null float64
Long                              313 non-null float64
dtypes: float64(5), int64(2), object(2)
memory usage: 24.5+ KB


In [15]:
high_influx_df = us_cities_df[us_cities_df['% Change'] >= 20.0].reset_index(drop=True) # define high
# influx of population dataframe
high_influx_df.head(high_influx_df.shape[0]) # take a peak

Unnamed: 0,City,State,2018 Estimate,2010 Census,% Change,Land Area (sq mi),Population Density (per sq mi),Lat,Long
0,Austin,Texas,964254,790390,22.0,312.7,3031.0,30.3039,-97.7544
1,Fort Worth,Texas,895008,741206,20.75,342.9,2491.0,32.7815,-97.3467
2,Seattle,Washington,744955,608660,22.39,83.8,8405.0,47.6205,-122.3509
3,Henderson,Nevada,310390,257729,20.43,104.7,2798.0,36.0097,-115.0357
4,Irvine,California,282572,212375,33.05,65.6,4057.0,33.6784,-117.7713
5,Durham,North Carolina,274291,228330,20.13,109.8,2395.0,35.9811,-78.9029
6,McKinney,Texas,191645,131117,46.16,63.0,2735.0,33.1985,-96.668
7,Cape Coral,Florida,189343,154305,22.71,105.6,1703.0,26.6432,-81.9974
8,Frisco,Texas,188170,116989,60.84,67.7,2417.0,33.1554,-96.8226
9,Cary,North Carolina,168160,135234,24.35,56.5,2873.0,35.7809,-78.8133


##### Visualization:

In [16]:
# grabbing the lat/long coords for United States

address = 'United States'

geolocator = Nominatim(user_agent="us_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of the United States are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of the United States are 39.7837304, -100.4458825.


In [17]:
# create map of United States using latitude and longitude values
map_us_cities = folium.Map(location=[latitude, longitude], zoom_start=4)

# add markers to map; color coded by % change (20%+)
for lat, lng, city, state, inflx in zip(high_influx_df['Lat'],
                                        high_influx_df['Long'],
                                        high_influx_df['City'],
                                        high_influx_df['State'],
                                        high_influx_df['% Change']):
    
    label = '{}, {}, {}\% increase'.format(city, state, inflx)
    
    label = folium.Popup(label, parse_html=True)
    
    # separate % ranges by different colors here
    if inflx < 30: 
        marker_color='blue'
    elif inflx < 40:
        marker_color='orange'
    elif inflx < 50:
        marker_color='green'
    else:
        marker_color='red'
        
    # adding markers
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=marker_color,
        fill=True,
        fill_color=marker_color,
        fill_opacity=0.7,
        parse_html=False).add_to(map_us_cities)  
    
map_us_cities

##### We now have a good idea of where the highest influx of population is for the most populous cities. In particular, the data seems to suggest that there is a large influx in the state of Texas, which may be worth investigating if any venues differ from other states' cities. We will proceed by obtaining further data for each of these cities and determine geospacial coordinates based on zip code within each city to represent our "neighborhoods." The webpage www.zip-codes.com is useful here.

##### More data scraping and wrangling:

In [18]:
texas_df = high_influx_df[high_influx_df['State'] == 'Texas'].reset_index(drop=True)
texas_df.head()

Unnamed: 0,City,State,2018 Estimate,2010 Census,% Change,Land Area (sq mi),Population Density (per sq mi),Lat,Long
0,Austin,Texas,964254,790390,22.0,312.7,3031.0,30.3039,-97.7544
1,Fort Worth,Texas,895008,741206,20.75,342.9,2491.0,32.7815,-97.3467
2,McKinney,Texas,191645,131117,46.16,63.0,2735.0,33.1985,-96.668
3,Frisco,Texas,188170,116989,60.84,67.7,2417.0,33.1554,-96.8226
4,Midland,Texas,142344,111147,28.07,74.4,1809.0,32.0246,-102.1135


In [19]:
texas_df = texas_df[['City','State','% Change']]
texas_df.head()

Unnamed: 0,City,State,% Change
0,Austin,Texas,22.0
1,Fort Worth,Texas,20.75
2,McKinney,Texas,46.16
3,Frisco,Texas,60.84
4,Midland,Texas,28.07


In [20]:
zip_url = 'https://www.zip-codes.com/state/tx.asp' # url lists ALL zip codes in the state of Texas

zip_contents = get_html_contents(zip_url) # store html contents as soup type

In [21]:
zip_table = zip_contents.find('table', {'id': 'tblZIP'}) # find necessary html table to scrape data from
zip_headers = zip_table.tr.find_all('td') # obtian list of all table rows within necessary table

In [22]:
zip_rows = zip_table.find_all('tr') # get list of relevant table rows
zip_codes = [] # initialize
city_names = [] # initialize

for data in zip_rows[1:]: # append cell data from table to appropriate lists
    cells = data.find_all('td')
    zip_codes.append(cells[0].text.strip())
    city_names.append(cells[1].text.strip())
    
zip_df = pd.DataFrame({'Zip_Code': zip_codes,'City': city_names}) # create dataframe with lists

zip_df.head() # take a peak

Unnamed: 0,Zip_Code,City
0,ZIP Code 73301,Austin
1,ZIP Code 73344,Austin
2,ZIP Code 73960,Texhoma
3,ZIP Code 75001,Addison
4,ZIP Code 75002,Allen


In [23]:
zip_df.shape # verify shape of data: Texas has roughly 2600 zip codes according to google so looks good

(2598, 2)

##### Exploring the city of Austin, TX (geocoder got a bit wonky here...caution with this data):

In [24]:
austin_df = zip_df[ zip_df['City'] == 'Austin'].reset_index(drop=True) # restrict zip codes to Austin only
austin_df.rename(columns={'Zip_Code': 'ZIP'},inplace=True) # renaming for convenience
austin_df.drop('City',axis=1,inplace=True) # get rid of city name since we know which city we are exploring
austin_df.head() # take a peak

Unnamed: 0,ZIP
0,ZIP Code 73301
1,ZIP Code 73344
2,ZIP Code 78701
3,ZIP Code 78702
4,ZIP Code 78703


In [25]:
austin_df.shape

(73, 1)

In [26]:
austin_zip_cleaned = [] # initialize

for i, ZIP in enumerate(austin_df.index): # formatting ZIP column to be numerical only
    austin_zip_cleaned.append(austin_df.loc[i,'ZIP'].split(' ')[2])
    
austin_df['ZIP'] = pd.Series(austin_zip_cleaned)
austin_df.head() # take a peak

Unnamed: 0,ZIP
0,73301
1,73344
2,78701
3,78702
4,78703


In [27]:
# grabbing the lat/long coords for zip codes within Austin, TX

austin_lat = [] # initialize
austin_long = [] # initialize

for i, ZIP in enumerate(austin_df.index): # parse through each zip code associated
    # with Austin and obtain the lat/long for each
    address = austin_df.loc[i,'ZIP'] + ', USA' # geocoder was obtaining locations in
    # Mexico, Toronto, and Europe when just the zip code was fed in... so I had to play
    # with this a bit... again caution with this data. Maybe look for better method later.
    geolocator = Nominatim(user_agent='austin_explorer')
    location = geolocator.geocode(address)
    try: # Not all zip codes were returning locations via geocoder so did try-except
        austin_lat.append(location.latitude)
        austin_long.append(location.longitude)
    except:
        continue
    
austin_df['Lat'] = pd.Series(austin_lat) 
austin_df['Long'] = pd.Series(austin_long)
austin_df.head() # take a peak

Unnamed: 0,ZIP,Lat,Long
0,73301,-4.52712,120.207789
1,73344,30.280466,-97.750088
2,78701,30.27846,-97.7188
3,78702,30.272052,-97.762387
4,78703,30.228031,-97.7734


In [28]:
austin_df.shape # Austin has about 73 zip codes but some locations are NaN...

(73, 3)

In [29]:
austin_df.dropna(axis=0,inplace=True) # only use zip codes with location found

In [30]:
austin_df.shape # restricted to 56 zip codes... again, data is not 100%

(56, 3)

In [31]:
# grabbing the lat/long coords for city of Austin, TX

address = 'Austin, TX'

geolocator = Nominatim(user_agent="tx_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of the United States are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of the United States are 30.2711286, -97.7436995.


In [32]:
# create map of Austin using latitude and longitude values
map_tx = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for austin_lat, austin_lng, ZIP in zip(austin_df['Lat'],
                         austin_df['Long'],
                         austin_df['ZIP']):
    label = 'Zip Code: {}, Lat: {}, Long: {}'.format(ZIP, austin_lat, austin_lng)
    label = folium.Popup(label, parse_html=True)
    marker_color='blue'
    
    folium.CircleMarker(
        [austin_lat, austin_lng],
        radius=5,
        popup=label,
        color=marker_color,
        fill=True,
        fill_color=marker_color,
        fill_opacity=0.7,
        parse_html=False).add_to(map_tx)  
    
map_tx

##### A good looking map of the 56 zip codes within Austin, TX. Now to explore venues.

##### Foursquare Credentials:

In [None]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [34]:
LIMIT = 100 # limite 100 venues

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zip', 
                  'Zip Latitude', 
                  'Zip Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [35]:
# storing nearby venues and listing nearby zip codes
austin_venues = getNearbyVenues(names=austin_df['ZIP'],
                                   latitudes=austin_df['Lat'],
                                   longitudes=austin_df['Long'])

73301
73344
78701
78702
78703
78704
78705
78708
78709
78710
78711
78712
78713
78714
78715
78716
78717
78718
78719
78720
78721
78722
78723
78724
78725
78726
78727
78728
78729
78730
78731
78732
78733
78734
78735
78736
78737
78738
78739
78741
78742
78744
78745
78746
78747
78748
78749
78750
78751
78752
78753
78754
78755
78756
78757
78758


In [36]:
print(austin_venues.shape) # 568 venues

(568, 7)


In [37]:
austin_venues.head() # take a peak

Unnamed: 0,Zip,Zip Latitude,Zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,73344,30.280466,-97.750088,Austin Land & Cattle,30.277549,-97.750821,Steakhouse
1,73344,30.280466,-97.750088,Austin Recreation Center,30.278168,-97.749114,Recreation Center
2,73344,30.280466,-97.750088,Pease District Park,30.283418,-97.753301,Park
3,73344,30.280466,-97.750088,Crystal Works,30.27719,-97.750624,Gift Shop
4,73344,30.280466,-97.750088,Castle Hill Fitness,30.276626,-97.752003,Gym


In [38]:
austin_venues.groupby('Zip').count() # take a peak of number of venues per zip code

Unnamed: 0_level_0,Zip Latitude,Zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Zip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
73344,17,17,17,17,17,17
78701,21,21,21,21,21,21
78702,38,38,38,38,38,38
78703,10,10,10,10,10,10
78704,25,25,25,25,25,25
78705,22,22,22,22,22,22
78708,4,4,4,4,4,4
78709,2,2,2,2,2,2
78710,2,2,2,2,2,2
78711,2,2,2,2,2,2


In [39]:
print('There are {} unique categories.'.format(len(austin_venues['Venue Category'].unique())))

There are 193 unique categories.


##### Begin clustering for Austin (for exploration purposes):

In [40]:
# one hot encoding
austin_onehot = pd.get_dummies(austin_venues[['Venue Category']], prefix="", prefix_sep="")

# add zip code column to dataframe
austin_onehot['Zip'] = austin_venues['Zip'] 

# move zip code column to the first column
fixed_columns = [austin_onehot.columns[-1]] + list(austin_onehot.columns[:-1])
austin_onehot = austin_onehot[fixed_columns]

austin_onehot.head()

Unnamed: 0,Zip,ATM,Adult Boutique,Advertising Agency,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Track,Trail,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Yoga Studio
0,73344,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,73344,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,73344,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,73344,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,73344,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
austin_onehot.shape # verify shape of dataframe

(568, 194)

In [42]:
austin_grouped = austin_onehot.groupby('Zip').mean().reset_index()
austin_grouped # order zip codes

Unnamed: 0,Zip,ATM,Adult Boutique,Advertising Agency,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Track,Trail,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Yoga Studio
0,73344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,78701,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,78702,0.0,0.026316,0.0,0.052632,0.026316,0.0,0.0,0.0,0.0,...,0.026316,0.078947,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.105263
3,78703,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0
4,78704,0.04,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,78705,0.0,0.0,0.0,0.045455,0.0,0.090909,0.0,0.0,0.045455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,78708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,78709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,78710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,78711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
austin_grouped.shape # 42 venues and 194 features (zip code and 193 unique categories)

(42, 194)

In [44]:
num_top_venues = 5

for ZIP in austin_grouped['Zip']: # print out relative frequency of top 5 venues for each zip code
    print("----"+ZIP+"----")
    temp = austin_grouped[austin_grouped['Zip'] == ZIP].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----73344----
           venue  freq
0      Pet Store  0.18
1  Historic Site  0.06
2           Park  0.06
3     Soup Place  0.06
4      Gift Shop  0.06


----78701----
                             venue  freq
0                              Bar  0.10
1                             Park  0.10
2                        BBQ Joint  0.05
3                      Gas Station  0.05
4  Southern / Soul Food Restaurant  0.05


----78702----
                  venue  freq
0  Gym / Fitness Center  0.13
1           Yoga Studio  0.11
2                   Spa  0.11
3                 Trail  0.08
4   American Restaurant  0.05


----78703----
                venue  freq
0   Convenience Store   0.1
1         Video Store   0.1
2          Food Truck   0.1
3  Chinese Restaurant   0.1
4         Gas Station   0.1


----78704----
                   venue  freq
0                   Café  0.08
1         Sandwich Place  0.08
2  Performing Arts Venue  0.08
3                   Park  0.04
4              Gift Shop  0.04


--

In [45]:
def return_most_common_venues(row, num_top_venues): # investigate most common venues
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [46]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zip']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
austin_venues_sorted = pd.DataFrame(columns=columns)
austin_venues_sorted['Zip'] = austin_grouped['Zip']

for ind in np.arange(austin_grouped.shape[0]):
    austin_venues_sorted.iloc[ind, 1:] = return_most_common_venues(austin_grouped.iloc[ind, :], num_top_venues)

austin_venues_sorted.head()

Unnamed: 0,Zip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,73344,Pet Store,Board Shop,Recreation Center,Skate Park,Sandwich Place,Soup Place,Salon / Barbershop,Bed & Breakfast,Sports Bar,Steakhouse
1,78701,Bar,Park,Convenience Store,Theater,Lighthouse,Laundromat,Gas Station,Southern / Soul Food Restaurant,Market,BBQ Joint
2,78702,Gym / Fitness Center,Spa,Yoga Studio,Trail,American Restaurant,Dive Bar,Clothing Store,Chinese Restaurant,Café,Sushi Restaurant
3,78703,Platform,Thrift / Vintage Store,Gas Station,Athletics & Sports,Residential Building (Apartment / Condo),Chinese Restaurant,Food Truck,Video Store,Advertising Agency,Convenience Store
4,78704,Café,Performing Arts Venue,Sandwich Place,Road,Motel,Bubble Tea Shop,Bus Station,Bus Stop,Ethiopian Restaurant,ATM


##### Performing clustering:

In [47]:
# set number of clusters
kclusters = 10

austin_grouped_clustering = austin_grouped.drop('Zip', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(austin_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([8, 8, 8, 8, 8, 8, 8, 9, 2, 8])

In [48]:
# add clustering labels
austin_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

austin_merged = austin_df

# merge Austin venues with Austin zip code dataframe to add latitude/longitude for each zip code
austin_merged = austin_merged.join(austin_venues_sorted.set_index('Zip'), on='ZIP')

austin_merged.dropna(inplace=True) # drops any NaN results that may pop up
austin_merged.head()

Unnamed: 0,ZIP,Lat,Long,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,73344,30.280466,-97.750088,8.0,Pet Store,Board Shop,Recreation Center,Skate Park,Sandwich Place,Soup Place,Salon / Barbershop,Bed & Breakfast,Sports Bar,Steakhouse
2,78701,30.27846,-97.7188,8.0,Bar,Park,Convenience Store,Theater,Lighthouse,Laundromat,Gas Station,Southern / Soul Food Restaurant,Market,BBQ Joint
3,78702,30.272052,-97.762387,8.0,Gym / Fitness Center,Spa,Yoga Studio,Trail,American Restaurant,Dive Bar,Clothing Store,Chinese Restaurant,Café,Sushi Restaurant
4,78703,30.228031,-97.7734,8.0,Platform,Thrift / Vintage Store,Gas Station,Athletics & Sports,Residential Building (Apartment / Condo),Chinese Restaurant,Food Truck,Video Store,Advertising Agency,Convenience Store
5,78704,30.288272,-97.727404,8.0,Café,Performing Arts Venue,Sandwich Place,Road,Motel,Bubble Tea Shop,Bus Station,Bus Stop,Ethiopian Restaurant,ATM


In [49]:
# grabbing the lat/long coords for Austin, TX

address = 'Austin, TX'

geolocator = Nominatim(user_agent="tx_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of the United States are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of the United States are 30.2711286, -97.7436995.


In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(austin_merged['Lat'], austin_merged['Long'], austin_merged['ZIP'], austin_merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [51]:
austin_merged.loc[austin_merged['Cluster Labels'] == 0.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Home Service    1
Name: 1st Most Common Venue, dtype: int64

In [52]:
austin_merged.loc[austin_merged['Cluster Labels'] == 1.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Coffee Shop    1
Gym            1
Name: 1st Most Common Venue, dtype: int64

In [53]:
austin_merged.loc[austin_merged['Cluster Labels'] == 2.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Park       3
Spa        1
Dog Run    1
Name: 1st Most Common Venue, dtype: int64

In [54]:
austin_merged.loc[austin_merged['Cluster Labels'] == 3.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Construction & Landscaping    2
Name: 1st Most Common Venue, dtype: int64

In [55]:
austin_merged.loc[austin_merged['Cluster Labels'] == 4.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Pool    1
Name: 1st Most Common Venue, dtype: int64

In [56]:
austin_merged.loc[austin_merged['Cluster Labels'] == 5.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Park    1
Name: 1st Most Common Venue, dtype: int64

In [57]:
austin_merged.loc[austin_merged['Cluster Labels'] == 6.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Plane    1
Name: 1st Most Common Venue, dtype: int64

In [58]:
austin_merged.loc[austin_merged['Cluster Labels'] == 7.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

BBQ Joint    1
Name: 1st Most Common Venue, dtype: int64

In [59]:
austin_merged.loc[austin_merged['Cluster Labels'] == 8.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Bar                           2
Mexican Restaurant            2
Gym / Fitness Center          2
Café                          1
Spa                           1
Platform                      1
Discount Store                1
Nightclub                     1
Insurance Office              1
Gym                           1
Capitol Building              1
Auto Workshop                 1
Yoga Studio                   1
Construction & Landscaping    1
Hotel                         1
Pet Store                     1
American Restaurant           1
Comedy Club                   1
Miscellaneous Shop            1
Grocery Store                 1
Mobile Phone Shop             1
Football Stadium              1
Coffee Shop                   1
Pool                          1
Name: 1st Most Common Venue, dtype: int64

In [60]:
austin_merged.loc[austin_merged['Cluster Labels'] == 9.0, austin_merged.columns[[0] + list(range(4, austin_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Pool    1
Name: 1st Most Common Venue, dtype: int64

##### Now exploring the city of Fort Worth (similar investigation as with Austin):

In [61]:
zip_df[zip_df['City'] == 'Fort Worth'].head()

Unnamed: 0,Zip_Code,City
624,ZIP Code 76101,Fort Worth
625,ZIP Code 76102,Fort Worth
626,ZIP Code 76103,Fort Worth
627,ZIP Code 76104,Fort Worth
628,ZIP Code 76105,Fort Worth


In [62]:
fort_worth_df = zip_df[ zip_df['City'] == 'Fort Worth'].reset_index(drop=True)
fort_worth_df.rename(columns={'Zip_Code': 'ZIP'},inplace=True)
fort_worth_df.drop('City',axis=1,inplace=True)
fort_worth_df.tail()

Unnamed: 0,ZIP
51,ZIP Code 76195
52,ZIP Code 76196
53,ZIP Code 76197
54,ZIP Code 76198
55,ZIP Code 76199


In [63]:
fort_worth_df.shape

(56, 1)

In [64]:
fort_worth_zip_cleaned = []

for i, ZIP in enumerate(fort_worth_df.index):
    fort_worth_zip_cleaned.append(fort_worth_df.loc[i,'ZIP'].split(' ')[2])
    
fort_worth_df['ZIP'] = pd.Series(fort_worth_zip_cleaned)
fort_worth_df.head()

Unnamed: 0,ZIP
0,76101
1,76102
2,76103
3,76104
4,76105


In [65]:
# grabbing the lat/long coords for Toronto, Ontario, Canada

fort_worth_lat = []
fort_worth_long = []

for i, ZIP in enumerate(fort_worth_df.index):
    address = fort_worth_df.loc[i,'ZIP'] + ', USA'
    geolocator = Nominatim(user_agent='fort_worth_explorer')
    location = geolocator.geocode(address)
    try:
        fort_worth_lat.append(location.latitude)
        fort_worth_long.append(location.longitude)
    except:
        continue
    
fort_worth_df['Lat'] = pd.Series(fort_worth_lat)
fort_worth_df['Long'] = pd.Series(fort_worth_long)
fort_worth_df.head()

Unnamed: 0,ZIP,Lat,Long
0,76101,32.750517,-97.33754
1,76102,32.755628,-97.326457
2,76103,32.737108,-97.277394
3,76104,32.727147,-97.330602
4,76105,32.7203,-97.289512


In [66]:
fort_worth_df.shape

(56, 3)

In [67]:
fort_worth_df.dropna(axis=0,inplace=True)

In [68]:
fort_worth_df.shape

(45, 3)

In [69]:
# storing nearby venues and listing nearby neighborhoods
fort_worth_venues = getNearbyVenues(names=fort_worth_df['ZIP'],
                                   latitudes=fort_worth_df['Lat'],
                                   longitudes=fort_worth_df['Long'])

76101
76102
76103
76104
76105
76106
76107
76108
76109
76110
76111
76112
76113
76114
76115
76116
76118
76119
76120
76121
76122
76123
76124
76126
76129
76130
76131
76132
76133
76134
76135
76136
76137
76140
76147
76148
76150
76155
76161
76162
76163
76164
76166
76177
76179


In [70]:
print(fort_worth_venues.shape)

(374, 7)


In [71]:
fort_worth_venues.head()

Unnamed: 0,Zip,Zip Latitude,Zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,76101,32.750517,-97.33754,Buon Giorno Coffee,32.749425,-97.336445,Coffee Shop
1,76101,32.750517,-97.33754,YMCA,32.752413,-97.334263,Gym
2,76101,32.750517,-97.33754,Studio 80,32.753064,-97.333263,Nightclub
3,76101,32.750517,-97.33754,Salsa Limón,32.752802,-97.333446,Food Truck
4,76101,32.750517,-97.33754,Walgreens,32.749379,-97.337187,Pharmacy


In [72]:
fort_worth_venues.groupby('Zip').count()

Unnamed: 0_level_0,Zip Latitude,Zip Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Zip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
76101,22,22,22,22,22,22
76102,35,35,35,35,35,35
76104,20,20,20,20,20,20
76105,1,1,1,1,1,1
76106,2,2,2,2,2,2
76107,20,20,20,20,20,20
76108,1,1,1,1,1,1
76109,8,8,8,8,8,8
76110,3,3,3,3,3,3
76111,22,22,22,22,22,22


In [73]:
print('There are {} unique categories.'.format(len(fort_worth_venues['Venue Category'].unique())))

There are 132 unique categories.


In [74]:
# one hot encoding
fort_worth_onehot = pd.get_dummies(fort_worth_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
fort_worth_onehot['Zip'] = fort_worth_venues['Zip'] 

# move neighborhood column to the first column
fixed_columns = [fort_worth_onehot.columns[-1]] + list(fort_worth_onehot.columns[:-1])
fort_worth_onehot = fort_worth_onehot[fixed_columns]

fort_worth_onehot.head()

Unnamed: 0,Zip,Airport,American Restaurant,Antique Shop,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Automotive Shop,BBQ Joint,Bagel Shop,...,Thrift / Vintage Store,Trail,Train Station,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfall,Women's Store
0,76101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,76101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,76101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,76101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,76101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [75]:
fort_worth_onehot.shape

(374, 133)

In [76]:
fort_worth_grouped = fort_worth_onehot.groupby('Zip').mean().reset_index()
fort_worth_grouped

Unnamed: 0,Zip,Airport,American Restaurant,Antique Shop,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Automotive Shop,BBQ Joint,Bagel Shop,...,Thrift / Vintage Store,Trail,Train Station,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfall,Women's Store
0,76101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0
1,76102,0.0,0.171429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,76104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,76105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,76106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,76107,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05
6,76108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,76109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0
8,76110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,76111,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,...,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0


In [77]:
fort_worth_grouped.shape

(37, 133)

In [78]:
num_top_venues = 5

for ZIP in fort_worth_grouped['Zip']:
    print("----"+ZIP+"----")
    temp = fort_worth_grouped[fort_worth_grouped['Zip'] == ZIP].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----76101----
                 venue  freq
0          Video Store  0.09
1  Rental Car Location  0.09
2       Sandwich Place  0.09
3           Food Truck  0.09
4                Hotel  0.05


----76102----
                  venue  freq
0   American Restaurant  0.17
1                 Hotel  0.11
2    Mexican Restaurant  0.06
3  Gym / Fitness Center  0.03
4            Public Art  0.03


----76104----
                     venue  freq
0              Pizza Place  0.10
1       Mexican Restaurant  0.10
2                      Bar  0.05
3  New American Restaurant  0.05
4    Salvadoran Restaurant  0.05


----76105----
                     venue  freq
0                Waterfall   1.0
1                  Airport   0.0
2  North Indian Restaurant   0.0
3        Mobile Phone Shop   0.0
4                    Motel   0.0


----76106----
               venue  freq
0  Convenience Store   0.5
1   Basketball Court   0.5
2            Airport   0.0
3       Optical Shop   0.0
4      Movie Theater   0.0


----7610

In [79]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [80]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zip']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
fort_worth_venues_sorted = pd.DataFrame(columns=columns)
fort_worth_venues_sorted['Zip'] = fort_worth_grouped['Zip']

for ind in np.arange(fort_worth_grouped.shape[0]):
    fort_worth_venues_sorted.iloc[ind, 1:] = return_most_common_venues(fort_worth_grouped.iloc[ind, :], num_top_venues)

fort_worth_venues_sorted.head()

Unnamed: 0,Zip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,76101,Sandwich Place,Food Truck,Video Store,Rental Car Location,Gym,Gas Station,Hotel,Italian Restaurant,Cosmetics Shop,Convenience Store
1,76102,American Restaurant,Hotel,Mexican Restaurant,Sandwich Place,Performing Arts Venue,Dessert Shop,Rock Club,Rental Car Location,Public Art,Plaza
2,76104,Mexican Restaurant,Pizza Place,Pharmacy,Basketball Court,Fast Food Restaurant,Japanese Restaurant,Donut Shop,Diner,Convenience Store,New American Restaurant
3,76105,Waterfall,Discount Store,Fast Food Restaurant,Farmers Market,Farm,Electronics Store,Donut Shop,Dive Bar,Women's Store,Financial or Legal Service
4,76106,Convenience Store,Basketball Court,Women's Store,Dive Bar,Fast Food Restaurant,Farmers Market,Farm,Electronics Store,Donut Shop,Discount Store


In [81]:
# set number of clusters
kclusters = 10

fort_worth_grouped_clustering = fort_worth_grouped.drop('Zip', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(fort_worth_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 4, 4, 0, 1, 4, 6, 4, 8, 1])

In [82]:
# add clustering labels
fort_worth_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

fort_worth_merged = fort_worth_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
fort_worth_merged = fort_worth_merged.join(fort_worth_venues_sorted.set_index('Zip'), on='ZIP')

fort_worth_merged.dropna(inplace=True) # drops any NaN results that may pop up
fort_worth_merged.head()

Unnamed: 0,ZIP,Lat,Long,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,76101,32.750517,-97.33754,4.0,Sandwich Place,Food Truck,Video Store,Rental Car Location,Gym,Gas Station,Hotel,Italian Restaurant,Cosmetics Shop,Convenience Store
1,76102,32.755628,-97.326457,4.0,American Restaurant,Hotel,Mexican Restaurant,Sandwich Place,Performing Arts Venue,Dessert Shop,Rock Club,Rental Car Location,Public Art,Plaza
3,76104,32.727147,-97.330602,4.0,Mexican Restaurant,Pizza Place,Pharmacy,Basketball Court,Fast Food Restaurant,Japanese Restaurant,Donut Shop,Diner,Convenience Store,New American Restaurant
4,76105,32.7203,-97.289512,0.0,Waterfall,Discount Store,Fast Food Restaurant,Farmers Market,Farm,Electronics Store,Donut Shop,Dive Bar,Women's Store,Financial or Legal Service
5,76106,32.784913,-97.374036,1.0,Convenience Store,Basketball Court,Women's Store,Dive Bar,Fast Food Restaurant,Farmers Market,Farm,Electronics Store,Donut Shop,Discount Store


In [83]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(fort_worth_merged['Lat'], fort_worth_merged['Long'], fort_worth_merged['ZIP'], fort_worth_merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [84]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 0.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Waterfall    1
Name: 1st Most Common Venue, dtype: int64

In [85]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 1.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Fast Food Restaurant    3
Liquor Store            2
Convenience Store       1
Fried Chicken Joint     1
Breakfast Spot          1
Name: 1st Most Common Venue, dtype: int64

In [86]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 2.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Park            1
Home Service    1
Name: 1st Most Common Venue, dtype: int64

In [87]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 3.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

American Restaurant    1
Name: 1st Most Common Venue, dtype: int64

In [88]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 4.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

American Restaurant       3
Park                      3
Pizza Place               2
Discount Store            2
Farm                      1
Motel                     1
Mexican Restaurant        1
Sandwich Place            1
Furniture / Home Store    1
Pharmacy                  1
Hobby Shop                1
BBQ Joint                 1
Basketball Court          1
Name: 1st Most Common Venue, dtype: int64

In [89]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 5.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Scenic Lookout    1
Name: 1st Most Common Venue, dtype: int64

In [90]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 6.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Gym / Fitness Center    1
Name: 1st Most Common Venue, dtype: int64

In [91]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 7.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Gym    1
Name: 1st Most Common Venue, dtype: int64

In [92]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 8.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Mexican Restaurant    2
Name: 1st Most Common Venue, dtype: int64

In [93]:
fort_worth_merged.loc[fort_worth_merged['Cluster Labels'] == 9.0, fort_worth_merged.columns[[0] + list(range(4, fort_worth_merged.shape[1]))]]['1st Most Common Venue'].value_counts()

Airport    1
Name: 1st Most Common Venue, dtype: int64

# Not able to finish due to CoVid-19 pandemic distractions throughout the week.

##### I was unable to finish due to the novel coronavirus being a main distraction. I wish I had more time because I was having fun... so I suppose I'll just have to finish when time frees up. I was not even able to get around to investigating the newest venues (Foursquare does not actually have this information so I will search for this later) and I was only able to investigate two cities. The quest is to be continued...