# Introduction  
Hello and welcome to my IBM Data Science Capstone Project on Coursera. This notebook holds the source code for the project and has detailed headers explaining each step along the way. The last cell of the notebook contains the code necessary to generate the final outcomes (though be sure to run all preceding code cells beforehand). Thank you for reviewing my project and  I hope you find it interesting!

In [1]:
#All relevent imports
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
import json
from geopy.geocoders import Nominatim
import geocoder
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import folium

from bs4 import BeautifulSoup

# Goal 
#### to determine the best place a buisness owner can open up his/her buisness in chicago.
- we will utalize chicago census data and the foursquare api to generate a heatmap of recommended locations
- clustering will be used to translate the unstructured foursquare data into meaningful insights about the buisness enviroment of chicago neighborhoods

In [2]:
#Some Resources Used
#base API
#https://api.foursquare.com/v2/
#https://api.foursquare.com/v2/venues/
#https://api.foursquare.com/v2/users/
#https://api.foursquare.com/v2/tips/
#https://github.com/blackmad/neighborhoods/blob/master/chicago.geojson
#https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/il_illinois_zip_codes_geo.min.json

# Getting Soup Output from website  
Here we will be pulling the names and zipcodes of chicago neighborhoods from a website for later use

In [3]:
df_chicago = pd.DataFrame(columns=['Zipcode','Neighborhood'])
df_chicago

Unnamed: 0,Zipcode,Neighborhood


In [4]:
addr = 'https://data.mongabay.com/igapo/zip_codes/metropolitan-areas/metro-zip/Chicago%20(IL)1.html'
source = requests.get(addr).text
soup = BeautifulSoup(source,'lxml')

In [5]:
table = soup.find('table',class_='boldtable')

In [6]:
for i in table.find_all('tr'):
    content = i.td.text.split()
    df_chicago = df_chicago.append(dict(zip(df_chicago.columns,content)),ignore_index=True)
df_chicago

Unnamed: 0,Zipcode,Neighborhood
0,60001,Alden
1,60002,Antioch
2,60002,Old
3,60002,Old
4,60002,Wadsworth
5,60004,Arlington
6,60005,Arlington
7,60006,Arlington
8,60007,Elk
9,60008,Rolling


In [7]:
#df_chicago.to_csv(r'D:\Desktop\outcomes\chicago.csv')

In [8]:
df_chicago_only = df_chicago[df_chicago["Neighborhood"] == "Chicago"]

In [9]:
codes = df_chicago.groupby(df_chicago["Neighborhood"]).groups

In [10]:
#Create Empty Pandas DF
df_grouped = pd.DataFrame(columns=['Neighborhood','Zipcode'])
df_grouped

Unnamed: 0,Neighborhood,Zipcode


# Combining Data By Neighborhood  
For easy lookback here we combine the neightborhood names by each zipcode. This will aid in individual research outside of the datasets can we can later use to formulate a cost function.

In [11]:
for nb in codes.keys():
    print("NB",nb)
    zc = []
    for i in codes[nb]:  
        zc.append(df_chicago.iloc[i][0])
        zc = list(set(zc))
        zcf = ', '.join(zc)

NB AT
NB Abbott
NB Addison
NB Alden
NB Algonquin
NB Alsip
NB American
NB Amf
NB Antioch
NB Argo
NB Arlington
NB Ashburn
NB Aurora
NB Ballou
NB Bank
NB Bannockburn
NB Barrington
NB Bartlett
NB Batavia
NB Beach
NB Bedford
NB Beecher
NB Bellwood
NB Bensenville
NB Berkeley
NB Berwyn
NB Big
NB Bloomingdale
NB Blue
NB Bolingbrook
NB Boulder
NB Braceville
NB Bradford
NB Braidwood
NB Bridgeview
NB Bristol
NB Broadview
NB Brookfield
NB Buffalo
NB Bull
NB Burbank
NB Burlington
NB Burnham
NB Burr
NB Burridge
NB CHI
NB CNA
NB Calumet
NB Campton
NB Carbon
NB Carol
NB Carpentersville
NB Carpentersvle
NB Cary
NB Channahon
NB Charles
NB Chase
NB Chemung
NB Chesney
NB Chestnut
NB Chgo
NB Chicago
NB Cicero
NB Citicorp
NB Clare
NB Clarendon
NB Clearing
NB Cloverdale
NB Cntry
NB Coal
NB Commonwealth
NB Coral
NB Cortland
NB Country
NB Countryside
NB Cragin
NB Crest
NB Cresthill
NB Crestwood
NB Crete
NB Crystal
NB Ctry
NB Custer
NB Daniel
NB Darien
NB Deer
NB Deerfield
NB Dekalb
NB Des
NB Despl/Rsmt
NB Diam

In [12]:
for nb in codes.keys():
    content = [nb]
    zc = []
    for i in codes[nb]:  
        zc.append(df_chicago.iloc[i][0])
        zc = list(set(zc))
        zcf = ', '.join(zc)
    content.append(zcf)
    df_grouped = df_grouped.append(dict(zip(df_grouped.columns,content)),ignore_index=True) 
df_grouped.head()

Unnamed: 0,Neighborhood,Zipcode
0,AT,60572
1,Abbott,60064
2,Addison,60101
3,Alden,"60001, 60033"
4,Algonquin,"60156, 60102"


# Illinois geodata  
In this section of code geojson data for the state of illinois is sorted to be used later in folium mapping. The file is rather large so it is sorted into only relevent sections and the remainder is dropped.

In [13]:
#loading GeoJSON file
with open('illinois.json','r') as jsonFile:
    data = json.load(jsonFile)

geo = data

In [14]:
#geo['features'][0]['properties']["ZCTA5CE10"]
geo['features'][0]

{'type': 'Feature',
 'properties': {'STATEFP10': '17',
  'ZCTA5CE10': '62359',
  'GEOID10': '1762359',
  'CLASSFP10': 'B5',
  'MTFCC10': 'G6350',
  'FUNCSTAT10': 'S',
  'ALAND10': 10360074,
  'AWATER10': 7921,
  'INTPTLAT10': '+40.0338795',
  'INTPTLON10': '-091.2014548',
  'PARTFLG10': 'N'},
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-91.182899, 40.026881],
    [-91.182577, 40.026761],
    [-91.182428, 40.026711],
    [-91.182125, 40.026608],
    [-91.181677, 40.02648],
    [-91.181419, 40.026419],
    [-91.18093, 40.026323],
    [-91.180498, 40.026255],
    [-91.180081, 40.026205],
    [-91.179637, 40.026169],
    [-91.179213, 40.026161],
    [-91.1788, 40.026162],
    [-91.178347, 40.026177],
    [-91.177816, 40.026225],
    [-91.177419, 40.026285],
    [-91.177051, 40.026348],
    [-91.176668, 40.026429],
    [-91.176692, 40.02387],
    [-91.176692, 40.023811],
    [-91.176698, 40.021465],
    [-91.176684, 40.021291],
    [-91.176632, 40.021046],
    [-91.176601, 40.02097

In [15]:
#Create Empty Pandas DF
df_geoZips = pd.DataFrame(columns=['Zipcode','Latitude','Longitude'])
df_geoZips

Unnamed: 0,Zipcode,Latitude,Longitude


In [16]:
validZips = []
#zz = set(df_chicago.iloc[:,0].values)
zz = set(df_chicago_only.iloc[:,0].values)
for i in range(len(geo['features'])):
    zi = geo['features'][i]['properties']["ZCTA5CE10"]
    lat = geo['features'][i]['properties']['INTPTLAT10']
    long = geo['features'][i]['properties']['INTPTLON10']
    
    if(zi in zz):
        validZips.append(geo['features'][i])
        df_geoZips = df_geoZips.append(dict(zip(df_geoZips.columns,[zi,lat,long])),ignore_index=True) 
df_geoZips.head()    

Unnamed: 0,Zipcode,Latitude,Longitude
0,60656,41.9742801,-87.8271313
1,60638,41.7814424,-87.7705341
2,60652,41.7479398,-87.7148066
3,60629,41.7758678,-87.7114956
4,60641,41.9466055,-87.7467867


In [17]:
#validZips = set(df_chicago.iloc[:,0].values)
#geoData = []
#for i in range(len(geo['features'])):
#    z = geo['features'][i]['properties']["ZCTA5CE10"]
#    if(z in zz):
# 
#df_geoZips.to_csv(r'D:\Desktop\outcomes\chicago_geozips.csv')

In [18]:
geo['features'][0]

{'type': 'Feature',
 'properties': {'STATEFP10': '17',
  'ZCTA5CE10': '62359',
  'GEOID10': '1762359',
  'CLASSFP10': 'B5',
  'MTFCC10': 'G6350',
  'FUNCSTAT10': 'S',
  'ALAND10': 10360074,
  'AWATER10': 7921,
  'INTPTLAT10': '+40.0338795',
  'INTPTLON10': '-091.2014548',
  'PARTFLG10': 'N'},
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-91.182899, 40.026881],
    [-91.182577, 40.026761],
    [-91.182428, 40.026711],
    [-91.182125, 40.026608],
    [-91.181677, 40.02648],
    [-91.181419, 40.026419],
    [-91.18093, 40.026323],
    [-91.180498, 40.026255],
    [-91.180081, 40.026205],
    [-91.179637, 40.026169],
    [-91.179213, 40.026161],
    [-91.1788, 40.026162],
    [-91.178347, 40.026177],
    [-91.177816, 40.026225],
    [-91.177419, 40.026285],
    [-91.177051, 40.026348],
    [-91.176668, 40.026429],
    [-91.176692, 40.02387],
    [-91.176692, 40.023811],
    [-91.176698, 40.021465],
    [-91.176684, 40.021291],
    [-91.176632, 40.021046],
    [-91.176601, 40.02097

# Folium Beta Visual  
This is a pre-calculations visual of our selected zipcodes. Each marker is placed in the center of our target zipcodes.

In [36]:
map_chicago = folium.Map(location=[41.88, -87.62], zoom_start=10)

for i in df_geoZips.values:
    t1 = float(i[1])
    t2 = float(i[2])
    #folium.Marker([t1,t2]).add_to(map_chicago)
    folium.CircleMarker([t1,t2],radius=5,color='blue',fill=True,fill_color='#3186cc',fill_opactity=1).add_to(map_chicago)

    
    
map_chicago

# Folium Geo visual  
This is a pre-calculations visual of our illinois geojson data trimmed to the relevant zipcodes.

In [37]:
map_chicago = folium.Map(location=[41.82, -87.62], zoom_start=9.5)

print(len(validZips))
for i in range(len(validZips)):
    folium.GeoJson(validZips[i],overlay=True,style_function= lambda x :{'fillColor':'green','color':'green'}).add_to(map_chicago)
    
map_chicago

65


In [38]:
#map_chicago.save("chicago.html")

# chicago buisness data  
Here we are bringing in data from the US Census to judge how many of each size of buisness are in our target areas. We can sort the data by NAICS code to mirror our foursquare data and gain further insight into where a good place for our buisness might be.

In [39]:
#SOURCE
#zbp16totals
#https://www.census.gov/data/datasets/2016/econ/cbp/2016-cbp.html

Field           Data  
Name            Type    Description
* ZIP             C       ZIP Code
* NAICS           C       Industry Code - 6-digit NAICS code               
* EST             N       Total Number of Establishments
* N1_4            N       Number of Establishments: 1-4 Employee Size Class
* N5_9            N       Number of Establishments: 5-9 Employee Size Class
* N10_19          N       Number of Establishments: 10-19 Employee Size Class
* N20_49          N       Number of Establishments: 20-49 Employee Size Class
* N50_99          N       Number of Establishments: 50-99 Employee Size Class
* N100_249        N       Number of Establishments: 100-249 Employee Size Class
* N250_499        N       Number of Establishments: 250-499 Employee Size Class
* N500_999        N       Number of Establishments: 500-999 Employee Size Class
* N1000           N       Number of Establishments: 1,000 or More Employee Size Class

In [40]:
df_ccd = pd.read_csv("zbp16detail.csv")

In [41]:
print(df_ccd.head())
#df_geoZips
print(df_ccd.shape)
distinct_zips = set(df_geoZips.iloc[:,0].values)

   zip   naics  est  n1_4  n5_9  n10_19  n20_49  n50_99  n100_249  n250_499  \
0  501  ------    2     1     0       0       1       0         0         0   
1  501  81----    2     1     0       0       1       0         0         0   
2  501  813///    2     1     0       0       1       0         0         0   
3  501  8131//    2     1     0       0       1       0         0         0   
4  501  81311/    2     1     0       0       1       0         0         0   

   n500_999  n1000  
0         0      0  
1         0      0  
2         0      0  
3         0      0  
4         0      0  
(8418283, 12)


In [42]:
df_ccd = df_ccd[df_ccd["zip"].isin(distinct_zips)]


df_ccd.shape

(42701, 12)

In [43]:
codes = df_ccd.groupby(df_ccd["zip"]).groups
print(len(codes.keys()))

65


# Foursquare Data  
The bulk of our insight is gained from foursquare data. Here we will utalize the API to gain insight into the buisnesses in each zipcode and cluster different zipcodes accordingly. Those clusters can then be compaired to our target buisness to find which cluster best fits our target buisness and further compaired to our Census dataset

In [44]:
#trending venues endpoint
#means venues with the most people checked in
#we can use this data for each zipcode along with the chicago buisness data
#to find the zipcodes with the least amount of establishments but most
#trending

In [45]:
#Foursquare credentials
client_id = 'your_ud'
client_secret = 'your_secret'
version = '20190526'

In [46]:
radius = 100000
LIMIT = 50

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

In [47]:
def getTrending(lat,long):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&time=any&day=any'.format(
            client_id, 
            client_secret, 
            version, 
            lat, 
            long, 
            radius, 
            LIMIT)
     # make the GET request
    #results = requests.get(url).json()["response"]['groups'][0]['items']
    results = requests.get(url).json()
    return results

In [48]:
#60649	+41.7634204	-087.5658787
fTest = getTrending(+41.7634204,-087.5658787)

In [49]:
k = fTest['response']['groups'][0]['items']
k[0]['venue']

{'id': '42eeb780f964a520b4261fe3',
 'name': 'Museum of Science and Industry',
 'location': {'address': '5700 S Lake Shore Dr',
  'crossStreet': 'at 57th Dr',
  'lat': 41.791617208319984,
  'lng': -87.58306656501914,
  'labeledLatLngs': [{'label': 'display',
    'lat': 41.791617208319984,
    'lng': -87.58306656501914}],
  'distance': 3447,
  'postalCode': '60637',
  'cc': 'US',
  'city': 'Chicago',
  'state': 'IL',
  'country': 'United States',
  'formattedAddress': ['5700 S Lake Shore Dr (at 57th Dr)',
   'Chicago, IL 60637',
   'United States']},
 'categories': [{'id': '4bf58dd8d48988d191941735',
   'name': 'Science Museum',
   'pluralName': 'Science Museums',
   'shortName': 'Science Museum',
   'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/museum_science_',
    'suffix': '.png'},
   'primary': True}],
 'photos': {'count': 0, 'groups': []},
 'venuePage': {'id': '85626555'}}

In [50]:
i = [1234]
for j in fTest['response']['groups'][0]['items']:
    v = dict(j)['venue']
    content = [i[0],v['name'],v['location']['lat'],v['location']['lng'],]
    print(content)
    break;

[1234, 'Museum of Science and Industry', 41.791617208319984, -87.58306656501914]


# Testing out Locations  
Here we gain human insight into our data by seeing what categorys tend to show up for each zipcode. This insight was also useful because it exposed that some buisnesses were being duplicated by the API (Airports, resturants, ect) and this was corrected.

In [51]:
#Create Empty Pandas DF
df_trends = pd.DataFrame(columns=['Zipcode','Name','Latitude','Longitude','Category'])
target_category = '5454144b498ec1f095bff2f2'
#https://developer.foursquare.com/docs/resources/categories
df_trends

Unnamed: 0,Zipcode,Name,Latitude,Longitude,Category


In [52]:
#df_geoZips

for i in df_geoZips.values:
    trending_venues = getTrending(i[1],i[2])['response']['groups'][0]['items']
    for j in trending_venues:
        v = dict(j)['venue']
        content = [i[0],v['name'],v['location']['lat'],v['location']['lng'],v['categories'][0]['name']]
        df_trends = df_trends.append(dict(zip(df_trends.columns,content)),ignore_index=True) 
    

In [53]:
df_trends.drop_duplicates(["Zipcode","Name"],inplace = True)
len(df_trends)
df_trends.head()

Unnamed: 0,Zipcode,Name,Latitude,Longitude,Category
0,60656,The Capital Grille,41.974923,-87.862916,American Restaurant
1,60656,Frank Lloyd Wright Home and Studio,41.894157,-87.799517,Historic Site
2,60656,Smoque BBQ,41.950168,-87.727684,BBQ Joint
3,60656,Trader Joe's,41.890123,-87.804593,Grocery Store
4,60656,Portillo's,41.907365,-87.912586,Hot Dog Joint


In [54]:
# one hot encoding
df_trends_onehot = pd.get_dummies(df_trends[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_trends_onehot['Zipcode'] = df_trends['Zipcode'] 

# move neighborhood column to the first column
fixed_columns = [df_trends_onehot.columns[-1]] + list(df_trends_onehot.columns[:-1])
df_trends_onehot = df_trends_onehot[fixed_columns]

df_trends_onehot.head()

Unnamed: 0,Zipcode,African Restaurant,American Restaurant,Amphitheater,Antique Shop,Art Gallery,Art Museum,Asian Restaurant,BBQ Joint,Bakery,Bar,Baseball Stadium,Beach,Beer Bar,Beer Store,Boat or Ferry,Bookstore,Breakfast Spot,Brewery,Butcher,Café,Chinese Restaurant,Chocolate Shop,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,Concert Hall,Cosmetics Shop,Cupcake Shop,Cycle Studio,Deli / Bodega,Dessert Shop,Diner,Donut Shop,Electronics Store,Farmers Market,Field,Flower Shop,French Restaurant,Frozen Yogurt Shop,Furniture / Home Store,Garden,Garden Center,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Historic Site,History Museum,Hot Dog Joint,Hotel,Ice Cream Shop,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lingerie Store,Liquor Store,Mediterranean Restaurant,Mexican Restaurant,Molecular Gastronomy Restaurant,Museum,Music School,Music Venue,Nature Preserve,New American Restaurant,Optical Shop,Other Great Outdoors,Outdoor Sculpture,Park,Pie Shop,Pizza Place,Rock Club,Salad Place,Salon / Barbershop,Sandwich Place,Science Museum,Seafood Restaurant,Spa,Stadium,Sushi Restaurant,Tapas Restaurant,Theater,Trail,Vegetarian / Vegan Restaurant,Waterfront,Yoga Studio,Zoo,Zoo Exhibit
0,60656,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,60656,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,60656,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,60656,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,60656,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [55]:
df_trends_grouped = df_trends_onehot.groupby('Zipcode').mean().reset_index()
df_trends_grouped.head()

Unnamed: 0,Zipcode,African Restaurant,American Restaurant,Amphitheater,Antique Shop,Art Gallery,Art Museum,Asian Restaurant,BBQ Joint,Bakery,Bar,Baseball Stadium,Beach,Beer Bar,Beer Store,Boat or Ferry,Bookstore,Breakfast Spot,Brewery,Butcher,Café,Chinese Restaurant,Chocolate Shop,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Comedy Club,Concert Hall,Cosmetics Shop,Cupcake Shop,Cycle Studio,Deli / Bodega,Dessert Shop,Diner,Donut Shop,Electronics Store,Farmers Market,Field,Flower Shop,French Restaurant,Frozen Yogurt Shop,Furniture / Home Store,Garden,Garden Center,Gourmet Shop,Grocery Store,Gym,Gym / Fitness Center,Historic Site,History Museum,Hot Dog Joint,Hotel,Ice Cream Shop,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lingerie Store,Liquor Store,Mediterranean Restaurant,Mexican Restaurant,Molecular Gastronomy Restaurant,Museum,Music School,Music Venue,Nature Preserve,New American Restaurant,Optical Shop,Other Great Outdoors,Outdoor Sculpture,Park,Pie Shop,Pizza Place,Rock Club,Salad Place,Salon / Barbershop,Sandwich Place,Science Museum,Seafood Restaurant,Spa,Stadium,Sushi Restaurant,Tapas Restaurant,Theater,Trail,Vegetarian / Vegan Restaurant,Waterfront,Yoga Studio,Zoo,Zoo Exhibit
0,60411,0.021277,0.0,0.021277,0.0,0.021277,0.021277,0.021277,0.021277,0.021277,0.042553,0.0,0.0,0.0,0.0,0.021277,0.0,0.021277,0.042553,0.0,0.0,0.021277,0.021277,0.0,0.0,0.0,0.021277,0.0,0.021277,0.0,0.0,0.0,0.021277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021277,0.021277,0.0,0.0,0.042553,0.021277,0.042553,0.042553,0.0,0.0,0.0,0.0,0.021277,0.021277,0.021277,0.0,0.0,0.0,0.0,0.0,0.021277,0.0,0.0,0.021277,0.021277,0.12766,0.0,0.021277,0.021277,0.0,0.0,0.0,0.021277,0.0,0.0,0.0,0.0,0.0,0.021277,0.021277,0.0,0.042553,0.021277,0.0,0.0
1,60415,0.020833,0.0,0.0,0.0,0.020833,0.020833,0.020833,0.020833,0.020833,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0625,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.020833,0.0,0.0,0.020833,0.0,0.0,0.041667,0.020833,0.0,0.020833,0.041667,0.020833,0.020833,0.0625,0.0,0.0,0.0,0.0,0.020833,0.020833,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.083333,0.0,0.041667,0.020833,0.0,0.0,0.020833,0.020833,0.0,0.0,0.020833,0.0,0.0,0.020833,0.0,0.0,0.020833,0.041667,0.0,0.0
2,60601,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.04,0.0,0.02,0.02,0.02,0.0,0.02,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.02,0.02,0.02,0.02,0.0,0.0,0.0,0.1,0.0,0.0,0.02,0.0,0.0,0.02,0.02,0.04,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.02,0.0,0.02,0.08,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.06,0.02,0.0,0.04,0.04,0.0,0.0
3,60602,0.0,0.0,0.020408,0.0,0.0,0.020408,0.0,0.020408,0.0,0.020408,0.0,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.061224,0.0,0.020408,0.020408,0.0,0.0,0.020408,0.0,0.0,0.020408,0.020408,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.020408,0.020408,0.020408,0.0,0.0,0.0,0.102041,0.0,0.0,0.020408,0.0,0.0,0.020408,0.020408,0.040816,0.0,0.0,0.020408,0.0,0.0,0.0,0.020408,0.020408,0.0,0.020408,0.081633,0.0,0.0,0.0,0.020408,0.0,0.020408,0.0,0.040816,0.0,0.0,0.0,0.0,0.061224,0.020408,0.0,0.040816,0.020408,0.0,0.0
4,60603,0.0,0.0,0.020408,0.0,0.0,0.020408,0.0,0.020408,0.0,0.020408,0.0,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.061224,0.0,0.020408,0.020408,0.0,0.0,0.020408,0.0,0.0,0.020408,0.020408,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.0,0.020408,0.020408,0.020408,0.0,0.020408,0.0,0.102041,0.0,0.0,0.020408,0.0,0.0,0.020408,0.020408,0.020408,0.0,0.0,0.020408,0.0,0.0,0.0,0.020408,0.020408,0.0,0.020408,0.102041,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.081633,0.020408,0.0,0.040816,0.020408,0.0,0.0


# Venue Categories  
To make calculations easier later on and create a nicer input interface the venue categories are called down from the API and sorted according to category teirs.

In [56]:
#https://api.foursquare.com/v2/venues/categories
#Create Empty Pandas DF
df_category = pd.DataFrame(columns=['Category','Subcategory','Sub-Subcategory'])
df_category

Unnamed: 0,Category,Subcategory,Sub-Subcategory


In [57]:
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
            client_id, 
            client_secret,version )

categories = requests.get(url).json()['response']['categories']

In [58]:
categories[0]['categories'][20]['categories']

[{'id': '4bf58dd8d48988d18f941735',
  'name': 'Art Museum',
  'pluralName': 'Art Museums',
  'shortName': 'Art Museum',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/museum_art_',
   'suffix': '.png'},
  'categories': []},
 {'id': '559acbe0498e472f1a53fa23',
  'name': 'Erotic Museum',
  'pluralName': 'Erotic Museums',
  'shortName': 'Erotic Museum',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/nightlife/stripclub_',
   'suffix': '.png'},
  'categories': []},
 {'id': '4bf58dd8d48988d190941735',
  'name': 'History Museum',
  'pluralName': 'History Museums',
  'shortName': 'History Museum',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/museum_history_',
   'suffix': '.png'},
  'categories': []},
 {'id': '4bf58dd8d48988d192941735',
  'name': 'Planetarium',
  'pluralName': 'Planetariums',
  'shortName': 'Planetarium',
  'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/museum_plan

In [59]:
for k in categories:
    for i in k['categories']:
        if(len(i['categories']) > 0):
            for j in i['categories']:
                df_category = df_category.append(dict(zip(df_category.columns,[k['name'],i['name'],j['name'],None])),ignore_index=True)
            else:
                df_category = df_category.append(dict(zip(df_category.columns,[k['name'],i['name'],i['name']])),ignore_index=True) 


    
df_category.head()

Unnamed: 0,Category,Subcategory,Sub-Subcategory
0,Arts & Entertainment,Movie Theater,Drive-in Theater
1,Arts & Entertainment,Movie Theater,Indie Movie Theater
2,Arts & Entertainment,Movie Theater,Multiplex
3,Arts & Entertainment,Movie Theater,Movie Theater
4,Arts & Entertainment,Museum,Art Museum


In [60]:
index = df_category[df_category['Subcategory'] == 'Stadium']
print(index)
df_trends[df_trends['Category'].isin(index['Sub-Subcategory'].values)]

                Category Subcategory     Sub-Subcategory
22  Arts & Entertainment     Stadium    Baseball Stadium
23  Arts & Entertainment     Stadium  Basketball Stadium
24  Arts & Entertainment     Stadium      Cricket Ground
25  Arts & Entertainment     Stadium    Football Stadium
26  Arts & Entertainment     Stadium        Hockey Arena
27  Arts & Entertainment     Stadium       Rugby Stadium
28  Arts & Entertainment     Stadium      Soccer Stadium
29  Arts & Entertainment     Stadium      Tennis Stadium
30  Arts & Entertainment     Stadium       Track Stadium
31  Arts & Entertainment     Stadium             Stadium


Unnamed: 0,Zipcode,Name,Latitude,Longitude,Category
67,60638,United Center,41.880759,-87.673974,Stadium
129,60652,United Center,41.880759,-87.673974,Stadium
174,60629,United Center,41.880759,-87.673974,Stadium
278,60625,Wrigley Field,41.94816,-87.655562,Baseball Stadium
373,60626,Wrigley Field,41.94816,-87.655562,Baseball Stadium
591,60630,Wrigley Field,41.94816,-87.655562,Baseball Stadium
627,60651,United Center,41.880759,-87.673974,Stadium
678,60645,Wrigley Field,41.94816,-87.655562,Baseball Stadium
788,60803,United Center,41.880759,-87.673974,Stadium
836,60712,Wrigley Field,41.94816,-87.655562,Baseball Stadium


# Collection of Dataframes  
Below is a detail of all of our collected dataframe thusfar and their held data. In total 7 dataframe were examined to give us great insight into the chicago buisness climate. With this data we can now proceed into final calculations.

* df_category = [CATEGORY,SUBCATEGORY,SUB-Subcategory]  
* df_trends_grouped = [Onehot encoded near buisnesses by category]  
* df_trends = [closest buisnesses and their categories]
* df_ccd = [chicago census data for buisnesses]  
* df_geoZips = [zip, lat ,long]  
* df_grouped = [all zipcodes for each neighborhood]
* df_chicago = [original scrapped data]

In [61]:
def mainCatPrintout():
    types = df_category.Subcategory.unique()
    print("Please Select a type:")
    for i in range(0,len(types),3):
        print("%-30s %-30s %s" %(str(i)+":"+types[i],str(i+1)+":"+types[i+1],str(i+2)+":"+types[i+2]))

        
def getmainCatSelection(index):
    index = int(index)
    if(index >= 0 and index < 52):
        sc = getGeoCats(df_category.Subcategory.unique()[index])
        #print(sc)
        return sc
    else:
        return "Selection Not Found. Please Try Again"

    
def getGeoCats(category_name):
    index = df_category[df_category['Subcategory'] == category_name]
    mc = index.values[0,0]
    sc = index['Sub-Subcategory'].values
    #print(sc)
    return sc

#mapping of input below to a NAICS code
#https://www.naics.com/business-lists/counts-by-naics-code/?#countsByNAICS
naics_codes = {0:71,1:61,2:71,3:71,4:81,5:71,6:71,7:71,8:61,9:61,10:72,
               11:72,12:72,13:72,14:72,15:72,16:72,17:72,18:72,19:72,20:72,
               21:72,22:72,23:72,24:72,25:72,26:72,27:72,28:72,29:72,30:72,
               31:72,32:44,33:11,34:11,35:92,36:71,37:71,38:92,39:62,40:55,
               41:61,42:71,43:62,44:44,45:72,46:42,47:81,48:48,49:53,50:48}

def getNaicsData(index):
    return df_ccd[df_ccd["naics"].str[0:2] == str(naics_codes[int(selection)])]

def getFoursquareData():
    limit = 10
    indicators = ['st', 'nd', 'rd']
    # create columns according to number of top venues
    columns = ['Zipcode']
    for ind in np.arange(limit):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))
    # create a new dataframe
    df_commons = pd.DataFrame(columns=columns)
    df_commons['Zipcode'] = df_trends_grouped['Zipcode']
    for ind in np.arange(df_trends_grouped.shape[0]):
        df_commons.iloc[ind, 1:] = getMostCommon(df_trends_grouped.iloc[ind, :], limit)
    return df_commons

def getMostCommon(row, limit):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:limit]

In [62]:
getFoursquareData().head()

Unnamed: 0,Zipcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,60411,Park,History Museum,Waterfront,Hotel,Ice Cream Shop,Bar,Brewery,Mediterranean Restaurant,Nature Preserve,Concert Hall
1,60415,Park,Brewery,Ice Cream Shop,Grocery Store,Bar,Pizza Place,History Museum,Yoga Studio,Coffee Shop,Garden
2,60601,Hotel,Park,Theater,Coffee Shop,Yoga Studio,Waterfront,Mediterranean Restaurant,Seafood Restaurant,Boat or Ferry,Gym / Fitness Center
3,60602,Hotel,Park,Theater,Coffee Shop,Seafood Restaurant,Mediterranean Restaurant,Boat or Ferry,Waterfront,Cosmetics Shop,Concert Hall
4,60603,Park,Hotel,Theater,Coffee Shop,Seafood Restaurant,Boat or Ferry,Waterfront,Mediterranean Restaurant,New American Restaurant,Museum


# Clustering on Foursqure Data  
Here we utalize the encoded data from the foursquare API to cluster zipcodes according to buisness climates. This will form a large part of our predictions

In [63]:
#Num clusters
k = 5
#dataSet = getFoursquareData().drop('Zipcode',1)

kmc = KMeans(random_state=0)
kmc.fit(df_trends_grouped.drop('Zipcode',1))

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

In [64]:
kmc.labels_

array([5, 5, 4, 4, 4, 4, 4, 4, 4, 1, 1, 4, 4, 3, 6, 0, 1, 1, 1, 2, 1, 1,
       1, 3, 3, 3, 6, 6, 5, 5, 2, 2, 5, 1, 7, 1, 1, 5, 7, 6, 7, 0, 5, 3,
       6, 2, 7, 1, 3, 5, 1, 4, 5, 2, 0, 6, 6, 4, 7, 7, 2, 5, 3, 5, 5])

In [65]:
df_geoZips.sort_values(by=["Zipcode"],inplace=True)
df_geoZips.insert(3,"Cluster",kmc.labels_,True)

In [66]:
df_geoZips.head()

Unnamed: 0,Zipcode,Latitude,Longitude,Cluster
44,60411,41.5087744,-87.5903141,5
22,60415,41.7029482,-87.7788303,5
20,60601,41.8853104,-87.6221295,4
52,60602,41.8830726,-87.6291494,4
33,60603,41.8801879,-87.6255095,4


In [67]:
validZips[0]['properties']['ZCTA5CE10']

for i in validZips[0:1]:
    print(i['properties']['ZCTA5CE10'])

60656


In [68]:
df_geoZips.head()

Unnamed: 0,Zipcode,Latitude,Longitude,Cluster
44,60411,41.5087744,-87.5903141,5
22,60415,41.7029482,-87.7788303,5
20,60601,41.8853104,-87.6221295,4
52,60602,41.8830726,-87.6291494,4
33,60603,41.8801879,-87.6255095,4


# Cluster Map  
This map represents the clustered data. All that remains is a cost function analysis cloropleth map to be overlayed atop it to create final recommendations.

In [69]:
map_chicago = folium.Map(location=[41.88, -87.62], zoom_start=10)
numClusters = df_geoZips["Cluster"].max()
x = np.arange(numClusters)
ys = [i + x + (i*x)**2 for i in range(numClusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


for i in df_geoZips.values:
    t1 = float(i[1])
    t2 = float(i[2])
    folium.CircleMarker([t1,t2],radius=5,color=rainbow[int(i[3]-1)],fill=True,fill_color=rainbow[int(i[3]-1)],fill_opactity=0.1).add_to(map_chicago)

     
for i in range(len(validZips)):
    clust = df_geoZips[df_geoZips["Zipcode"]==validZips[i]['properties']['ZCTA5CE10']].values[0][3]
    folium.GeoJson(validZips[i],style_function= lambda x: {'fillColor':'grey','color':'gray'}).add_to(map_chicago)
    
    
map_chicago

# Recommendation Logic  
Here lays the recommendation cost function for our analysis it attempts to score zipcodes based on the business opportunity by balancing the right amount of existing business presence (signaling a market/want) and threat of competition (too many small businesses or a few large businesses)

In [70]:
fqd = getFoursquareData()
def zipcodeScore(zipCode,ccdSmallBuisnessNum,selection):
    buisMod = 0
    clusterMod = 0
    if(ccdSmallBuisnessNum != 0 and (ccdSmallBuisnessNum > 8 or ccdSmallBuisnessNum < 4)):
        buisMod = (abs(ccdSmallBuisnessNum-6)-2)*-1
    weight = -5
    for i in fqd[fqd["Zipcode"]==str(zipCode)].values[0]:
        if(i not in getmainCatSelection(selection)):
            clusterMod-=(5-abs(weight))
            weight+=1
    return clusterMod+buisMod


def recommendationEngine(selection):
    naics = getNaicsData(selection)
    scores = []
    for i in df_geoZips.values:
        buisnesses = naics[naics["zip"]==int(i[0])]
        bNum = buisnesses["n1_4"].sum() + buisnesses["n5_9"].sum()
        scores.append(zipcodeScore(i[0],bNum,selection))
    df_geoZips.insert(4,"Score",scores,True)
    bestCluster = df_geoZips.iloc[df_geoZips[['Score']].idxmax()].values[0][3]
    for i in range(len(df_geoZips)):
        if(df_geoZips.iloc[i,3]==bestCluster):
            df_geoZips.iloc[i,4]+=30
    return df_geoZips

In [71]:
df_geoZips.head()

Unnamed: 0,Zipcode,Latitude,Longitude,Cluster
44,60411,41.5087744,-87.5903141,5
22,60415,41.7029482,-87.7788303,5
20,60601,41.8853104,-87.6221295,4
52,60602,41.8830726,-87.6291494,4
33,60603,41.8801879,-87.6255095,4


In [76]:
mainCatPrintout()
selection = input()
if((selection != None) and int(selection) >= 0 and int(selection) < 51):
    getmainCatSelection(selection)
    recommendationEngine(selection)
    map_chicago = folium.Map(location=[41.88, -87.62], zoom_start=10)

    # Add the color for the chloropleth:
    map_chicago.choropleth(
     geo_data=dict({"Type":"FeatureCollection","features":list(validZips)}),
     name='choropleth',
     data=df_geoZips,
     columns=['Zipcode', 'Score'],
     key_on='properties.ZCTA5CE10',
     fill_color='BuGn',
     fill_opacity=0.9,
     line_opacity=0.5,
     legend_name="Recommedation Cost Estimate"
    )
    folium.LayerControl().add_to(map_chicago)

    for i in df_geoZips.values:
        t1 = float(i[1])
        t2 = float(i[2])
        folium.CircleMarker([t1,t2],radius=5,color=rainbow[int(i[3]-1)],fill=True,fill_color=rainbow[int(i[3]-1)],fill_opactity=0.1).add_to(map_chicago)



    display(map_chicago)
    df_geoZips.drop(["Score"], axis=1,inplace=True)
else:
    print("Please enter a valid Selection")

Please Select a type:
0:Movie Theater                1:Museum                       2:Music Venue
3:Performing Arts Venue        4:Public Art                   5:Stadium
6:Theme Park                   7:Zoo                          8:College Academic Building
9:College Stadium              10:African Restaurant          11:American Restaurant
12:Asian Restaurant            13:Caribbean Restaurant        14:Dessert Shop
15:Eastern European Restaurant 16:French Restaurant           17:German Restaurant
18:Greek Restaurant            19:Hawaiian Restaurant         20:Indian Restaurant
21:Italian Restaurant          22:Jewish Restaurant           23:Latin American Restaurant
24:Mediterranean Restaurant    25:Mexican Restaurant          26:Middle Eastern Restaurant
27:Russian Restaurant          28:Spanish Restaurant          29:Turkish Restaurant
30:Ukrainian Restaurant        31:Bar                         32:Athletics & Sports
33:Beach                       34:Ski Area                   