# Capstone Project Report – The Battle of Neighbourhoods in Chicago

## Introduction/Business Problem

A sushi franchise owner is seeking perfect locations to open branches where he can intorduce the finest sushi to the residents of the city. However, he is new to the city and couldn't decide where to set root for the business to grow. The three rules for starting a business are 1)location, 2)location, and 3)location! Therefore, he seeks help from data scientists and engineers to solve the problem that could be the deciding factor to this expansion.

## Data

Based on definition of our problem, factors that will influence the decission are:
* Postal and geolocation data of the neighbourhoods in Chicago.
* Number of restaurants and the ratio of restaurants venues to other venues.
* Number of sushi restaurants in the neighbourhoods.

## Methodology

We will first identify and map the neighbourhoods on a folium map to provide a taste of where and how the neighbourhoods in Chicago are located. Next, we will start to explore each neighbourhoods, using Foursquare API to help us find restaurants venues and all other types of venues, and make exploratory data analyses to see what the neighbourhoods are made up of. Then, we will use Foursquare API again to find all the sushi restaurants in each neighbourhood to identify potential competitors, and count them as one of the important factors. Finally, do a K-means clustering analysis to categorize them into 5 clusters, which could be served as indicators to help us determine whether a neighbourhood has the potential to start a successful business.

### Import Libraries

In [62]:
import bs4
from bs4 import BeautifulSoup

import requests

import pandas as pd

import numpy as np

import re

import folium

# !conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

### Make GET request to Wikipedia and extract our target table

In [63]:
url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
data = requests.get(url).text # send GET request and store as text data
my_soup = BeautifulSoup(data, 'html5lib') # parse the data with beautifulsoup

# search for target table
tables = my_soup.find_all('table')

for index, table in enumerate(tables):
    if 'Chicago community areas by number, population, and area' in str(table): # find the table by title
        target_table_index = index
print('There are {} tables found.\nTarget Table Index : {}'.format(index + 1, target_table_index))

There are 4 tables found.
Target Table Index : 0


### Find the coordinates of the neighbourhoods

In [64]:
# convert DMS coordinate to decimal coordinates
# the reason for creating this function is because the coordinates on the web page
# are in the DMS format instead of decimal format, which is what we need
def dms2dd(s):
    if '″' in s:
        degrees, minutes, seconds, direction = re.split('[°′″]+', s)
        dd = float(degrees) + float(minutes)/60 + float(seconds)/(60*60)
        if direction in ('S','W'):
            dd*= -1

    else:
        degrees, minutes, direction = re.split('[°′]+', s)
        dd = float(degrees) + float(minutes)/60
        if direction in ('S','W'):
            dd*= -1

    return dd

# get coordinate from wiki sub page
# the coordinates of the neighbourhoods are in the links of each individual neighbourhood
# therefore will have to make a get request for each neighbourhood and find the coordinates
def get_coordinate(row, name):
    link = 'https://en.wikipedia.org' + row.find('a')['href']
    data = requests.get(link).text
    sub_soup = BeautifulSoup(data,'html5lib')

    table = sub_soup.find('table', {'class':'infobox geography vcard'})
    latitude = table.find('span', {'class':'latitude'}).getText()
    longitude = table.find('span', {'class':'longitude'}).getText()

    latitude = dms2dd(latitude)
    longitude = dms2dd(longitude)

    return latitude, longitude

# create dataframe then append contents from wikipedia : https://en.wikipedia.org/wiki/Community_areas_in_Chicago
community_df =  pd.DataFrame(columns = ['No.', 'Name', 'Latitude', 'Longitude'])

for count, row in enumerate(tables[target_table_index].tbody.find_all('tr')):
    if count > 1 and count < 79:
        number = row.find('td').getText().replace('\n', '')
        name = row.find('a').getText()

        latitude, longitude = get_coordinate(row, name)

        community_df = community_df.append({'No.' : number, 'Name' : name, 'Latitude' : latitude, 'Longitude' : longitude}, ignore_index = True)

community_df

Unnamed: 0,No.,Name,Latitude,Longitude
0,01,Rogers Park,42.010000,-87.670000
1,02,West Ridge,42.000000,-87.690000
2,03,Uptown,41.970000,-87.660000
3,04,Lincoln Square,41.970000,-87.690000
4,05,North Center,41.950000,-87.680000
...,...,...,...,...
72,73,Washington Heights,41.703833,-87.653667
73,74,Mount Greenwood,41.700000,-87.710000
74,75,Morgan Park,41.690000,-87.670000
75,76,O'Hare,42.000000,-87.920000


### After we got our coordinate data, mark them on our map. We will locate Chicago first.

In [65]:
address = 'Chicago, IL'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Chicago are ({}, {}).'.format(latitude, longitude))

The geograpical coordinates of Chicago are (41.8755616, -87.6244212).


### Map out the neighbourhoods in the city of Chicago

In [66]:
map_chicago = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, num, neighbourhood in zip(community_df['Latitude'], community_df['Longitude'], community_df['No.'], community_df['Name']):
    label = '{}. {}'.format(num, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_chicago)  
    
map_chicago

### There we have all of our neighbourhoods in place. Now, we will try to utilize our Foursquare API

In [67]:
CLIENT_ID = 'PFH4NKA0XJIIWJCZDYMADEGSIWG1ODINGUV23CUVG5MUW1S4' # your Foursquare ID
CLIENT_SECRET = 'QOQ1TGJKHA1SWAMXHGMN21OAPLCX34I40B4G4VFXGXK2G2NI' # your Foursquare Secret
VERSION = '20210101' # Foursquare API version
LIMIT = 50 # A default Foursquare API limit value

# function for getting venues
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    # append them into our desire form of dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('done collecting venues')
    return(nearby_venues)

### A brief summary of the collected venue data

In [68]:
chicago_venues = getNearbyVenues(names = community_df['Name'],
                                 latitudes = community_df['Latitude'],
                                 longitudes = community_df['Longitude'])
print('Shape of the venue dataframe is {}'.format(chicago_venues.shape))
print('There are {} unique venue categories.'.format(len(chicago_venues['Venue Category'].unique())))

done collecting venues
Shape of the venue dataframe is (2919, 7)
There are 297 unique venue categories.


### Put the venues into different categories

In [69]:
chicago_onehot = pd.get_dummies(chicago_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
chicago_onehot['Neighbourhood'] = chicago_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [chicago_onehot.columns[-1]] + list(chicago_onehot.columns[:-1])
chicago_onehot = chicago_onehot[fixed_columns]

print('The shape of the venue category matrix is {}.'.format(chicago_onehot.shape))

The shape of the venue category matrix is (2919, 298).


### A brief summary of venues and the frequency of restaurants of each neighbourhoods

In [70]:
# also group all restaurants as one group
chicago_grouped = chicago_onehot.groupby('Neighbourhood').mean().reset_index()

num_top_venues = 5

restaurant_freq = []

for hood in chicago_grouped['Neighbourhood']:
    temp = chicago_grouped[chicago_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    
    # sum the restaurants' frequency as a individual category
    group_restaurants = temp[temp['venue'].str.contains('Restaurant')]
    restaurant_freq.append(round(group_restaurants['freq'].sum(), 2))

community_df['Restaurant Freq.'] = restaurant_freq
community_df.head()

Unnamed: 0,No.,Name,Latitude,Longitude,Restaurant Freq.
0,1,Rogers Park,42.01,-87.67,0.32
1,2,West Ridge,42.0,-87.69,0.2
2,3,Uptown,41.97,-87.66,0.34
3,4,Lincoln Square,41.97,-87.69,0.12
4,5,North Center,41.95,-87.68,0.18


In [71]:
print('Restaurant Frequency in ascending order')
print(community_df.sort_values('Restaurant Freq.', ascending = True).reset_index(drop = True).head())
print('\nRestaurant Frequency in descending order')
print(community_df.sort_values('Restaurant Freq.', ascending = False).reset_index(drop = True).head())

Restaurant Frequency in ascending order
  No.             Name   Latitude  Longitude  Restaurant Freq.
0  56   Garfield Ridge  41.816667 -87.760000              0.00
1  28   Near West Side  41.880000 -87.666667              0.00
2  60       Bridgeport  41.837500 -87.647500              0.00
3  77        Edgewater  41.990000 -87.660000              0.03
4  46    South Chicago  41.740000 -87.550000              0.06

Restaurant Frequency in descending order
  No.                 Name   Latitude  Longitude  Restaurant Freq.
0  75          Morgan Park  41.690000 -87.670000              0.54
1  67       West Englewood  41.775833 -87.664167              0.42
2  21             Avondale  41.940000 -87.710000              0.40
3  73   Washington Heights  41.703833 -87.653667              0.36
4  24            West Town  41.900000 -87.680000              0.34


### Now we know how each neighbourhood is made of, we will start to find all the sushi restaurants in the neighbourhoods.

In [72]:
category_ID = '4bf58dd8d48988d1d2941735' # ID of category – sushi restaurant

# function for getting sushi restaurants venues
def getSushiVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION,
            category_ID, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information - name and coordinates of the sushi restaurant
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Restaurant Name', 
                  'Restaurant Latitude', 
                  'Restaurant Longitude']
    
    print('done collecting venues')
    return(nearby_venues)

In [73]:
Sushi_df = getSushiVenues(names = community_df['Name'],
                          latitudes = community_df['Latitude'],
                          longitudes = community_df['Longitude'])
Sushi_df.head(10)

done collecting venues


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Restaurant Name,Restaurant Latitude,Restaurant Longitude
0,Rogers Park,42.01,-87.67,Asahi Roll,42.005543,-87.660996
1,Rogers Park,42.01,-87.67,Hana,42.005825,-87.66052
2,Rogers Park,42.01,-87.67,Hira's Cafe,42.007936,-87.666718
3,Uptown,41.97,-87.66,Agami Contemporary Sushi,41.967519,-87.658831
4,Uptown,41.97,-87.66,Dib Sushi Bar & Thai Cuisine,41.969042,-87.655973
5,Uptown,41.97,-87.66,Taketei Sushi,41.978093,-87.658353
6,Uptown,41.97,-87.66,Wabi Sabi Rotary,41.964322,-87.654553
7,Uptown,41.97,-87.66,Gorilla Sushi Bar,41.965832,-87.666872
8,Uptown,41.97,-87.66,Ora,41.975715,-87.668389
9,Lincoln Square,41.97,-87.69,Sushi Tokoro,41.968376,-87.688964


### Map out the sushi restaurants

In [74]:
map_sushi = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, name in zip(Sushi_df['Restaurant Latitude'], Sushi_df['Restaurant Longitude'], Sushi_df['Restaurant Name']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=1,
        parse_html=False).add_to(map_sushi)

map_sushi

### A brief summary of the sushi restaurants in the neighbourhoods

In [75]:
sushi_count = Sushi_df.groupby('Neighbourhood').size().reset_index()
sushi_count.columns = ['Name', 'count']
sushi_count

Unnamed: 0,Name,count
0,Albany Park,2
1,Avondale,3
2,Belmont Cragin,1
3,Bridgeport,2
4,Dunning,1
5,Edgewater,7
6,Forest Glen,2
7,Hyde Park,2
8,Irving Park,1
9,Jefferson Park,3


### Merge them to our dataframe

In [76]:
community_df = community_df.merge(sushi_count, how = 'outer', on = 'Name')
community_df['count'] = community_df['count'].fillna(0).astype('int32')
community_df.head()

Unnamed: 0,No.,Name,Latitude,Longitude,Restaurant Freq.,count
0,1,Rogers Park,42.01,-87.67,0.32,3
1,2,West Ridge,42.0,-87.69,0.2,0
2,3,Uptown,41.97,-87.66,0.34,6
3,4,Lincoln Square,41.97,-87.69,0.12,6
4,5,North Center,41.95,-87.68,0.18,5


### KMean cluster

In [77]:
# set number of clusters
kclusters = 5

chicago_clustering = community_df
chicago_clustering = chicago_clustering.drop(['No.', 'Name', 'Latitude', 'Longitude'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(chicago_clustering)

print(kmeans.labels_[0:15])

# add clustering labels
result = community_df
result.insert(0, 'Cluster Labels', kmeans.labels_)

result.head()

[4 0 2 2 2 1 1 1 0 4 4 4 2 4 0]


Unnamed: 0,Cluster Labels,No.,Name,Latitude,Longitude,Restaurant Freq.,count
0,4,1,Rogers Park,42.01,-87.67,0.32,3
1,0,2,West Ridge,42.0,-87.69,0.2,0
2,2,3,Uptown,41.97,-87.66,0.34,6
3,2,4,Lincoln Square,41.97,-87.69,0.12,6
4,2,5,North Center,41.95,-87.68,0.18,5


### Map out the clusters

In [78]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(result['Latitude'], result['Longitude'], result['Name'], result['Cluster Labels']):
    label = folium.Popup(str(poi) + ' => Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(map_clusters)
       
map_clusters

### Examine the results

In [79]:
result.loc[result['Cluster Labels'] == 0, result.columns[[1] + list(range(5, result.shape[1]))]]

Unnamed: 0,No.,Restaurant Freq.,count
1,2,0.2,0
8,9,0.3,0
14,15,0.22,1
15,16,0.3,1
16,17,0.08,1
17,18,0.1,0
18,19,0.2,1
19,20,0.08,0
22,23,0.24,0
24,25,0.15,0


In [80]:
result.loc[result['Cluster Labels'] == 1, result.columns[[1] + list(range(5, result.shape[1]))]]

Unnamed: 0,No.,Restaurant Freq.,count
5,6,0.28,16
6,7,0.12,13
7,8,0.24,12
23,24,0.34,10


In [81]:
result.loc[result['Cluster Labels'] == 2, result.columns[[1] + list(range(5, result.shape[1]))]]

Unnamed: 0,No.,Restaurant Freq.,count
2,3,0.34,6
3,4,0.12,6
4,5,0.18,5
12,13,0.21,5
76,77,0.03,7


In [82]:
result.loc[result['Cluster Labels'] == 3, result.columns[[1] + list(range(5, result.shape[1]))]]

Unnamed: 0,No.,Restaurant Freq.,count
31,32,0.2,24


In [83]:
result.loc[result['Cluster Labels'] == 4, result.columns[[1] + list(range(5, result.shape[1]))]]

Unnamed: 0,No.,Restaurant Freq.,count
0,1,0.32,3
9,10,0.08,2
10,11,0.28,3
11,12,0.12,2
13,14,0.14,2
20,21,0.4,3
21,22,0.34,4
27,28,0.0,3
32,33,0.08,4
40,41,0.18,2


## Results and Discussion

The analysis shows that neighbourhoods of cluster 0(red) shows a high potential of opening a successful sushi restaurant, since there is little competition and moderate rate of restaurants occupied the area. Cluster 4(orange) has second less in competition, followings are cluster 2(blue), cluster 1(purple), and cluster 3(green), which is highly crowded with sushi restaurant. Therefore, we can start our location search by the neighbourhoods with red or orange labels; neighbourhoods with blue labels are not recommended; and neighbourhoods with purple or green labels are places we strongly don't recommend.

## Conclusion

The purpose of this project is to identify the spots in Chicago where there are less competition and show high potential for sushi restaurant businesses to grow. Our analysis could give the stakeholders a better understanding of the ecosystem of the Chicago's restaurant business and the areas around. Location matters the most when starting a business from scratch. Based on our analysis on the neighbourhoods, we enable the stakeholders and give them a head start to spot and secure their seats in this highly competitive food market.