# Capstone Project Report – The Battle of Neighbourhoods in Chicago

## Introduction/Business Problem

A sushi franchise owner is seeking perfect locations to open branches where he can intorduce the finest sushi to the residents of the city. However, he is new to the city and couldn't decide where to set root for the business to grow. The three rules for starting a business are 1)location, 2)location, and 3)location! Therefore, he seeks help from data scientists and engineers to solve the problem that could be the deciding factor to this expansion.

## Data

Retriving postal and geolocation data from wikipedia for locating the neighbourhoods in Chicago.
Use Foursquare API to explore venues in the neighbourhoods for analysis.
Use Foursquare API to search for sushi restaurants in the neighbourhoods.
Use Foursquare API to extract number of users who have liked the restaurants.

In [104]:
import bs4
from bs4 import BeautifulSoup

import requests

import pandas as pd

import re

import folium

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [67]:
url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
data = requests.get(url).text # send GET request and store as text data
my_soup = BeautifulSoup(data, 'html5lib') # parse the data with beautifulsoup

# search for target table
tables = my_soup.find_all('table')

for index, table in enumerate(tables):
    if 'Chicago community areas by number, population, and area' in str(table): # find the string in our target table that is unique to the other tables
        target_table_index = index
print('There are {} tables found.\nTarget Table Index : {}'.format(index + 1, target_table_index))

There are 4 tables found.
Target Table Index : 0


In [100]:
# convert DMS coordinate to decimal coordinates
def dms2dd(s):
    if '″' in s:
        degrees, minutes, seconds, direction = re.split('[°′″]+', s)
        dd = float(degrees) + float(minutes)/60 + float(seconds)/(60*60)
        if direction in ('S','W'):
            dd*= -1

    else:
        degrees, minutes, direction = re.split('[°′]+', s)
        dd = float(degrees) + float(minutes)/60
        if direction in ('S','W'):
            dd*= -1

    return dd

# get coordinate from wiki sub page
def get_coordinate(row, name):
    link = 'https://en.wikipedia.org' + row.find('a')['href']
    data = requests.get(link).text
    sub_soup = BeautifulSoup(data,'html5lib')

    table = sub_soup.find('table', {'class':'infobox geography vcard'})
    latitude = table.find('span', {'class':'latitude'}).getText()
    longitude = table.find('span', {'class':'longitude'}).getText()

    latitude = dms2dd(latitude)
    longitude = dms2dd(longitude)

    return latitude, longitude

# create dataframe then append contents from wikipedia : https://en.wikipedia.org/wiki/Community_areas_in_Chicago
community_df =  pd.DataFrame(columns = ['No.', 'Name', 'Latitude', 'Longitude'])

for count, row in enumerate(tables[target_table_index].tbody.find_all('tr')):
    if count > 1 and count < 79:
        number = row.find('td').getText().replace('\n', '')
        name = row.find('a').getText()

        latitude, longitude = get_coordinate(row, name)

        community_df = community_df.append({'No.' : number, 'Name' : name, 'Latitude' : latitude, 'Longitude' : longitude}, ignore_index = True)

community_df

Unnamed: 0,No.,Name,Latitude,Longitude
0,01,Rogers Park,42.010000,-87.670000
1,02,West Ridge,42.000000,-87.690000
2,03,Uptown,41.970000,-87.660000
3,04,Lincoln Square,41.970000,-87.690000
4,05,North Center,41.950000,-87.680000
...,...,...,...,...
72,73,Washington Heights,41.703833,-87.653667
73,74,Mount Greenwood,41.700000,-87.710000
74,75,Morgan Park,41.690000,-87.670000
75,76,O'Hare,42.000000,-87.920000


In [105]:
address = 'Chicago, IL'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Chicago are ({}, {}).'.format(latitude, longitude))

The geograpical coordinates of Chicago are (41.8755616, -87.6244212).


In [106]:
map_chicago = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, num, neighbourhood in zip(community_df['Latitude'], community_df['Longitude'], community_df['No.'], community_df['Name']):
    label = '{}. {}'.format(num, neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_chicago)  
    
map_chicago

In [110]:
CLIENT_ID = 'PFH4NKA0XJIIWJCZDYMADEGSIWG1ODINGUV23CUVG5MUW1S4' # your Foursquare ID
CLIENT_SECRET = 'QOQ1TGJKHA1SWAMXHGMN21OAPLCX34I40B4G4VFXGXK2G2NI' # your Foursquare Secret
VERSION = '20210101' # Foursquare API version
LIMIT = 50 # A default Foursquare API limit value

# function for getting venues
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('done collecting venues')
    return(nearby_venues)

In [112]:
chicago_venues = getNearbyVenues(names = community_df['Name'],
                                 latitudes = community_df['Latitude'],
                                 longitudes = community_df['Longitude'])
print('Shape of the venue dataframe is {}'.format(chicago_venues.shape))
print('There are {} unique venue categories.'.format(len(chicago_venues['Venue Category'].unique())))

done collecting venues
Shape of the venue dataframe is (2916, 7)


In [115]:
# one hot encoding
chicago_onehot = pd.get_dummies(chicago_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
chicago_onehot['Neighbourhood'] = chicago_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [chicago_onehot.columns[-1]] + list(chicago_onehot.columns[:-1])
chicago_onehot = chicago_onehot[fixed_columns]

print('The shape of the venue category matrix is {}.'.format(chicago_onehot.shape))

The shape of the venue category matrix is (2916, 293).


In [136]:
# also group all restaurants as one group
chicago_grouped = chicago_onehot.groupby('Neighbourhood').mean().reset_index()

num_top_venues = 5

restaurant_freq = []

for hood in chicago_grouped['Neighbourhood']:
    #print("----"+hood+"----")
    temp = chicago_grouped[chicago_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    #print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    
    group_restaurants = temp[temp['venue'].str.contains('Restaurant')]
    restaurant_freq.append(round(group_restaurants['freq'].sum(), 2))
    #print('Frequency of restaurants ', restaurant_freq[-1])

    #print(group_restaurants.head())
    #print('\n')

community_df['Restaurant Freq.'] = restaurant_freq
community_df.head()
#restaurant_freq

Unnamed: 0,No.,Name,Latitude,Longitude,Restaurant Freq.
0,1,Rogers Park,42.01,-87.67,0.34
1,2,West Ridge,42.0,-87.69,0.16
2,3,Uptown,41.97,-87.66,0.34
3,4,Lincoln Square,41.97,-87.69,0.12
4,5,North Center,41.95,-87.68,0.21


In [142]:
print('Restaurant Frequency in ascending order')
print(community_df.sort_values('Restaurant Freq.', ascending = True).reset_index(drop = True).head())
print('\nRestaurant Frequency in descending order')
print(community_df.sort_values('Restaurant Freq.', ascending = False).reset_index(drop = True).head())

Restaurant Frequency in ascending order
  No.             Name   Latitude  Longitude  Restaurant Freq.
0  28   Near West Side  41.880000 -87.666667              0.00
1  56   Garfield Ridge  41.816667 -87.760000              0.00
2  60       Bridgeport  41.837500 -87.647500              0.00
3  77        Edgewater  41.990000 -87.660000              0.03
4  57   Archer Heights  41.810000 -87.730000              0.06

Restaurant Frequency in descending order
  No.                 Name   Latitude  Longitude  Restaurant Freq.
0  75          Morgan Park  41.690000 -87.670000              0.54
1  21             Avondale  41.940000 -87.710000              0.50
2  67       West Englewood  41.775833 -87.664167              0.46
3  73   Washington Heights  41.703833 -87.653667              0.38
4  01          Rogers Park  42.010000 -87.670000              0.34


In [180]:
category_ID = '4bf58dd8d48988d1d2941735' # ID of category – sushi restaurant

# function for getting venues
def getSushiVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION,
            category_ID, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Restaurant Name', 
                  'Restaurant Latitude', 
                  'Restaurant Longitude']
    
    print('done collecting venues')
    return(nearby_venues)

In [193]:
Sushi_df = getSushiVenues(names = community_df['Name'],
                          latitudes = community_df['Latitude'],
                          longitudes = community_df['Longitude'])
Sushi_df.head(10)

done collecting venues


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Restaurant Name,Restaurant Latitude,Restaurant Longitude
0,Rogers Park,42.01,-87.67,Asahi Roll,42.005543,-87.660996
1,Rogers Park,42.01,-87.67,Hana,42.005825,-87.66052
2,Rogers Park,42.01,-87.67,Hira's Cafe,42.007936,-87.666718
3,Uptown,41.97,-87.66,Agami Contemporary Sushi,41.967519,-87.658831
4,Uptown,41.97,-87.66,Dib Sushi Bar & Thai Cuisine,41.969042,-87.655973
5,Uptown,41.97,-87.66,Taketei Sushi,41.978093,-87.658353
6,Uptown,41.97,-87.66,Ora,41.975715,-87.668389
7,Uptown,41.97,-87.66,Wabi Sabi Rotary,41.964322,-87.654553
8,Uptown,41.97,-87.66,Gorilla Sushi Bar,41.965832,-87.666872
9,Lincoln Square,41.97,-87.69,Sushi Tokoro,41.968376,-87.688964


In [196]:
restaurant_count = Sushi_df.groupby('Neighbourhood').size().sort_values(ascending = False)
restaurant_count

Neighbourhood
 Loop               24
 Lake View          17
 Lincoln Park       14
 Near North Side    12
 West Town          10
 Edgewater           7
 Uptown              6
 Lincoln Square      6
 North Park          5
 North Center        5
 Logan Square        4
 Near South Side     4
 Avondale            3
 Jefferson Park      3
 Rogers Park         3
 Near West Side      3
 Forest Glen         2
 Hyde Park           2
 Bridgeport          2
 Albany Park         2
 Norwood Park        2
 Irving Park         1
 Lower West Side     1
 Morgan Park         1
 Dunning             1
 Portage Park        1
 Belmont Cragin      1
 South Lawndale      1
 Woodlawn            1
dtype: int64