# Coursera Capstone Project

This notebook will be used for the capstone project.

### Criteria

For **first week**, you will required to submit the following:
* A description of the problem and a discussion of the background. (15 marks)
* A description of the data and how it will be used to solve the problem. (15 marks)

For the **second week**, the final deliverables of the project will be:
* A link to your Notebook on your Github repository, showing your code. (15 marks)
* A full report consisting of all of the following components (15 marks):
    * Introduction where you discuss the business problem and who would be interested in this project.
    * Data where you describe the data that will be used to solve the problem and the source of the data.
    * Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
    * Results section where you discuss the results.
    * Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
    * Conclusion section where you conclude the report.
* Your choice of a presentation or blogpost. (10 marks)

In [62]:
# Importing libraries

import pandas as pd
import numpy as np
import random
import requests

# module to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# modules to work with geodata
import geopandas as gp
from geopandas.tools import geocode
import folium
from folium.plugins import HeatMap

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
import json

# import tools for webscraping
from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib

## 1. Problem Description and Background Discussion

### 1.1 Problem Description
As part of the Capstone Project for the Applied Data Science Coursera Course I have chosen to analyze the effectiveness of the Business Improvement Area (BIA) Program of Toronto, ON in Canada. The question I will answer is: **„Are most Venues located near or in Business Improvement Areas or are there clusters of Venues that should be made into BIAs?“** 

### 1.2 Background Discussion
The **Business Improvement Area (BIA)** is an association of commercial property owners and tenants within a defined area who work in partnership with the City to create thriving, competitive, and safe business areas that attract shoppers, diners, tourists, and new businesses. The question is how effective this association and the created Areas are for attracting shoppers, diners, tourists and new business. 

## 2. Data Description 

### 2.1 Description of Data and Data Source
The BIA layer represents the active BIAs in the City of Toronto that has been enacted by Council. Each BIA has been defined by a by-law and is represented by a Board of Management. The layer is updated as BIAs are created, amended or deleted by Council. This file is a polygon file that shows the BIAs Areas. The BIAs Data can be found at ...

Also I used Data about Boroughs and Neighborhoods in Toroto. They can be found as GeoJSON data at...

The second part of the data for the analysis comes via the Foursquare API. This dataset contains venues located in Toronto, there location, name, venue category and user rating. The information collected will be for all the central neighborhoods in Toronto.

### 2.2 How will the Data be used to solve the Problem
In a first step the location of the venues will be ploted on a map as overlay to the BIAs. This will show if most Venues are located within or very near BIAs and therefor give an answer about the effectivenes in promoting business near or in BIAs. 

In a second step the venues will be clustered depending on there location and the clusters plotted to show whether the clusters are located whitin the BIAs. This will help answer the second part of the question, whether there are clusters of venues in Toronto that are not part of any BIAs. This clusters could selected for the creation and localisation of future BIAs.


### Getting the BIAs Data

Via the API provided by the City of Toronto 

In [2]:
# Get the dataset metadata by passing package_id to the package_search endpoint
# For example, to retrieve the metadata for this dataset:

url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
params = { "id": "9edb9628-1213-42bd-8352-5c4ed28e9e42"}
response = urllib.request.urlopen(url, data=bytes(json.dumps(params), encoding="utf-8"))
package = json.loads(response.read())

# Get the data by passing the resource_id to the datastore_search endpoint
# See https://docs.ckan.org/en/latest/maintaining/datastore.html for detailed parameters options
# For example, to retrieve the data content for the first resource in the datastore:

for idx, resource in enumerate(package["result"]["resources"]):
    if resource["datastore_active"]:
        url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/datastore_search"
        p = { "id": resource["id"] }
        r = urllib.request.urlopen(url, data=bytes(json.dumps(p), encoding="utf-8"))
        data = json.loads(r.read())
        df_BIAs = pd.DataFrame(data["result"]["records"])
        break
df_BIAs.head()

Unnamed: 0,_id,AREA_ID,DATE_EFFECTIVE,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,X,Y,LONGITUDE,LATITUDE,OBJECTID,Shape__Area,Shape__Length,geometry
0,3298,2481875,2020-02-04T17:20:36,26006975,,115-00,115-00,Rogers Road,Rogers Road,307227.635,4837983.077,-79.46989,43.681791,17568785,351093.855469,5936.862796,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4662..."
1,3299,2481874,2020-02-04T17:20:36,26006974,,031-02,031-02,Bloor-Yorkville,Bloor-Yorkville,313738.285,4836723.196,-79.389159,43.670401,17568801,918046.484375,6613.691633,"{""type"": ""Polygon"", ""coordinates"": [[[-79.3872..."
2,3300,2481873,2020-02-04T17:20:36,26006973,,020-01,020-01,Little Italy,Little Italy,311705.037,4835053.901,-79.414394,43.655397,17568817,232341.589844,3917.542802,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4205..."
3,3301,2481872,2020-02-04T17:20:36,26006972,,042-01,042-01,Liberty Village,Liberty Village,311152.727,4833083.985,-79.421265,43.63767,17568833,797292.066406,4400.913504,"{""type"": ""Polygon"", ""coordinates"": [[[-79.4246..."
4,3302,2481871,2020-02-04T17:20:36,26006971,,093-01,093-01,Leslieville,Leslieville,318224.026,4835848.463,-79.333555,43.66246,17568849,351302.890625,6457.749078,"{""type"": ""Polygon"", ""coordinates"": [[[-79.3240..."


In [3]:
df_BIAs.shape

(83, 17)

In [4]:
# dropping BIAs that are out of the central area of Toronto
area_names = ['Albion Islington Square', 'Wilson Village', 'Sheppard East Village', 'The Waterfront', 'Emery Village', 'DuKe Heights', 'Kennedy Road', 'Wexford Heights', 'Crossroads of the Danforth']
index_drop = df_BIAs['AREA_NAME'].isin(area_names)
index_drop = index_drop[index_drop == True]
index_drop.index

Int64Index([5, 17, 22, 28, 45, 50, 57, 58, 79], dtype='int64')

In [5]:
df_BIAs = df_BIAs.drop(index_drop.index)
df_BIAs.reset_index(drop = True, inplace = True)
df_BIAs.shape

(74, 17)

### Loading Data about Neighborhoods

A GeoJSON file is loaded and the geometry and names of the Neighborhoods are used to plot them onto a map of Toronto.

In [6]:
df_neighborhoods = gp.read_file('Data/Neighbourhoods.geojson.json')
print(df_neighborhoods.shape)
df_neighborhoods.head()

(140, 16)


Unnamed: 0,_id,AREA_ID,AREA_ATTR_ID,PARENT_AREA_ID,AREA_SHORT_CODE,AREA_LONG_CODE,AREA_NAME,AREA_DESC,X,Y,LONGITUDE,LATITUDE,OBJECTID,Shape__Area,Shape__Length,geometry
0,5181,25886861,25926662,49885,94,94,Wychwood (94),Wychwood (94),,,-79.425515,43.676919,16491505,3217960.0,7515.779658,"POLYGON ((-79.43592 43.68015, -79.43492 43.680..."
1,5182,25886820,25926663,49885,100,100,Yonge-Eglinton (100),Yonge-Eglinton (100),,,-79.40359,43.704689,16491521,3160334.0,7872.021074,"POLYGON ((-79.41096 43.70408, -79.40962 43.704..."
2,5183,25886834,25926664,49885,97,97,Yonge-St.Clair (97),Yonge-St.Clair (97),,,-79.397871,43.687859,16491537,2222464.0,8130.411276,"POLYGON ((-79.39119 43.68108, -79.39141 43.680..."
3,5184,25886593,25926665,49885,27,27,York University Heights (27),York University Heights (27),,,-79.488883,43.765736,16491553,25418210.0,25632.335242,"POLYGON ((-79.50529 43.75987, -79.50488 43.759..."
4,5185,25886688,25926666,49885,31,31,Yorkdale-Glen Park (31),Yorkdale-Glen Park (31),,,-79.457108,43.714672,16491569,11566690.0,13953.408098,"POLYGON ((-79.43969 43.70561, -79.44011 43.705..."


In [7]:
# droping neighborhoods that are not in central Toronto
l = [1, 2, 3, 4, 5, 21, 22, 23, 24, 25, 26, 27, 33, 34, 35, 36, 37, 38, 46, 47, 48, 49, 50, 51, 52, 53]
df_neighborhoods_center = df_neighborhoods[df_neighborhoods.AREA_SHORT_CODE < 116]
df_neighborhoods_centerDROP = df_neighborhoods_center['AREA_SHORT_CODE'].isin(l)
df_neighborhoods_centerDROP = df_neighborhoods_centerDROP[df_neighborhoods_centerDROP == True]
df_neighborhoods_centerDROP.index

Int64Index([  3,   6,  20,  24,  25,  35,  37,  40,  56,  63,  64,  68,  69,
             79,  81,  82,  87,  99, 104, 107, 112, 119, 122, 126, 128, 129],
           dtype='int64')

In [8]:
df_neighborhoods_center.drop(df_neighborhoods_centerDROP.index, inplace = True)
df_neighborhoods_center.reset_index(inplace=True)
df_neighborhoods_center.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


(89, 17)

### Plotting the BIA-Areas and Neighborhoods on a Map

The Folium library is use to plot the Neighborhoods and the BIA-Areas on the map of Toronto. The Map shows the shape of BIAs in green and the shape of Neighborhoods in blue.

In [9]:
# get location for map centering from Downtown Toronto
address = 'Toronto, Downtown'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto Downtown are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto Downtown are 43.6541737, -79.38081164513409.


In [27]:
# create a open street map, center it on a location using latitude and longitude and give it a starting zoom factor
m = folium.Map(location = [latitude, longitude], tiles = 'Stamen Toner', zoom_start = 12)

# create a feature group for the map
fg = folium.map.FeatureGroup(name='BIAs').add_to(m)
fg1 = folium.map.FeatureGroup(name='Center Venue Search').add_to(m)
fg2 = folium.map.FeatureGroup(name='Neighborhoods').add_to(m)

style_BIA = {'color': '#228B22', 'fillColor': '#228B22', 'lineColor': '#228B22'}

for i in range(len(df_BIAs['geometry'])):
    b = folium.GeoJson(df_BIAs['geometry'][i], style_function = lambda x: style_BIA)
    b.add_child(folium.Popup(df_BIAs['AREA_NAME'][i]))
    fg.add_child(b)

style_Neighborhoods = {'color': '#1E90FF', 'fillColor': '#1E90FF', 'lineColor': '#1E90FF', 'opacity': 0.4}    
    
for i in range(len(df_neighborhoods_center['geometry'])):
    a = folium.GeoJson(df_neighborhoods_center['geometry'][i], style_function = lambda x: style_Neighborhoods)
    a.add_child(folium.Popup(df_neighborhoods_center['AREA_NAME'][i]))
    fg2.add_child(a)   
    
for lat, long, name in zip(df_neighborhoods_center['LATITUDE'], df_neighborhoods_center['LONGITUDE'], df_neighborhoods_center['AREA_SHORT_CODE']):
    name = folium.Popup(name, parse_html = True)
    c = folium.Circle(
    [lat, long],
    radius = 800,
    popup = name,
    color = 'red',
    fill = True,
    parse_html = False)
    fg1.add_child(c)    
    
folium.LayerControl().add_to(m)
    
# display the map
m

### Setting up the API for accessing foursquare data

In [50]:
CLIENT_ID = '5MEM4YM205NTQBOMWUQX00NHLMW2GJGAV2OPGIHK55JSJKFU' # your Foursquare ID
CLIENT_SECRET = 'XQ34UGNCZTPZWQFKCIVSYLXHK533UR24OSHJ1BKLE2SSZTT3' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 50
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5MEM4YM205NTQBOMWUQX00NHLMW2GJGAV2OPGIHK55JSJKFU
CLIENT_SECRET:XQ34UGNCZTPZWQFKCIVSYLXHK533UR24OSHJ1BKLE2SSZTT3


### Getting Data for the Venues around the BIAs
Via a API request the data for the venues in all Neighborhoods are collected and stored in a Data Frame for easy data manipulation. The definied function will get, depending on the location (center of Neighborhood), the names of the venues within a radius of 800 m, the exact location of the venues (latitude, longitude) and the venue category. 

In [51]:
# defining a function to get the Venues Name, Location, Rating and Category via the Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=800, LIMIT = 50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['id'],
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['BIA', 
                  'BIA Latitude', 
                  'BIA Longitude',
                  'Venue ID',
                  'Venue', 
                  'Venue Latitude',
                  'Venue Longitude',
                  'Venue Category']
    
    return(nearby_venues)

In [52]:
df_venues = getNearbyVenues(names=df_neighborhoods_center['AREA_NAME'],
                                   latitudes=df_neighborhoods_center['LATITUDE'],
                                   longitudes=df_neighborhoods_center['LONGITUDE']
                                  )

Wychwood (94)
Yonge-Eglinton (100)
Yonge-St.Clair (97)
Yorkdale-Glen Park (31)
Lambton Baby Point (114)
Lawrence Park North (105)
Lawrence Park South (103)
Leaside-Bennington (56)
Little Portugal (84)
Long Branch (19)
Maple Leaf (29)
Markland Wood (12)
Mimico (includes Humber Bay Shores) (17)
Moss Park (73)
Mount Dennis (115)
Mount Pleasant East (99)
Mount Pleasant West (104)
New Toronto (18)
Niagara (82)
North Riverdale (68)
North St.James Town (74)
O'Connor-Parkview (54)
Oakwood Village (107)
Old East York (58)
Palmerston-Little Italy (80)
Parkwoods-Donalda (45)
Playter Estates-Danforth (67)
Princess-Rosethorn (10)
Regent Park (72)
Rockcliffe-Smythe (111)
Roncesvalles (86)
Rosedale-Moore Park (98)
Runnymede-Bloor West Village (89)
Rustic (28)
South Parkdale (85)
South Riverdale (70)
St.Andrew-Windfields (40)
Stonegate-Queensway (16)
Taylor-Massey (61)
The Beaches (63)
Thorncliffe Park (55)
Trinity-Bellwoods (81)
University (79)
Victoria Village (43)
Waterfront Communities-The Island 

In [53]:
print(df_venues.shape)
df_venues.head()

(2694, 8)


Unnamed: 0,BIA,BIA Latitude,BIA Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wychwood (94),43.676919,-79.425515,4b86e89df964a52051a531e3,Wychwood Barns Farmers' Market,43.68001,-79.423849,Farmers Market
1,Wychwood (94),43.676919,-79.425515,4afc6ed3f964a520a82222e3,Wychwood Barns,43.680028,-79.42381,Event Space
2,Wychwood (94),43.676919,-79.425515,52418b0b7e48222eea81d2d2,Pukka Restaurant,43.681055,-79.429187,Indian Restaurant
3,Wychwood (94),43.676919,-79.425515,4aedbe8df964a52080ce21e3,Hillcrest Park,43.676012,-79.424787,Park
4,Wychwood (94),43.676919,-79.425515,4aeb7abbf964a52080c221e3,Ferro Bar Cafe,43.68108,-79.42857,Italian Restaurant


In [54]:
# safe df_venue to csv
df_venues.to_csv('Data/df_venues.csv')

### Preparing the Venue Data for Analysis

Cleaning the Data, droping missing values, converting Latitude and Longitude to float.

In [55]:
# Prepare the data for Heatmap
# Ensure you're handing it floats
df_venues['Venue Latitude'] = df_venues['Venue Latitude'].astype(float)
df_venues['Venue Longitude'] = df_venues['Venue Longitude'].astype(float)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df_venues[['Venue Latitude', 'Venue Longitude']]
heat_df = heat_df.dropna(axis=0, subset=['Venue Latitude','Venue Longitude'])

# List comprehension to make out list of lists
heat_data = [[row['Venue Latitude'],row['Venue Longitude']] for index, row in heat_df.iterrows()]

In [59]:
print('Shape of heat_df:', heat_df.shape)

Shape of heat_df: (2694, 2)


## Analysing the Data
The collected Data is analised in the following ways:
1. Mapping the Venues with a Heatmap to visualise the Venue Density
2. Clustering the Venues with k-means
3. Mapping the Clusters to visualise Venue Clusters
4. Identify Clusters without BIAs

### 1. Mapping the Venue Density

Using a Heatmap it is possible to show the Density of Venues for the Neighborhoods. As we can see most Venues are located in or very near BIAs. This is a first and only visual indication for the effectivnes of BIAs to promote the creation of Venues.

In [63]:
# create map of Toronto with BIAs (blue) and Venues within the BIAs (red)

toronto_map = folium.Map(location = [latitude, longitude], tiles = 'Stamen Toner', zoom_start = 12)

# create a feature group for the map
fg = folium.map.FeatureGroup(name='BIAs').add_to(toronto_map)
fg1 = folium.map.FeatureGroup(name = 'Venues').add_to(toronto_map)
fg4 = folium.map.FeatureGroup(name = 'Density of Venues').add_to(toronto_map)

# add geojson data for the BIAs to map
for i in range(len(df_BIAs['geometry'])):
    b = folium.GeoJson(df_BIAs['geometry'][i])
    b.add_child(folium.Popup(df_BIAs['AREA_NAME'][i]))
    fg.add_child(b)
    
for lat, long, name in zip(df_venues['Venue Latitude'], df_venues['Venue Longitude'], df_venues['Venue']):
    name = folium.Popup(name, parse_html = True)
    c = folium.Circle(
    [lat, long],
    radius = 2,
    popup = name,
    color = 'red',
    fill = True,
    parse_html = False)
    fg1.add_child(c)   
    

# Plot heatmap
h = HeatMap(heat_data)
fg4.add_child(h)    
    
folium.LayerControl().add_to(toronto_map)
    
# display the map
toronto_map