<div align = "center"><h1> Coursera Capstone Project - Battle of the Neighbourhoods</h1> </div>
Project By: Jothika Sundaram   

<h2> Table of Contents <a id = "TOC"></a></h2>

A. <a href = "#Intro">Introduction and Business Problem</a>  
B. <a href = "#Methodology">Data and Methodology</a>    
C. <a href = "#Data">Data Collection and Cleaning</a>    
D. <a href = "#Analysis">Analysis</a>   
E. <a href = "#Results">Results and Discussion</a>    
F. <a href = "#Conclusion">Conclusion</a>    

<h2><a id = "Intro"></a>A. Introduction and Business Problem</h2>

The city of Toronto is one of the major metropolese in Canada. With a population of over 2.93 million, it is the most populous city in Canada known for its iconic skyscrapers, bustling city life and dynamic ethnic diversity. For these reasons, Toronto is also an international centre for business and finance, and is a major econimic hub in Canada.  

These factors also encourage entrepeneurs, small busniess owners and startup companies to open their business in Toronto. This project aims to act as a "startup company guide" to new entrepeneurs. It will provide an analysis on the various business and local venues located across the city, along with local population demographics such as ethnicity and age groups. This information will then give us an idea about **what kinds** of business should open up **in which area** of the city, along with the demographics of the local population that will be targeted.

<h2>B.1 Data </h2><a id = "Methodology"></a> 

To perform this analysis we require the following sets of data:

*  In order to determine the local population demographics, we require the city's **neighbourhood profiles**. This dataset can be obtain from the **[City of Toronto Open Data Portal](https://open.toronto.ca/dataset/neighbourhood-profiles/)**. This dataset includes population distribution and demographics such as age and ethnicity groups. With this we can discover the characteristics of the audiences surrounding the types of venues in each neighbourhood. [(Click here to download the dataset)](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)


*  We can easily obtain the **location data** of the city using **[Foursqaure API](https://foursquare.com/)**. With this we can analyse the geographical features, such as popular venues and companies in each borough. This will give us an idea of how businesses are spread out in the city which will aid us in finding different locations for potantial entrepeneurs to open a venue. 


*   In order to retrieve the information we want from the Foursqaure API, we will need to provide specific locations we want to explore. I will be using a dataset of the different postal code areas in Toronto, along with their respective boroughs and neighbourhoods. This can be obtained by scraping **[this wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)** of postal codes in Toronto.     


* We will need the geographic coordinates of each borough and neighbourhood to feed into our Foursquare API, which will be obtained through python's geocoding API service **[GeoPy](https://geopy.readthedocs.io/en/stable/)**. 

<h2>B.2 Methodology</h2>

1. We will examine the **neighbourhood profiles** of Toronto to get an idea of the demographics in each area. We will then visualize our findings using **choropleth maps** to see the distribution of these different groups.    
          

2. We will collect and clean the data required to feed our **Foursquare API** to get the location data of the different venues across the city. We will visualize these locations on a map.  

   
3. We will do a quantitative analysis on the **most common venues by category** located in each borough and use bar charts to visualize these findings.  


4. We will cluster the neighbourhoods using the **k-means clustering** ML algorithm in order to partition these areas based on the **most common venue type**. These clusters will then be visualized on map.   
    
    
5. Finally, we will be able to see the distribution of venues across the city along with their local population demographics - this will provide potential locations and target audiences for new entrepeneurs based on their business needs.

<h2><a id = "Data"></a>C. Data Collection and Cleaning</h2>
<a href = "#TOC">back to table of contents</a>

### In this Section: <a id = "CTOP"></a>
1. <a href = "#census">Analysing Ethnic Origins of Residents in Toronto</a>   
2. <a href = "#choro-ethnic">Choropleth Map of Foreign Ethnic Origin Density</a>
3. <a href = "#age">Analysing Age Groups of Residents in Toronto</a>
4. <a href = "#choro-age">Choropleth Map of Different Age Groups</a>

First we need to import all the required libraries:

In [1]:
import pandas as pd
import numpy as np
import json # library to handle JSON files
import requests # library to handle requests
import io
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.colors as cols
import matplotlib.pyplot as plt
mpl.style.use('ggplot') # optional: for ggplot-like style
%matplotlib inline
import seaborn as sns
import math
import folium
from folium import plugins

<h2><a id = 'census'></a> Analysing Ethnic Origins</h2>  
<a href = "#CTOP">back to section top</a>

The dataset of neighbourhood profiles is fetched from this url and read into a dataframe.

In [2]:
url = 'https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv'
tor_data = pd.read_csv(url)

In [3]:
tor_data.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,...,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


I created a function that will quickly reset the index of a given dataframe, which we will need to use for many of the dataframes we make.

In [4]:
# This function will quickly reset the index of a given dataframe if needed
def reset_index_(dataframe, col_name):
    dataframe.reset_index(inplace = True)
    if len(col_name)==0:
        if 'index' in dataframe:
            dataframe.drop('index',1,inplace= True)
    else:
        dataframe.rename(columns = {'index':col_name},inplace = True)
    if 'level_0' in dataframe:
        dataframe.drop('level_0',1,inplace= True)

In [5]:
tor_data.drop(['_id','Data Source'],1,inplace = True)

Lets focus on the Ethnic Origins of residents in this dataset. 

In [6]:
tor_ethnics = pd.DataFrame(tor_data[tor_data['Category']=='Ethnic origin'])

In [7]:
reset_index_(tor_ethnics,"")

In [8]:
tor_ethnics.head()

Unnamed: 0,Category,Topic,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Ethnic origin,Ethnic origin population,Guadeloupean,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ethnic origin,Ethnic origin population,Scottish,256250,600,725,1720,5225,2835,510,...,975,1720,2760,2915,1485,1915,2395,2665,725,375
2,Ethnic origin,Ethnic origin population,Total - Ethnic origin for the population in pr...,2691665,28820,23470,12025,28635,26995,15580,...,16670,22135,53015,12425,7845,13250,11805,12295,27575,14020
3,Ethnic origin,Ethnic origin population,North American Aboriginal origins,35630,40,105,305,475,230,90,...,105,130,605,425,270,335,140,215,220,105
4,Ethnic origin,Ethnic origin population,First Nations (North American Indian),27610,25,90,200,345,175,75,...,60,110,470,355,235,275,90,130,200,85


In [9]:
tor_ethnics['Characteristic'].unique()

array([' Guadeloupean', ' Scottish',
       'Total - Ethnic origin for the population in private households - 25% sample data',
       ' North American Aboriginal origins',
       ' First Nations (North American Indian)', ' Inuit', ' Mtis',
       ' Other North American origins', ' Acadian', ' American',
       ' Canadian', ' New Brunswicker', ' Newfoundlander',
       ' Nova Scotian', ' Ontarian', ' Qubcois', ' Portuguese',
       ' Other North American origins; n.i.e.', ' European origins',
       ' British Isles origins', ' Channel Islander', ' Cornish',
       ' English', ' Irish', ' Manx', ' Welsh',
       ' British Isles origins; n.i.e.', ' French origins', ' Alsatian',
       ' Breton', ' Corsican', ' French',
       ' Western European origins (except French origins)', ' Austrian',
       ' Bavarian', ' Belgian', ' Dutch', ' Flemish', ' Frisian',
       ' German', ' Luxembourger', ' Swiss',
       ' Western European origins; n.i.e.',
       ' Northern European origins (except Br

There are many unique origins, but we want to focus on the ones that come from outside of North America and traditional North American origins.

In [10]:
drop_origins = [ ' Other North American origins', ' Acadian', ' American',
       ' Canadian', ' New Brunswicker', ' Newfoundlander',
       ' Nova Scotian', ' Ontarian', ' Qubcois',
       ' Other North American origins; n.i.e.',
 'Total - Ethnic origin for the population in private households - 25% sample data',' European origins',
       ' British Isles origins', ' Channel Islander', ' Cornish',
       ' English', ' Irish', ' Manx', ' Welsh',
       ' British Isles origins; n.i.e.', ' French origins', ' Alsatian',
       ' Breton', ' Corsican', ' French']

We need to drop these rows from the dataframe. We also need to convert all numerical values to type int.

In [11]:
for index in tor_ethnics.index:
    for origin in drop_origins:
        if tor_ethnics.loc[index,'Characteristic']== origin: # drop all the unwanted ethnic origins from the dataset
            tor_ethnics.drop(index,0,inplace = True)
            break
reset_index_(tor_ethnics,"")

for index in tor_ethnics.index: # convert all numerical values to type int
    for col_name in tor_ethnics.columns[3:]:
        value = tor_ethnics.loc[index,col_name] 
        value = value.replace(',',"")
        tor_ethnics.loc[index,col_name] = int(value)

In [12]:
tor_ethnics.head()

Unnamed: 0,Category,Topic,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Ethnic origin,Ethnic origin population,Guadeloupean,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ethnic origin,Ethnic origin population,Scottish,256250,600,725,1720,5225,2835,510,...,975,1720,2760,2915,1485,1915,2395,2665,725,375
2,Ethnic origin,Ethnic origin population,North American Aboriginal origins,35630,40,105,305,475,230,90,...,105,130,605,425,270,335,140,215,220,105
3,Ethnic origin,Ethnic origin population,First Nations (North American Indian),27610,25,90,200,345,175,75,...,60,110,470,355,235,275,90,130,200,85
4,Ethnic origin,Ethnic origin population,Inuit,515,0,0,15,20,10,0,...,0,0,25,0,0,10,0,0,0,0


In [13]:
columns = tor_ethnics.columns[3:-1]
tor_ethnics['Total'] = tor_ethnics[columns].sum(axis=1).astype(int) # add a Total column at the end

Now we've added a Total columns that holds the total number of residents from each ethnic origin.

In [14]:
tor_ethnics.head()

Unnamed: 0,Category,Topic,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,...,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park,Total
0,Ethnic origin,Ethnic origin population,Guadeloupean,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Ethnic origin,Ethnic origin population,Scottish,256250,600,725,1720,5225,2835,510,...,1720,2760,2915,1485,1915,2395,2665,725,375,512145
2,Ethnic origin,Ethnic origin population,North American Aboriginal origins,35630,40,105,305,475,230,90,...,130,605,425,270,335,140,215,220,105,71170
3,Ethnic origin,Ethnic origin population,First Nations (North American Indian),27610,25,90,200,345,175,75,...,110,470,355,235,275,90,130,200,85,55080
4,Ethnic origin,Ethnic origin population,Inuit,515,0,0,15,20,10,0,...,0,25,0,0,10,0,0,0,0,1035


Lets create a dataframe that holds the total number of ethnic residents in each neighbourhood.

In [15]:
local_ethnics = pd.DataFrame(tor_ethnics.drop(['Category','Topic','Characteristic','Total'],1))

In [16]:
local_ethnics = local_ethnics.T

In [17]:
local_ethnics['Total'] = local_ethnics.sum(axis = 1).astype(int)

In [18]:
reset_index_(local_ethnics,'Neighbourhood')

In [19]:
local_ethnics.drop(0,inplace = True)

In [20]:
local_ethnics.sort_values('Total',ascending = False,inplace = True)

In [21]:
local_ethnics.head()

Unnamed: 0,Neighbourhood,0,1,2,3,4,5,6,7,8,...,245,246,247,248,249,250,251,252,253,Total
123,Waterfront Communities-The Island,0,8355,965,665,0,305,1385,6925,570,...,70,235,210,40,0,0,0,0,0,165680
130,Willowdale East,0,1900,205,165,10,45,470,1795,200,...,45,55,40,10,0,0,0,0,0,146025
133,Woburn,0,2760,605,470,25,110,800,1505,130,...,215,80,10,0,80,65,0,10,0,144360
106,Rouge,0,2485,285,185,10,90,865,1590,45,...,125,50,30,10,0,10,0,0,0,129445
74,Malvern,0,1360,335,270,30,65,590,785,30,...,120,55,20,15,15,15,0,10,0,127005


<h2><a id = "choro-ethnic"></a> Choropleth Map of Foreign Ethnic Origin Density</h2>  
<a href = "#CTOP">back to section top</a>

We need to download a geojson file of Toronto to use in our choropleth map.    
**NOTE:** if you download the json file yourself, you will need to open the file and remove `var neighbourhoods =`
on the first line of the file.

In [22]:
import wget
# download toronto geojson file
json_url = r'https://raw.githubusercontent.com/adamw523/toronto-geojson/master/neighbourhoods.js'
# json_file = wget.download(json_url)  # uncomment to download
    
print('GeoJSON file downloaded!')

GeoJSON file downloaded!


In [23]:
json_file = r'neighbourhoods.js'
with open(json_file) as tor_json:
    tor_geo = json.load(tor_json)

Now we can visualize the density of foreign ethnic residents in the different neighbourhoods of the city.

In [104]:
# create a plain world map using the geographic coordinates of Toronto
tor_lat = 43.6532
tor_lon = -79.3832
tor_choro = folium.Map(location=[tor_lat,tor_lon], zoom_start=12)


# generate choropleth map using the local_ethnics data
choro_map = folium.Choropleth(
    geo_data=json_file,
    name = "Ethnic Residents",
    data=local_ethnics,
    columns=['Neighbourhood', 'Total'],
    key_on='feature.properties.HOOD',
    fill_color='RdPu', 
    fill_opacity=1, 
    line_opacity=0.2,
    legend_name='Residents of Foriegn Ethnic Origin'
).add_to(tor_choro)


style_function = "font-size: 15px; font-weight: bold"
choro_map.geojson.add_child(
    folium.features.GeoJsonTooltip(['HOOD'], style=style_function, labels=False))

# create a layer control
folium.LayerControl().add_to(tor_choro)


# # display map
tor_choro

<h2>Analysing Age Groups<a id = "age"></a></h2>
<a href = "#CTOP">back to section top</a>

Lets take a chunk out of our toronto census dataframe that contains the age groups of residents.

In [30]:
tor_age_groups = pd.DataFrame(tor_data[tor_data['Topic']=='Age characteristics'].head(6))
reset_index_(tor_age_groups,"")

In [33]:
tor_age_groups

Unnamed: 0,Category,Topic,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Population,Age characteristics,Children (0-14 years),398135,3840,3075,1760,2360,3605,2325,...,1785,3555,9625,2325,1165,1860,1800,1210,4045,1960
1,Population,Age characteristics,Youth (15-24 years),340270,3705,3360,1235,3750,2730,1940,...,2230,2625,7660,1035,675,1320,1225,920,4750,1870
2,Population,Age characteristics,Working Age (25-54 years),1229555,11305,9965,5220,15040,10810,6655,...,7480,8140,21945,6165,3790,6420,5860,5960,12290,5860
3,Population,Age characteristics,Pre-retirement (55-64 years),336670,4230,3265,1825,3480,3555,2030,...,2070,2905,6245,1625,1150,1595,1325,1540,2965,1810
4,Population,Age characteristics,Seniors (65+ years),426945,6045,4105,2015,5910,6975,2940,...,3370,4905,8010,1380,1095,3150,1600,2905,3530,3295
5,Population,Age characteristics,Older Seniors (85+ years),66000,925,555,320,1040,1640,710,...,655,885,1130,170,125,880,165,470,400,775


In [70]:
# turn all numerical values into type int
for index in tor_age_groups.index: # convert all numerical values to type int
    for col_name in tor_age_groups.columns[3:]:
        value = tor_age_groups.loc[index,col_name] 
        value = value.replace(',',"")
        tor_age_groups.loc[index,col_name] = int(value)

Lets create smaller datasets of each age group to visualize in a choropleth map.   
These will include: ``Children (0-14 years),Youth (15-24 years), Working Age (25-54 years) and Seniors (65+ years``

In [134]:
children = pd.DataFrame(tor_age_groups.head(1))
children = children.T.reset_index()[4:]
reset_index_(children,"Neighbourhood")
children.rename(columns = {0:"Total"},inplace = True)
children.head()

Unnamed: 0,Neighbourhood,Total
0,Agincourt North,3840
1,Agincourt South-Malvern West,3075
2,Alderwood,1760
3,Annex,2360
4,Banbury-Don Mills,3605


In [125]:
youth = pd.DataFrame(tor_age_groups[1:2])
youth = youth.T.reset_index()[4:]
reset_index_(youth,"Neighbourhood")
youth.rename(columns = {1:"Total"},inplace = True)
youth.head()

Unnamed: 0,Neighbourhood,Total
0,Agincourt North,3705
1,Agincourt South-Malvern West,3360
2,Alderwood,1235
3,Annex,3750
4,Banbury-Don Mills,2730


In [121]:
working_age = pd.DataFrame(tor_age_groups[2:3])
working_age = working_age.T.reset_index()[4:]
reset_index_(working_age,"Neighbourhood")
working_age.rename(columns = {2:"Total"},inplace = True)
working_age.head()

Unnamed: 0,Neighbourhood,Total
0,Agincourt North,11305
1,Agincourt South-Malvern West,9965
2,Alderwood,5220
3,Annex,15040
4,Banbury-Don Mills,10810


It is clear that the working age group has the largest population out of all age groups.

In [122]:
seniors = pd.DataFrame(tor_age_groups[4:5])
seniors = seniors.T.reset_index()[4:]
reset_index_(seniors,"Neighbourhood")
seniors.rename(columns = {4:"Total"},inplace = True)
seniors.head()

Unnamed: 0,Neighbourhood,Total
0,Agincourt North,6045
1,Agincourt South-Malvern West,4105
2,Alderwood,2015
3,Annex,5910
4,Banbury-Don Mills,6975


<h2><a id = "choro-age"></a>Choropleth Map of Age Groups</h2>  
<a href = "#CTOP">back to section top</a>

### Now we can create a map showing the distribution of these different age groups across the city.   
### You can toggle the layer of each dataset by clicking on the layer icon on the right-hand side of the map.


In [132]:
# create a plain world map using the geographic coordinates of Toronto
tor_age = folium.Map(location=[tor_lat,tor_lon],zoom_start=10)


# generate choropleth map using the total immigration of each country to Canada from 1980 to 2013
child_map = folium.Choropleth(
    geo_data=json_file,
    name = 'Children 0-14',
    data=children,
    columns=['Neighbourhood', 'Total'],
    key_on='feature.properties.HOOD',
    fill_color='RdPu', 
    fill_opacity=0.8, 
    line_opacity=0.2,
    legend_name='Children 0-14 in Toronto'
).add_to(tor_age)



youth_map = folium.Choropleth(
    geo_data=json_file,
    name = 'Youth 15-24',
    data=youth,
    columns=['Neighbourhood', 'Total'],
    key_on='feature.properties.HOOD',
    fill_color='OrRd', 
    fill_opacity=0.8, 
    line_opacity=0.2,
    legend_name='Youth 15-24 in Toronto'
).add_to(tor_age)



adults_map = folium.Choropleth(
    geo_data=json_file,
    name = 'Working Age 25-54',
    data=working_age,
    columns=['Neighbourhood', 'Total'],
    key_on='feature.properties.HOOD',
    fill_color='YlGn', 
    fill_opacity=0.8, 
    line_opacity=0.2,
    legend_name='Working Age 25-54 in Toronto'
).add_to(tor_age)



seniors_map = folium.Choropleth(
    geo_data=json_file,
    name = 'Seniors 65+',
    data=seniors,
    columns=['Neighbourhood', 'Total'],
    key_on='feature.properties.HOOD',
    fill_color='BuPu', 
    fill_opacity=0.8, 
    line_opacity=0.2,
    legend_name='Seniors 65+ in Toronto'
).add_to(tor_age)

style_function = "font-size: 15px; font-weight: bold"

child_map.geojson.add_child(folium.features.GeoJsonTooltip(['HOOD'], style=style_function, labels=False))
youth_map.geojson.add_child(folium.features.GeoJsonTooltip(['HOOD'], style=style_function, labels=False))

adults_map.geojson.add_child(folium.features.GeoJsonTooltip(['HOOD'], style=style_function, labels=False))
seniors_map.geojson.add_child(folium.features.GeoJsonTooltip(['HOOD'], style=style_function, labels=False))
# create a layer control
folium.LayerControl().add_to(tor_age)


# # display map
tor_age