# Final Capstone Report

## by Daniel da Cruz

### (Side note: Due to Foursquare API not functioning. The project has been adapted to compensate for the server-side error. The project below is focused on extracting data, preparing data, analyzing the data as well as providing a conclusion from the data.) :)

# The Business Problem 

Mr. Ronald is a 56 year old restaurenteer. He has been involved in the restaurant business since the age of 16. He has been quite succesful, and has built two businesses in the small town of Dundas, Ontario. 

His restaurants are part of a franchise called, "Ronald's Express." Mr. Ronald is the owner and founder of the franchise. Despite not having techincal background, he has identified a factors he believes has made his business a success. The menu is entirely designed by him. From all the years in the industry, his recipes have been created and refined to make his food the most delicious in his town. 

He sells burgers, tacos and pizzas using his very best recipes and cooking style. It is a town favourite. He know believes that Ronald's Express is ready to expand its business into Toronto, Ontario. The franchise is entirely family-owned, and he would like to pass it down to his sons when he retires. This will allow him to grow the business for many years to come. 

One of the key factors is opening a restaurant is the location. After a few failed busineses prior to Ronald's Expres, he realized that the location should have "plenty of feet." He is implying that the business need to be placed in a position to encounter a lot of people. He has also identified that his main customers are middle class people. He noticed that most of his customers are not exaclty "well off."

Ronald has been to Toronto a couple of times but its quite a big place. He has no idea where to begin looking for a place. He was told by his sons that he main need to find a data scientist to help with his problem. He has contacted me for assistance.

In this data science project, the target audience is Mr. Ronald and his family as they try to expand their family business into the Great Toronto Area. 

# The Data and Methodology

Due to Toronto being a world-class city, it keeps all census data on its website www.toronto.ca. This website contains all the data required to understand the city, and how it functions. The most recent census data will be used as part of the analysis. 

The neighborhood data for Toronto is available as an Excel spreadsheet and all unnecessary data was removed. The data was then adjusted so that it could be used for analyzing. The three most important coloumns of data, with respect to the project, were the original population size of the Toronto neighborhoods, the area size of the neighborhood and the average family income. From the data, we will create a new column that will give us the population density. 

We created the population density data by taking the population size and dividing it by the area size. This is a justifiable deducation and this will tell us how many people are present per square kilometre. This is what Mr. Ronald was talking about without being able to fully articulate what he meant. We can report this figure to him, and provide a recommendation for where to start his search. 

We will then focus on the average family income. Mr. Ronald had explained that both the lower class and upper class economic brackets did not seem to make up his customer base. He beleives that the middle class can both afford his products, and also enjoys it the most. These are rough estimations on his parts and we will like to go by his word.

After doing a statistical analysis, we will be able to report the mean and median incomes for the city of Toronto. This will give us a reference point for finding the middle class. It will also give us information on which towns to avoid. Financial metrics tend to have skewed data and we will likely have to use a median average to get a good reference for the middle class. 

After extracting and cleaning data, we will do a statistical analysis and report the findings using Plotly. We will plot the data on graphs depending on which graphs will best present the findings. 

# Libraries Imported

In [42]:
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install folium
import folium # map rendering library

#Graph Plotting Library
import plotly.express as px
import plotly.graph_objs

print('Libraries imported.')

Libraries imported.


## Toronto Location Data 

In [43]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

43.6534817 -79.3839347


## Retrieving Census Data from CSV File
- The data has been cleaned and formatted for easier data handling.
- Density column has been added by dividing 'Total Population' by 'Total Area'

In [44]:
dataset = pd.read_csv(r"C:\Users\Daniel\Desktop\wellbeing-toronto-demographics.csv", delimiter=";")
dataframe = pd.DataFrame(dataset)
dataframe.head()

Unnamed: 0,Neighbourhood,AREA_S_CD,Total Area,Total Population,Density,Average Family Income
0,West Humber-Clairville,1,3009,34100,1133,67240
1,Mount Olive-Silverstone-Jamestown,2,46,32790,7128,52745
2,Thistletown-Beaumond Heights,3,34,10140,2982,71300
3,Rexdale-Kipling,4,25,10485,4194,65215
4,Elms-Old Rexdale,5,29,9550,3293,56515


# Obtaining Location Coordinates of Toronto Neighborhoods

In [45]:
neighborhood_lat = []
neighborhood_long = []
for i in dataframe['Neighbourhood']: 
    address = '{}, Toronto, Canada'.format(i)
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    if location == None:
        neighborhood_lat.append('unknown')
        neighborhood_long.append('unknown')
    else:
        latitude = location.latitude
        longitude = location.longitude
        neighborhood_lat.append(latitude)
        neighborhood_long.append(longitude)

## Adding the Latitude and Longitude Coordinates to the Dataframe

In [46]:
dataframe.insert(6, 'Latitude', neighborhood_lat)
dataframe.insert(7, 'Longitude', neighborhood_long)
dataframe.head(15)

Unnamed: 0,Neighbourhood,AREA_S_CD,Total Area,Total Population,Density,Average Family Income,Latitude,Longitude
0,West Humber-Clairville,1,3009,34100,1133,67240,43.722563,-79.597039
1,Mount Olive-Silverstone-Jamestown,2,46,32790,7128,52745,unknown,unknown
2,Thistletown-Beaumond Heights,3,34,10140,2982,71300,unknown,unknown
3,Rexdale-Kipling,4,25,10485,4194,65215,43.722114,-79.572292
4,Elms-Old Rexdale,5,29,9550,3293,56515,43.72177,-79.552173
5,Kingsview Village-The Westway,6,51,21725,4260,60440,unknown,unknown
6,Willowridge-Martingrove-Richview,7,55,21345,3881,70855,unknown,unknown
7,Humber Heights-Westmount,8,28,10580,3779,69750,43.695785,-79.520832
8,Edenbridge-Humber Valley,9,55,14945,2717,142550,43.670672,-79.518855
9,Princess-Rosethorn,10,52,11200,2154,152850,unknown,unknown


Note: The location data for some neighborhoods cannot be found. This poses a problem. Instead, we will continue to use the available data. Therefore, some neighborhoods will not show up on the map. Foursquare API is not functioning properply, but we can, on the basis on census data, make sufficient inferences based on the goals of the project. 

## Mapping the neighborhoods of Toronto

In [47]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10.4)

for lat, lng, neighborhood in zip(dataframe['Latitude'], dataframe['Longitude'], dataframe['Neighbourhood']):
    if lat and lng == 'unknown':
        continue
    else:
        label ='{}'.format(neighborhood)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_toronto)
map_toronto

# Results 

### Plotting the Average Family Income of Toronto Neighborhoods

In [48]:
fig = px.histogram(dataframe, x="Average Family Income")
fig.update_layout(
    title={
        'text': "The Distribution of Family Incomes in Toronto",
        'y':1.0,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

- Notice that the data is skewed to the right. This is a common occurence when observing financial data. This means that using the mean would not be a good measure of the central tendency of income distribution across Toronto.
- The median would be a better measure of the average family income. 
- The mean and median values need to be caluclated for further statistical analysis. 

### Mean and Median of Toronto Family Incomes

In [49]:
dataframe['Average Family Income'].mean()

80817.96428571429

- The mean of the family income across the Toronto region is 80817$

In [50]:
dataframe['Average Family Income'].median()

66727.5

- The median of average family income across the Toronto region is 66727$

In [51]:
fig = px.box(dataframe, y="Average Family Income", title='The Dispersion of Family Income Across Toronto Neighborhoods')
#fig.show()
fig.update_layout(
    title={
        'text': "The Dispersion of Family Income Across Toronto Neighborhoods",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

- Above is a boxplot further showing the impact that the outlier is having on the distribution of income. 

In [52]:
dataframe = dataframe.sort_values('Average Family Income')

In [53]:
fig = px.bar(dataframe.tail(10), x='Neighbourhood', y='Average Family Income', title='The Average Family Income of Toronto Neighbourhoods')
fig.update_layout(
    title={
        'text': "The Average Family Income of Toronto Neighbourhoods",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

- We note that Bridle Path-Sunnybrook-York Mills is an upper class neighborhood, and is the ourlier in this data sample.  

## Plotting the Population Density of Toronto Neighborhoods

In [54]:
dataframe = dataframe.sort_values('Density')

In [55]:
fig = px.bar(dataframe.tail(10), x='Neighbourhood', y='Density', title='The Density of Toronto Neighbourhoods')
fig.update_layout(
    title={
        'text': "The Density of Toronto Neighbourhoods",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

- From this graph, the three rightmost neighborhoods will be required for further observation - North St. James, Mount Pleasant West and Church-Yonge Corridor. 


### Adjusting Average Family Incomes by Population Density of Toronto Neighborhoods

In [56]:
dataframe = dataframe.sort_values('Density')

In [57]:
fig = px.scatter(dataframe.tail(10), x='Neighbourhood', y='Average Family Income', size='Density', title='The Average Family Income of Toronto Neighbourhoods')
fig.update_layout(
    title={
        'text': "The Average Family Income of Toronto Neighbourhoods - Adjusted by Population Density",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# Conclusion 

- Our top three candidates are North St. James, Mount Pleasant West and Church-Yonge Corridor. By observing the data, the neighborhood with the highest density is North St. James. However, it has a below average family income – at 40 230 dollars.
- The 2nd & 3rd neighborhoods are Mount Pleasant West and Church-Yonge Corridor. These, in contrast, have an above average family income at 69 830 and 69 435 dollars respectively. The median family income is 66727 dollars. 
- We recommend Mount Pleasant West and Church-Yonge Corridor as our top two candidates for the placement of Ronald’s Express. However, due diligence is required to observe correct placement of business within the region. This is best done in person and with the help of a commercial real estate agent.

# Reference

1. Toronto, C., Government, C. and Data, R., 2021. Data, Research & Maps. [online] City of Toronto. Available at: <https://www.toronto.ca/city-government/data-research-maps/> [Accessed 13 July 2021].