# Capstone report week 4

In [6]:
import pandas as pd
import numpy as np
import geopy
import folium
import geopandas as gpd
import fiona
import os
import requests

## Introduction/Business Problem

We have been hired to solve a business problem for a couple that wants to open a business in Toronto.  
They want to open a Hair Salon.  

They are new in Toronto and need help with the research and analysis of the data to figure out where is the best neighborhood to open the new business.  

They provided us with the demographics of the target market of the Hair Salon:  

* Latin American Women
* Between 15 and 70 years old
* Medium income or higher
* College educated family  

They expect the results to be a small list of three neighborhoods where the provided demographics could be found and a visualization of existing Hair Salon and/or other businesses especifically tailored for women in Toronto.

## Data

We are going to use the Neighborhood data provided by the city of Toronto instead of the data from postal codes. The reason is, that the data divided by the city can be matched with economic and demographic data more easily.  

We will access our data from the Open Data Catalogue of the city of Toronto. You can find the Open Data page at:  
https://www.toronto.ca/city-government/data-research-maps/open-data/  

Toronto has announced a new Open Data Portal, where eventually all the data setes are going to be rellocated. Yo can find the Open Data Portal at:
https://portal0.cf.opendata.inter.sandbox-toronto.ca/catalogue/  

The specific data sets that we are going to use are:    
* Boundaries of City of Toronto Neighbourhoods in WGS84 format. It can be found at: 
https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#a45bd45a-ede8-730e-1abc-93105b2c439f  
    

* Toronto Neighbourhood Profiles in csv format. It can be found at:  
https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#8c732154-5012-9afe-d0cd-ba3ffc813d5a  



### Datasets explanation

* Boundaries of City of Toronto Neighbourhoods  
This is a shapefile with the 140 neighborhoods in Toronto. It contains Neighborhood Id, Neighborhood Name and a geometry column with the boundaries of each neighborhood that creates a polygon.

In [10]:
df = gpd.GeoDataFrame.from_file('NEIGHBORHOODS_WGS84.shp')
print(df.columns)
df.head()

Index(['AREA_S_CD', 'AREA_NAME', 'geometry'], dtype='object')


Unnamed: 0,AREA_S_CD,AREA_NAME,geometry
0,97,Yonge-St.Clair (97),"POLYGON ((-79.39119482700001 43.681081124, -79..."
1,27,York University Heights (27),"POLYGON ((-79.505287916 43.759873494, -79.5048..."
2,38,Lansing-Westgate (38),"POLYGON ((-79.439984311 43.761557655, -79.4400..."
3,31,Yorkdale-Glen Park (31),"POLYGON ((-79.439687326 43.705609818, -79.4401..."
4,16,Stonegate-Queensway (16),"POLYGON ((-79.49262119700001 43.64743635, -79...."


As you can see, the column 'AREA_S_CD' has a 3 digit format with 0's in front. We will strip the 0's to make our life easier in the future.

In [25]:
df['AREA_S_CD'] = df['AREA_S_CD'].str.replace(r"^0*", '')
df.head()

Unnamed: 0,AREA_S_CD,AREA_NAME,geometry
0,97,Yonge-St.Clair (97),"POLYGON ((-79.39119482700001 43.681081124, -79..."
1,27,York University Heights (27),"POLYGON ((-79.505287916 43.759873494, -79.5048..."
2,38,Lansing-Westgate (38),"POLYGON ((-79.439984311 43.761557655, -79.4400..."
3,31,Yorkdale-Glen Park (31),"POLYGON ((-79.439687326 43.705609818, -79.4401..."
4,16,Stonegate-Queensway (16),"POLYGON ((-79.49262119700001 43.64743635, -79...."


With that information we can use folium and plot a map with the neighborhoods of Toronto.  
First we are going to convert the shapefile to GeoJSON.

In [27]:
df.to_file('toronto_neigh.json', driver='GeoJSON')
map_toronto = folium.Map(location=[43.653963, -79.387207], zoom_start=10)
toronto_neigh = os.path.join( 'toronto_neigh.json')
folium.GeoJson(
    toronto_neigh,
    name='geojson'
).add_to(map_toronto)
map_toronto

As you can see, we can now use the neighborhoods of toronto to do further analysis.

* Toronto Neighbourhood Profiles in csv format.  

The Census of Population is held across Canada every 5 years and collects data about age and sex, families and households, language, immigration and internal migration, ethnocultural diversity, Aboriginal peoples, housing, education, income, and labour.  City of Toronto Neighbourhood Profiles use this Census data to provide a portrait of the demographic, social and economic characteristics of the people and households in each City of Toronto neighbourhood. The profiles present selected highlights from the data, but these accompanying data files provide the full data set assembled for each neighbourhood.  
  
Even tough we have access to a big set of data we are going to use a "small" subset in this exercise.  
  
As we will see we have demographic information for each neighborhood relevant to solve or business problem.  
Population, Population density, number of females in each age bracket, number of persons with Spanish as Mother Tongue, income brackets, education brackets.


In [13]:
toronto_data = pd.read_csv('neighborhood_census_final_dataset.csv')
print(toronto_data.columns)
toronto_data.head()

Index(['Unnamed: 0', 'Neighbourhood Number', 'Population2016',
       'Population density per square kilometre',
       'Land area in square kilometres', 'Female: 15 to 19 years',
       'Female: 20 to 24 years', 'Female: 25 to 29 years',
       'Female: 30 to 34 years', 'Female: 35 to 39 years',
       'Female: 40 to 44 years', 'Female: 45 to 49 years',
       'Female: 50 to 54 years', 'Female: 55 to 59 years',
       'Female: 60 to 64 years', 'Female: 65 to 69 years',
       'Female: 70 to 74 years', 'motherTongueSpanish',
       'Percentage with total income', 'Under $10000 (including loss)',
       '$10000 to $19999', '$20000 to $29999', '$30000 to $39999',
       '$40000 to $49999', '$50000 to $59999', '$60000 to $69999',
       '$70000 to $79999', '$80000 to $89999', '$90000 to $99999',
       '$100000 and over', 'Americas', 'Brazil', 'Colombia', 'El Salvador',
       'Guyana', 'Haiti', 'Jamaica', 'Mexico', 'Peru', 'Trinidad and Tobago',
       'United States', 'Other places of b

Unnamed: 0.1,Unnamed: 0,Neighbourhood Number,Population2016,Population density per square kilometre,Land area in square kilometres,Female: 15 to 19 years,Female: 20 to 24 years,Female: 25 to 29 years,Female: 30 to 34 years,Female: 35 to 39 years,...,College CEGEP or other non-university certificate or diploma,University certificate or diploma below bachelor level,University certificate diploma or degree at bachelor level or above,Bachelors degree,University certificate or diploma above bachelor level,Degree in medicine dentistry veterinary medicine or optometry,Masters degree,Earned doctorate,Total income: Average amount ($),Total income: Aggregate amount ($000)
0,Agincourt North,129,29113,3929,7.41,865,975,1005,935,775,...,2550,570,4240,3090,215,105,735,85,30414,714879
1,Agincourt South-Malvern West,128,23757,3034,7.83,690,895,975,835,715,...,2340,485,4615,3270,215,120,920,85,31825,616446
2,Alderwood,20,12054,2435,4.95,290,310,350,430,450,...,1735,195,1980,1415,105,45,390,25,47709,473038
3,Annex,95,30526,10863,2.81,550,1520,2265,1675,1040,...,2005,385,12640,6855,670,330,3930,870,112766,2888507
4,Banbury-Don Mills,42,27695,2775,9.98,660,650,745,860,895,...,2420,560,8060,4895,530,430,1970,245,67757,1513345


The names of the columns are not easy to work with, but for now, we will change just the first three to give a small example of what we can acomplish with the data.

In [15]:
toronto_data.rename(columns = {'Unnamed: 0': 'neighborhood_name',
                               'Neighbourhood Number': 'neighborhood_id',
                               'Population2016': 'population2016'  }, inplace=True)

In [16]:
toronto_data.head()

Unnamed: 0,neighborhood_name,neighborhood_id,population2016,Population density per square kilometre,Land area in square kilometres,Female: 15 to 19 years,Female: 20 to 24 years,Female: 25 to 29 years,Female: 30 to 34 years,Female: 35 to 39 years,...,College CEGEP or other non-university certificate or diploma,University certificate or diploma below bachelor level,University certificate diploma or degree at bachelor level or above,Bachelors degree,University certificate or diploma above bachelor level,Degree in medicine dentistry veterinary medicine or optometry,Masters degree,Earned doctorate,Total income: Average amount ($),Total income: Aggregate amount ($000)
0,Agincourt North,129,29113,3929,7.41,865,975,1005,935,775,...,2550,570,4240,3090,215,105,735,85,30414,714879
1,Agincourt South-Malvern West,128,23757,3034,7.83,690,895,975,835,715,...,2340,485,4615,3270,215,120,920,85,31825,616446
2,Alderwood,20,12054,2435,4.95,290,310,350,430,450,...,1735,195,1980,1415,105,45,390,25,47709,473038
3,Annex,95,30526,10863,2.81,550,1520,2265,1675,1040,...,2005,385,12640,6855,670,330,3930,870,112766,2888507
4,Banbury-Don Mills,42,27695,2775,9.98,660,650,745,860,895,...,2420,560,8060,4895,530,430,1970,245,67757,1513345


We need to be sure that the key to merge the geographic data with the demographic data is of the same data type.  
In our JSON file, we have a column "AREA_S_CD" that has the neighborhood id as strings.  
Now we will check and change type (if necesary) of the column "neighborhood_id" in the demographic dataset.

In [17]:
toronto_data['neighborhood_id'].dtype

dtype('int64')

The type is 'int64'. Now we have to change it.

In [18]:
toronto_data['neighborhood_id'] = toronto_data['neighborhood_id'].astype(str)
toronto_data['neighborhood_id'].dtype

dtype('O')

Now we have a type 'Object'. For Pandas this is the same as string.  
Now we can add some information to our map.

In [29]:
map_toronto3 = folium.Map(location=[43.653963, -79.387207], zoom_start=10)
map_toronto3.choropleth(geo_data='toronto_neigh.json',
                        data= toronto_data,
                        columns= ['neighborhood_id', 'population2016'],
                        key_on='feature.properties.AREA_S_CD',
                        fill_color='YlGn',
                        fill_opacity=0.7,
                        line_opacity=0.2,
                        legend_name='Population',
                        reset=True
                 )
map_toronto3

We have now a Choropleth map with the population of each neighborhood of Toronto.  
Now that we have our datasets ready and working, we can start the actual analysis.