# Coursera Capstone Project

Tarek Zawi


Description of Problem

A potential client wants to open a health food store in the city of Toronto, but is unsure where the best location would be. Toronto is a bustling city, being the number one most populous city in Canada, and the fourth most populous in North America as a whole.There are already a number of health food stores present, so choosing a location where competition is less likely to be a factor is essential to maximize profits. The client is particularly interested in setting up shop in boroughâ€™s(districts) that contain the word Toronto, i.e.,Downtown Toronto, East Toronto, Central Toronto, West Toronto, and Toronto/York.

Data

Foursquare will be leveraged to solve the problem for its abundant and accurate location data. We will examine each neighborhood in the area of interest, and using the Foursquare API, we will explore nearby venues. In particular, we are interested in the presence of both health food stores and gyms-gym goers are generally health conscious individuals who would more likely frequent a health food store. This will be a good way to increase traffic to the store and help our chances of the store succeeding.

The data used will be two fold: 1) A table from Wikipedia listing the borough and neighborhoods for the corresponding postal codes in Toronto, Canada. 2) A second table listing the geographical coordinates (latitude and longitude) for the different Postal Codes in Canada.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [12]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# import k-means from clustering stage
from sklearn.cluster import KMeans
#!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
#install folium
!pip install folium
import folium

from sklearn.cluster import KMeans
from sklearn import datasets


print('Libraries imported.')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Libraries imported.


Lets scrape the table with the needed Toronto geographical information from wikipedia and put it into a dataframe

In [13]:
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'
dfs=pd.read_html(url)
print(len(dfs))

3


In [14]:
df=dfs[0]

Lets preview the dataframe

In [15]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Only cells that have an assigned borough are of interest to us. We will drop the cells which do not have an assigned borough. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

The data frame does not have the latititude and longitude coordinates for the neighborhoods, so we will import a csv with the necessary information. In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [16]:
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
    endpoint_966a994cd8994c5a91dc5f410f9bf19a = 'https://s3-api.us-geo.objectstorage.softlayer.net'
else:
    endpoint_966a994cd8994c5a91dc5f410f9bf19a = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'

client_966a994cd8994c5a91dc5f410f9bf19a = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='Bk6Auea_evpxbrrhin2scuNdfyacisserkxBQo1pjF7_',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url=endpoint_966a994cd8994c5a91dc5f410f9bf19a)

body = client_966a994cd8994c5a91dc5f410f9bf19a.get_object(Bucket='segmentingandclusteringtorontodat-donotdelete-pr-vpkkm68rpotdkb',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

dfcoor = pd.read_csv(body)
dfcoor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will now join the original scraped table and our new table of latitude/longitude coordinates, matching the entires based on Postal Code

In [17]:
df_toronto=pd.merge(df,dfcoor, how='left', left_on='Postal Code', right_on='Postal Code')

We will drop any rows which are missing latitude and longitude values.

In [18]:
df_toronto=df_toronto.dropna(axis=0,how='any')

We will now look at only look at those borough's(districts) that contain the word Toronto, i.e.,Downtown Toronto, East Toronto, Central Toronto, West Toronto, and Toronto/York. We will filter our dataframe using the following code.

In [19]:
toronto_data=df_toronto[df_toronto['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
