## Assignment : Segmenting and clustering the neighborhoods in the city of Toronto

We will first import the libraries.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!pip install bs4
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2021.5.30  |       ha878542_0         136 KB  conda-forge
    certifi-2021.5.30          |   py36h5fab9bb_0         141 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.1.0                |     pyhd3deb0d_0          64 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         375 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.1.0-pyhd3deb0d_0

The following packages will be

Then, we will have access to the html page of Toronto on Wikipedia and try to create a dataframe from it.

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050."
html_data  = requests.get(url).text

#turning our html into Beautiful Soup
soup = BeautifulSoup(html_data,"html5lib")

#let's have a look at the html through a nested structure
#print(soup.prettify())

In [3]:
tables = soup.find_all('table')
for index,table in enumerate(tables):
    if ("wikitable" in str(table)):
        table_index = index
print('The index of the table we are looking for is',table_index)
#print(tables[table_index].prettify())

The index of the table we are looking for is 0


Now, let's create the dataframe from the table.

In [4]:
df = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        postalcode = col[0].text
        borough = col[1].text
        neighborhood = col[2].text
        df = df.append({"PostalCode":postalcode, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)      

print(df.shape)
df.head()

(287, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


Until now, the dataframe has 287 rows. Let's clean it : taking off the \n, dropping the Not assigned etc.

In [5]:
#drop the \n in Neighborhood column
df["Neighborhood"] = df["Neighborhood"].str.replace("\n", "")

In [6]:
#drop the rows where 'Borough' is not assigned
df.drop(df[df["Borough"]=="Not assigned"].index,inplace=True)

In [7]:
df.reset_index(drop=True, inplace=True)
row = df.shape[0]
print(row)
df.head(10)

210


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


In [8]:
#group neighborhoods by postal code
i = 0
while i < row : #cheking all the indexes
    k=1
    while (i+k < row) & (df['PostalCode'][i] == df['PostalCode'][i+k]) : #comparing the postal code of two cells, if it is similar :
        df['Neighborhood'][i] = df['Neighborhood'][i] + ', ' + df['Neighborhood'][i+k] #adding the neighborhood in the first cell
        df.drop([i+k],inplace=True) #delete the second row
        k = k+1 #increasing k to compare with the cell of the next row
    i = i+k

KeyError: 210

In [9]:
df.reset_index(drop=True, inplace=True)
row = df.shape[0]
print(row)
df.head(10)

103


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [10]:
#replace the Not assigned Neighborhoods by the value in Borough
for i in range (row): #checking all the indexes
    if df['Neighborhood'][i] == 'Not assigned': #to see if there is a 'not assigned' value for neighborhood
        df['Neighborhood'][i] = df['Borough'][i] #and replace it

In [11]:
print('This dataframe has', df.shape[0], 'rows.')

This dataframe has 103 rows.


~~

~~

## Second question

Let's download the data for the latitudes and longitudes.

In [14]:
!wget -q -O 'geo_data.csv' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
print('Data downloaded!')

Data downloaded!


Now, we can have another data frame. We will try to combine them together.

In [15]:
geo_data = pd.read_csv('geo_data.csv')
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


First, we must check if the two dataframes have the same number of rows. A different number would mean something went wrong with the first dataframe treatment.

In [16]:
print(geo_data.shape)

(103, 3)


Fortunately, we have 103 rows for both ! Now, let's complete the first dataframe with the coordinates of the second one.

In [17]:
#let's change the order of the Postal Code so it can match the geo dataframe
#we also shouldn't forget to change the index
df.sort_values(by=['PostalCode'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We will drop the Postal Code column so we will just have to concatenate the two dataframes.

In [18]:
geo_data.drop(['Postal Code'], axis=1, inplace=True)
geo_data.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476


In [19]:
df = pd.concat([df,geo_data], axis=1)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


And now, we have a beautiful cleaned dataframe with all the information needed.

In [20]:
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


~~

~~

## Third question

Finally, we will try to create a map from the dataframe.

In [21]:
df.shape

(103, 5)

In [22]:
# create map of Toronto using latitude and longitude values
latitude = 43.70611987759837
longitude = -79.26923452997346

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now, we would like to analyse the number of vegetarian restaurants in Toronto, depending on the borough. For this, we will start to use Foursquare to identify the restaurants and then create a dataframe with their location.

In [27]:
#Let's connect to Foursquare

CLIENT_ID = '14UOKNLJ4VVFS0IDGBH5TK5ATSFCZ5F2PYIFALBS0IQNFQW0' # your Foursquare ID
CLIENT_SECRET = 'U40AJOI5FGSLJH1K1C410KZATD3VIOVM5ZVTBTLTXBDRYMZI' # your Foursquare Secret
ACCESS_TOKEN = 'X05NE5ZPYSYLBSELNVSRQCI0PBPUTE5A52G5Q5LAYJKG3ZT4' # your FourSquare Access Token
VERSION = '20180604'
code = 'X1QWDNRID1NJSXUKDTP1YWFAKG5YYRY0TEHEXR0ZZVOMODEC#_=_'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 14UOKNLJ4VVFS0IDGBH5TK5ATSFCZ5F2PYIFALBS0IQNFQW0
CLIENT_SECRET:U40AJOI5FGSLJH1K1C410KZATD3VIOVM5ZVTBTLTXBDRYMZI


In [28]:
search_query = 'vegetarian'
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query)
url

'https://api.foursquare.com/v2/venues/search?client_id=14UOKNLJ4VVFS0IDGBH5TK5ATSFCZ5F2PYIFALBS0IQNFQW0&client_secret=U40AJOI5FGSLJH1K1C410KZATD3VIOVM5ZVTBTLTXBDRYMZI&ll=43.70611987759837,-79.26923452997346&oauth_token=X05NE5ZPYSYLBSELNVSRQCI0PBPUTE5A52G5Q5LAYJKG3ZT4&v=20180604&query=vegetarian'

Now, we will extract the data from foursquare about the vegetarian restaurants in Toronto into a dataframe.

In [29]:
results = requests.get(url).json()
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
vdf = json_normalize(venues)
print(vdf.shape)
vdf.head()

ConnectionError: HTTPSConnectionPool(host='api.foursquare.com', port=443): Max retries exceeded with url: /v2/venues/search?client_id=14UOKNLJ4VVFS0IDGBH5TK5ATSFCZ5F2PYIFALBS0IQNFQW0&client_secret=U40AJOI5FGSLJH1K1C410KZATD3VIOVM5ZVTBTLTXBDRYMZI&ll=43.70611987759837,-79.26923452997346&oauth_token=X05NE5ZPYSYLBSELNVSRQCI0PBPUTE5A52G5Q5LAYJKG3ZT4&v=20180604&query=vegetarian (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fa0a9c34cf8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

Let's clean our dataframe a little bit. We only keep the relevant columns and only the restaurants in the city of Toronto.

In [24]:
vdf.drop(['categories','hasPerk','location.address','location.labeledLatLngs','location.distance',
         'location.cc','location.state','location.country','location.formattedAddress','location.crossStreet',
          'location.neighborhood'], axis = 1, inplace=True)
vdf.drop(['id','referralId'], axis=1, inplace=True)

In [50]:
vdf.drop(vdf[vdf["location.city"]=="Markham"].index,inplace=True)
vdf.drop(vdf[vdf["location.city"]=="Mississauga"].index,inplace=True)
vdf.drop(vdf[vdf["location.city"]=="Thornhill"].index,inplace=True)
vdf.drop(vdf[vdf["location.city"]=="Richmond Hill"].index,inplace=True)
vdf.head(30)

Unnamed: 0,name,location.lat,location.lng,location.city,location.postalCode
0,King's Vegetarian Food 觀自在,43.786749,-79.270004,Toronto,
1,Nelakee Vegetarian,43.816569,-79.296205,Scarborough,M1V 5H4
2,The Buddhist Vegetarian Kitchen 佛海齋廚,43.806526,-79.288972,Scarborough,M1V 4W8
3,Annapurna Vegetarian Restaurant,43.672804,-79.414087,Toronto,M5R 3G8
4,Vegetarian Cafe in the Big Carrot,43.677874,-79.352939,Toronto,
5,Lotus Pond Vegetarian Restaurant 蓮花素食,43.819421,-79.294682,Scarborough,M1K 5V5
6,Graceful Vegetarian Restaurant 法海素食軒,43.82878,-79.306199,Toronto,L3R 0N4
9,Vegetarian Haven,43.656016,-79.392758,Toronto,M5T 1L1
10,The Vegetarian Restaurant,43.656792,-79.468117,Toronto,M6P 1Y6
11,Green Garden Vegetarian,43.781869,-79.279157,Scarborough,M1S 5A8


Now, we have the list of all the vegetarian restaurants in Toronto, with their postal code and location. Let's only keep the postal code.

In [25]:
vdf['location.postalCode'] = vdf['location.postalCode'].str[:3]
vdf.drop(['name','location.city'], axis=1, inplace=True)

In [27]:
vdf.rename(columns = {'location.lat':'lat'}, inplace = True)
vdf.rename(columns = {'location.lng':'lng'}, inplace = True)
vdf.rename(columns = {'location.postalCode':'PostalCode'}, inplace = True)

As we only want to know the number of restaurants in each borough, we will just count the number of rows for each unique postal code with group by.

In [47]:
vdf.drop(vdf[vdf["PostalCode"]=="L3R"].index,inplace=True)
gdf=vdf.groupby('PostalCode').count().reset_index()

We can then merge the two data frame so we have, for each borough, the postal code, neighborhood, location and number of vegetarian restaurants in the "vege" column.

In [51]:
mdf=df.merge(gdf,how='left')
mdf.drop(['lng'], axis=1, inplace=True)
mdf.rename(columns = {'lat':'vege'}, inplace = True)
mdf.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,vege
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,
3,M1G,Scarborough,Woburn,43.770992,-79.216917,
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,


Let's clean it a little bit, so the "NaN" becomes 0.

In [55]:
mdf['vege'] = mdf['vege'].fillna(0)
mdf.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,vege
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0


Now, let's see how we can cluster it.

In [58]:
print(mdf['vege'].max())

3.0


Actually, the maximum number of vegetarian restaurants is only 3. So we have four natural clusters : 0, 1, 2 and 3.

The question now would be : should we cluster it more ? For example, gather the boroughs without restaurant, then a cluster with 1,2 and the last with 3. Though, having four clusters depending on the real number seems fair. It doesn't show too much data, goes straight to the point and is very accurate.