## Assignment : Segmenting and clustering the neighborhoods in the city of Toronto

We will first import the libraries.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!pip install bs4
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.12.5  |       ha878542_0         137 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.1.0                |     pyhd3deb0d_0          64 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         235 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.1.0-pyhd3deb0d_0

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    pkgs/main::ca-

Then, we will have access to the html page of Toronto on Wikipedia and try to create a dataframe from it.

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050."
html_data  = requests.get(url).text

#turning our html into Beautiful Soup
soup = BeautifulSoup(html_data,"html5lib")

#let's have a look at the html through a nested structure
#print(soup.prettify())

In [3]:
tables = soup.find_all('table')
for index,table in enumerate(tables):
    if ("wikitable" in str(table)):
        table_index = index
print('The index of the table we are looking for is',table_index)
#print(tables[table_index].prettify())

The index of the table we are looking for is 0


Now, let's create the dataframe from the table.

In [4]:
df = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        postalcode = col[0].text
        borough = col[1].text
        neighborhood = col[2].text
        df = df.append({"PostalCode":postalcode, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)      

print(df.shape)
df.head()

(287, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


Until now, the dataframe has 287 rows. Let's clean it : taking off the \n, dropping the Not assigned etc.

In [5]:
#drop the \n in Neighborhood column
df["Neighborhood"] = df["Neighborhood"].str.replace("\n", "")

In [6]:
#drop the rows where 'Borough' is not assigned
df.drop(df[df["Borough"]=="Not assigned"].index,inplace=True)

In [7]:
df.reset_index(drop=True, inplace=True)
row = df.shape[0]
print(row)
df.head(10)

210


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


In [11]:
#group neighborhoods by postal code
i = 0
while i < row : #cheking all the indexes
    k=1
    while (i+k < row) & (df['PostalCode'][i] == df['PostalCode'][i+k]) : #comparing the postal code of two cells, if it is similar :
        df['Neighborhood'][i] = df['Neighborhood'][i] + ', ' + df['Neighborhood'][i+k] #adding the neighborhood in the first cell
        df.drop([i+k],inplace=True) #delete the second row
        k = k+1 #increasing k to compare with the cell of the next row
    i = i+k

KeyError: 210

In [12]:
df.reset_index(drop=True, inplace=True)
row = df.shape[0]
print(row)
df.head(10)

103


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [13]:
#replace the Not assigned Neighborhoods by the value in Borough
for i in range (row): #checking all the indexes
    if df['Neighborhood'][i] == 'Not assigned': #to see if there is a 'not assigned' value for neighborhood
        df['Neighborhood'][i] = df['Borough'][i] #and replace it

In [14]:
print('This dataframe has', df.shape[0], 'rows.')

This dataframe has 103 rows.


~~

~~

## Second question

Let's download the data for the latitudes and longitudes.

In [28]:
!wget -q -O 'geo_data.csv' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
print('Data downloaded!')

Data downloaded!


Now, we can have another data frame. We will try to combine them together.

In [35]:
geo_data = pd.read_csv('geo_data.csv')
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


First, we must check if the two dataframes have the same number of rows. A different number would mean something went wrong with the first dataframe treatment.

In [36]:
print(geo_data.shape)

(103, 3)


Fortunately, we have 103 rows for both ! Now, let's complete the first dataframe with the coordinates of the second one.

In [24]:
#let's change the order of the Postal Code so it can match the geo dataframe
#we also shouldn't forget to change the index
df.sort_values(by=['PostalCode'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We will drop the Postal Code column so we will just have to concatenate the two dataframes.

In [38]:
geo_data.drop(['Postal Code'], axis=1, inplace=True)
geo_data.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476


In [39]:
df = pd.concat([df,geo_data], axis=1)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


And now, we have a beautiful cleaned dataframe with all the information needed.

In [41]:
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
