# <h1 align="center"><font size="8">Applied Data Science Capstone</font></h1>
<h3 align="center"><font size="5">By Ian Riera Smolinska</font></h3>

<h2> Segmenting and Clustering Neighborhoods in Toronto. </h2>

<h3> Part 1: Scrapping of the Wikipedia list of postal codes of Canada and Dataframe creation. </h3>

Although this webpage is proposed https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M , due to continuous updates of the table that might cause some trouble with the scrapping, a fixed version of the webpage has been used: https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=933624196

In [1]:
# install the required libraries for scrapping
!pip install beautifulsoup4 # library for web scrapping
!pip install lxml # xml parser



In [2]:
# import the required libraries
import pandas as pd # library for data analsysis
import requests # library to handle requests
from bs4 import BeautifulSoup # library for scrapping

print('Libraries imported.')

Libraries imported.


In [3]:
# get the webpage we want to analyze
source = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=933624196').text

# apply the scraper on the webpage
soup = BeautifulSoup(source, 'lxml')

# extract the table from the webpage
table = soup.find('table', {'class':'wikitable sortable'})

# to get the data from the cells
table_data=""
for tr in table.find_all('tr'):
    row_data=""
    for tds in tr.find_all('td'):
        row_data=row_data+","+tds.text
    table_data=table_data+row_data[1:]
    
print(table_data[0:500])

M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor
M7A,Downtown Toronto,Queen's Park
M8A,Not assigned,Not assigned
M9A,Queen's Park,Not assigned
M1B,Scarborough,Rouge
M1B,Scarborough,Malvern
M2B,Not assigned,Not assigned
M3B,North York,Don Mills North
M4B,East York,Woodbine Gardens
M4B,East York,Parkview Hill
M5B,Downtown Toronto,Ryerso


In [4]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [5]:
# get the table into rows
table_rows = table_data.split('\n')

# fill the dataframe with the information in each row
for row in table_rows:
    if row != '':
        data = row.split(',')
        postal_code = data[0]
        borough = data[1]
        neighborhood = data[2]
        df = df.append({'PostalCode' : postal_code, 'Borough' : borough, 'Neighborhood' : neighborhood}, ignore_index=True)

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<h4> Preprocessing </h4>


In [6]:
# remove the Postal Codes 'Not assigned' to a Borough
df = df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [7]:
# assign the borough name to 'Not assigned' neighborhoods
df.Neighborhood.replace("Not assigned", df.Borough, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [8]:
# unify neighborhoods with same code
df = df.groupby(['PostalCode', 'Borough'], sort=True).agg( ', '.join)

# the index should be restored after deleting and merging rows from the dataframe
df = df.reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
df.shape

(103, 3)

<h3> Part 2: get the latitude and the longitude coordinates of each neighborhood. </h3>

<h4> Plan A: Tried with geocoder but calls never returned result and kernel locked. </h4>

!pip install geocoder

import geocoder # import geocoder

for postal_code in df.PostalCode:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates, as geocoder sometimes fail to work and you need to be persistent
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    df = df.append({'Latitude' : latitude, 'Longitude' : longitude}, ignore_index=True)
    
df.head()

<h4>PLAN B: Using the csv with the latitude and longitudes. </h4> 

csv file: http://cocl.us/Geospatial_data

In [10]:
df_geo = pd.read_csv('http://cocl.us/Geospatial_data')

In [11]:
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
df['Latitude'] = df_geo['Latitude']
df['Longitude'] = df_geo['Longitude']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
