<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New Canada</font></h1>

In [2]:
#!pip install bs4

In [5]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup

print('Libraries imported.')


Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

Use the Beautiful Soup for extract HTML.

In [6]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url)
#print(soup.prettify())


Extract Table with Postal Code.


In [7]:
My_table = soup.find('table',{'class':'wikitable sortable'})

#### Tranform the data into a *pandas* dataframe
Get Table to Dataframe and Unassigned checks are made in Borough and Neighborhood

Clean Data Frame :)

In [8]:
Head = My_table.find_all('tr')
Table = []
for th in Head:
    row = np.array(th.getText()[1:-1].split('\n'))
    if row[1] != 'Not assigned': 
        if row[2] == 'Not assigned':
            row[2] = row[1]
        Table.append(row)
        
df_Canada = pd.DataFrame(data=Table[1:], columns=Table[0])
df_Canada.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


## 2. Trasnform Dataset

Get Name Columns

In [9]:
Col = df_Canada.columns
Col

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

Group the neighborhood by Postcode

In [24]:
Col = df_Canada.columns
df_Canada_Group = df_Canada.groupby(Col[0])[Col[2]].apply(lambda x: ''.join([str(elem+',') for elem in list(x)])).reset_index()
df_Canada_Group.head()

Unnamed: 0,Postcode,Neighbourhood
0,M1B,"Rouge,Malvern,"
1,M1C,"Highland Creek,Rouge Hill,Port Union,"
2,M1E,"Guildwood,Morningside,West Hill,"
3,M1G,"Woburn,"
4,M1H,"Cedarbrae,"


The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [25]:
# define the dataframe columns
column_names = ['Postcode', 'Borough', 'Neighbourhood']

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [26]:
neighborhoods

Unnamed: 0,Postcode,Borough,Neighbourhood


Then let's loop through the data and fill the dataframe one row at a time.

In [27]:
df_repeated = df_Canada.sort_values('Postcode', ascending=True)
df_repeated.drop_duplicates('Postcode', keep="last", inplace=True)
df_Canada_Group['Borough'] = df_repeated.reset_index()['Borough']

In [29]:
df_Canada_Group.head()

Unnamed: 0,Postcode,Neighbourhood,Borough
0,M1B,"Rouge,Malvern,",Scarborough
1,M1C,"Highland Creek,Rouge Hill,Port Union,",Scarborough
2,M1E,"Guildwood,Morningside,West Hill,",Scarborough
3,M1G,"Woburn,",Scarborough
4,M1H,"Cedarbrae,",Scarborough


## 3. Geospatial data

In [32]:
#!pip install geocoder

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

In [38]:
!wget -q -O 'postal_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

Next, let's load the data.

In [44]:
df_csv = pd.read_csv('postal_data.csv')
df_csv.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [46]:
df_Canada_Group['Latitude'] = df_csv['Latitude']
df_Canada_Group['Longitude'] = df_csv['Longitude']
df_Canada_Group

Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude
0,M1B,"Rouge,Malvern,",Scarborough,43.806686,-79.194353
1,M1C,"Highland Creek,Rouge Hill,Port Union,",Scarborough,43.784535,-79.160497
2,M1E,"Guildwood,Morningside,West Hill,",Scarborough,43.763573,-79.188711
3,M1G,"Woburn,",Scarborough,43.770992,-79.216917
4,M1H,"Cedarbrae,",Scarborough,43.773136,-79.239476
5,M1J,"Scarborough Village,",Scarborough,43.744734,-79.239476
6,M1K,"East Birchmount Park,Ionview,Kennedy Park,",Scarborough,43.727929,-79.262029
7,M1L,"Clairlea,Golden Mile,Oakridge,",Scarborough,43.711112,-79.284577
8,M1M,"Cliffcrest,Cliffside,Scarborough Village West,",Scarborough,43.716316,-79.239476
9,M1N,"Birch Cliff,Cliffside West,",Scarborough,43.692657,-79.264848
