# Creating the DataFrame From Wikipedia Using beautifulsoup4 package

In this notebook we build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. There are different website scraping libraries and packages in Python. One of the most common packages is BeautifulSoup. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/  
Very good introduction video for beautifulsoup: https://www.youtube.com/watch?v=ng2o98k983k

### First of all we need to install beautifulsoup4

In [183]:
!pip install beautifulsoup4

[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### Next we need a parser to parse our html

In [184]:
!pip install lxml
!pip install html5lib

[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### To get the html we need the request library

In [185]:
!pip install requests

[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### 1. Lets import packages

In [186]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import matplotlib.cm as cm
import matplotlib.colors as colors



Solving environment: done


  current version: 4.4.7
  latest version: 4.5.11

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.



### 2. Lets read in the wikipedia page

In [187]:

# Get source code of website
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
print(soup.prettify())


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":867606113,"wgRevisionId":867606113,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

### 3. Read the table in wikipedia page into our DataFrame
#### table in html are represented by $<table> .... </table>$ and each row by $<tr>...</tr>$ and each cell by $<td>...</td>$

In [188]:
data = []
table = soup.find('table')
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
    
# removing first empty element
data = data[1:]

Toronto = pd.DataFrame(data,columns=['PostalCode','Borough','Neighborhood'])
Toronto



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Now we drop rows that have no borough assigned

In [189]:
Toronto.drop(Toronto[Toronto.Borough=='Not assigned'].index,axis=0,inplace=True)
Toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [190]:
## Merging rows with same postal code

In [191]:
Toronto = Toronto.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
Toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough

In [192]:
for i in range(0,Toronto.shape[0]):
    if Toronto.iloc[i,2] == 'Not assigned':
        Toronto.iloc[i,2] = Toronto.iloc[i,1]
Toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Let's see how many rows we got

In [193]:
Toronto.shape

(103, 3)

### Next step is to get the coordinate for each postal code using geocoder package

In [194]:
#!pip install geocoder
"""
import geocoder
latitude = []
longitude = []


for post_code in Toronto.PostalCode:
    g = geocoder.google('{}, Toronto, Ontario'.format(post_code))
    latitude.append = g.latlng[0]
    longitude.append = g.latlng[1]
    
latitude
longitude
"""

"\nimport geocoder\nlatitude = []\nlongitude = []\n\n\nfor post_code in Toronto.PostalCode:\n    g = geocoder.google('{}, Toronto, Ontario'.format(post_code))\n    latitude.append = g.latlng[0]\n    longitude.append = g.latlng[1]\n    \nlatitude\nlongitude\n"

### Apparently the geocoder package doesn't work anymore
#### Fortunately I could find a csv file with the geospatial data


In [195]:
latlong = pd.read_csv('http://cocl.us/Geospatial_data')

In [196]:
latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [197]:
Toronto=Toronto.join(latlong[['Latitude','Longitude']])
Toronto

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


### Now lets cluster the boroughs that contain the word Toronto

In [198]:
Toronto_nhb = Toronto[Toronto['Borough'].str.contains("Toronto")]
Toronto_nhb


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


In [199]:
print('The Toronto contained bouroghs dataframe has {} boroughs and {} neighborhoods.'.format(len(Toronto_nhb['Borough'].unique()),Toronto.shape[0]))

The Toronto contained bouroghs dataframe has 4 boroughs and 103 neighborhoods.


### Lets create the map of Toronto with boroughs that has Toronto in their name

In [200]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto, CA'


geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [201]:
map_Toronto = folium.Map(location=[latitude,longitude],zoom_start=10)
for lat, lng, borough, neighborhood in zip(Toronto_nhb['Latitude'], Toronto_nhb['Longitude'], Toronto_nhb['Borough'], Toronto_nhb['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

## Cluster Neighborhoods

#### Run K-means to cluster the neighborhood into 4 cluster

In [202]:
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters,random_state=0).fit(Toronto_nhb[['Latitude','Longitude']])
kmeans.labels_

array([0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 2, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 0], dtype=int32)

In [203]:
Toronto_nhb['Cluster_label'] = kmeans.labels_
Toronto_nhb

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_label
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0
43,M4M,East Toronto,Studio District,43.659526,-79.340923,0
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197,2
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,2
47,M4S,Central Toronto,Davisville,43.704324,-79.38879,2
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,2
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,2


#### Finally, let's visualize the resulting clusters

In [206]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i+x+(i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]




# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_nhb['Latitude'], Toronto_nhb['Longitude'], Toronto_nhb['Neighborhood'], Toronto_nhb['Cluster_label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


### Now we can examine clusters

#### Cluster 1

In [207]:
Toronto_nhb.loc[Toronto_nhb['Cluster_label'] == 0]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_label
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0
43,M4M,East Toronto,Studio District,43.659526,-79.340923,0
87,M7Y,East Toronto,Business reply mail Processing Centre969 Eastern,43.662744,-79.321558,0


#### Cluster 2

In [208]:
Toronto_nhb.loc[Toronto_nhb['Cluster_label'] == 1]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_label
50,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1
51,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,1
52,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1
53,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636,1
54,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,1
55,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1
56,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1
58,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,1
59,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,1


#### Cluster 3

In [209]:
Toronto_nhb.loc[Toronto_nhb['Cluster_label'] == 2]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_label
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197,2
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,2
47,M4S,Central Toronto,Davisville,43.704324,-79.38879,2
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,2
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,2
63,M5N,Central Toronto,Roselawn,43.711695,-79.416936,2
64,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307,2


#### Cluster 4

In [210]:
Toronto_nhb.loc[Toronto_nhb['Cluster_label'] == 3]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_label
75,M6G,Downtown Toronto,Christie,43.669542,-79.422564,3
76,M6H,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,3
77,M6J,West Toronto,"Little Portugal, Trinity",43.647927,-79.41975,3
78,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191,3
82,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763,3
83,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325,3
84,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445,3
