##### Web scraping is a computer software technique of extracting information from websites.This technique mostly focusses on transformation of unstructured data(html format) on the web into structured data(Database or spreadsheet).BeautifulSoup is a Python package which is used for web scraping.It does not fetch web page for us .Its used in combination with urllib2.Here we will scrape tables from webpages.

In [1]:
#Importing required libraries
from bs4 import BeautifulSoup #To extract table,lists,paragraphs from wikipedia
import requests
import matplotlib.pyplot as plt
import urllib.request as urllib2 #Used in combination with BeautifulSoup
import numpy as np
import pandas as pd
import re


#The  scraped data from wikipedia consists of Postcode,Borough and Neighbourhood.In the following code we read data and convert it to bs4.BeautifulSoup data.

In [2]:
#First we specify the URL.
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r=requests.get(url)
H=BeautifulSoup(r.content) #to read content from url using BeautifulSoup.

To look at nested structure of html page we use prettify.

In [3]:
print(H.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":876823784,"wgRevisionId":876823784,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

We first need to select the table that we'd like to scrape.As webpages contain multiple tables,we should read the table names into a list.

In [4]:
htmlpage=urllib2.urlopen(url) #Query the website and return the html to the variable 'htmlpage'
lst=[] #Empty list initialised
for line in htmlpage:
    line=line.rstrip()
    if re.search(b'table class',line) :
        lst.append(line)

In [5]:
len(lst)
lst

[b'<table class="wikitable sortable">',
 b'<table class="multicol" role="presentation" style="border-collapse: collapse; padding: 0; border: 0; background:transparent; width:100%;">',
 b'<table class="navbox">']

We will scrape the first table and therefore use index 0 in lst to capture the first table name.The table will be read by using BeautifulSoups find function.A simple option is to type in the table name.We simply select the name in lst, which in this case is "Wikitable sortable".

In [6]:
table=H.find('table',{'class','wikitable sortable'})

In [7]:
y=lst[0]    #We capture data from the list and then strip of unnecessary characters
print(y) 
extr=re.findall(b'"([^"]*)"',y)
#table=H.find('table',{'class',str(extr).strip("'[]'")})

b'<table class="wikitable sortable">'


After stripping of the unnecessary characters we read the header and row names separately.

In [8]:
headers=[header.text for header in table.find_all('th')]
headers

['Postcode', 'Borough', 'Neighbourhood\n']

In [9]:
rows=[]
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

In [10]:
df1=pd.DataFrame(rows,columns=headers)

df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,,,
1,b'M1A',b'Not assigned',b'Not assigned\n'
2,b'M2A',b'Not assigned',b'Not assigned\n'
3,b'M3A',b'North York',b'Parkwoods\n'
4,b'M4A',b'North York',b'Victoria Village\n'


#To remove None from first row we use dropna.

In [11]:
df1.dropna(axis=0,inplace=True)
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,b'M1A',b'Not assigned',b'Not assigned\n'
2,b'M2A',b'Not assigned',b'Not assigned\n'
3,b'M3A',b'North York',b'Parkwoods\n'
4,b'M4A',b'North York',b'Victoria Village\n'
5,b'M5A',b'Downtown Toronto',b'Harbourfront\n'


We get b before every value in the table.This is because it is byte encoded.So we decode it using the following function.Its not a string.

In [12]:
df1['Borough']=df1['Borough'].str.decode("utf-8")


In [13]:
df1['Postcode']=df1['Postcode'].str.decode("utf-8")
df1['Neighbourhood\n']=df1['Neighbourhood\n'].str.decode("utf-8")


In [14]:
df1.columns

Index(['Postcode', 'Borough', 'Neighbourhood\n'], dtype='object')

In [15]:
df1.columns=[i.strip() for i in df1.columns]

In [16]:
df1.columns
df1['Neighbourhood']=[i.strip() for i in df1.Neighbourhood]

In [17]:
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


After decoding it we get a table in the required format.

We will remove all the rows in Borough column which have 'Not assigned' written in them.

In [18]:
df1.head()
df2=df1[df1['Borough'] !='Not assigned']
df2.head()
#df2.index

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


We then combine those rows which have same postcodes so that the Neighbourhoods get concatenated like below.

In [19]:
df3=df2.Neighbourhood.groupby([df2.Postcode,df2.Borough]).apply(list).reset_index()
df3.head(100)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]
5,M1J,Scarborough,[Scarborough Village]
6,M1K,Scarborough,"[East Birchmount Park, Ionview, Kennedy Park]"
7,M1L,Scarborough,"[Clairlea, Golden Mile, Oakridge]"
8,M1M,Scarborough,"[Cliffcrest, Cliffside, Scarborough Village West]"
9,M1N,Scarborough,"[Birch Cliff, Cliffside West]"


After this we search in the Neighbourhood column to see if any not assigned value is there.If it is there we replace it by the corresponding value in Borough column.

In [20]:
def replace (df,col,KEY,val):
    m=[v==KEY for v in df[col]]
    df.loc[m,col]=val
    
replace(df3,'Neighbourhood',['Not assigned'],df3.Borough)


In [21]:
df3.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]


In [22]:
df3.shape

(103, 3)

In [23]:
#Checking whether df3 is a dataframe or not
type(df3)

pandas.core.frame.DataFrame

In [24]:
df3.dtypes

Postcode         object
Borough          object
Neighbourhood    object
dtype: object

After running the above code,we get a dataframe with 103 rows and 3 columns.We have taken  a wikipedia page containing postalcodes and successfully using python we have replicated it here by using the above code.

In [25]:
df3.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"[Rouge, Malvern]"
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]"
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]"
3,M1G,Scarborough,[Woburn]
4,M1H,Scarborough,[Cedarbrae]


In [26]:
DF=pd.read_csv("https://cocl.us/Geospatial_data")
DF.head()
DF.shape

(103, 3)

In [27]:
DF_toronto=pd.merge(df3,DF,left_on='Postcode',right_on='Postal Code',how='right').drop('Postal Code',axis=1)
DF_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"[Rouge, Malvern]",43.806686,-79.194353
1,M1C,Scarborough,"[Highland Creek, Rouge Hill, Port Union]",43.784535,-79.160497
2,M1E,Scarborough,"[Guildwood, Morningside, West Hill]",43.763573,-79.188711
3,M1G,Scarborough,[Woburn],43.770992,-79.216917
4,M1H,Scarborough,[Cedarbrae],43.773136,-79.239476


In [28]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

Folium is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

In [29]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [32]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10.7)

# add markers to map
for lat, lng, borough, neighbourhood in zip(DF_toronto['Latitude'], DF_toronto['Longitude'],DF_toronto['Borough'], DF_toronto['Neighbourhood']):
    
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_Toronto)  
    
map_Toronto
