<a href="https://colab.research.google.com/github/1jlal/Coursera_Capstone/blob/main/Table_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Scraping Canada Postal Codes Table from Wikipedia**

Using BeautifulSoup library

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

In [2]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

webpage = bs(r.content)

In [3]:
table = webpage.select('table.wikitable')[0]
# print(table)
columns = table.find_all('th')
column_names = [str(c.string).strip() for c in columns]
# print(column_names)

l = []
table_rows = table.find('tbody').find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.string).strip() for tr in td]
    l.append(row)

# print(l[0:10])

df = pd.DataFrame(l, columns=column_names)

In [4]:
df.drop(df.index[0], inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
df_new = df[df.Borough != 'Not assigned']
df_new

Unnamed: 0,Postal Code,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
166,M4Y,Downtown Toronto,Church and Wellesley
169,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
df_new.shape

(103, 3)

In [10]:
!pip install geocoder

import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 16.9MB/s eta 0:00:01[K     |██████▋                         | 20kB 21.4MB/s eta 0:00:01[K     |██████████                      | 30kB 15.6MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 10.2MB/s eta 0:00:01[K     |████████████████▋               | 51kB 4.3MB/s eta 0:00:01[K     |████████████████████            | 61kB 4.9MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 5.4MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 5.5MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 5.7MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.3MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592c

### Using geocoder to obtain coordinates of the postal codes

In [11]:
latitude=[]
longitude=[]
for code in df_new['Postal Code']:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
    print(code, g.latlng)
    while (g.latlng is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
        print(code, g.latlng)
    latlng = g.latlng
    latitude.append(latlng[0])
    longitude.append(latlng[1])

M3A [43.75245000000007, -79.32990999999998]
M4A [43.73057000000006, -79.31305999999995]
M5A [43.65512000000007, -79.36263999999994]
M6A [43.72327000000007, -79.45041999999995]
M7A [43.66253000000006, -79.39187999999996]
M9A [43.662630000000036, -79.52830999999998]
M1B [43.811390000000074, -79.19661999999994]
M3B [43.74923000000007, -79.36185999999998]
M4B [43.70718000000005, -79.31191999999999]
M5B [43.65739000000008, -79.37803999999994]
M6B [43.70687000000004, -79.44811999999996]
M9B [43.65034000000003, -79.55361999999997]
M1C [43.78574000000003, -79.15874999999994]
M3C [43.72168000000005, -79.34351999999996]
M4C [43.68970000000007, -79.30681999999996]
M5C [43.65215000000006, -79.37586999999996]
M6C [43.69211000000007, -79.43035999999995]
M9C [43.64857000000006, -79.57824999999997]
M1E [43.765750000000025, -79.17469999999997]
M4E [43.67709000000008, -79.29546999999997]
M5E [43.64536000000004, -79.37305999999995]
M6E [43.68784000000005, -79.45045999999996]
M1G [43.76812000000007, -79.2

 Converting the coordinates lists into a dataframe

In [25]:
coord_data = [latitude, longitude] 
coord_labels = ['Latitude', 'Longitude']
coord_df = pd.DataFrame(coord_data).T
coord_df.columns = coord_labels
coord_df.head()

Unnamed: 0,Latitude,Longitude
0,43.75245,-79.32991
1,43.73057,-79.31306
2,43.65512,-79.36264
3,43.72327,-79.45042
4,43.66253,-79.39188


Adding the coordinates dataframe to the original dataframe

In [45]:
df_cnd = pd.concat([df_new, coord_df], axis=1)
df_cnd

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,,,,43.75245,-79.32991
1,,,,43.73057,-79.31306
2,,,,43.65512,-79.36264
3,M3A,North York,Parkwoods,43.72327,-79.45042
4,M4A,North York,Victoria Village,43.66253,-79.39188
...,...,...,...,...,...
161,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",,
166,M4Y,Downtown Toronto,Church and Wellesley,,
169,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",,
170,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",,


Dropping all empty rows

In [49]:
df_cnd.dropna(inplace=True)
df_cnd.shape
df_cnd.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
3,M3A,North York,Parkwoods,43.72327,-79.45042
4,M4A,North York,Victoria Village,43.66253,-79.39188
5,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.66263,-79.52831
6,M6A,North York,"Lawrence Manor, Lawrence Heights",43.81139,-79.19662
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.74923,-79.36186
