# First Data Scrape of Wikipedia

In this notebook I am learning how to scrape data from a website (hopefully anyway).

Website to be scraped: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

This is a list of neighborhoods by postal code within the city of Toronto, Canada. The aim is to scrape the wikipedia page; create a dataframe containing the data in tabular format, and then store the dataframe into a .csv file for further analysis.

Ultimately, these neighborhoods will be used to obtain gps coordinates to obtain further data for various venues through the Foursquare API and form cluster groups of these venues.

In [1]:
# import necessary libraries

import numpy as np
import pandas as pd

from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
# define the function that takes the desired url 
# and read/store its contents

def grab_html_contents(url):
    html = urlopen(url)
    html_page = html.read()
    html.close()
    soup = BeautifulSoup(html_page, 'html.parser')
    return soup

In [3]:
# Look at html page contents of desired page
# and parse through html tables to find the table desired

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html_content = grab_html_contents(url)

tables = html_content.find_all('table')
for table in tables:
    print(table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

In [4]:
# After inspecting the html, we see that 'wikitable sortable' 
# is the table we need so now we'll loop over the table
# data and perform the scrape

table = html_content.find('table', 
                     {'class': 'wikitable sortable'})

rows = table.find_all('tr')

# create .csv file for data to be saved to

file_name = 'toronto_postal_data.csv'
f = open(file_name,'w')

headers = 'PostalCode, Borough, Neighborhood\n'

f.write(headers)

# postal_data = []
# borough_data = []
# neighborhood_data = []

for index, row in enumerate(rows):
    cells = row.find_all('td')
    if len(cells) > 1:
        postal_data = cells[0].text.strip()
        borough_data = cells[1].text.strip()
        neighborhood_data = cells[2].text.strip()
        print('Obersvation {}'.format(index))
        print('Postal Code: ' + postal_data)
        print('Borough: ' + borough_data)
        print('Neighborhood: ' + neighborhood_data)
        f.write(postal_data + ',' + borough_data + ',' + neighborhood_data + '\n')

f.close()

Obersvation 1
Postal Code: M1A
Borough: Not assigned
Neighborhood: Not assigned
Obersvation 2
Postal Code: M2A
Borough: Not assigned
Neighborhood: Not assigned
Obersvation 3
Postal Code: M3A
Borough: North York
Neighborhood: Parkwoods
Obersvation 4
Postal Code: M4A
Borough: North York
Neighborhood: Victoria Village
Obersvation 5
Postal Code: M5A
Borough: Downtown Toronto
Neighborhood: Harbourfront
Obersvation 6
Postal Code: M6A
Borough: North York
Neighborhood: Lawrence Heights
Obersvation 7
Postal Code: M6A
Borough: North York
Neighborhood: Lawrence Manor
Obersvation 8
Postal Code: M7A
Borough: Downtown Toronto
Neighborhood: Queen's Park
Obersvation 9
Postal Code: M8A
Borough: Not assigned
Neighborhood: Not assigned
Obersvation 10
Postal Code: M9A
Borough: Etobicoke
Neighborhood: Islington Avenue
Obersvation 11
Postal Code: M1B
Borough: Scarborough
Neighborhood: Rouge
Obersvation 12
Postal Code: M1B
Borough: Scarborough
Neighborhood: Malvern
Obersvation 13
Postal Code: M2B
Borough: No

In [9]:
toronto_df = pd.read_csv('toronto_postal_data.csv')
print('Number of obersvations: {} \n'.format(toronto_df.shape[0]))
print('Number of features: {} \n'.format(toronto_df.shape[1]))
toronto_df.head()

Number of obersvations: 287 

Number of features: 3 



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Success! Data scraped and stored as a dataframe! Now... time to clean it up. I will do that in the next notebook.