First install Beautiful Soup package, used for performing "web-scraping" operations...

(Note that we install the latest Beautiful Soup package, version 4, using LXML parser) 

In [1]:
!conda install -c conda-forge  beautifulsoup4  --yes

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0 conda-forge

beautifulsoup4 100% |################################| Time: 0:00:00   1.44 MB/s


In [2]:
!conda install -c conda-forge  lxml  --yes

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    libxml2: 2.9.4-h6b072ca_5     --> 2.9.8-h422b904_2     conda-forge
    libxslt: 1.1.29-hcf9102b_5    --> 1.1.32-h88dbc4e_2    conda-forge
    lxml:    4.1.0-py35ha401a81_0 --> 4.2.5-py35hc9114bc_0 conda-forge

libxml2-2.9.8- 100% |################################| Time: 0:00:00   7.88 MB/s
libxslt-1.1.32 100% |################################| Time: 0:00:00  58.45 MB/s
lxml-4.2.5-py3 100% |################################| Time: 0:00:00  57.42 MB/s


Now import the necessary Python Libraries...

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [4]:
# Create Pandas dataframe to store Toronto neighborhood data
# Only have three columns: PostalCode, Borough, and Neighborhood

# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood']

# instantiate the dataframe
df_neighborhoods = pd.DataFrame(columns=column_names)

# take a look at the empty dataframe, to check that columns are correctly named
df_neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


Now use the Python requests library to read the contents of the Wikipedia web site as a string of HTML code

This HTML code string will then be parsed using the Beautiful Soup library (with XML parse module)

In [5]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
#print(source)
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())


We now use the structure of the HTML code to find the postal code, borough and neighborhood data.

Note that it is necessary to examine HTML code to see how it should be parsed to find this data...

In [6]:
# now search for the PostalCode, Borough and Neighborhood data in the HTML data
body = soup.find('body')
#print(body.prettify())
table = body.find('table', class_='wikitable sortable')
#print(table.prettify())
table_data = table.tbody.find_all('tr')
# skip first occurrence, as that is just header data
for i in range(1, len(table_data)):
    data = table_data[i].text.split('\n')
    postcode = data[1]
    borough = data[2]
    neighborhood = data[3]

Now that we have parsed this data from the web-site HTML code, it will be used to populate the Pandas dataframe.

Note that if a borough is not assigned, then data is skipped; if a neighborhood is not assigned, then it gets borough name.

In [7]:
# Now read through this table data, to assign data to dataframe
# Note that need to use dictionary to build up list of neighborhoods for each postal code;
# the dictionary key is the postal code and the dictionary value is list of neighborhoods
neighborhood_dict = {}
borough_dict = {}
for i in range(1, len(table_data)):
    data = table_data[i].text.split('\n')
    postcode = str(data[1])
    borough = str(data[2])
    neighborhood = str(data[3])
    if borough == 'Not assigned':
        continue
    elif  neighborhood == 'Not assigned':
        neighborhood = borough
    if not(postcode in neighborhood_dict.keys()):
        neighborhood_dict[postcode] = []
    if not(neighborhood in neighborhood_dict[postcode]):
        neighborhood_dict[postcode].append(neighborhood)
    if not(postcode in borough_dict.keys()):
        borough_dict[postcode] = ""
    if len(borough) > 0:
        borough_dict[postcode] = borough   

Now add this information to the pandas dataframe, converting list of neighborood names into comma-separated strings

In [8]:
# now add this data to the dataframe
key_list = list(neighborhood_dict.keys())
data_list = []
for i in range(len(key_list)):
    data_dict = {}
    data_dict['PostalCode'] = key_list[i]
    data_dict['Borough'] = borough_dict[key_list[i]]
    # need to convert list of strings into a single comma-separated string
    if len(neighborhood_dict[key_list[i]]) > 1:
       data_dict['Neighborhood'] = ", ".join(neighborhood_dict[key_list[i]]) 
    else:     
       data_dict['Neighborhood'] = neighborhood_dict[key_list[i]][0]
    data_list.append(data_dict)
    
df_neighborhoods = pd.DataFrame(data_list)
df_neighborhoods = df_neighborhoods[['PostalCode', 'Borough', 'Neighborhood']]
df_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station"
1,M4B,East York,"Woodbine Gardens, Parkview Hill"
2,M4J,East York,East Toronto
3,M6J,West Toronto,"Little Portugal, Trinity"
4,M3M,North York,Downsview Central
5,M6L,North York,"Maple Leaf Park, North Park, Upwood Park"
6,M9L,North York,Humber Summit
7,M4M,East Toronto,Studio District
8,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
9,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel"


Finally, display the number of rows in our pandas dataframe...

In [9]:
# let's find the number of rows in our pandas dataframe
print("The number of rows in our pandas dataframe is:", df_neighborhoods.shape[0])

The number of rows in our pandas dataframe is: 103


Now let's add Latitude and Longitude columns into the pandas dataframe...

In [10]:
df_neighborhoods.insert(3, 'Latitude', '')
df_neighborhoods.insert(4, 'Longitude', '')
df_neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",,
1,M4B,East York,"Woodbine Gardens, Parkview Hill",,
2,M4J,East York,East Toronto,,
3,M6J,West Toronto,"Little Portugal, Trinity",,
4,M3M,North York,Downsview Central,,


Note that we are forced to use CSV file containing Latitude and Longitude values for Toronto area

Although much time was spent working with Geocoder, it could not provide this data for us... 

In [11]:
# read in latitude and longitude values from CSV file
!wget  -q -O 'Toronto_Lat_Long.csv'  https://cocl.us/Geospatial_data
print("Geospatial Data Successfully downloaded...")


Geospatial Data Successfully downloaded...


Now read the Latitude/Longitude data from the CSV file into our Pandas dataframe...

In [12]:
import csv

df_neighborhoods.set_index('PostalCode', inplace=True)

with open('Toronto_Lat_Long.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter = ',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            # skip first row, since it is just header information
            pass
        else:
            postalCode = str(row[0])
            df_neighborhoods.loc[[postalCode], ['Latitude']]  = str(row[1])
            df_neighborhoods.loc[[postalCode], ['Longitude']] = str(row[2])
        line_count = line_count + 1
csv_file.close()
            
        

Look at the first few rows of our dataframe, to check that the Latitude/Longitude data is available...

In [13]:
df_neighborhoods.head(10)

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.6408157,-79.3817523
M4B,East York,"Woodbine Gardens, Parkview Hill",43.7063972,-79.309937
M4J,East York,East Toronto,43.685347,-79.3381065
M6J,West Toronto,"Little Portugal, Trinity",43.6479267,-79.4197497
M3M,North York,Downsview Central,43.7284964,-79.4956974
M6L,North York,"Maple Leaf Park, North Park, Upwood Park",43.7137562,-79.4900738
M9L,North York,Humber Summit,43.7563033,-79.5659633
M4M,East Toronto,Studio District,43.6595255,-79.340923
M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.7394164,-79.5884369
M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.6481985,-79.3798169
