# Toronto postal codes and neighbourhoods retrieval

The present workbook aims to scrape the postal codes and neighbourhoods of Toronto from the following <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">link from Wikipedia</a>.


## I. Libraries preparations

In [2]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
!conda install -c conda-forge bs4 --yes 
from bs4 import BeautifulSoup
!conda install -c conda-forge lxml --yes 
import lxml
!conda install -c conda-forge requests --yes 
import requests
import csv # library to handle csv files
print("Done!")

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - bs4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.0       |   py36h9f0ad1d_0         160 KB  conda-forge
    bs4-4.9.0                  |                0           4 KB  conda-forge
    soupsieve-1.9.4            |   py36h9f0ad1d_1          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         222 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/linux-64::beautifulsoup4-4.9.0-py36h9f0ad1d_0
  bs4                conda-forge/noarch::bs4-4.9.0-0
  soupsieve          conda-forge/linux-64::soupsieve-1.9.4-py36h9f0ad1d_1



Downloading and Extracting Packag

## II. Web-data scrapping

In [3]:
#----------------
# Web data extraction
#----------------
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
#soup = BeautifulSoup(requests.get(url).content, "lxml")
#print(soup.prettify())


In [4]:
#----------------
# Reading table
#----------------
table=soup.find('table', class_="wikitable sortable")
#print(table.prettify())

#------------------------------------------------------------------------------
# Scanning through the table and extracting row info into a list (table_data)
#------------------------------------------------------------------------------
table_rows=table.find_all('tr')
table_data=[]
for tr in table_rows:
    td = tr.find_all('td')
    entry = [i.text for i in td]
    table_data.append(entry)
    #print(entry)

#------------------------------------------------
# Creating a df based on row info in table_tada
#------------------------------------------------
table_df = pd.DataFrame(table_data, columns=["PostalCode", "Borough", "Neighbourhood"])
print(table_df.shape)
table_df.head()

(181, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,,,
1,M1A\n,Not assigned\n,\n
2,M2A\n,Not assigned\n,\n
3,M3A\n,North York\n,Parkwoods\n
4,M4A\n,North York\n,Victoria Village\n


## III. Data manipulation and preparation

in this chapter, the data collected and stored in the dataframe will be manipulated to meet the criteria from the assignment being it:
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [5]:
#----------------------------------
# DF cleaning and re-organisation
#----------------------------------
#Removing the \n 
table_df.replace(to_replace ='\n', value = '', regex = True, inplace=True)
table_df.head()

#Discarding not assigned boroughs (Not assigned / None)
table_df.drop(table_df.loc[ (table_df['Borough'] == 'Not assigned')].index, inplace=True)
table_df.dropna(inplace=True)

#Changing separators from "/" to ","
table_df['Neighbourhood'].replace(to_replace =' /', value = ',', regex = True, inplace=True)

#Updating Bourhoods with missing Neighbourhoods
table_df['Neighbourhood']= np.where(table_df['Neighbourhood']== 'Not assigned', table_df['Borough'], table_df['Neighbourhood'])

#Resetting the index
table_df.reset_index(inplace=True, drop=True)
table_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


## IV. Results

This setion will show the result of the previous analyses.

In [6]:
table_df.shape

(103, 3)

# Toronto Neighbourhood Coordinates

In this section, we will be finding the coordinates for each postal code from the previous dataframe. We will use Geocoder for this purpose. Nevertheless, due to the unreliability of this package, there is available a .csv file to be used if necessary.

The methodology I was initially planning was via the geocoder. Due to the inestability experienced during the retrieval of the coordinates, this method was finally discarded.

```python
!conda install -c conda-forge geocoder --yes 
import geocoder # import geocoder

LatLon=[]
print("Test")
#for row, index in table_df.iterrows():
for row in table_df.itertuples(index=True, name='Pandas'):
#for value in table_df['PostalCode']:
    value=getattr(row, "PostalCode")
    print(value)
    # Initialize the variable to None
    lat_lng_coords = None
    # Loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario, Canada'.format(value))
        lat_lng_coords = g.latlng

    print(lat_lng_coords[0],lat_lng_coords[1])
    LatLon.append([(lat_lng_coords[0],lat_lng_coords[1])])
    
table_df['Latitude', 'Longitude']=LatLon
      
table_df.head(20)
```

The next approach was to directly use the csv file and retrieve the coordinates from there. This was the result:

In [18]:
# Methodology via csv file
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
    
with open('Geospatial_Coordinates.csv') as csvfile:
    df_Geocoords = pd.read_csv(csvfile)

df_Geocoords.reset_index(inplace=True) 
df_Geocoords.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df_Geocoords.set_index('PostalCode', inplace=True)

table_df.set_index('PostalCode',inplace=True)
joined_df=pd.merge(table_df, df_Geocoords, on='PostalCode')
table_df.reset_index(inplace=True)
joined_df.reset_index(inplace=True)

joined_df.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood,index,Latitude,Longitude
0,M3A,North York,Parkwoods,25,43.753259,-79.329656
1,M4A,North York,Victoria Village,34,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",53,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",71,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",85,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,93,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",0,43.806686,-79.194353
7,M3B,North York,Don Mills,26,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",35,43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",54,43.657162,-79.378937
