# Capstone project -- week 3 -- geocoding neighborhoods table

This notebook addresses the second part of the week 3 assignment: adding geospatial information (i.e. longitude and lattitude to the postal codes. While initially the geocoding was attempted through geocoder, now reliable results were produced from the API. Therefore, the <a href="http://cocl.us/Geospatial_data">file</a> provided in the instructions was used to add geospatial information to each row of the Toronto postal code data frame.

This work follows on from the first notebook referenced on the submission and available <a href="https://github.com/AnikaC-git/Coursera_Capstone/blob/master/notebooks/Capstone_week3_parsing_Wikipedia_page.ipynb">here</a>.


In [1]:
import pandas as pd

INPUT_FILE = r"../data/postal_codes_toronto.csv" # this file was generated with the previous notebook
INPUT_GEO_URL = r"http://cocl.us/Geospatial_data" # geospatial data as provided in instructions
OUTPUT_FILE = r"../data/postal_codes_toronto_geocoded.csv"

In [2]:
# read data previously downloaded from Wikipedia into dataframe
df_toronto = pd.read_csv(INPUT_FILE)
print(df_toronto.head())
print(df_toronto.shape)

  Postal Code           Borough                                 Neighborhood
0         M3A        North York                                    Parkwoods
1         M4A        North York                             Victoria Village
2         M5A  Downtown Toronto                    Regent Park, Harbourfront
3         M6A        North York             Lawrence Manor, Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government
(103, 3)


In [3]:
# add two columns for longitude and latitude to the data frame
df_toronto['Longitude'] = None
df_toronto['Latitude'] = None
print(df_toronto.head())

  Postal Code           Borough                                 Neighborhood  \
0         M3A        North York                                    Parkwoods   
1         M4A        North York                             Victoria Village   
2         M5A  Downtown Toronto                    Regent Park, Harbourfront   
3         M6A        North York             Lawrence Manor, Lawrence Heights   
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

  Longitude Latitude  
0      None     None  
1      None     None  
2      None     None  
3      None     None  
4      None     None  


In [4]:
# read geospatial data from data file provided in instructions
df_geo = pd.read_csv(INPUT_GEO_URL)
print(df_geo.head())

  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476


In [5]:
# merging both data sets together
for ind, row in df_toronto.iterrows():
    # retrieving entry from geospatial data set
    res = df_geo[df_geo['Postal Code'] == row['Postal Code']].index
    # assigning lat and long value to the neighborhoods' table
    df_toronto.loc[ind, 'Latitude'] = df_geo.loc[res[0], 'Latitude']
    df_toronto.loc[ind, 'Longitude'] = df_geo.loc[res[0], 'Longitude']
print(df_toronto.head()) 

  Postal Code           Borough                                 Neighborhood  \
0         M3A        North York                                    Parkwoods   
1         M4A        North York                             Victoria Village   
2         M5A  Downtown Toronto                    Regent Park, Harbourfront   
3         M6A        North York             Lawrence Manor, Lawrence Heights   
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government   

  Longitude Latitude  
0  -79.3297  43.7533  
1  -79.3156  43.7259  
2  -79.3606  43.6543  
3  -79.4648  43.7185  
4  -79.3895  43.6623  


In [6]:
# checking that there are no empty values for Longitude and Latitude
print(df_toronto[df_toronto['Latitude'] == None])
print(df_toronto[df_toronto['Longitude'] == None])

Empty DataFrame
Columns: [Postal Code, Borough, Neighborhood, Longitude, Latitude]
Index: []
Empty DataFrame
Columns: [Postal Code, Borough, Neighborhood, Longitude, Latitude]
Index: []


Seeing that both queries return an empty data frame, all the dummy values have been replaced with entries from the geospatial data set provided. The data set can now be saved for the third part of the assignment.

In [7]:
df_toronto.to_csv(OUTPUT_FILE, index=False)