# Capstone project
This notebook will be used for the capstone project in which I will use location data to solve a specific problem.

### Part 1 : Creation of the dataframe from the wikipedia page
#### Creating the empty dataframe 

In [51]:
import pandas as pd
import numpy as np

# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


### Fill the empty dataframe with the wikipedia table values

In [52]:
from bs4 import BeautifulSoup
import requests

def fillDataframe(wiki_tab_values):
    row_number = 0
    for i in range(0,len(wiki_tab_values),3):
        postal_code = wiki_tab_values[i].string.replace('\n','')
        borough = wiki_tab_values[i+1].string.replace('\n','')
        neighborhood = wiki_tab_values[i+2].string.replace('\n','') 
        if (borough != 'Not assigned'):
            if (neighborhood == 'Not assigned'): 
                neighborhood = borough
            df.loc[row_number] = [postal_code,borough,neighborhood]
            row_number+=1
            
            
respond = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(respond.text, 'html.parser')
tab = soup.find("table",{"class":"wikitable sortable"})
wiki_tab_values = tab.find_all('td')

fillDataframe(wiki_tab_values)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [53]:
df.shape

(103, 3)

## Part 2 : Adding Latitude and Longitude columns for each neighborhood
Reading the csv file and creating a dataframe with geospatial data for each postal code.

In [59]:
geodf = pd.read_csv("http://cocl.us/Geospatial_data")
geodf = geodf.rename(columns={'Postal Code': 'PostalCode'})
geodf.head()

I will then use a left outer join to merge both table into one, with a common key : PostalCode.
This will result in a new table where we can find the latitude/longitude values for each postal code.

In [62]:
result = pd.merge(df, geodf, how='left', on=['PostalCode'])
result.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
