Torento Neighborhood 
====
This notebook Load and clean data about the neighborhoods in the city of Toronto :
1. Get Toronto neighborhood data : scrape the Wikipedia page and wrangle the data, clean it
1. Read Toronto neighborhood data it into a pandas dataframe
1. Get the latitude and the longitude coordinates of each neighborhood. 


In [10]:
#!pip install bs4 lxml

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc02364d08e5b84/lxml-4.5.0-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 26.4MB/s eta 0:00:01     |██████▎                         | 1.1MB 26.4MB/s eta 0:00:01     |████████████████                | 2.9MB 26.4MB/s eta 0:00:01�█████████████████▏           | 3.6MB 26.4MB/s eta 0:00:01     |███████████████████████▌        | 4.2MB 26.4MB/s eta 0:00:01     |██████████████████████████████▌ | 5.5MB 26.4MB/s eta 0:00:01
[?25hCollecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 1

In [37]:
# Import required modules
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup
import lxml

I - Get Toronto neighborhood data
----

In [38]:
# Create a variable with the url
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content)#, 'lxml')

soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

In [39]:
# 1) Fill Dataframe with Toronto neighborhood data:
table = soup.find_all('table')
df = pd.read_html(str(table))[0]

df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [40]:
#2)  Ignore cells with a borough that is Not assigned : Borough != 'Not assigned'
df = df.loc[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [41]:
# 3) neighborhoods separated with a comma
df = df.replace(to_replace=' /', value=',', regex=True)
#test
df.loc[ df['Postal code'] == 'M5A']

Unnamed: 0,Postal code,Borough,Neighborhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [42]:
# (4) If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

## check if Neighborhood = Nan then return Borough, otherwise return Neighborhood
def check_Neighborhood(Neighborhood,Borough):
    if type(Neighborhood)==float: 
        if np.isnan(float(Neighborhood)):
            return Borough
        else:
            return Neighborhood
    else:
        return Neighborhood     
df['Neighborhood'] = df.apply(lambda x: check_Neighborhood(x['Neighborhood'],x['Borough']),axis=1)

# (5) Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [43]:
# (6) In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
df.shape[0]

103

II - Get the latitude and the longitude coordinates of each neighborhood. 
----

In [44]:
# Import clean data 
path = 'http://cocl.us/Geospatial_data'
df_Geospatial = pd.read_csv(path)
df_Geospatial.rename(columns = {'Postal Code':'Postal code'}, inplace = True) 

In [45]:
# merge the Geospatial data into the dataframe
df = pd.merge(df, df_Geospatial, on='Postal code')

In [46]:
#test
df.loc[ df['Postal code'] == 'M2H']

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
27,M2H,North York,Hillcrest Village,43.803762,-79.363452
