## Task 1: Scraping of the Wikipedia page


__Task 1:__ Start by creating a new Notebook for this assignment. 

__Task 2:__ Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [75]:
import pandas as pd
import numpy as np

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_can = pd.read_html(url, header=0)[0]
df_can.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


# Data Cleansing according to the instructions

 3. To create the above dataframe:
 
__Task 3a:__ The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [76]:
#Rename Column Postal Code
df_can.rename(columns={"Postal Code": "PostalCode"}, inplace = True)
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


__Task 3b:__ Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [77]:
#Drop all rows that have "Not assigned" in the Borough column
df_can.drop(df_can.index[df_can.Borough == 'Not assigned'], inplace = True)
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


_Note: I also checked the shape before and after droping the rows using df_can.shape. Before droping the rows there were 180 rows, after droping there are 103 rows left. That means 77 rows were dropped as they had "Not assigned" in the Borough column_

__Task 3c:__ More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [78]:
df_can.groupby("PostalCode")['Neighborhood'].apply(lambda tags: ','.join(tags))
df_can

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


__Task 3d:__ If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [79]:
df_can['Neighborhood'].replace("Not assigned", df_can["Borough"],inplace=True)
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


__Task 3 e:__ Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making. 


__Task 3 f:__ In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.      

In [80]:
df_can.shape
print("The shape of the Canada Dataframe is {} ".format(df_can.shape))

The shape of the Canada Dataframe is (103, 3) 


## Task 2: Get llatitude and longitude of each neigborhood

Using the Geocoder Python package (https://geocoder.readthedocs.io/index.html) with a while loop for each postal code to get the coordinates for all of the neighborhoods. _The geocoder.arcgis_ was used, as the _geocoder.google_ did not work (the while loop never endet). The Arcgis provider was found here: https://geocoder.readthedocs.io/. 

The following steps were done: 
1. Install geocoder & import library
2. Define a function, that takes as input parameters postalcode and borogh and gives back the latitude and longitude of the specific postalcode and borogh
3. Reset the index in the Canada dataframe (that you can loop through every row)
4. For loop through every row of the dataframe to get for every row the specific postalcode and borough using loc, then call the function that was defined and save latitude and longitude in the new column in our dataframe

__Step 1:__ Install geocoder & import library

__Step 2:__ Define a function, that takes as input parameters postalcode and borogh and gives back the latitude and longitude of the specific postalcode and borogh

In [81]:
#Step 1
#conda install -c conda-forge geocoder
import geocoder

##Step 2: 
def get_cordinates(postalcode, borogh):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, {}'.format(postalcode, borogh))
        lat_lng_coords = g.latlng
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
    return latitude, longitude

__Step 3:__ Reset the index in the Canada dataframe (that you can loop through every row)

__Step 4:__ For loop through every row of the dataframe to get for every row the specific postalcode and borough using loc, then call the function that was defined and save latitude and longitude in the new column in our dataframe

In [86]:
#Step 3: 
df_can.reset_index(drop=True, inplace=True)

#Step 4: 
for rownr in range(0,len(df_can)):
    postalcode = df_can.loc[rownr, 'PostalCode'] 
    borogh = df_can.loc[rownr, 'Borough'] 
    lat, lon = get_cordinates(postalcode, borogh)
    df_can.loc[df_can.index[rownr], 'Latitude'] = lat
    df_can.loc[df_can.index[rownr], 'Longitude'] = lon

In [88]:
#check how the dataframe looks 
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65011,-79.3829
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.65011,-79.3829


## Task 3: Explore and cluster the neighborhoods in Toronto

Only boroughs that contain the word Toronto were used to replicate the same analysis that was doneto the New York City data. It is up to you.
Just make sure:
1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.


The following steps were done: 
The following steps were done: 
1. Create new Dataframe which only has boroughs "Toronto"
2. Install geocoder & import libraries to get the geograpical coordinate of Toronto, Ontario
3. Create a map of Toronto, Ontario with neighborhoods superimposed on top

__Step 1:__ Create new Dataframe which only has boroughs "Toronto"

In [89]:
#Step 1
df_toronto = df_can[df_can['Borough'].str.contains("Toronto")]

print("There are {} boroughs in the Canada Dataframe. For the futher analyis only those containing \
the word 'Toronto' are used which are {}.".format(
        len(df_can['Borough']),
        len(df_toronto['Borough'])
    )
)
df_toronto.head()

There are 103 boroughs in the Canada Dataframe. For the futher analyis only those containing the word 'Toronto' are used which are 39.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65011,-79.3829
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.65011,-79.3829
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65011,-79.3829
15,M5C,Downtown Toronto,St. James Town,43.65011,-79.3829
19,M4E,East Toronto,The Beaches,40.478341,-80.735023


__Step 2:__ Install geocoder & import libraries to get the geograpical coordinate of Toronto, Ontario

In [90]:
#Step 2
#!conda install -c conda-forge geocoder --yes
#!conda install -c conda-forge geopy --yes 

import geocoder # import geocoder
from geopy.geocoders import Nominatim

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, Ontario are 43.6534817, -79.3839347.


__Step 3:__ Create a map of Toronto, Ontario with neighborhoods superimposed on top

In [74]:
#Step 3
#!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto