### General - importing main libraries:

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


## First part: Getting data from Wikipedia page

 I decided to follow example listed in this [article](https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722
    ) to scrap [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).
    
I will use BeautifulSoup package for this task.

In [2]:
#First step : import requests library & Beautiful soup package
import requests

In [3]:
!conda install -c conda-forge beautifulsoup4 --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [4]:
from bs4 import BeautifulSoup

In [5]:
#Get the required link
wiki_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_page= requests.get(wiki_url).text

# Creating Beautiful Soup object.
soup = BeautifulSoup(wikipedia_page,'html')
#print(soup.prettify())


First task is to find table in the HTML script and extract it.

In [6]:
my_table = soup.find('table')

#my_table

In [76]:
#Iterate extracted table to get the data from the HTML page and store it into a list
data = []
columns = []
#
for index, tr in enumerate(my_table.find_all('tr')):
    section = []
    for td in tr.find_all(['th','td']):
        section.append(td.text.rstrip())
    
#Defining the header
    if (index == 0):
        columns = section
    else:
        data.append(section)

#Converting list into data frame:
df_toronto = pd.DataFrame(data = data,columns = columns)
#Visualizing first 5 rows to check data frame
df_toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


## Cleaning extracted table

-  Ignore cells with a borough that is Not assigned.

In [77]:
df_toronto = df_toronto[df_toronto['Borough'] != 'Not assigned']
df_toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [78]:
df_toronto ["Neighborhood"] = df_toronto.groupby("Postal code")["Neighborhood"].transform(lambda neigh: ','.join(neigh))
#Replace existing groups of Neighborhoods divided with '/' by commas 
df_toronto["Neighborhood"] = df_toronto["Neighborhood"].str.replace('/',',')
#Resetting index
df_toronto.reset_index(drop=True, inplace=True)
#Show data frame
df_toronto.head(11)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern , Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill , Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [79]:
#Checking whether it is applicable

if df_toronto['Neighborhood'].str.contains('Not assigned').any():
    print ("There are not assigned neighborhoods")
else:
    print ("There are no not assigned neighborhoods")


There are no not assigned neighborhoods


Result: There are no not assigned neighbohoods. 

### Print the number of rows of your dataframe

In [80]:
df_toronto.shape

(103, 3)

## Second part: Get geospatial data

In [82]:
df_geospatial = pd.read_csv('http://cocl.us/Geospatial_data')
df_geospatial.columns = ['Postal code', 'Latitude', 'Longitude']

Create combined data frame:

In [83]:
merged_df = pd.merge(df_toronto, df_geospatial, on=['Postal code'], how='inner')
merged_df.head(11)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill , Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Third part: Explore and cluster the neighborhoods in Toronto

#### Create a map of Toronto with neighborhoods superimposed on top.