# Introduction

This notebook will be used for Webscrapping and Clustering neighborhoods in Toronto CapStone Project. It has 3 parts:

## Part 1:
Scrape Toronto Neighborhood Data from Wikipedia
## Part 2:
Read the Latitide and Longitide Coordinates from a csv file and update the dataframe
## Part 3:
Cluster Using K-Means Clustering

## Part 1:
Scrape Toronto Neighborhood Data from Wikipedia

We will now install and import the first set of libraries needed. Including Beautiful Soup and lxml parser for web scrapping

In [4]:
!pip install html5lib;
!pip install request;
!pip install beautifulsoup4;



In [5]:
import pandas as pd;
import numpy as np;
import requests;
from bs4 import BeautifulSoup;

In [None]:
Get-retrieve the website content

In [6]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M";
r = requests.get(URL);
print(r)

<Response [200]>


Initialize empty neigbourhood pandas dataframe with the needed columns

In [7]:
column_names = ['PostalCode','Borough', 'Neighborhood']; 
neigbourhood_df = pd.DataFrame(columns=column_names);
neigbourhood_df

Unnamed: 0,PostalCode,Borough,Neighborhood


Get the object of the BeautifulSoup and read from Wikipedia and print the output

In [8]:
soup = BeautifulSoup(r.content, 'html5lib')

table = soup.body.table.tbody
#print(soup.prettify());

Parse and Loop through the html while populating the initialized dataframe

In [9]:
for row in table.findAll('tr'):
        #print("Current tds ==", row.findAll('td'));
        tdRow = row.findAll('td');
        if(len(tdRow)!=0):
            borough = str(tdRow[1]).replace("<td>","").replace("</td>","").replace("\n","");
            if(borough.startswith('Not')==True):
                continue
            else:
                #print("borrough  == ",borough);
                postalCode = str(tdRow[0]).replace("<td>","").replace("</td>","").replace("\n","");
                neighbourhood = str(tdRow[2]).replace("<td>","").replace("</td>","").replace("\n","");
                neigbourhood_df=neigbourhood_df.append({'PostalCode':postalCode,
                                'Borough':borough,
                                'Neighborhood':neighbourhood},ignore_index=True);
        
neigbourhood_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [10]:
neigbourhood_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Print out the shape

In [11]:
neigbourhood_df.shape

(103, 3)

## Part 2:
Read the Latitude and Longitude Coordinates from a csv file and update the dataframe

Now we move over to get the Longitude and Latitude of neigbourhood_df from a csv file

In [13]:

neig_lat_long_df = pd.read_csv("https://cocl.us/Geospatial_data");


Display the data

In [14]:
neig_lat_long_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Loop through the neigbourhood_df, get each postalcode and use it to retrive the corresponding coordinates from the neig_lat_long_df

In [30]:
for i in range(0,len(list(neigbourhood_df['PostalCode']))):
    postalCode = neigbourhood_df.loc[i,'PostalCode'];
    postal_df = neig_lat_long_df[neig_lat_long_df['Postal Code']==postalCode]
    neigbourhood_df.loc[i,'Latitude'] = postal_df['Latitude'].values[0];
    neigbourhood_df.loc[i,'Longitude'] = postal_df['Longitude'].values[0];

display the resultant  and updated neigbourhood_df

In [31]:
neigbourhood_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
