<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe. Once the data is in a structured format we will explore and cluster the data using K-Means algorithm.

## Part 1 - Extracting data from web

In [133]:
# import the library we use to open URLs
import urllib.request

In [134]:
# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [135]:
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

In [136]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

In [137]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

In [138]:
#look at the HTML
#print(soup.prettify())

In [139]:
right_table=soup.find('table', class_='wikitable sortable')
#right_table

In [140]:
A=[]
B=[]
C=[]

for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].get_text(strip=True))
        B.append(cells[1].get_text(strip=True))
        C.append(cells[2].get_text(strip=True).replace('/', ','))

In [141]:
import pandas as pd
df=pd.DataFrame(A,columns=['Postal code'])
df['Borough']=B
df['Neighborhood']=C
df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern , Rouge"


In [142]:
#remove all "Not assigned" of the feature "Borough"
assigned = df['Borough']=='Not assigned'
df=df[~assigned]
df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern , Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill , Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [143]:
#checkinfgif there are some null value
df.isna().sum()

Postal code     0
Borough         0
Neighborhood    0
dtype: int64

In [144]:
#checkinfgif there are some "Not assigned" value in the "Neighborhood" feature
isnull= df['Neighborhood']=='Not assigned'
df[isnull]

Unnamed: 0,Postal code,Borough,Neighborhood


In [145]:
#checkinfgif there are some empty value in the "Neighborhood" feature
isnull= df['Neighborhood']==''
df[isnull]

Unnamed: 0,Postal code,Borough,Neighborhood


In [146]:
df.shape

(103, 3)

## Part 2 - Get the coordinates

In [158]:
#get the csv data with coordinates
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')

^C
Data downloaded!


In [148]:
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [153]:
#merge the data
result = pd.merge(df,
                 geo_df,
                 left_on='Postal code',
                 right_on='Postal Code')
result.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",M7A,43.662301,-79.389494


In [156]:
result.drop(['Postal Code'], axis=1, inplace=True)

In [157]:
result

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.636258,-79.498509
