                          

<h1 align=center><font size = 5>Segmenting and Clustering the Neighborhoods in Toronto</font></h1>

## Introduction

This project works on the segmentating and clustering of the Neighborhoods of the city of Toronto. The neighorhood names are extracted from "List of postal codes of Canada: M" in Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). 
The Foursquare API was used to find the information about venues in postcode areas. With K-Means clustering methodology, the postcode areas were grouped based on the venues density.


## Table of Contents

1. <a href="#item1">Extract the List of Neighborhoods and Process the Data </a>
2. <a href="#item2">Get the Latitude and Longitude of the Neighborhoods</a>  
3. <a href="#item3">clustering of the Neighborhoods and Conclusion</a>  

## Part 1: Extract the List of Neighborhoods and Process the Data

In [31]:
# import packages

from bs4 import BeautifulSoup
import requests
import pandas as pd


In [32]:
# scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
# store the content of the webpage into a string

url_can = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
str_can = requests.get(url_can).text 
type(str_can)

str

In [33]:
#transform the str to html

html_can = BeautifulSoup(str_can, 'lxml') 
type(html_can)

bs4.BeautifulSoup








Step 2: Extract and save targeted data into a dataframe 


In [34]:

# extract the neighborhoods data into table

neighborhoods = html_can.find('table', class_ = 'wikitable')

In [35]:
# extract all tags whose .string matches 'tr' 

neigh = neighborhoods.find_all('tr')

In [36]:
# notice the head from the above printout:
#<th>Postcode</th>
#<th>Borough</th>
#<th>Neighbourhood
#</th>

# define a new Dataframe
heads = ['Postcode','Borough','Neighborhood']
df_can = pd.DataFrame(columns = heads)

# extract each row for ('Postcode', 'Borough', 'Neighbourhood') from the table
# then split the row into three strings
# finally, attach the three strings to the Dataframe

for row in neigh:
    info = row.text.split('\n')[1:-1] 
    pc = info[0]
    br = info[1]
    nbhd= info[2]
    df_can = df_can.append({'Postcode': pc,'Borough': br,'Neighborhood': nbhd}, ignore_index=True)

df_can.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [37]:
#drop the first row, which is not real data but just the original column heads on the website

df_can = df_can.iloc[1:]
df_can.head()

Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront





Step 3: Data Cleaning

For this part, I conducted three procedures:
first, delete all the rows without Borough;
second, for those rest rows without Neighborhood, I assign Borough to Neighborhood;
third, I combined the rows with the same postcode

In [38]:
# drop those data with Borough = "Not assigned"

no_br_index = df_can.index[df_can['Borough'] == 'Not assigned'] # extract the index of rows without "Borough"
df_can.drop(no_br_index, inplace=True) # filter the Dataframe

df_can.shape

(211, 3)

In [39]:
# assign "Borough" to "Neighborhood", if "Borough" was not assigned before

#no_nbhd_index = df_can.index[df_can['Neighborhood'] == 'Not assigned'] 

#for i in no_nbhd_index:
    # df_can['Neighborhood'][i] = df_can['Borough'][i]
        
for i, row in df_can.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']
        
df_can.shape

(211, 3)

In [40]:
df = df_can.groupby(['Postcode', 'Borough'])['Neighborhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()

#df.columns = ['Postcode', 'Borough', 'Neighborhood']

In [41]:
df.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [42]:
print('The shape of the processed dataframe is: ',  df.shape)

The shape of the processed dataframe is:  (103, 3)


## Part 2: Get the Latitude and Longitude of the Neighborhoods

In [46]:
# import package to work with streams
import io

# extract the data containing latitude and longitude from http://cocl.us/Geospatial_data
url_geo="http://cocl.us/Geospatial_data"

# extract the data into a string
str_geo=requests.get(url_geo).content

# convert the data into dataframe
geo=pd.read_csv(io.StringIO(str_geo.decode('utf-8')))
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [47]:
# rename the column 'Postal Code' to 'Postcode'
geo.rename(columns={'Postal Code':'Postcode'}, inplace=True)

# merge the dataframe based on the column 'Postcode'
geo = pd.merge(geo, df, on='Postcode')
geo.head()

Unnamed: 0,Postcode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Rouge, Malvern"
1,M1C,43.784535,-79.160497,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [48]:
# reorder column names and show the dataframe
neigh_geo = neigh_geo[['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]
neigh_geo.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
