# Segmenting and Clustering Neighborhoods in Toronto
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

In [7]:
# import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs

print("Libraries imported")

Libraries imported


**Scrape the wikipedia page for Toronto neighborhood data and wrangle into a dataframe**

In [15]:
# grab the data from wikipediea
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(wiki_url).text
canada_xml = bs(source, 'lxml')
table=canada_xml.find('table')


In [16]:
# grab the columns Postalcode,Borough and Neighborhood
column_names = ['Postalcode','Borough','Neighborhood']
df_canada = pd.DataFrame(columns = column_names)


In [29]:
# build the table
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df_canada.loc[len(df_canada)] = row_data
        

(206, 3)

**Let's cleanup some data**

In [42]:
# Get rid of the not assigned boroughs; group them by postalcodes and boroughs
df_boroughs = df_canada[df_canada.Borough != "Not assigned"].reset_index()
df_boroughs= df_boroughs.groupby(['Postalcode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
# Are there any neighborhoods not assigned?
(df_boroughs['Neighborhood'] == "Not assigned").any()
df_boroughs.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge, Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek, Rouge ..."
2,M1E,Scarborough,"Guildwood, Morningside, West Hill, Guildwood, ..."
3,M1G,Scarborough,"Woburn, Woburn"
4,M1H,Scarborough,"Cedarbrae, Cedarbrae"


**We don't have any "Not Assigned" neighborhoods since the answer to the last answer is False.**

In [43]:
df_boroughs.shape

(103, 3)

**We have 103 rows and three columns.**