# Segmenting and Clustring Neighborhoods in Toronto

### Importing Libraries
1. BeautifulSoup is imported for webscraping
2. requests is imported for retrieving the html code from the wikipedia
3. pandas is imported for converted the data in dataframe

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Web Scrapping
Used request module to get request from the page and assigned text to the source variable and then pass it through BeautifulSoup to find out the table of class = "wikitable sortable"

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
table= soup.find("table", class_="wikitable sortable").text
print(table)



Postcode
Borough
Neighbourhood


M1A
Not assigned
Not assigned


M2A
Not assigned
Not assigned


M3A
North York
Parkwoods


M4A
North York
Victoria Village


M5A
Downtown Toronto
Harbourfront


M6A
North York
Lawrence Heights


M6A
North York
Lawrence Manor


M7A
Downtown Toronto
Queen's Park


M8A
Not assigned
Not assigned


M9A
Queen's Park
Not assigned


M1B
Scarborough
Rouge


M1B
Scarborough
Malvern


M2B
Not assigned
Not assigned


M3B
North York
Don Mills North


M4B
East York
Woodbine Gardens


M4B
East York
Parkview Hill


M5B
Downtown Toronto
Ryerson


M5B
Downtown Toronto
Garden District


M6B
North York
Glencairn


M7B
Not assigned
Not assigned


M8B
Not assigned
Not assigned


M9B
Etobicoke
Cloverdale


M9B
Etobicoke
Islington


M9B
Etobicoke
Martin Grove


M9B
Etobicoke
Princess Gardens


M9B
Etobicoke
West Deane Park


M1C
Scarborough
Highland Creek


M1C
Scarborough
Rouge Hill


M1C
Scarborough
Port Union


M2C
Not assigned
Not assigned


M3C
North York
Flemingdon Par

### Data Wrangling
As our table is still in string format, we have to remove the blank spaces. Firstly we converted the string into list of elements by spliting with __\n__ and then removed the empty elements in the raw table_list. Our data contain 3 elements in a row thus we chunked the list with interval of 3 element and converted chunked data into pandas DataFrame. Then droped the first row as we had defined our column name already and reset the index. Lastly removed the rows for which Borough was Not assigned.

In [3]:
table_list=table.split("\n")
table_list[:] = [x for x in table_list if x] #remove empty elements
chunked=[table_list[i:i + 3] for i in range(0, len(table_list), 3)]
column=table_list[0:3] #coloumn name
df=pd.DataFrame(chunked,columns=column)
df.drop([0], inplace=True)
df = df[df.Borough != 'Not assigned'].reset_index(drop=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


#### Combining neighborhoods that have the same postcode and separating them with a comma

In [4]:
df=df.groupby(['Postcode','Borough'], sort = False)['Neighbourhood'].aggregate(lambda x: ', '.join(x)).reset_index()
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."


#### Assignment of borough to neighbourhood if neighbourbood is not assinged

In [5]:
df.loc[df['Neighbourhood']=="Not assigned",'Neighbourhood']=df['Borough']

In [6]:
df.shape

(103, 3)