# Toronto - The City Of Neighborhoods 

The strength and vitality of the many neighbourhoods that make up Toronto, Ontario, Canada has earned the city its unofficial nickname of "the city of neighbourhoods. There are over 140 neighbourhoods officially recognized by the City of Toronto. The aim of this project is to explore, segment and clusterise Toronto according to its neighborhoods and find similarites and disimilarities using data science techniques.

### Install Dependencies

In [None]:
!pip3 install bs4
!pip3 install requests
!pip3 install html5lib

### Import Dependencies
 We import Beautifulsoup dependency for web scraping of wikipedia page, requests for making http calls, html5lib a type of beautifulsoup parser for html files and pandas for working with extracted data in the form of a dataframe
 

In [116]:
from bs4 import BeautifulSoup
import requests
import html5lib
import pandas as pd

## Data Collection - Scrape Files

In [138]:
scraping_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_file = requests.get(scraping_url).text	
soup = BeautifulSoup(html_file, "html5lib")


## Data Preprocessing - Convert Files into DataFrame

The scraped html files will be conerted using pandas into a dataframe consisting on three columns postal code , borough and neighborhood

In [139]:
# Create a list of neighborhoods
neighborhoods = []
for row in soup.find("table").findAll("td"):
    data = {}
    if row.span.text == "Not assigned":
        pass
    else:
        data["PostalCode"] = row.p.text[:3]
        data["Borough"] = row.span.text.split("(")[0]
        data["Neighborhood"] = (((row.span.text.split("(")[1]).strip(")")).replace("/",",").replace(')',' ')).strip(' ')
        neighborhoods.append(data)

# create dataframe
df = pd.DataFrame(neighborhoods)

# replace outlying formats for boroughs
df['Borough']=df['Borough'].replace({
    'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
    'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
    'EtobicokeNorthwest':'Etobicoke Northwest',
    'East YorkEast Toronto':'East York/East Toronto',
    'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

# replace not assigned neighborhoods                                           
not_assigned_neighborhoods = df[df["Neighborhood"]== "Not assigned"]
not_assigned_neighborhoods["Neighborhood"] = not_assigned_neighborhoods["Borough"]
df.sort_values(["PostalCode"],ascending=True, inplace=True)
df.reset_index(drop=True,inplace=True)
df.head(10)
    

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern , Rouge"
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park , Ionview , East Birchmount Park"
7,M1L,Scarborough,"Golden Mile , Clairlea , Oakridge"
8,M1M,Scarborough,"Cliffside , Cliffcrest , Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


## Data Exploration

In [123]:

df.shape

(103, 3)