# **Segmenting and Clustering Neighborhoods in Toronto**

In this assignment, will be explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information

#**PART 1 Webscraping**


The libraries to use are imported

In [256]:
!pip install yfinance
#!pip install pandas
#!pip install requests
!pip install bs4
#!pip install plotly



In [257]:
import yfinance as yf
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [258]:
#The below url contains html tables with list of neighbourhoods in toronto
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# get the contents of the webpage in text format and store in a variable called data
html_data  = requests.get(url).text

Creating a Beautiful soup object

In [259]:
soup = BeautifulSoup(html_data,"html5lib")

The dataframe is created 

In [260]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


The dimensions of the dataframe and the first values are checked 

In [261]:
df.shape

(103, 3)

# **PART 2 Use geopy library** 

Use geopy library to get the latitude and longitude values of Toronto City.

In [262]:
pip install geopy



In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent Course_Project, as shown below.

In [263]:
import geocoder
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="Course_Project")

In [264]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [265]:
df.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
102,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [266]:
# The values of the columns of the dataframe are stored in the following variables
borough = df['Borough']
neigh = df['Neighborhood']
postal_code = df['PostalCode']

# The variables are created in the form of a list that will store the latitude and longitude
latitude = []
longitude = []

# For the longitude and latitude search, only a single neighborhood from the Neighborhood column will be considered. Only the first neighborhood in line will be considered.
for i in range(len(neigh)):
  neigh[i] = neigh[i].split(sep=',')[0]

# The values of the longitude and latitude are obtained
for i in range(df.shape[0]):
  
# We will first try to obtain the values using as arguments: neighborhood and Borough. In the event that an error occurs during the search, it will try to obtain only the information with Borough
  try:
    m = (f"{neigh[i]}, {borough[i]}, Toronto, Canada")
    location = geolocator.geocode(m)
    latitude.append(location.latitude)
    longitude.append(location.longitude)

  except:
    m = (f"{borough[i]}, Toronto, Canada")
    location = geolocator.geocode(m)
    latitude.append(location.latitude)
    longitude.append(location.longitude)

df.shape
len(longitude)
len(latitude)

103

Columns for latitude and longitude are created

In [267]:
df['Latitude'] = latitude
df['Longitude'] = longitude

df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.758800,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
3,M6A,North York,Lawrence Manor,43.722079,-79.437507
4,M7A,Queen's Park,Ontario Provincial Government,43.659659,-79.390340
...,...,...,...,...,...
98,M8X,Etobicoke,The Kingsway,43.647381,-79.511333
99,M4Y,Downtown Toronto,Church and Wellesley,43.658124,-79.375609
100,M7Y,East Toronto Business,Enclave of M4L,43.721789,-79.374027
101,M8Y,Etobicoke,Old Mill South,43.649826,-79.494334


# **Analyze Each Neighborhood**

The first values of the dataframe are shown and the latitude and longitude are shown by neighborhood

In [268]:
df.shape
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
3,M6A,North York,Lawrence Manor,43.722079,-79.437507
4,M7A,Queen's Park,Ontario Provincial Government,43.659659,-79.39034


In [269]:
df.groupby('Neighborhood').mean().reset_index()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Agincourt,43.785353,-79.278549
1,Alderwood,43.601717,-79.545232
2,Bathurst Manor,43.763893,-79.456367
3,Bayview Village,43.769197,-79.376662
4,Bedford Park,43.737388,-79.410925
...,...,...,...
96,Willowdale West,43.761510,-79.410923
97,Woburn,43.759824,-79.225291
98,Woodbine Heights,43.699920,-79.319279
99,York Mills,43.744039,-79.406657


In [270]:
print('There are {} uniques categories.'.format(len(df['Neighborhood'].unique())))

There are 101 uniques categories.


# **Create a map of New York with neighborhoods**

In [271]:
import folium # map rendering library

address = 'Toronto, Ontario'
location_toronto = geolocator.geocode(address)
latitude_toronto = location_toronto.latitude
longitude_toronto = location_toronto.longitude

map1 = folium.Map(
    location=[latitude_toronto,longitude_toronto],
    tiles='cartodbpositron',
    zoom_start=11
)
df.apply(lambda row:folium.CircleMarker(location=[row["Latitude"], row["Longitude"]], radius=6,
        fill=True,
        fill_opacity=0.7).add_to(map1), axis=1 )
map1