# Toronto Neigbourhoods: Segmentation & Clustering

In [4]:
!pip install beautifulsoup4
import pandas as pd
import requests
from bs4 import BeautifulSoup



### First let's get the HTML response

In [5]:
# get the response in the form of html
wikiurl="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wikiurl)
print(response.status_code)

200


### And parse it into a BeautifulSoup object

In [7]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
to_codes=soup.find('table',{'class':"wikitable sortable"})


### Great! Now we can read it into a dataframe

In [15]:
df=pd.read_html(str(to_codes))

# convert list to dataframe
df=pd.DataFrame(df[0])
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [14]:
print(df.shape) 

(180, 3)


### And remove boroughs 'not assigned'

In [26]:
df1=df[df['Borough'] != "Not assigned"]
df1.reset_index(drop=True, inplace=True)
df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [18]:
print(df1.shape) 

(103, 3)


## OK! Now we need to get those Latitudes and Longitudes.

### Let's use the CSV file. We're going to open the CSV, and iterate through our dataframe, matching the postal codes in it to those in the CSV. When we find a match, we'll add the corresponding Latitudes and Longitudes from the CSV file into our dataframe.

In [41]:
geo_csv = pd.read_csv('https://cocl.us/Geospatial_data')
    
df1['Latitude'] = df1['Postal Code'].apply(lambda x: geo_csv[geo_csv['Postal Code']==x]['Latitude'].values[0])

df1['Longitude'] = df1['Postal Code'].apply(lambda x: geo_csv[geo_csv['Postal Code']==x]['Longitude'].values[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [40]:
df1.head(20)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# Done and dusted!