# Segmenting and Clustering Neighborhoods in Toronto 
###### John Filak

In this notebook , I will be scraping information about the neighborhoods in Toronoto Canada. I have scraped the data into a pandas dataframe and then cleaned it up to have neighborhoods sharing the same borough, in the same row. As well as filtering through the missing data and not including it. Due to the Geocoder API not working, I instead used the geospacial_coordinates excel sheet to grab the location values for the neighborhoods. 

## Scraping info 

Scraping the wikipedia page and place the data for postal codes in toronto https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
import pandas as pd 
import numpy as np 


In [2]:
from bs4 import BeautifulSoup
import requests


In [3]:
page_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

page_response = requests.get(page_link,timeout=10)

page_content = BeautifulSoup(page_response.content, 'html.parser')



#### Unfiltered Wiki Page Content 

In [34]:
#print(page_content)

In [35]:
match = page_content.find('table')
#print(match)

#### Creating Data Frame to Store Desired Table Data 

In [6]:
columns = ['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame(columns=columns)

In [7]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [36]:
for row in match.find_all('tr'): 
        value = row
        #print(value)
        

#### Filtering though data in page, placing data into a list that pandas will then insert as a series into the data frame. 

In [37]:
for row in match.find_all('tr'): 
    
        inner = row.find_all('td')
        new_tbl = []
        for td in inner: 
           # print(td.text)
            new_tbl.append(td.text.rstrip("\n"))
        
        #print(new_tbl)
        
        #print(inner)
        
        if (len(new_tbl) > 1) and (new_tbl[1] != ('Not assigned'))  : 
            
            if len(df.loc[df.PostalCode == new_tbl[0]]) == 0 :
                df = df.append(pd.Series(new_tbl, index =['PostalCode', 'Borough', 'Neighborhood']), ignore_index=True)
            else: 
               # df['Neighborhood'] = np.where(df['PostalCode'] == new_tbl[0], df['Neighborhood'] + ', '  + new_tbl[0])
                df.loc[df.PostalCode == new_tbl[0], 'Neighborhood'] += (", " + new_tbl[2])

In [10]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned


#### Droping Boroughs with missing values (not assigned) . This was needed before I had made a conditional statement not to include this data in the dataframe. 

In [11]:
index = df[df['Borough'] == ('Not assigned')].index
df.drop(index, inplace=True)


#### Another task was to assign a missing nerighborhood with the columns borough name. Since its only one value , can easily replace it by filtering through the dataFrame

In [12]:
df.loc[df.Neighborhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
4,M7A,Queen's Park,Not assigned


In [13]:
df = df.replace(to_replace = 'Not assigned', value = 'Queen\'s Park')


In [14]:
df.loc[df.Neighborhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood


In [15]:
df.shape

(103, 3)

In [16]:
df.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,103,103,103
unique,103,11,102
top,M4M,North York,Queen's Park
freq,1,24,2


#### Geocoder API not working so instead I used csv file to join the tables, and get Lat & Lng Data 

In [17]:
geo_codes = pd.read_csv('https://cocl.us/Geospatial_data')

In [18]:
geo_codes.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [19]:
geo_codes.dtypes


Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

In [20]:
result = df.merge(geo_codes, left_on='PostalCode', right_on='Postal Code', how='left' )
result.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",M6A,43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,M7A,43.662301,-79.389494
5,M9A,Downtown Toronto,Queen's Park,M9A,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
7,M3B,North York,Don Mills North,M3B,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",M4B,43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",M5B,43.657162,-79.378937


In [21]:
 result = result.drop(['Postal Code'], axis=1)

In [22]:
result.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Downtown Toronto,Queen's Park,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


In [23]:
result.Borough.describe()

count            103
unique            11
top       North York
freq              24
Name: Borough, dtype: object

In [24]:
result['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           11
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

#### From this We can expect that North York is the largest part of Toronto and Queen's Park to be the smallest. I will attempt to visualize the data to get a better understanding if this is correct. 

In [25]:
!pip install folium
import folium 



In [27]:
to_map = folium.Map(location=[43.7, -79.4], zoom_start=10)
to_map

From thye data that we have inside our data frame, I will make a map with markers to mark the location of each neighborhood. Through folium I will superimpose the locations onto the map using a feature group. 

In [28]:
colorMap = {'North York': 'red',
'Downtown Toronto':'blue',
'Scarborough':'green',
'Etobicoke':'yellow',
'Central Toronto':'purple',
'West Toronto':'pink',
'East Toronto': 'black',
'York':'orange',
'East York':'lightred',
'Queen\'s Park':'beige',
'Mississauga':'lightgray' 
           }

In [29]:
colorMap

{'North York': 'red',
 'Downtown Toronto': 'blue',
 'Scarborough': 'green',
 'Etobicoke': 'yellow',
 'Central Toronto': 'purple',
 'West Toronto': 'pink',
 'East Toronto': 'black',
 'York': 'orange',
 'East York': 'lightred',
 "Queen's Park": 'beige',
 'Mississauga': 'lightgray'}

In [30]:
colorMap['North York']

'red'

In [31]:
from folium.plugins import MarkerCluster

In [32]:
neighborhoods = folium.map.FeatureGroup()

for lat, lng, boro in zip(result.Latitude, result.Longitude, result.Borough): 
    neighborhoods.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius = 5, 
            color = colorMap[boro], 
            fill = True, 
            #fill_color = 'yellow', 
            fill_opacity = 1
        )
    )
to_map.add_child(neighborhoods)

In [33]:
colorMap

{'North York': 'red',
 'Downtown Toronto': 'blue',
 'Scarborough': 'green',
 'Etobicoke': 'yellow',
 'Central Toronto': 'purple',
 'West Toronto': 'pink',
 'East Toronto': 'black',
 'York': 'orange',
 'East York': 'lightred',
 "Queen's Park": 'beige',
 'Mississauga': 'lightgray'}

### Now we have a better visual representation of the different neighborhoods based on waht borough they belong to. 