# Segmenting and Clustering Neighborhoods in Toronto

## 1.Start by creating a new Notebook for this assignment.

## 2.Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [1]:
#import the libaries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [44]:
link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(link).text
soup = BeautifulSoup(page, 'lxml')

#Extract the table, which it's class is table, and use 'wikitable' as tag
#This can be found by reading the html code of the page, by pressing F12 when browsing the page using Google Chrome, or other methods to extract the html code
table = soup.find('table', class_= 'wikitable')

#Extract the rows
rows = table.find_all('tr')
print("Total numbers of rows: ", len(rows))

#Extract the columns
columns = [v.text for v in rows[0].find_all('th')]
print("Original Columns: ", columns)

#Delete the '\n' symbols in columns
columns = [v.text.replace('\n', '') for v in rows[0].find_all('th')]
print("Modified Columns: ", columns)

Total numbers of rows:  289
Original Columns:  ['Postcode', 'Borough', 'Neighbourhood\n']
Modified Columns:  ['Postcode', 'Borough', 'Neighbourhood']


## 3. Create the dataframe:

In [66]:
#Import the information into the pandas dataframe
df = pd.DataFrame(columns = columns)
print(df, '\n')

#Lets extract one row to see if everything is okay
row = [v.text for v in rows[1].find_all('td')]
print ("Original Row: ", row)

#Again, Delete the '\n' symbols in the row
row = [v.text.replace('\n', '') for v in rows[1].find_all('td')]
print ("Modified Row: ", row, '\n')
print ("The type of a row is: ", type(row), '\n')

#Now, insert all row information into the dataframe
for i in range(1, len(rows)):#Skip the first row becasue it's already in the column name
    row_i = [v.text.replace('\n', '') for v in rows[i].find_all('td')]
    #A list is generated, 
    df = df.append(pd.Series(row_i, index = columns), ignore_index = True)

print(df.head())

Empty DataFrame
Columns: [Postcode, Borough, Neighbourhood]
Index: [] 

Original Row:  ['M1A', 'Not assigned', 'Not assigned\n']
Modified Row:  ['M1A', 'Not assigned', 'Not assigned'] 

The type of a row is:  <class 'list'> 

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront


In [82]:
#Find the rows with Not assigned value in Borough and drop them
NA_borough = df[df['Borough'] == 'Not assigned'].index
df.drop(NA_borough, inplace = True)
df.reset_index (drop = True, inplace = True)
print(df.head(10))

  Postcode           Borough     Neighbourhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M5A  Downtown Toronto       Regent Park
4      M6A        North York  Lawrence Heights
5      M6A        North York    Lawrence Manor
6      M7A      Queen's Park      Queen's Park
7      M9A         Etobicoke  Islington Avenue
8      M1B       Scarborough             Rouge
9      M1B       Scarborough           Malvern


In [85]:
#Replace Not assigned neighbourhood with borough value
NA_Neigh = df[df['Neighbourhood'] == 'Not assigned'].index
print("NA_Neigh's information: ", NA_Neigh, '\n')

#Since NA_Neigh has only one value, which is six, we will do a simple replace
#df['Neighbourhood'][6] = df['Borough'][6]

#Of course, it is better to use a loop
for i in NA_Neigh:
    df['Neighbourhood'][i] = df['Borough'][i]
    
#Check if row index 6 is modified
print(df.head(7))

NA_Neigh's information:  Int64Index([], dtype='int64') 

  Postcode           Borough     Neighbourhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M5A  Downtown Toronto       Regent Park
4      M6A        North York  Lawrence Heights
5      M6A        North York    Lawrence Manor
6      M7A      Queen's Park      Queen's Park


In [101]:
#Combine the Postal codes
df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()
print(df.head(), '\nThe size of the dataframe is: ', df.shape)

  Postcode      Borough                           Neighbourhood
0      M1B  Scarborough                          Rouge, Malvern
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough       Guildwood, Morningside, West Hill
3      M1G  Scarborough                                  Woburn
4      M1H  Scarborough                               Cedarbrae 
The size of the dataframe is:  (103, 3)


## 4. Submit a link to your Notebook on your Github repository. (End of Part 1)

## 5. Download the Geocoder csv file and join the dataframes

In [108]:
#First, read the csv file
Geodf = pd.read_csv("http://cocl.us/Geospatial_data")
Geodf.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [110]:
#Join the two dataframes
df = df.join(Geodf.set_index('Postal Code'), on = 'Postcode')
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 6. Submit a link to your Notebook on your Github repository. (End of Part 2)

## 7. Create a map for clustering

In [114]:
#First, import the required libraries for map visualization and clustering analysis
import sklearn.cluster as KMeans
!pip -q install folium
import folium 
import matplotlib.pyplot as plt

In [120]:
#Create a map of toronto
map_tor = folium.Map(location = [43.6532, -79.3832], zoom_start = 11)

for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighbourhood']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'greed',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html=False).add_to(map_tor)
    
map_tor
