<h1 align=center><font size = 6>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

## Introduction

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

In [1]:
import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

print('Libraries imported.')

Libraries imported.


# 1st. Part of the Assignment
### 1. Download and Explore Dataset

The file was placed on my Github repository, so you can access the data directly with the next code.

In [2]:
url = 'https://raw.githubusercontent.com/Aero09/coursera_capstone_IBM_DSP/master/TorontoPostalCodes.csv'
df = pd.read_csv(url, error_bad_lines=False)
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


### 2. Transform the dataframe as required

In [3]:
#change headers names
headers = ["PostalCode","Borough","Neighborhood"]
df.columns = headers

In [4]:
#new dataframe without rows where Borough='Not assigned'
df1 = df[df.Borough != 'Not assigned']
df1.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
9,M9A,Downtown Toronto,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [5]:
# Function to get all the positions (indexes) that match a value
def getIndexes(dfObj, value):
    ''' Get index positions of value in dataframe '''
    listOfPos = list()
    # Get bool dataframe with True at positions where the given value exists
    result = dfObj.isin([value])
    # Get list of columns that contains the value
    seriesObj = result.any()
    columnNames = list(seriesObj[seriesObj == True].index)
    # Iterate over list of columns and fetch the rows indexes where value exists
    for col in columnNames:
        rows = list(result[col][result[col] == True].index)
        for row in rows:
            listOfPos.append((row, col))
    # Return a list of tuples indicating the positions of value in the dataframe
    return listOfPos

In [6]:
# Get list of index positions of all occurrences of "Not assigned" in the dataframe
listOfPositions = getIndexes(df1, 'Not assigned')
 
print('Index positions of "Not assigned" in Dataframe df1: ')
for i in range(len(listOfPositions)):
    print('Position ', i, ' (Row index , Column Name) : ', listOfPositions[i])

Index positions of "Not assigned" in Dataframe df1: 
Position  0  (Row index , Column Name) :  (7, 'Neighborhood')


In [7]:
# Get bool dataframe with True at positions where value is "Not assigned"
result = df1.isin(['Not assigned'])
print('Bool Dataframe representing existence of value "Not assigned" as True')
print(result.head(12))

Bool Dataframe representing existence of value "Not assigned" as True
    PostalCode  Borough  Neighborhood
2        False    False         False
3        False    False         False
4        False    False         False
5        False    False         False
6        False    False         False
7        False    False          True
9        False    False         False
10       False    False         False
11       False    False         False
13       False    False         False
14       False    False         False
15       False    False         False


In [8]:
# Get list of columns that contains the value "Not assigned"
seriesObj = result.any()
columnNames = list(seriesObj[seriesObj == True].index)
print('Names of columns which contains "Not assigned":', columnNames)

Names of columns which contains "Not assigned": ['Neighborhood']


In [9]:
# Iterate over each column and get the rows and columns that match the value "Not assigned"
for col in columnNames:
    rows = list(result[col][result[col] == True].index)
    for row in rows:
        print('Index : ', row, ' Col : ', col)

Index :  7  Col :  Neighborhood


Where Value "Not Assigned" was found in the Neighborhood column, then the value of Borough is copied there

In [10]:
# Iterate to change the value contained in the indexes
for col in columnNames:
    rows = list(result[col][result[col] == True].index)
    for row in rows:
        df1.at[row,'Neighborhood'] = df1.at[row,'Borough']

In [11]:
df1.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Downtown Toronto,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [12]:
# Define PostalCode column as Index
df1.set_index('PostalCode', inplace=True)
df1.columns

Index(['Borough', 'Neighborhood'], dtype='object')

In [13]:
df1.head(12)

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor
M7A,Queen's Park,Queen's Park
M9A,Downtown Toronto,Queen's Park
M1B,Scarborough,Rouge
M1B,Scarborough,Malvern
M3B,North York,Don Mills North


In [14]:
# Group the uniques Postal codes with more than a Neighborhood, concatenating the neighborhoods in the same cell
slted_df1 = df1.groupby(["PostalCode","Borough"])['Neighborhood'].apply(lambda tags: ','.join(tags))

In [15]:
slted_df1

PostalCode  Borough    
M1B         Scarborough                                        Rouge,Malvern
M1C         Scarborough                 Highland Creek,Rouge Hill,Port Union
M1E         Scarborough                      Guildwood,Morningside,West Hill
M1G         Scarborough                                               Woburn
M1H         Scarborough                                            Cedarbrae
                                                 ...                        
M9N         York                                                      Weston
M9P         Etobicoke                                              Westmount
M9R         Etobicoke      Kingsview Village,Martin Grove Gardens,Richvie...
M9V         Etobicoke      Albion Gardens,Beaumond Heights,Humbergate,Jam...
M9W         Etobicoke                                              Northwest
Name: Neighborhood, Length: 103, dtype: object

On the previous code the result was a set o series, so now these series are converted to a dataframe

In [16]:
# The set of previous series are converted to a dataframe
df2 = slted_df1.to_frame()
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
PostalCode,Borough,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
...,...,...
M9N,York,Weston
M9P,Etobicoke,Westmount
M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [44]:
# The Index is reseted to be only the PostalCode
df2.reset_index(level=['PostalCode','Borough'])

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


In [50]:
# Creates a new datafreme with the last structure
df3 = df2.reset_index()
df3

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


# 2nd. Part of the Assignment
### 1. Get the geographical coordinates of the neighborhoods

The file was placed on my Github repository, so you can access the data directly with the next code.

In [51]:
url = 'https://raw.githubusercontent.com/Aero09/coursera_capstone_IBM_DSP/master/Geospatial_Coordinates.csv'
df4 = pd.read_csv(url, error_bad_lines=False)

# Assign the required names to columns headers of the new dataframe
headers = ["PostalCode","Latitude","Longitude"]
df4.columns = headers
df4

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [59]:
# Add columns Latitude and Longitude from the previous dataframe to the final dataframe
df3['Latitude'] = df4['Latitude']
df3['Longitude'] = df4['Longitude']
df3

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.739416,-79.588437


# 3rd. Part of the Assignment
### 1. Explore and cluster the neighborhoods in Toronto

The file was placed on my Github repository, so you can access the data directly with the next code.

#### Use geopy library to get the latitude and longitude values of Toronto.

In [70]:
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
# map rendering library
import folium 

In [71]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [72]:
neighborhoods = df3

#### Create a map of Toronto with neighborhoods superimposed on top.

In [73]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's slice the original dataframe and create a new dataframe of the Downtown Toronto data.

In [74]:
downtownToronto_data = neighborhoods[neighborhoods['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtownToronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown,St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


Let's get the geographical coordinates of Downtown Toronto.

In [76]:
address = 'Downtown Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


let's visualize the Downtown Toronto neighborhoods in it.

In [78]:
# create map of Downtown Toronto using latitude and longitude values
map_DowntownToronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(downtownToronto_data['Latitude'], downtownToronto_data['Longitude'], downtownToronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DowntownToronto)  
    
map_DowntownToronto