## Segmenting and Clustering Neighborhoods in Toronto

## Introduction

We will convert addresses into their equivalent latitude and longtitude values. At the same time we will work with Forsquare API to explore neighborhoods in Toronto. Also we will use the **explore** function to get the most common venue categories in each neighborhood, afterward we will use this feature to group the neighborhoods into clusters. For that we gonna use *k*-means clustering algorithm to complete also we will get benefits from Folium library to visualize the neighborhoods in Toronto.

Firstly we will get the data and start exploring it. For that we need to download all the dependencies (library) that we will need. 

In [77]:
# Download beautifulsoup4 library for webscraping,

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

##### We will get the source from the webpage and assigning the variable source to it. After that initialize the beautifulsoup object to soup

In [78]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text 
soup = BeautifulSoup(source, 'lxml')

##### Initialize the csv_writer object and write the name of the columns.

In [79]:
csv_file = open('toronto_postal_codes.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Postcode', 'Borough', 'Neighbourhood'])

32

##### Scraping the page to extract data

In [80]:
table = soup.find('table', class_ = 'wikitable sortable') # Gets the table from the webpage
rows = table.find_all('tr') # for table rows

postcodes = [] # Initializes the raw postcodes list
boroughs = [] # Initializes the raw boroughs list
neighbourhoods = [] # Initializes the raw neighbourhoods list

for row in rows:    
    columns = row.find_all('td')
    try :
        if columns[1].text != 'Not assigned':  # To skip if the borough name is 'Not Assigned'
            
            postcode = columns[0].text
            postcodes.append(postcode)
            
            borough = columns[1].text
            boroughs.append(borough)
            
            neighbourhood = columns[2].text.split('\n')[0] # Removing the newline character at the end     
            
            if neighbourhood == 'Not assigned': # Assigning the same name to neighbourhood if it is 'Not Assigned'
                neighbourhood = borough            
                
            neighbourhoods.append(neighbourhood)
             
    except Exception as e : # To skip the first row which contains column names
        pass 
    
postcode_explored = [] # Initializing the list of explored postcodes
for index_i, postcode_i in enumerate(postcodes) :   
    if postcode_i not in postcode_explored :
        nbds = neighbourhoods[index_i]
        for index_f, postcode_f in enumerate(postcodes) :
            if postcode_i == postcode_f and index_i != index_f:
                nbds = nbds + ', ' + neighbourhoods[index_f] # Concatenating the neighbourhood names
        csv_writer.writerow([postcode_i, boroughs[index_i], nbds]) # Writing the rows in the csv file
        postcode_explored.append(postcode_i)

##### Closing the csv file

In [81]:
csv_file.close()

##### Creating the pandas dataframe

In [82]:
df = pd.read_csv('toronto_postal_codes.csv')

##### Shape of the dataframe

In [83]:
df.shape

(180, 3)

##### Examine the resulting dataframe

In [84]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods
3,M4A\n,North York\n,Victoria Village
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront"
