# Capstone assignment 2 - Segmenting and Clustering Neighborhoods in Toronto

## Introduction

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. The Toronto neighborhood data is scraped from a Wikipedia page. The data was structured into pandas dataframe and analysized to explore the segmentation and clustering.

The Foursquare API was used to search for a specific type of venues, particular venue, Foursquare user, geographical location, and trending venues around a location. Also used the visualization library, Folium, to visualize the results.

### Import necessary Libraries

In [1]:
import requests
import lxml.html as lh
import pandas as pd

In [2]:
# Assign url to the Wikipedia page
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')


In [3]:
#Check the length of the first 5 rows
[len(T) for T in tr_elements[:5]]

[3, 3, 3, 3, 3]

In [4]:
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
#print '%d:"%s"'%(i,name)
#    print '%d:"%s"'%(i,name)
    col.append((name,[]))

#print (tr_elements[:5])
len(tr_elements)

294

In [5]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
#If row is not of size 3 as this table has 3 columns, the //tr data is not from our table 
    if len(T)!=3:
        break
    
#i is the index of our column
    i=0
    
#Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [6]:
[len(C) for (title,C) in col]

[288, 288, 288]

In [7]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [8]:
# Write dataframe to a file
df.to_csv('List_of_postal_codes_of_Canada 1.csv',header=1,index=False)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


In [9]:
df.shape

(288, 3)

In [11]:
df['Neighbourhood'] = df.Neighbourhood.str.replace('\n','') #To remove "\n" stream at the end of Neighbourhood 

In [12]:
df.drop(df.loc[df['Borough']=="Not assigned"].index, inplace=True) # To drop off rows when Borough value is not assigned.

In [13]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [14]:
df.shape  # Much reduced number of rows after elimination of rows with "Not assigned" value for Borough

(211, 3)

In [15]:
# df.to_csv('List_of_postal_codes_of_Canada 2.csv',header=1,index=False)  #To output a csv file for the dataframe

In [16]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
# As shown on row #8 below 
df['Neighbourhood'] = df['Neighbourhood'].replace("Not assigned",df['Borough']) 

In [17]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [None]:
# More than one neighborhood can exist in one postal code area. 
# Such rows will be combined into one row with the neighborhoods separated with a comma.
df_group=df.groupby(['Postcode','Borough'],as_index=False)['Neighbourhood'].agg(','.join)

# Below is result for answer #1 to Capstone Week 3 Project

In [19]:
df_group.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"
