# Coursera Project On Toronto Neighborhood Segmentation

### Will showcase web-scraping and geo-spatial clustering

In [32]:
#importing libraries that will be used
import pandas as pd #data analysis
#ensure that we can print out an entire dataset if desired
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
import requests #handle web requests
from bs4 import BeautifulSoup #parse the resultant html

In [8]:
wikiurl = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #link containing our data to be analyzed
table_class = "wikitable sortable jquery-tablesorter" #define type of table expected to be collected from target link
response = requests.get(wikiurl) #retrieve html data from target link
print(response.status_code) #HTML status code, 200 is expected as that indicates succescful retrieval

200


In [13]:
#convert the html data into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
#specify table object from html response
toronto_table=soup.find('table', {'class':"wikitable"})
toronto_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

From the output of the previous cell, we can see that the table is being located with the desired Toronto neighborhood data.

In [16]:
#convert BeautifulSoup object to Pandas DataFrame 
df_toronto = pd.read_html(str(toronto_table)) #using Pandas built in function for reading html
df_toronto = pd.DataFrame(df_toronto[0]) #coerce list into DataFrame
print(df_toronto.head()) #verify

  Postal Code           Borough              Neighbourhood
0         M1A      Not assigned               Not assigned
1         M2A      Not assigned               Not assigned
2         M3A        North York                  Parkwoods
3         M4A        North York           Victoria Village
4         M5A  Downtown Toronto  Regent Park, Harbourfront


The output of the cell above demonstrates that the table contains the features that we are looking for (Postal Code, Borough, Neighborhood), however it contains several instances on "Not assigned" data. For this, we will need to clean the data.

Since our goal is to segment Toronto neighborhoods, entry points with no neighborhoods are not useful to us; however, we can substitue the borough in cases where a neighborhood is not assigned. If no borough is available either, it will be necessary to drop the entry.

In [18]:
#Begin cleaning data
NoBoroughIndexes = df_toronto[df_toronto['Borough'] == 'Not assigned'].index #establish indexes where no borough is assigned
df_toronto_bor = df_toronto.drop(NoBoroughIndexes) #drop the indexes obtained
df_toronto_bor.head() #verify succesful drop

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In the output above, we can see that we have succesfully removed rows with no borough assigned, however,we should check to see if there is any data with a borough but no neighborhood.

In [26]:
print(df_toronto_bor[df_toronto_bor['Neighbourhood'] == 'Not assigned'])

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


We can see above that parsing the neighborhood column for 'Not assigned' entries returns nothing. However, we can double check to make sure there are no rows with unassigned neighborhoods under different entries.

In [34]:
df_neighborhoods = pd.DataFrame(df_toronto_bor['Neighbourhood'].value_counts())
df_neighborhoods

Unnamed: 0,Neighbourhood
Downsview,4
Don Mills,2
Leaside,1
"Forest Hill North & West, Forest Hill Road Park",1
Cedarbrae,1
"Kennedy Park, Ionview, East Birchmount Park",1
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",1
Davisville North,1
Bayview Village,1
"Moore Park, Summerhill East",1


We can see when looking at the counts for how many times a neighborhood shows up, there are no entries without assigned neighborhoods.
We can also see that many postal areas contain multiple nighborhoods, where neighboroods are separated by commas in the above table.

In [38]:
shape = df_toronto_bor.shape
row = shape[0]
col = shape[1]
fmt_str = "There are %i rows and %i columns in our cleaned data set" % (row, col)
print(fmt_str)

There are 103 rows and 3 columns in our cleaned data set
