<h1>Applied Data Science Capstone Project</h1>

<h2>Assignment #2:  Segmenting and Clustering Neighborhoods in Toronto</h2>

<h3>1. Import required libraries</h3>

In [1]:
from bs4 import BeautifulSoup
import requests, csv, os, sys
import pandas as pd

<h4>Just in case: see what's going to be the path of the future csv file</h4>

In [2]:
print(os.getcwd())

/home/dsxuser/work


<h3>2. Collect and prepare the data</h3>

<h4>Retrieve the data:</h4>

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(source.content, 'lxml')

<h4>Prepare the csv file: </h4>

In [4]:
csv_file = open('neighborhoods.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['PostalCode', 'Borough', 'Neighborhood'])

33

<h4>There are several tables on the page, so we need to find the exact one:</h4>

In [5]:
table = soup.find('table', class_='wikitable sortable')

<h4>Find all rows and columns of the table and write the data into csv file:</h4>
<p>Since table headers use the <em>[th]</em> tag, it will throw an error, so there must be an exception.</p>

In [6]:
for items in table.find_all('tr')[1::1]:
    entries = items.find_all('td')
    try:
        PostalCode = entries[0].get_text(strip=True)
        Borough = entries[1].get_text(strip=True)
        Neighbourhood = entries[2].get_text(strip=True)
    except IndexError:
        pass
    csv_writer.writerow([PostalCode, Borough, Neighbourhood])

csv_file.close()

<h4>Create the dataframe</h4>
<p>I know that using df as a dataframe name is sort of conventional, but I prefer giving it more meaningful name to improve my code readability</p>

In [7]:
neighborhoods = pd.read_csv('neighborhoods.csv')
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


<h3>3. Work on the data</h3>

<h4>1. Exclude rows that don't have an assigned borough</h4>

In [8]:
neighborhoods = neighborhoods[neighborhoods.Borough != 'Not assigned']
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


<h4>2. If a neighborhoods exists in several postal codes, combine the neighborhoods</h4>

In [9]:
neighborhoods = neighborhoods.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


<h4>3. Change unassigned neighborhood to the same value as borough</h4>

<p><b><em><font color="purple">Before:</font></em></b></p>

In [10]:
neighborhoods.loc[neighborhoods['PostalCode']=='M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


<p><b><em><font color="purple">Changing the value:</font></em></b></p>

In [11]:
neighborhoods.loc[neighborhoods['Neighborhood'] == 'Not assigned', 'Neighborhood'] = neighborhoods['Borough']

<p><b><em><font color="purple">After:</font></em></b></p>

In [12]:
neighborhoods.loc[neighborhoods['PostalCode']=='M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


<h4>4. Print the number of rows of the dataframe</h4>

In [13]:
neighborhoods.shape

(103, 3)