# IBM Applied Data Science Capstone Project

## Segmenting and Clustering Neighborhoods in Toronto
### <font color='lightblue'> Peer Graded Assignment (Week 3) </font>

Explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information..

<div class="alert alert-block alert-info">
    Blue Divider is used for Each Part
    <br> 
    <br>Part 1 -  Webscraping: Get Toronto postal code data from Wikipedia portal
    <br>
    <br>
</div>

Toronto postal codes start with "M" so use this link - https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
# Import necessary packages
import pandas as pd
import requests
from bs4 import BeautifulSoup

#### 1.1 geolocator = Nominatim(user_agent="ny_explorer")

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text

#### 1.2 Create soup object with the extracted html data

In [3]:
soup = BeautifulSoup(html_data, 'html.parser')

#### 1.3 Get index of the postal code table

In [4]:
# Select index of the postal code table
# loop through each table and select the table that has the string matches "M1A" postal code
tables = soup.find_all('table')
for i, t in enumerate(tables):
    if("M1A" in str(t)):
        table_index = i
        
# Index of the postal codes table
print('Index of postal codes table: ', table_index)

Index of postal codes table:  0


#### 1.4 Extract postal codes from html table

In [5]:
table_contents=[]
for row in tables[table_index].findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)
df_postal_codes = pd.DataFrame(table_contents)

#### 1.5 Clean and verify data

In [6]:
df_postal_codes['Borough']=df_postal_codes['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [7]:
# Validate data 
df_postal_codes.query('Borough == "" | Neighborhood == "" | Borough == "Not assigned" | Neighborhood == "Not assigned"')

Unnamed: 0,PostalCode,Borough,Neighborhood


In [8]:
df_test_codes = pd.DataFrame({'PostalCode':['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']})
df_test_codes.merge(df_postal_codes, on='PostalCode', how='left')

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


#### <font color='red'> ..... Part 1 output </font>

In [9]:
df_postal_codes.shape

(103, 3)

#### <font color='red'> ***** End of Part 1 *****</font>

<div class="alert alert-block alert-info">
    <br>Part 2 - Latitude and Longitude of the postal codes: Get coordinates from csv and add to our postal codes data frame
    <br>
    <br>
</div>

#### 2.1 Download coordinates csv file from coursera link

In [10]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
print('Geospatial Coordinates downloaded!')

Geospatial Coordinates downloaded!


#### 2.2 Read coordinates to a data frame

In [11]:
geo_coordinates = pd.read_csv('Geospatial_Coordinates.csv')

#### 2.3 Rename the column name of "Postal Code", there is no space in our original data frame

In [12]:
geo_coordinates.rename(columns={'Postal Code' : "PostalCode"}, inplace=True)
geo_coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### 2.4 Join postal codes and coordinates data frames into a new data frame

In [13]:
#### Join both data frames
df_postal_codes_with_ll = df_postal_codes.merge(geo_coordinates, on='PostalCode', how='left')

#### <font color='red'> ..... Part 2 output </font>

In [14]:
#### Test data and verify if it matches with the question
df_test_codes = pd.DataFrame({'PostalCode':['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']})
df_test_codes.merge(df_postal_codes_with_ll, on='PostalCode', how='left')

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750071,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


#### <font color='red'> ***** End of Part 2 *****</font>