# Assignment: Segmenting and Clustering Neighborhoods in Toronto

# Problem Statement

## Brief
>1. Acquire the data, 
2. Explore the data, 
3. Segment, and Cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.

### 1. Part I - Data Acquisition

>The Toronto neighborhood data can be found on a Wikipedia page that has all the information we need to explore and cluster the neighborhoods in Toronto. We will have to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format.

Create a dataframe consisting of three columns: PostalCode, Borough, and Neighborhood from `'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.'`
- Only process the cells that have an assigned borough. 
- Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. Combine multiple rows into single rows with comma-separated `'Neighborhoods'`.
- If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.



### 1.1 Method

#### Import the required Libraries

In [1]:
!pip install folium
import folium
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests # this module helps us to download a web page
from bs4 import BeautifulSoup # this module helps in web scrapping.

print('Libraries imported.')

Libraries imported.


#### 1.1.1 Get Data from Wiki.
Define URL and get via requests

In [2]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" #Wiki url
req = requests.get(URL)

#### 1.1.2 Pass to BeautifulSoup.
First, the document is converted to Unicode, (similar to ASCII), 
and HTML entities are converted to Unicode characters. 
Beautiful Soup transforms a HTML document into a tree of Python objects.

In [3]:
soup = BeautifulSoup(req.content, 'html5lib') 
print('Page Scraped.') 

Page Scraped.


#### 1.1.3 Create a dataframe from the scraped data
>1. Create a list to store the scraped data
2. Find the 1st table in the scraped data represented by the tag \<table\>
3. Find all tags \<tr\> in the table
4. Create a dictionary to store data
5. Ignore Not assigned values.
6. Postal code contains up to 3 characters from the \<p\> tag
7. Split Borough from the \<span\> tag
8. Split Neighborhood from the \<span\> tag
9. Append to list
10. Change list to dataframe
11. Standardize Borough
12. Reset the index
13. Combine multiple rows into single rows (multiple neighborhoods can exist in one postal code area)
14. Check the health of the dataframe


In [4]:
table_contents=[] # Create a list to store the scraped data
table=soup.find('table') # Find the 1st table in the scraped data represented by the tag <table>
for row in table.findAll('td'): # Find all tags <tr> in the table
    cell = {} # create a dictionary to store data
    if row.span.text=='Not assigned':
        pass # Ignore Not assigned.
    else:
        cell['PostalCode'] = row.p.text[:3] # postal code contains up to 3 characters from the <p> tag
        cell['Borough'] = (row.span.text).split('(')[0] # Split Borough from the <span> tag
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') # Split Neighborhood from the <span> tag
        table_contents.append(cell) # Append to list

df=pd.DataFrame(table_contents) # Change list to dataframe

df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'}) # Standardize
df.reset_index(inplace = True) # Reset the index
df=df.groupby(['PostalCode']).first() # More than one neighborhood can exist in one postal code area. Combine multiple rows into single rows with comma-separated 'Neighborhoods'
df.head()

Unnamed: 0_level_0,index,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M1B,6,Scarborough,"Malvern, Rouge"
M1C,12,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,18,Scarborough,"Guildwood, Morningside, West Hill"
M1G,22,Scarborough,Woburn
M1H,26,Scarborough,Cedarbrae


In [5]:
print('Number of unique Postal Codes is {}'.format(len(df['Neighborhood'].unique())))

Number of unique Postal Codes is 103


In [6]:
print('Number of Borough\'s \'Not assigned\' is {}'.format(df[df['Borough'] == 'Not assigned'].count()[0]))

Number of Borough's 'Not assigned' is 0


In [7]:
df.shape

(103, 3)

### 2. Part II - Data Exploration

### 3. Part III - Data Processing