## IBM Data Science Professional Certificate
#### Module: Capstone, week 3, Project Title:
###  Clustering Toronto Neigbourhoods 
##### author: Rafal Radecki
***

<p style="color:green;"><i>
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.
Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.
</i></p>

### 1. Find information about Toronto on the web
After some browsing and the above suggestion two links have been identified:<br>
>    a. https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto#Table  <br>
>    b. https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  <br>
>    c. https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641 <br>

Table **a.** seems more structured, but does not contain the postcode information which could make it more difficult to mark it on the map. The link **b.** has the table formatted in a very difficult way to scrap. Thus link **c.** will be used.

<i>Also found this example notebook: https://github.com/crismag/Coursera_Capstone/blob/master/SegmentingAndClusteringNeighborhoods-Toronto.PART_1.ipynb </i>


### 2. Importing required packages for the analysis

In [1]:
!pip install bs4
!pip install requests
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page




### 3. Download Data and store in an object

In [2]:
# Use get to download the contents of the webpage in text format and store in a variable called data:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=926287641" # defines the source as described in section 1.
data  = requests.get(url).text

In [3]:
# Create a BeautifulSoup object using the BeautifulSoup constructor
soup = BeautifulSoup(data,"html5lib")

#### Check what we've got:

In [4]:
# Page title:
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

### 4. Scrape data from HTML table

In [5]:
import pandas as pd

In [6]:
# Find the right table
table = soup.find("table",class_="wikitable")
table

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a class="mw-redirect" href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></t

Table's heading are within 'th' markers:

     <th>Postcode</th>
     <th>Borough</th>
     <th>Neighbourhood
     </th>

The information we want is in rows marked by 'tr' while columns are separated by 'td':

     <tr>
     <td>M3A</td>
     <td><a href="/wiki/North_York" title="North York">North York</a></td>
     <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
     </td></tr>
     <tr>

In [7]:
df = pd.DataFrame(columns=["PostalCode", "Borough", "Neighbourhood"])

for row in table.tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        postalcode = col[0].text
        borough = col[1].text
        neighbourhood = col[2].text
        df = df.append({"PostalCode":postalcode, "Borough":borough, "Neighbourhood":neighbourhood}, ignore_index=True)

df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


### 5. Data Wrangling
#### Clean the data and add geolocation based on the postcode.

<p style="color:green;"><i>The above table shows unwanted characters, rows with missing data. From Wikipedia we know there are 140 neigbourhoods and there can be more than one within the same postcode. Bacuase the geolocation will be based on the postode, these neighborhood will have to be merged into one.</p>

In [8]:
print("Table size: ",df.shape)
print("Number of unique postcodes: ", df.PostalCode.nunique())
print("Number of unique Boroughs: ",df.Borough.nunique())
print("Number of unique Neighbourhoods: ",df.Neighbourhood.nunique())

Table size:  (288, 3)
Number of unique postcodes:  180
Number of unique Boroughs:  12
Number of unique Neighbourhoods:  209


In [9]:
print("Number of 'Not assigned' values in Borough column: ", df['Borough'].eq('Not assigned').sum())
print("Number of 'Not assigned' values in Neighbourhood column: ", df['Neighbourhood'].eq('Not assigned\n').sum())

Number of 'Not assigned' values in Borough column:  77
Number of 'Not assigned' values in Neighbourhood column:  78


Remove rows that have the Neighbourhood not assigned

In [16]:
# identify partial string to look for
discard = ["assigned"]

# drop rows that contain the partial string "assigned" in the Neighbourhood column
df = df[~df.Neighbourhood.str.contains('|'.join(discard))]
# Reset the index:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [17]:
#Check the new data frame:
print("Table size: ",df.shape)
print("Number of unique postcodes: ", df.PostalCode.nunique())
print("Number of unique Boroughs: ",df.Borough.nunique())
print("Number of unique Neighbourhoods: ",df.Neighbourhood.nunique())
print("Number of 'Not assigned' values in Borough column: ", df['Borough'].eq('Not assigned').sum())
print("Number of 'Not assigned' values in Neighbourhood column: ", df['Neighbourhood'].eq('Not assigned\n').sum())

Table size:  (210, 3)
Number of unique postcodes:  102
Number of unique Boroughs:  10
Number of unique Neighbourhoods:  208
Number of 'Not assigned' values in Borough column:  0
Number of 'Not assigned' values in Neighbourhood column:  0


In [18]:
# Remove the '\n' sring from column Neighborhood:
df.replace(to_replace=r'\n', value='', regex=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


<p style="color:green;"><i>As we know that the number of post codes is smaller than the number of neighbourhoods, we will combine multiple neighborhoods with the same postcode.</p>

In [32]:
# Group by postcode:
df1 = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df1.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [29]:
#save the results as the and of part 1 task
df1.to_csv(r'toronto-neighbourhoods.csv', index=False)

In [31]:
df1.shape

(102, 3)