# Exploration and clustering of Totonto neighbourhoods  

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#ref1">Data Collection</a></li>
        <!-- <li><a href="#ref2">Preprocessing</a></li> -->
        <!-- <li><a href="#ref3">Collaborative Filtering</a></li> -->
    </ol>
</div>
<br>
<hr>

<a id="ref1"></a>
# Data Collection  

### Important necessary libraries to collect necessary dataset by doing web scraping and prepare the expected DataFrame

In [16]:
import numpy as np
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

### Perform a little web scraping to get the data  
I will use BeautifulSoup to scrape the given website,
and save the scraped row data into a CSV file

In [17]:
# To get source code of the website
main_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(main_url)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)
# print(soup.prettify())
# Noticed that expected information is located within 'table' tag
# Therefore make my soup only with the table
soup = soup.find('table')
#
# Open the CSV file
csv_file = open('Toronto_Postcode_Borough_Neighbourhood.csv', 'w')
# Create the CSV writer object to write contents into CSV file
csv_writer = csv.writer(csv_file)
for i, soup in enumerate(soup.find_all('tr')):
    # print(soup.prettify())
    soup_splitted = list(soup.stripped_strings)
     # ignore those cells with 'Not assigned' entries for Borough as required in the assignment
    if soup_splitted[1]!='Not assigned':
        # if 'Neighbourhood' is 'Not assigned', then assign the 'Borough' value to it
        if soup_splitted[2]=='Not assigned':
            csv_writer.writerow([soup_splitted[0], soup_splitted[1], soup_splitted[1]])
        else:
            csv_writer.writerow([soup_splitted[0], soup_splitted[1], soup_splitted[2]])
    else:
        pass
# Close the CSV file
csv_file.close()

### Create the Pandas dataframe from above scraped row data

In [18]:
# Define the dataframe from the above collected 'csv' file
df = pd.read_csv('Toronto_Postcode_Borough_Neighbourhood.csv')
# Look at first few entries
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


Check how many entries we have

In [19]:
df.shape

(210, 3)

In [20]:
'''
This function will emerge all Neighbourhoods for same Postcodes
leav unchanged if there is no repeated Postcodes for Neighbourhoods
'''
def merge_Neighbourhoods(df):
    # find repeated Postcode and number of occurances
    unique, counts = np.unique(df.iloc[:, 0].values, return_counts=True)
    print(f'There are {len(unique)} of unique Postcodes while {df.shape[0]} of Neighbourhoods.\n So there are many Neighbourhoods are having the same Postcodes.')
    # find the index of Postcodes which are occured more than once
    repeat_index = np.argwhere(counts>1)
    # Postcodes occured more than once
    pc_list = unique[repeat_index][:,0]
    Ent_list = []
    for pc in unique:
        df_pc = df[df['Postcode']==pc]
        if pc in pc_list:
            df_pc_join = ', '.join(df_pc.iloc[:, 2].to_list())
            entry = [df_pc.iloc[0, :3][0], df_pc.iloc[0, :3][1], df_pc_join]
            Ent_list.append(entry)
        else:
            entry = [df_pc.iloc[0, :3][0], df_pc.iloc[0, :3][1], df_pc.iloc[0, :3][1]]
            Ent_list.append(entry)
    return Ent_list

Apply above prepared function

In [21]:
new_data_list = merge_Neighbourhoods(df)

There are 103 of unique Postcodes while 210 of Neighbourhoods.
 So there are many Neighbourhoods are having the same Postcodes.


Check if the function worked as expected

In [22]:
np.shape(new_data_list)

(103, 3)

So, it seems the function has done his job  

###  Now prepare wanted DataFrame

In [23]:
new_df = pd.DataFrame(new_data_list, columns =['Postcode', 'Borough', 'Neighbourhood']) 

Check the data frame

In [24]:
new_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Scarborough
4,M1H,Scarborough,Scarborough


In [25]:
new_df.shape

(103, 3)