## Segmenting and Clustering Neighborhoods in Toronto Assignmant

### Install BeautifulSoup and other libraries
<br />We will first download some libraries to make sure we have all the tools we need for the work in this notebook.

In [1]:
!pip install beautifulsoup4
#!pip install lxml
#!pip install html5lib
!pip install request



### Import BeautifulSoup and others
<br />We will first have to import some of the libraries we are going to use in this notebook.

In [7]:
from bs4 import BeautifulSoup
import requests
#import urllib.request, urllib.error, urllib.parse
import pandas as pd

### Reading the data table from wiki
<br />Now we will define the URL we are going to use as the URL for the Wiki page that should have the table we want to analyze.
<br />After defining our URL we will conver the information it stores into a html object.

In [5]:
# Open Canada information link

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')


Next we are going to fetch all the tables from this web page and we wil print out the top 5 rows of each table so we can see which table we want to use.

In [8]:
# Fetch the table with the data
df_wiki = pd.read_html(url,header=0)

# Print out all tables on the requested web page (first 5 rows of each table)
for i in range (len(df_wiki)):
    n = i + 1
    print ('_'*50)
    print('This is table #' + str(n) + ' on the requested web page:')
    print ('_'*50 + '\n')
    table = df_wiki[i]
    print(table.head())
    print('\n\n')



__________________________________________________
This is table #1 on the requested web page:
__________________________________________________

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront



__________________________________________________
This is table #2 on the requested web page:
__________________________________________________

                                          Unnamed: 0  \
0  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
1                                                 NL   
2                                                  A   

                               Canadian postal codes  \
0  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
1                                                 NS   
2                           

We can see that the table we want to use is the first table on the requested web page.
<br />So now we must set the first table to our dataframe.

In [9]:
# Set the first table to our dataframe.
pre_df = df_wiki[0]
pre_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Checking the shape of our talble

In [10]:
pre_df.shape

(287, 3)

### Prepartion of the dataframe
<br /> We will start with clearing up the table and removing any cell that does not have an assigned borough.

In [11]:
# Check how many rows do not have their borough specified
pre_df['Borough'].value_counts()

Not assigned        77
Etobicoke           44
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

### We can see that we need to drop 77 rows from out dataframe.

In [12]:
# Delete all rows that do not have a borough assigned to them
df = pre_df
for i in range (len(df['Borough'])):
    if df['Borough'][i] == 'Not assigned':
        df = df.drop(i, axis=0)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [13]:
df.shape

(210, 3)

In [14]:
# Reset the index numbers
df.reset_index(drop=True, inplace = True)
df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### Now we will change the Neighbourhood value to the Borough value if the Neighbourhood value is "Not assigned".

In [15]:
# Check which rows have their Neighbourhood set as "Not Assigned" and then change that value to the row's Borough value
for i in range (len(df['Neighbourhood'])):
    if df['Neighbourhood'][i] == 'Not assigned':
        df['Neighbourhood'][i] = df['Borough'][i]


In [16]:
# Group all Neighbourhood from same Postcode in to one row an separate them by commas
grouped = df.groupby(['Postcode', 'Borough']).Neighbourhood.agg([('Neighbourhood', ', '.join)])

# Restet the index of the new dataframe
grouped.reset_index(drop=False, inplace = True)

grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [17]:
grouped.shape

(103, 3)