# <span style="color:blue">Segmenting and Clustering Neighborhoods in Toronto Assignmant</span>
<br /><br />

## <span style="color:blue">Part 1 - Preparation of our dataframe</span>

### <span style="color:blue">Install BeautifulSoup and other libraries</span>
<br />
<span style="color:blue">We will first download some libraries to make sure we have all the tools we need for the work in this notebook.</span>

In [1]:
!pip install beautifulsoup4
#!pip install lxml
#!pip install html5lib
!pip install request



### <span style="color:blue">Import BeautifulSoup and others</span>
<br />
<span style="color:blue">We will first have to import some of the libraries we are going to use in this notebook.</span>

In [2]:
from bs4 import BeautifulSoup
import requests
#import urllib.request, urllib.error, urllib.parse
import pandas as pd

### <span style="color:blue">Reading the data table from wiki</span>
<br />
<span style="color:blue">Now we will define the URL we are going to use as the URL for the Wiki page that should have the table we want to analyze.</span>
<br />
<span style="color:blue">After defining our URL we will conver the information it stores into a html object.</span>

In [3]:
# Open Canada information link

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

<span style="color:blue">Next we are going to fetch all the tables from this web page and we wil print out the top 5 rows of each table so we can see which table we want to use.</span>

In [4]:
# Fetch the table with the data
df_wiki = pd.read_html(url,header=0)

# Print out all tables on the requested web page (first 5 rows of each table)
for i in range (len(df_wiki)):
    n = i + 1
    print ('_'*50)
    print('This is table #' + str(n) + ' on the requested web page:')
    print ('_'*50 + '\n')
    table = df_wiki[i]
    print(table.head())
    print('\n\n')

__________________________________________________
This is table #1 on the requested web page:
__________________________________________________

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront



__________________________________________________
This is table #2 on the requested web page:
__________________________________________________

                                          Unnamed: 0  \
0  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
1                                                 NL   
2                                                  A   

                               Canadian postal codes  \
0  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
1                                                 NS   
2                           

<span style="color:blue">We can see that the table we want to use is the first table on the requested web page.</span>
<br />
<span style="color:blue">So now we must set the first table to our dataframe.</span>

In [5]:
# Set the first table to our dataframe.
pre_df = df_wiki[0]

pre_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### <span style="color:blue">Checking the shape of our talble</span>

In [6]:
pre_df.shape

(287, 3)

### <span style="color:blue">Prepartion of the dataframe</span>
<br /> 
<span style="color:blue">We will start with clearing up the table and removing any cell that does not have an assigned borough.</span>

In [7]:
# Check how many rows do not have their borough specified
pre_df['Borough'].value_counts()

Not assigned        77
Etobicoke           44
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

### <span style="color:blue">We can see that we need to drop 77 rows from our dataframe.</span>

In [8]:
# Delete all rows that do not have a borough assigned to them
df = pre_df
for i in range (len(df['Borough'])):
    if df['Borough'][i] == 'Not assigned':
        df = df.drop(i, axis=0)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [9]:
df.shape

(210, 3)

In [38]:
# Reset the index numbers
df.reset_index(drop = True, inplace = True)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### <span style="color:blue">Now we will change the Neighbourhood value to the Borough value if the Neighbourhood value is "Not assigned".</span>

In [11]:
# Check which rows have their Neighbourhood set as "Not Assigned" and then change that value to the row's Borough value
for i in range (len(df['Neighbourhood'])):
    if df['Neighbourhood'][i] == 'Not assigned':
        df['Neighbourhood'][i] = df['Borough'][i]

In [39]:
# Group all Neighbourhood from same Postcode in to one row an separate them by commas
df = df.groupby(['Postcode', 'Borough']).Neighbourhood.agg([('Neighbourhood', ', '.join)])

# Restet the index of the new dataframe
df.reset_index(drop = False, inplace = True)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [41]:
df.shape

(103, 3)

### <span style="color:blue">The last line of code conclude part 1 of this project and we can see that our dataframe has 103 row and 3 columns.</span>
### <span style="color:blue">-----------------------------------------------------------------------------------------------------------------------------------------</span>
<br /><br /><br />

## <span style="color:blue">Part 2 - adding the coordinates data into our dataframe</span>

### <span style="color:blue">Lets creat a dataframe with the postal codes from the published CSV file.</span>

In [47]:
# Download the postal codes coordinates
df_postal = pd.read_csv('https://cocl.us/Geospatial_data')
df_postal.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [50]:
df_postal.shape

### <span style="color:blue">We can see that we have the same amount of rows in both of our dataframes, so now lets join them together using the postal codes (as they are uniqe per row)</span>

In [55]:
# Create a merged dataframe that includes the PostalCode, Borough, Neighbourhood, Latitude and Longitude columns
df_final = df.set_index('Postcode').join(df_postal.set_index('Postal Code'))

# Restet the index of the new dataframe
df_final.reset_index(drop = False, inplace = True)

# Rename column Postcode to PostalCode
df_final = df_final.rename(columns={'Postcode': 'PostalCode'})

df_final.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [56]:
df_final.shape

(103, 5)

### <span style="color:blue">Now we have our final dataframe stored as "df_final"</span>