# <span style="color:blue">Segmenting and Clustering Neighborhoods in Toronto Assignmant</span>
<br /><br />

## <span style="color:blue">Part 1 - Preparation of our dataframe</span>

### <span style="color:blue">Install BeautifulSoup and other libraries</span>
<br />
<span style="color:blue">We will first download some libraries to make sure we have all the tools we need for the work in this notebook.</span>

In [1]:
!pip install beautifulsoup4
#!pip install lxml
#!pip install html5lib
!pip install request

Collecting request
  Downloading https://files.pythonhosted.org/packages/f1/27/7cbde262d854aedf217061a97020d66a63163c5c04e0ec02ff98c5d8f44e/request-2019.4.13.tar.gz
Collecting get (from request)
  Downloading https://files.pythonhosted.org/packages/3f/ef/bb46f77f7220ac1b7edba0c76d810c89fddb24ddd8c08f337b9b4a618db7/get-2019.4.13.tar.gz
Collecting post (from request)
  Downloading https://files.pythonhosted.org/packages/0f/05/bd79da5849ea6a92485ed7029ef97b1b75e55c26bc0ed3a7ec769af666f3/post-2019.4.13.tar.gz
Collecting query_string (from get->request)
  Downloading https://files.pythonhosted.org/packages/12/3c/412a45daf5bea9b1d06d7de41787ec4168001dfa418db7ec8723356b119f/query-string-2019.4.13.tar.gz
Collecting public (from query_string->get->request)
  Downloading https://files.pythonhosted.org/packages/54/4d/b40004cc6c07665e48af22cfe1e631f219bf4282e15fa76a5b6364f6885c/public-2019.4.13.tar.gz
Building wheels for collected packages: request, get, post, query-string, public
  Building wheel

### <span style="color:blue">Import BeautifulSoup and others</span>
<br />
<span style="color:blue">We will first have to import some of the libraries we are going to use in this notebook.</span>

In [2]:
from bs4 import BeautifulSoup
import requests
#import urllib.request, urllib.error, urllib.parse
import pandas as pd

### <span style="color:blue">Reading the data table from wiki</span>
<br />
<span style="color:blue">Now we will define the URL we are going to use as the URL for the Wiki page that should have the table we want to analyze.</span>
<br />
<span style="color:blue">After defining our URL we will conver the information it stores into a html object.</span>

In [3]:
# Open Canada information link

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

<span style="color:blue">Next we are going to fetch all the tables from this web page and we wil print out the top 5 rows of each table so we can see which table we want to use.</span>

In [4]:
# Fetch the table with the data
df_wiki = pd.read_html(url,header=0)

# Print out all tables on the requested web page (first 5 rows of each table)
for i in range (len(df_wiki)):
    n = i + 1
    print ('_'*50)
    print('This is table #' + str(n) + ' on the requested web page:')
    print ('_'*50 + '\n')
    table = df_wiki[i]
    print(table.head())
    print('\n\n')

__________________________________________________
This is table #1 on the requested web page:
__________________________________________________

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront



__________________________________________________
This is table #2 on the requested web page:
__________________________________________________

                                          Unnamed: 0  \
0  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
1                                                 NL   
2                                                  A   

                               Canadian postal codes  \
0  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
1                                                 NS   
2                           

<span style="color:blue">We can see that the table we want to use is the first table on the requested web page.</span>
<br />
<span style="color:blue">So now we must set the first table to our dataframe.</span>

In [5]:
# Set the first table to our dataframe.
pre_df = df_wiki[0]

pre_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### <span style="color:blue">Checking the shape of our talble</span>

In [6]:
pre_df.shape

(287, 3)

### <span style="color:blue">Prepartion of the dataframe</span>
<br /> 
<span style="color:blue">We will start with clearing up the table and removing any cell that does not have an assigned borough.</span>

In [7]:
# Check how many rows do not have their borough specified
pre_df['Borough'].value_counts()

Not assigned        77
Etobicoke           44
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

### <span style="color:blue">We can see that we need to drop 77 rows from our dataframe.</span>

In [8]:
# Delete all rows that do not have a borough assigned to them
df = pre_df
for i in range (len(df['Borough'])):
    if df['Borough'][i] == 'Not assigned':
        df = df.drop(i, axis=0)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [9]:
df.shape

(210, 3)

In [10]:
# Reset the index numbers
df.reset_index(drop = True, inplace = True)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### <span style="color:blue">Now we will change the Neighbourhood value to the Borough value if the Neighbourhood value is "Not assigned".</span>

In [11]:
# Check which rows have their Neighbourhood set as "Not Assigned" and then change that value to the row's Borough value
for i in range (len(df['Neighbourhood'])):
    if df['Neighbourhood'][i] == 'Not assigned':
        df['Neighbourhood'][i] = df['Borough'][i]

In [12]:
# Group all Neighbourhood from same Postcode in to one row an separate them by commas
df = df.groupby(['Postcode', 'Borough']).Neighbourhood.agg([('Neighbourhood', ', '.join)])

# Restet the index of the new dataframe
df.reset_index(drop = False, inplace = True)

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [13]:
df.shape

(103, 3)

### <span style="color:blue">The last line of code conclude part 1 of this project and we can see that our dataframe has 103 row and 3 columns.</span>
### <span style="color:blue">-----------------------------------------------------------------------------------------------------------------------------------------</span>
<br /><br /><br />

## <span style="color:blue">Part 2 - adding the coordinates data into our dataframe</span>

### <span style="color:blue">Lets creat a dataframe with the postal codes from the published CSV file.</span>

In [14]:
# Download the postal codes coordinates
df_postal = pd.read_csv('https://cocl.us/Geospatial_data')
df_postal.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
df_postal.shape

(103, 3)

### <span style="color:blue">We can see that we have the same amount of rows in both of our dataframes, so now lets join them together using the postal codes (as they are uniqe per row)</span>

In [16]:
# Create a merged dataframe that includes the PostalCode, Borough, Neighbourhood, Latitude and Longitude columns
df_final = df.set_index('Postcode').join(df_postal.set_index('Postal Code'))

# Restet the index of the new dataframe
df_final.reset_index(drop = False, inplace = True)

# Rename column Postcode to PostalCode
df_final = df_final.rename(columns={'Postcode': 'PostalCode'})

# Print a summery message
print('The dataframe has {} postal codes and neighborhood groups, and a total of {} boroughs.'.format(
    df_final.shape[0],    
    len(df_final['Borough'].unique())
    )
)

df_final.head()

The dataframe has 103 postal codes and neighborhood groups, and a total of 11 boroughs.


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [17]:
df_final.shape

(103, 5)

### <span style="color:blue">Now we have our final dataframe stored as "df_final"</span>
<br /><br /><br />

## <span style="color:blue">Part 3 - Exploring the data</span>

### <span style="color:blue">Lets start by installing geopy and Folium.</span>

In [19]:
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes
from geopy.geocoders import Nominatim
import folium

Solving environment: done

# All requested packages already installed.

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.0.0               |             py_0         606 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         704 KB

The following NEW packages will be INSTALLED:

    altair:  4.0.0-py_0 conda-forge
    branca:  0.3.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Down

### <span style="color:blue">Now lets get the coordinates of Toronto.</span>

In [20]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


### <span style="color:blue">Now we will visualize our data on a map.</span>

In [29]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#4B0082',
        fill=True,
        fill_color='#9400D3',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### <span style="color:blue">We will focus on the data of the boroughs that their name includes "Toronto", so lets create our dataframe to support this.</span>

In [22]:
# Create a dataframe that focuses on boroughs containing "Toronto" in their name
toronto_data = df_final[df_final['Borough'].str.contains('Toronto') == True]

# Print a summery message
print('The dataframe has {} postal codes and neighborhood groups, and a total of {} boroughs.'.format(
    toronto_data.shape[0],    
    len(toronto_data['Borough'].unique())
    )
)

toronto_data.head()

The dataframe has 39 postal codes and neighborhood groups, and a total of 4 boroughs.


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [23]:
toronto_data.shape

(39, 5)

### <span style="color:blue">Now we will visualize our new dataframe.</span>

In [26]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#4B0082',
        fill=True,
        fill_color='#9400D3',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [30]:
# The code was removed by Watson Studio for sharing.