
# <center>Applied Data Science Capstone Week 3 </center>

<h3>1. Web Scraping and creating dataframe for Toronto city neighbourhoods.</h3>

Import <b>Beautifull Soup</b> for scaping from wikipedia page.

We need to recognize that a lot of sites have precautions to fend off scrapers from accessing their data. The first thing we can do to get around this is spoofing the headers we send along with our requests to make our scraper look like a legitimate browser:

In [1]:
import requests
from bs4 import BeautifulSoup


headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # url of the wikipedia page for List of postal codes of Canada: M (Toronto postal codes begin with the letter M)...
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser') # creating a soup object...
#print(soup.prettify())

<b>After retreiving the URL and creating a Beautiful soup object.</b>

1. Firstly create a list
2. Later after finding the table and table data  create a dictionary called cell having 3 keys PostalCode, Borough and Neighborhood.
3. As postal code contains upto 3 characters extract that using tablerow.p.text
4. Next use split ,strip and replace functions for getting Borough and Neighborhood information.
5. Append to the list  
6. Create a dataframe with list

In [3]:
table_contents=[]                                             # Empty list is created...
table=soup.find('table')                                      # find tables from the soup...
for row in table.findAll('td'):
    #print(row)
    cell = {}                                                 # Dictionary called 'cell' is created...
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)                          # Append dictionary to the list...

<b>Now, we have to import pandas library to create a dataframe containing data we have scrapped from the wikipedia using beautiful soup.

In [4]:
import pandas as pd                             # import pandas module...

In [5]:
# print(table_contents)
df=pd.DataFrame(table_contents)                # create a dataframe containing extracted list...
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})


In [6]:
df.shape                                     # number of rows and columns in the dataframe...
#df.head()

(103, 3)

<h3>2. Get the latitude and the longitude coordinates of each neighborhood.</h3>

we will use the <b>Geocoder</b> Python package to find coordinates of neighbourhood areas.

In [7]:
#pip install geocoder

In [8]:
# import geocoder                                                  # import geocoder package...

# # initialize your variable to None
# lat_lng_coords = None

# # loop until you get the coordinates
# while(lat_lng_coords is None):
#   g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#   lat_lng_coords = g.latlng

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]

 There is a problem using geocoder package, which i am not able to solve. So, let's use another method.
 we will use a .CSV file of the coordinates.

In [9]:
df_location = pd.read_csv('https://cocl.us/Geospatial_data')                    # read data from CSV file...
df_location = df_location.rename(columns={"Postal Code":"PostalCode"})  # rename column 'Postal Code' to 'PostalCode'...
#df_location.head()

In [10]:
df_final = pd.merge(df, df_location, on='PostalCode')    # merge to dataframes where the 'PostalCode' is used as a key(common column in both dataframes)... 
df_final

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


<h3>3. Explore and Cluster the Toronto neighbourhood data.</h3>

Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [11]:
import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<b>We segment and cluster only the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of the Toronto data.

In [12]:
toronto_data = df_final[df_final['Borough'].str.contains('Toronto',regex=False)]   # search dataframe containing 'toronto' word in Borough column...
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


Visualizing all the Neighbourhoods of the above data frame using <b>Folium</b>.

Let's get the geographical coordinates of Toronto.


In [13]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<b>Create a map of Toronto with neighborhoods superimposed on top.

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<b>Using KMeans clustering for the clustering of the neighbourhoods.

In [15]:
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [16]:
k=5
toronto_clustering = toronto_data.drop(['PostalCode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
toronto_data.insert(0, 'Cluster Labels', kmeans.labels_)

  from ipykernel import kernelapp as app


In [17]:
toronto_data.head()

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,1,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


<b>Now, visualise clusters on map using matplotlib.

In [18]:
import numpy as np

In [19]:
# create map
map_clusters = folium.Map(location=[43.6534817,-79.3839347],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood'], toronto_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters