<h1 align=center><font size = 5>Coursera Data Science and Machine Learning Capstone Project</font></h1>
<h1 align=center><font size = 5>Segmenting Neighbourhood Toronto part 1 </font></h1>

This is data science capstone project in coursera from IBM week 3 assingment. In this project, we will cluster the neighbourhood of Toronto city using K-means clustering. In this part we will scrapping webpage to obtain the data.

### Retrieving the data

First, we need to obtain necessary data, which is the list of neighbourhood and its coordinate of latitude and longitude. To achieve this, first we will parsing the webpage https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto to get the neighbourhood list using bs4 and then, use geopy to get the coordinate.

#### Import the libraries

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup 
import lxml
from geopy.geocoders import Nominatim
from geopy.geocoders import ArcGIS
from geopy.extra.rate_limiter import RateLimiter

#### Parsing the web page and getting the list of Toronto's neighbourhoods

In [2]:
# set the webpage address
my_url = 'https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto'

# request the webpage xml code
page = requests.get(my_url).text

# parse the webpage
# set the BeautifulSoup var 
soup = BeautifulSoup(page, 'lxml')

# Getting all district name container
district_h3 = soup.find_all("h3", limit=6)

# Getting all table containing neighbourhood list
neigh_table = soup.find_all("table", limit=6)

In [3]:
# define a function for saving html text from parsing result
def save_soup_text(soup_var):
    soup_text_list = []
    for i in soup_var:
        soup_text_list.append(i.find("a").text)
    return soup_text_list

In [4]:
# save district name in list
district_list = save_soup_text(district_h3)
print(district_list)

['Old Toronto', 'East York', 'Etobicoke', 'North York', 'Scarborough', 'York']


In [5]:
# prepare empty dataframe to save the list of the neighbourhoods and its district name
df = pd.DataFrame(columns = ['District', 'Neighbourhood'])

In [6]:
# filling the dataframe with 'District', 'Neighbourhood' names

for district_name, table in zip(district_list, neigh_table):
    temp_list = []
    
    # Getting all neighbourhood list(<li>) from the table
    neigh_li = table.find_all("li") 
    
    for alist in neigh_li:
        
        # Getting neighbourhood name, save into list var in the form of pairs of district and Neighbourhood name
        temp_list.append([district_name, alist.find("a").text])
        
    # create a temp dataframe of 'District', 'Neighbourhood' for the current district_name
    temp_df = pd.DataFrame(np.array(temp_list), columns = ['District', 'Neighbourhood'])
    
    # append the temp dataframe into main dataframe
    df = df.append(temp_df, ignore_index=True)

In [7]:
# check the dataframe
df.head()

Unnamed: 0,District,Neighbourhood
0,Old Toronto,Alexandra Park
1,Old Toronto,The Annex
2,Old Toronto,Baldwin Village
3,Old Toronto,Cabbagetown
4,Old Toronto,CityPlace


In [8]:
# check the shape of dataframe
df.shape

(212, 2)

The number of neighbourhoods in the list is 212.

#### Retrieve the coordinate of each of neighbourhoods

In [9]:
# set the geolocator 
geolocator_Nominatim = Nominatim(user_agent="coursera-capstone-project")

# set the geocode function to limit the number of call per second 
geocode_Nominatim = RateLimiter(geolocator_Nominatim.geocode, min_delay_seconds=1, max_retries=5)

# ser another geolocator in case there are unretrieved coordinate using the Nominatim geolocator
geolocator_ArcGIS = ArcGIS()
geocode_ArcGIS = RateLimiter(geolocator_ArcGIS.geocode, min_delay_seconds=1, max_retries=5)

In [10]:
# set a column in dataframe contains strings of address to be passed as query to geolocator
df['Location'] = df['Neighbourhood'] + ', ' + df['District'] + ', ' + 'Toronto, CANADA'

# get the coordinate and put it into a new column in dataframe 
loc_data = df['Location'].apply(geocode_Nominatim)
df['Latitude'] = loc_data.apply(lambda loc: loc.latitude  if loc else np.nan)
df['Longitude'] = loc_data.apply(lambda loc: loc.longitude  if loc else np.nan)

In [12]:
# check the dataframe
df.head()

Unnamed: 0,District,Neighbourhood,Location,Latitude,Longitude
0,Old Toronto,Alexandra Park,"Alexandra Park, Old Toronto, Toronto, CANADA",43.650758,-79.404298
1,Old Toronto,The Annex,"The Annex, Old Toronto, Toronto, CANADA",43.670338,-79.407117
2,Old Toronto,Baldwin Village,"Baldwin Village, Old Toronto, Toronto, CANADA",,
3,Old Toronto,Cabbagetown,"Cabbagetown, Old Toronto, Toronto, CANADA",43.664473,-79.366986
4,Old Toronto,CityPlace,"CityPlace, Old Toronto, Toronto, CANADA",43.639248,-79.396387


There seems to be null values in LatLongAlt column, lets check the number of missing data.

In [13]:
# create mask to select all rows with null value in Latitude or Longitude column
check_nan_coor = np.logical_or(df['Latitude'].isnull(),df['Longitude'].isnull())

# check the number of missing value of coordinate
print (check_nan_coor.value_counts())
print("True : missing value")

False    193
True      19
Name: Latitude, dtype: int64
True : missing value


There are 19 missing data. Lets try to retrieve the coordinate of the missing data using ArcGIS geolocator.

In [14]:
# get the missing coordinate using ArcGIS geolocator
loc_data = df['Location'][check_nan_coor].apply(geocode_ArcGIS)
df['Latitude'][check_nan_coor] = loc_data.apply(lambda loc: loc.latitude  if loc else np.nan)
df['Longitude'][check_nan_coor] = loc_data.apply(lambda loc: loc.longitude  if loc else np.nan)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [15]:
df.head()

Unnamed: 0,District,Neighbourhood,Location,Latitude,Longitude
0,Old Toronto,Alexandra Park,"Alexandra Park, Old Toronto, Toronto, CANADA",43.650758,-79.404298
1,Old Toronto,The Annex,"The Annex, Old Toronto, Toronto, CANADA",43.670338,-79.407117
2,Old Toronto,Baldwin Village,"Baldwin Village, Old Toronto, Toronto, CANADA",43.655185,-79.397399
3,Old Toronto,Cabbagetown,"Cabbagetown, Old Toronto, Toronto, CANADA",43.664473,-79.366986
4,Old Toronto,CityPlace,"CityPlace, Old Toronto, Toronto, CANADA",43.639248,-79.396387


Lets count the number of the missing value in the coordinate column

In [16]:
# check the number of missing value of coordinate 
# True means the number of the missing values
missing_data= np.logical_or(df['Latitude'].isnull(),df['Longitude'].isnull())
print (missing_data.value_counts())

False    212
Name: Latitude, dtype: int64


There is no missing coordinate for every neighbourhood. Before we save the dataframe into a csv file, lets check the neighbourhood with the same name and remove one of each that have the same coordinate.

In [17]:
# group dataframe by neighbourhood
grouped_df = df.groupby('Neighbourhood').count()

# check the neighbourhood with same name
grouped_df[grouped_df['Location']>1]

Unnamed: 0_level_0,District,Location,Latitude,Longitude
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bermondsey,2,2,2,2
East Danforth,2,2,2,2
Queen Street West,2,2,2,2


Lets check the position of these neighbourhood in dataframe.

In [18]:
df[df['Neighbourhood'] == 'Bermondsey']

Unnamed: 0,District,Neighbourhood,Location,Latitude,Longitude
95,East York,Bermondsey,"Bermondsey, East York, Toronto, CANADA",43.713824,-79.310678
133,North York,Bermondsey,"Bermondsey, North York, Toronto, CANADA",43.725405,-79.313693


In [19]:
df[df['Neighbourhood'] == 'East Danforth']

Unnamed: 0,District,Neighbourhood,Location,Latitude,Longitude
34,Old Toronto,East Danforth,"East Danforth, Old Toronto, Toronto, CANADA",43.68636,-79.300316
92,East York,East Danforth,"East Danforth, East York, Toronto, CANADA",43.68636,-79.300316


In [20]:
df[df['Neighbourhood'] == 'Queen Street West']

Unnamed: 0,District,Neighbourhood,Location,Latitude,Longitude
23,Old Toronto,Queen Street West,"Queen Street West, Old Toronto, Toronto, CANADA",43.649852,-79.391175
83,Old Toronto,Queen Street West,"Queen Street West, Old Toronto, Toronto, CANADA",43.649852,-79.391175


We find that each of East Danforth and Queen Street West have the same coordinate, so lets remove one row of each from the dataframe.

In [21]:
df.drop([83, 92], inplace = True)
df.reset_index(drop = True, inplace =True)
df

Unnamed: 0,District,Neighbourhood,Location,Latitude,Longitude
0,Old Toronto,Alexandra Park,"Alexandra Park, Old Toronto, Toronto, CANADA",43.650758,-79.404298
1,Old Toronto,The Annex,"The Annex, Old Toronto, Toronto, CANADA",43.670338,-79.407117
2,Old Toronto,Baldwin Village,"Baldwin Village, Old Toronto, Toronto, CANADA",43.655185,-79.397399
3,Old Toronto,Cabbagetown,"Cabbagetown, Old Toronto, Toronto, CANADA",43.664473,-79.366986
4,Old Toronto,CityPlace,"CityPlace, Old Toronto, Toronto, CANADA",43.639248,-79.396387
5,Old Toronto,Chinatown,"Chinatown, Old Toronto, Toronto, CANADA",43.652924,-79.398032
6,Old Toronto,Church and Wellesley,"Church and Wellesley, Old Toronto, Toronto, CA...",43.665524,-79.383801
7,Old Toronto,Corktown,"Corktown, Old Toronto, Toronto, CANADA",43.657371,-79.356519
8,Old Toronto,Discovery District,"Discovery District, Old Toronto, Toronto, CANADA",43.657556,-79.389480
9,Old Toronto,Distillery District,"Distillery District, Old Toronto, Toronto, CANADA",43.650295,-79.359540


Lets save the dataframe into csv file, so that we can use it for part 2.

In [22]:
path = "Toronto_neighbourhood_list.csv"
df.to_csv(path, index=False)

***<font size = 3>Author : Hadi Muhshi</font>*** 