# Segmenting and Clustering Neighborhoods in Toronto
### Week 3. Graded Lab

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured way as required to work with it.

                                                                                         Student: Norma López-Sancho

In [1]:
# Importing Libraries
import numpy as np 
import pandas as pd 

In [2]:
# installing the beautifulsoup funtionality for web scrapping in case is needed
!pip install lxml html5lib beautifulsoup4
print ('You are good to go')

You are good to go


In [3]:
# Reading URL through pandas
tnt = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [4]:
# Checking how many tables are within the specified URL
print(len(tnt))

3


In [5]:
# Checking that the table I want is the first contained in the web page
tnt[0].head()

Unnamed: 0,0,1,2
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [6]:
# Since I have confirmed the first table [0] is the one I want, get it in a new dataframe                                                                                                           
tnt_df = pd.DataFrame(data=tnt[0])
tnt_df.head()

Unnamed: 0,0,1,2
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [7]:
# Using the first row as column names in my dataframe as is where are contained
tnt_df.columns = tnt_df.iloc[0]
tnt_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [8]:
# And now dropping row 0 as contains the column names, which have already been used
tnt_df.drop([0], inplace = True)
tnt_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [9]:
# Removing rows that have Not assigned a Borough
tnt_df = tnt_df[~tnt_df.Borough.str.contains('Not assigned')]
tnt_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [10]:
# Checking if the code has indeed worked by searching for "Not assigned" string in the column Borough
tnt_df[tnt_df['Borough'].str.match('Not assigned')]

Unnamed: 0,Postal Code,Borough,Neighbourhood


#### It is mentioned in the lab that there are repeated Postal Codes with different neigbourhoods assigned. As example they use M5A

#### Let´s make a first check to see if the statement it´s true:

In [11]:
# Getting the rows that contains M5A
tnt_df[tnt_df['Postal Code'].str.match('M5A')]

Unnamed: 0,Postal Code,Borough,Neighbourhood
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Well, seems it is not, at least for M5A as proposed in the exercise 

#### So let´s make a double check over the whole set, just in case other Postal Codes are:

In [12]:
# Counting values in Postal Code column to see if any returns any greater than 1
check = tnt_df['Postal Code'].value_counts()
check[check>1]


Series([], Name: Postal Code, dtype: int64)

#### There´s definitely nothing repeated, as shown by the check above, but if there was, we could use the below code for merging data in Neighbourhood column:

<code> tnt_df = tnt_df.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(', '.join).reset_index()</code>

#### Is also requested in the exercise that if a Neighbourhood is Not Assigned, then use the name of the Borough.
#### Let´s check if any row matches that criteria:

In [13]:
tnt_df[tnt_df['Neighbourhood'].str.match('Not assigned')]

Unnamed: 0,Postal Code,Borough,Neighbourhood


#### Well, we can see that all Neighbourhoods are assigned, but if they weren´t , we could use the below code to perform the requested task

<code> tnt_df.Neighbourhood[tnt_df.Neighbourhood == 'Not assigned'] = tnt_df.Borough </code>


In [14]:
tnt_df.shape

(103, 3)

#### First I have loaded the Geospatial_Coordinates file into my Docker Container. Now let´s read it into a pandas DataFrame:

In [15]:
# Reading the csv from my docker container
geo = pd.read_csv('Geospatial_Coordinates.csv')

In [16]:
# Checking the first 5 rows of the dataframe
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
#Checking the shape
geo.shape

(103, 3)

#### We can see the Geo file has same rows than our final tnt_df, so should contain all the Lat/Lon that we need

#### Let´s now join both dataframes together, as required in the lab:

In [18]:
#First we set Postal Code on tnt_df as index
tnt_df = tnt_df.set_index(['Postal Code'])
tnt_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [19]:
# Now we set Postal Code on geo as index as well
geo = geo.set_index(['Postal Code'])
geo.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [20]:
# Now concatenating using inner join and reseting index
tnt_geo = pd.concat([tnt_df, geo], axis=1, join='inner').reset_index()
tnt_geo.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


#### Another way suggested by the lab to take into account is geocoder, in case you don´t have an usable file with the Latitudes and Longitudes, code would be as follows

<code>!pip install geocoder</code>
<code>import geocoder </code> # import geocoder 

#initialize your variable to None

<code>lat_lng_coords = None</code>

#loop until you get the coordinates

<code>for code in tnt_geo['Postal Code']:
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(code)
        #print(code, g.latlng)
        
   latlng = g.latlng
   latitude.append(latlng[0])
   longitude.append(latlng[1])</code>