# This is the Notebook I will analyze Toronto's neighborhoods with

### I don't know in advance how much it will take, it looks kind of scary.

The first attempt involved using the beautiful soup library, but then it turned out using pandas was much faster, so i switched to that.
For the sake of recording, i left the unused cells there

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import re

* Now we will open the wikipedia link with the beautifulsoup
* This cell is not actually used for the final analysis

In [2]:
# the following opens the website using the requests.get method
html_source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# using beautifulsoup, we parse the file with the lxml library
soup = bs(html_source, 'lxml')

#print(soup.prettify())

# with find, we look for the table we're interested in. by checking the code, we know already we're looking for a table paragraph and a wikitable sortable class
table_match = soup.find('table', class_ = 'wikitable sortable')

In [3]:
headers_ = table_match.tbody.text
#headers_.text.rstrip()
#print(headers_)

### Actually, you know what? seems like in pandas it is way simpler than with beautiful soup. Let's give it a try:

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

df = pd.read_html(url)

### It turns out the dataframe is already created and available! BeautifulPandas!

In [5]:
#since there is more than one paragraph on that link, let's just use the first one
df1 = df[0]
df1.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


**Now we need to clean the dataset up. Next cell will filter the not assigned postal codes out**

In [6]:
# this will clean the column which have a borough value 'Not assigned'
df2= df1[df1['Borough'] != 'Not assigned'].reset_index(drop=True)

### It appears there are no more borough with a not assigned neighborhood in the dataset, so we can skip that part

In [7]:
df2.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,Malvern / Rouge
7,M3B,North York,Don Mills
8,M4B,East York,Parkview Hill / Woodbine Gardens
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### Also, all duplicates are already merged together, with a "/" instead of a comma to separate them, so we're going ahead to replace those slashes with a comma

In [8]:
regespr = re.compile(r' /')

In [9]:
# this will replace the " /" with just a ",", as required by the instructions
df3 = df2.replace(regespr,',')

df3.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## And here is the shape, as requested by the exercise

In [10]:
df3.shape

(103, 3)

### Here we're installing the geocoder package, even though it turns out it will not do his job, so we will just download the csv file later

In [11]:
#!conda install -c conda-forge geocoder --yes

In [12]:
#import geocoder
#print('geocoder installed!')

### Unfortunately, the geocoder package doesn't seem to work for me

Anyway, the solution would probably involve using something like this:

df3.append({'Latitude': latitude, 'Longitude': longitude}, ignore_index=True)

In [13]:
"""print('geocode_test')
# initialize your variable to None
lat_lng_coords = None
postal_code = 'M5A'
# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
print(latitude, longitude)
print('hello there')"""

"print('geocode_test')\n# initialize your variable to None\nlat_lng_coords = None\npostal_code = 'M5A'\n# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\nprint(latitude, longitude)\nprint('hello there')"

### So we move on and download the already arranged csv!

In [14]:
url1 = 'http://cocl.us/Geospatial_data'
df4 = pd.read_csv(url1)

### In order to correctly merge together the two dataframes, we will slightly change one column name to match the other one

In [15]:
df4.rename(columns = {'Postal Code':'Postal code'}, inplace = True) 
df4.head()

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### The following cell merges together the two dataframes, based on the postal code column we just renamed

In [16]:
df5 = pd.merge(df3, df4, on='Postal code', how='inner')
df5.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Now on we go, to the cluster analysis!

**First, we need to filter the Boroughs to only take the one that have Toronto in them.
This proved to be extremely difficult for my limited skills, so i created a list of boroughs which contain 'Toronto' in them, and then i proceed to filter the dataframe using the .isin() method**

In [17]:
regespr1 = re.compile(r'oronto+') # toronto, not specifying the t, since it may be capitalized or not :-D

In [18]:
#df6
borough_list = df5.Borough.unique().tolist()
borough_list
borough_list_filtered = list()
for item in borough_list:
    result = regespr1.search(item)
    if result != None:
        borough_list_filtered.append(item)

print(borough_list_filtered)
    

['Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto']


### Below, I filter the dataframe to create a new one with the .isin() method

In [19]:
df6 = df5[df5.Borough.isin(borough_list_filtered)].reset_index(drop=True)
df6

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [23]:
print('importation: begin!')
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium # -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

print('Libraries imported.')

importation: begin!
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 15.1MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/81/6d/31c83485189a2521a75b4130f1fee5364f772a0375f81afff619004e5237/branca-0.4.0-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.0 folium-0.10.1
Libraries imported.


### Following the manhattan exercise, we will arrange the clustering using only latitude and longitude, since there are no other numerical values there

In [32]:
# set number of clusters
kclusters = 5

# only useful columns to create the cluster are lat and long
df7 = df6.drop(axis= 1, columns = ['Postal code', 'Borough', 'Neighborhood'])
df7.head()


Unnamed: 0,Latitude,Longitude
0,43.65426,-79.360636
1,43.662301,-79.389494
2,43.657162,-79.378937
3,43.651494,-79.375418
4,43.676357,-79.293031


**The cells below will create the clusters' labels**

In [33]:
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df7)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]

array([0, 0, 0, 0, 4, 0, 0, 3, 0, 1, 0, 3, 4, 0, 3, 4, 0, 4, 2, 2, 2, 2,
       1, 2, 3, 1, 2, 3, 1, 2, 3, 2, 0, 0, 0, 0, 0, 0, 4], dtype=int32)

**The following will apply the cluster's label to the approprie

In [34]:
# add clustering labels
df6.insert(0, 'Cluster Labels', kmeans.labels_)

df6.head()

Unnamed: 0,Cluster Labels,Postal code,Borough,Neighborhood,Latitude,Longitude
0,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [36]:
latitude = 43.65
longitude = -79.36

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df6['Latitude'], df6['Longitude'], df6['Neighborhood'], df6['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters