## Introduction

The neighborhoods of Toronto will be analyzed by using postal code data from Wikipedia. Several techniques will use to get the data, to clean them et to put them in a a good shape. Beautiful Soup will be used to extract the postal code data from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M Foursquare API will be also used to obtain relevant data in order to analyze each Neighborhood and cluster them. 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>
  
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import folium # map rendering library
import urllib.request, urllib.parse, urllib.error,ssl
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

import json # library to handle JSON files
import ssl
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

We use the Beautiful Soup package to clean the data. 

Notice how all the relevant data is in the *td* key, which is basically a list of the neighborhoods. 

In [2]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags=soup('td')
i=0
post=str()
postal=list()
while tags[i].text!='':
    post= post+','+tags[i].text
    i=i+1
post2=post.split('\n')
for i in post2:
    if i=='': continue
    else:
        postal.append(i.split(','))

len(postal)

289

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python list of lists into a *pandas* dataframe.  

In [3]:
df=pd.DataFrame(postal)
df=df.drop(columns=0)
df.columns=['PostalCode','Borough', 'Neighborhood']
print(df.head(3))
print(df.shape)

  PostalCode       Borough  Neighborhood
0        M1A  Not assigned  Not assigned
1        M2A  Not assigned  Not assigned
2        M3A    North York     Parkwoods
(289, 3)


We remove all not assigned Borough

In [4]:
df=df[df['Borough']!='Not assigned']
print('A total of', df.shape[0],'boroughs are assigned')

A total of 212 boroughs are assigned


In this section, we will affect the  Not assigned neighborhoods  with the same value of the correspondent  borough.  

In [5]:
for index, line in df.iterrows():
    if line['Neighborhood'] =='Not assigned':
        line['Neighborhood'] = line['Borough']
df.head(8)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue


In this part, we regroup all the Neighborhoods of the same Postal code. 

In [6]:
dfdup=df[df.PostalCode.duplicated()==True]
for index, line in df.iterrows():
     for index, i in dfdup.iterrows():
        if i['PostalCode']==line['PostalCode'] and i['Neighborhood']!=line['Neighborhood']:
            line['Neighborhood']= line['Neighborhood'] +','+ i['Neighborhood']
dfclean=df[df.PostalCode.duplicated()==False]
dfclean.head(4)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront,Regent Park"
6,M6A,North York,"Lawrence Heights,Lawrence Manor"


In [8]:
dfclean.shape

(103, 3)

The resulting dataset has all 103 different postal codes.