# Segmentation and Clustering

## Part 1 Get the Toronto neighborhood data

We are get the Toronto neighborhood information from Wikipedia using below mentioned url

url = https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

further we will wrangle the data, clean it, and then read it into a pandas dataframe

we will require following 2 libraries to complete this task:
* [urllib](https://docs.python.org/2/library/urllib.html)
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
import pandas as pd
import numpy as np

import sys
if sys.version_info[0] == 3:
    from urllib.request import urlopen
else:
    from urllib import urlopen
from bs4 import BeautifulSoup

import folium
from geopy.geocoders import Nominatim
import requests

from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

### Lets save the url in variable named link

In [2]:
link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Then use 'urlopen' function to open the page

In [3]:
wikipage = urlopen(link)

In [4]:
if sys.version_info[0] == 3:
    wikipage = wikipage.read()

### Lets read the page by creating an object of BeautifulSoup

In [5]:
soup = BeautifulSoup(wikipage, "lxml")

### lets check the title of page

In [6]:
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

### To get the table we will extract all the content under table tag

In [7]:
tables = soup.find('table', class_='wikitable sortable')

### Now, we will extract the tale rows, identified by 'tr' tag

In [8]:
rows = tables.find('tr')
print(rows)

<tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>


### Using .findall function, we will extract headers of rows

In [9]:
headers = []
header_rows = rows.findAll('th')
for header in header_rows:
    headers.append(header.find(text = True).strip())
headers

['Postal Code', 'Borough', 'Neighborhood']

### Now lets extract each column 

In [10]:
Postal_Code = []
Borough = []
Neighborhood =[]

for row in tables.findAll("tr"):
    cell = row.findAll('td')
    if len(cell) > 0 : 
        Postal_Code.append(cell[0].find(text=True).strip())
        Borough.append(cell[1].find(text=True).strip())
        Neighborhood.append(cell[2].find(text=True).strip())

### Append all the list into dataframe

In [11]:
df_toronto = pd.DataFrame()
df_toronto[headers[0]] = Postal_Code
df_toronto[headers[1]] = Borough
df_toronto[headers[2]] = Neighborhood

### lets check our dataframe

In [12]:
df_toronto

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Now we will remove all the 'Not assigned' values

In [13]:
df_toronto = df_toronto[df_toronto.Borough != 'Not assigned'].dropna()

In [14]:
df_toronto.reset_index(inplace = True)

In [15]:
del df_toronto['index']

In [16]:
df_toronto

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [17]:
df_toronto.shape

(103, 3)

### Part 2: We will get the location coordinates from the csv file Geospatial_Coordinates

link: http://cocl.us/Geospatial_data

In [19]:
df_geoloc = pd.read_csv('Geospatial_Coordinates.csv')
df_geoloc

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


### Merge to dataframe df_toronto and df_geoloc on column 'Postal Code'

In [20]:
df_toronto_geoloc = pd.merge(df_toronto,df_geoloc,on="Postal Code")

In [21]:
df_toronto_geoloc

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
