<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New York City</font></h1>

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<font size = 3>

1. <a href="#item1">Import needed libraries</a>

2. <a href="#item2">Download and Explore Dataset from Wiki</a>

</font>
</div>


## 1. Import needed libraries


In [190]:
import numpy as np # library to handle data in a vectorized manner

!pip3 install lxml
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# import libraries for Wiki scrumble
import requests
import urllib.request
import time
!pip install beautifulsoup4
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


## 2. Download and Explore Dataset from Wiki


In [191]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html = urlopen(url) 

soup = BeautifulSoup(html, 'html.parser')

#tables = soup.find_all('table',{"class":"wikitable sortable"})
table = soup.find('table',{"class":"wikitable"})
#print(indiatable)

In [192]:
#Create array to hold the data we extract
postalCodes = []
boroughs = []
neighbourhoods = []  
    
rows = table.find_all('tr')

data = {}

for row in rows:
    cells = row.find_all('td')


    if len(cells) > 1:
        postalCode = cells[0]            
        postalCodes.append(postalCode.text.strip())

        borough = cells[1]            
        boroughs.append(borough.text.strip())                    

        neighbourhood = cells[2]            
        neighbourhoods.append(neighbourhood.text.strip())                   
        
data = {'PostalCode': postalCodes, 'Borough': boroughs, 'Neighbourhood': neighbourhoods}      

In [193]:
#Transform the data into a pandas dataframe
df1 = pd.DataFrame.from_dict(data)
df1.head()
#df1["PostalCode"].count()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [194]:
not_assigned = "Not assigned"

In [195]:
#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.         
indexRows = df1[ df1['Borough'] == not_assigned ].index

#print(indexRows)

df1.drop(indexRows , inplace=True)
df1.head()
#df1["PostalCode"].count()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [196]:
#More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
df2 = df1.groupby(['PostalCode', 'Borough'], as_index = False).agg({'Neighbourhood': ','.join})
df2.head()
#df2["PostalCode"].count()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [197]:
#If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
df2['Neighbourhood'] = np.where((df2.Neighbourhood == not_assigned),df2.Borough,df2.Neighbourhood)
df2.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [199]:
df2.shape

(103, 3)