<a href="https://colab.research.google.com/github/MarufAnsari/Coursera_Capstone/blob/master/Segmenting_and_Clustering_Neighborhoods_in_the_city_of_Toronto%2C_Canada.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Segmenting and Clustering Neighborhoods in the city of Toronto, Canada**

**1. Importing Libraries**

In [23]:
import numpy as np                # library to handle data in a vectorized manner
import pandas as pd               # library for data analsysis
import requests                   # library to handle requests
from bs4 import BeautifulSoup     # library to parse HTML and XML documents

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json                                   # library to handle JSON files
from geopy.geocoders import Nominatim         # convert an address into latitude and longitude values
from pandas.io.json import json_normalize     # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium 

print('Libraries imported.')

Libraries imported.


**2. Scrap data from Wikipedia page into a DataFrame**

Get the HTML page of Wikipedia showing the data of Toronto city.

We will use read_html function here in order to convert the html data into list of Dataframe objects.

We will remove the rows with not assigned boroughs

In [30]:
# requesting to get the html page from the URL/Link
wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text


# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(wiki_page, 'html.parser') 

#Lists to store the table data
PostalCode_ = []
Borough_ = []
Neighborhood_ = []

#appending data into lists
for row in soup.find('table').find_all('tr') :
  cells = row.find_all('td')
  if(len(cells)>0) :
    PostalCode_.append(cells[0].text.rstrip('\n'))
    Borough_.append(cells[1].text.rstrip('\n'))
    Neighborhood_.append(cells[2].text.rstrip('\n'))

#creating Dataframe by concatconcatenating the lists
toronto_df = pd.DataFrame({"PostalCode":PostalCode_, "Borough":Borough_, "Neighborhood":Neighborhood_})

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


**3. Dropping the cells with a borough that is "Not assigned"**



In [33]:
# dropping "Not assigned" borough
toronto_df['Borough'].replace('Not assigned', np.nan, inplace=True)
toronto_df.dropna(subset=['Borough'], inplace=True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


**4. Group Neighborhoods in the same borough**

In [43]:
# Grouping by Borough and Postaal-Code
toronto_df_grp = toronto_df.groupby(['PostalCode','Borough'], as_index=False).agg(lambda x: ','.join(x))
toronto_df_grp.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


**5. "Not assigned" Neighborhood is replaced with respective borough name**

In [44]:
# replacing "Not assigned" Neighborhood
for index, row in toronto_df_grp.iterrows() :
  if row['Neighborhood'] == 'Not assigned' :
    row['Neighborhood'] = row['Borough']

toronto_df_grp.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


**6. Matching the table with the question**

In [52]:
column_names = ["PostalCode", "Borough", "Neighborhood"]
test_df = pd.DataFrame(columns=column_names)

postcode_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in postcode_list:
    test_df = test_df.append(toronto_df_grp[toronto_df_grp["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


In [46]:
toronto_df_grp.shape

(103, 3)