# Segmentation and Clustering
In this assignment, I will segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information. 

## Sourcing Data & Data Wrangling
The neighborhood data is not readily available on the internet, however, a [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,) exists that has all the information we need to explore and cluster the neighborhoods in Toronto. \
I will use [Requests](https://realpython.com/python-requests/#the-get-request), a Python de facto HTTP library, to scrape the Wikipedia page, wrangle the data, clean it, and then read it into a pandas  dataframe so that the data is in a structured format. 

In [1]:
# Import necessary libraries to create the dataframe
import pandas as pd
import numpy as np
import requests 

Using the GET method to get or retrieve data from a specified resource

In [9]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

toronto_postal_codes = requests.get(url)

# Checking to see if the request was successful 
response = requests.get(url)

if response.status_code == 200:
    print('Success!')
elif response.status_code == 404:
    print('Not Found.')

Success!


Now that we have the data, let's create the columns and read it into a pandas dataframe

In [3]:
# Create coloumns for dataframe
column_names = ["Postal Code",  "Borough", "Neighborhood"]

# Reading the url into the dataframe
toronto_data = pd.read_html(toronto_postal_codes.text, header = 0)
toronto_data = toronto_data[0]

toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


This step deals with missing information. In the event that the 'Neighborhood' is 'not assigned', it will be replaced by the 'Borough' value. However, if the 'Borough' is 'not assigned' the row will be dropped.

In [4]:
# Replacing missing 'Neighborhood' value  with the 'Borough' value
toronto_data['Neighbourhood'] = np.where(toronto_data['Neighbourhood'] == 'Not assigned',toronto_data['Borough'], toronto_data['Neighbourhood'])

# Dropping rows where the 'Borough' is unassigned
not_assigned = toronto_data[toronto_data["Borough"] == 'Not assigned'].index

toronto_data.drop(not_assigned, axis= 0, inplace=True)
toronto_data.reset_index(drop = True, inplace = True)

toronto_data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Finally, all the Neighborhoods that have the same 'Postal Code' will be listed together under one code.

In [5]:
# Combining neighborhoods with the same postal codes 
toronto_data = toronto_data.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
toronto_data.reset_index(inplace=True)

toronto_data.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
toronto_data.shape

(103, 3)