<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City (part 1 by Levent Bingol)</font></h1>

## Introduction
In this project, we will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. 
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, we will replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

## Table of Contents

### Part 1

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Webscrape the Data from Wikipedia page</a>
    
2. <a href="#item2">Preprocess and Explore the Dataset</a>
    
</font>
</div>

### Import necessary libraries for part 1
Before we get the data and start exploring it, let's download all the dependencies that we will need.
In this projecet especially we will use necessary Libraries such as BeautifulSoup and requests for Web Scraping.

In [31]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

#import webscraping libraries
from bs4 import BeautifulSoup
import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
print('Libraries imported.')

Libraries imported.


## 1. Webscrape the Data from Wikipedia page into Dataframe
I have three different methods two scrape the data as shown below.We can use one of the following methods to get the necessary html file and data from the wikipedia page.
    
    a .Pandas method 
    
    b. BeautifulSoup method:


## a. Pandas method
In this method we will use url and requests library and the with pandas text file will be transfered into into dataframe


In [32]:
#to get the table method-1
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
torontoSource = requests.get(url).text
tables = pd.read_html(torontoSource)

# result of first method 
neighborhoods1=pd.DataFrame(tables[0])
neighborhoods1.columns = ['PostalCode','Borough', 'Neighborhood'] 
neighborhoods1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## b. BeautifulSoup method:

We can downlad the web page as html file and then convert to file into lxml file for parsing. We can do this step by using !wget method or requests library (get method) then use BeautifulSoup capabilities.

In [33]:
#We can get the html info in two alternatives also.Then scraped data into lxml file with Beautifulsoup
#to get the HTML file first method-1

!wget -q -O 'canadapost.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')
with open('canadapost.html') as can_html_file:
    soup=BeautifulSoup(can_html_file,'lxml')
    
#to get the HTML file second method-2
canadasource= requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup=BeautifulSoup(canadasource,'lxml')

Data downloaded!


Now, with the following code we scrape the table data in the html file into tableCanada element by using BeautifulSoup. In the following steps we will turn this information into pandas data frame. You can prettify the information and examine to see the hierarchy in the data.

In [34]:
tableCanada=soup.find('table',class_='wikitable sortable')
#type(tableCanada)
#print(tableCanada.prettify())

When we examine the tableCanada data we will see that the necessary information stays between <td> parts. And each is in order with groups of postalcode, borough and neighborhood follwing each other. To get the data easier out from html element we will use fin_all method of Beautifulsoup library.

In [35]:
#Table content in a list. the order in three groups is such as postalcode,borough and neighborhood
tableContent=tableCanada.find_all ('td')
tableContent[0:9]

[<td>M1A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M2A</td>, <td>Not assigned</td>, <td>Not assigned
 </td>, <td>M3A</td>, <td><a href="/wiki/North_York" title="North York">North York</a></td>, <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td>]

Now we have a list and with the following codes we can exploer the data inside the list. Then we learn how to get the postcode, borough and neighborhood information.

In [36]:
#get the postalcode, borough and neighborhood
postalcode=tableContent[0].text
borough=tableContent[1].text
neighborhood=tableContent[2].text
print(postalcode,borough,neighborhood)
print('The length of tableContent list: ',  len(tableContent))

M1A Not assigned Not assigned

The length of tableContent list:  867


Then we get ready to scrape the data into a dataframe. So lets define the columns of our neighborhoods data frame that place the Toronto postal code and neighborhood information.

In [37]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
neighborhoods= pd.DataFrame(columns=column_names)
neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


Now we can fiil the table with parsing the information from tableContent list into our dataframe with the following loop.

In [38]:
#fill the data frame
i=0
while i <= (len(tableContent)-1):
  postalcode=tableContent[i].text
  borough=tableContent[i+1].text
  neighborhood=tableContent[i+2].text.split('\n')[0]
  neighborhoods = neighborhoods.append({'PostalCode':postalcode,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood
                                          }, ignore_index=True)
  i=i+3


Now we can take a look at the current shape and the content of our dataframe. Then we will do some preprocessing and cleaning in the following section.

In [39]:
neighborhoods.head()
#neighborhoods.shape

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [40]:
#when we compare two solutions we have the same dataframe in the end of both methodology
neighborhoods1.head()
#neighborhoods.shape

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## 2. Preprocess and Explore the Dataset <a id='ref2'></a>

Our dataframe now consists of three columns: PostalCode, Borough, and Neighborhood.We now only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [41]:
# clean and drop the rows with Borough is Not assigned
neighborhoods.drop(neighborhoods[neighborhoods['Borough']=="Not assigned"].index,axis=0, inplace=True)
neighborhoods.head(5)
#alternative method can be as follows (Replace "Not assigned" to NaN) then clean
# neighborhoods.replace("Not assigned", np.nan, inplace = True)
# neighborhoods.dropna(subset=["Borough"], axis=0, inplace=True)
# neighborhoods.reset_index(drop=True, inplace=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [42]:
neighborhoods.shape

(212, 3)

number of unique postal codes can be calculated as follows:

In [43]:
len(neighborhoods['PostalCode'].unique())

103

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [44]:
TorontoDF=neighborhoods.groupby('PostalCode',as_index=False).agg(lambda x:','.join(set(x)))
TorontoDF.shape

(103, 3)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [45]:
TorontoDF.loc[TorontoDF['Neighborhood']=='Not assigned','Neighborhood']=TorontoDF.loc[TorontoDF['Neighborhood']=='Not assigned','Borough']


In [46]:
TorontoDF.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Morningside,Guildwood,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Ionview,Kennedy Park,East Birchmount Park"
7,M1L,Scarborough,"Golden Mile,Clairlea,Oakridge"
8,M1M,Scarborough,"Cliffside,Scarborough Village West,Cliffcrest"
9,M1N,Scarborough,"Cliffside West,Birch Cliff"


In [47]:
TorontoDF.to_csv('TorontoDFpart1.csv', encoding='utf-8', index=False)

In [48]:
TorontoDF.shape

(103, 3)

## This is the end of part 1. Answer for the first question