# Segmenting and Clustering Neighborhoods in Toronto
## Week 3 Applied Data Science Capstone - Question 1
In this assignment the requirement is to explore, segment, and cluster the neighborhoods in the city of Toronto. 

### Uploading the required libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

For the Toronto neighborhood data, a Wikipedia page exists that has all the information to explore and cluster the neighborhoods in Toronto.

The Wikipedia page is scraped. Then the data is wrangled, cleaned, and then read into a pandas dataframe so that it is in a structured format.

### Scraping the data and finding/matching the required table

In [2]:
page='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(page)
soup = BeautifulSoup(r.text, 'lxml')
#print(soup.prettify())

match=soup.table
#print(match)

### Defining a function to turn the data into a pandas dataframe

In [3]:
def html_table_dataframe(table):
    nb_columns = 0
    nb_rows=0
    column_names = []
    
    # Find the number of rows and columns and the column titles 
   
    for row in table.find_all('tr'):
        td_tags = row.find_all('td')
        if len(td_tags) > 0:
            nb_rows+=1
            if nb_columns == 0:
                nb_columns = len(td_tags)
        th_tags = row.find_all('th') 
        if len(th_tags) > 0 and len(column_names) == 0:
            for th in th_tags:
                column_names.append(th.get_text())
       
    columns = column_names if len(column_names) > 0 else range(0,nb_columns)
    df = pd.DataFrame(columns = columns,index= range(0,nb_rows))
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            df.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
        if len(columns) > 0:
            row_marker += 1
            
    return df

In [4]:
df=html_table_dataframe(match)
df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


### Cleaning the data including: removing Boroughs not assigned, matching the Neighborhood to the Borough when not assigned and group Neighborhoods by Postcode and Borough

In [5]:
df.columns = ['Postcode', 'Borough', 'Neighborhood']
df['Neighborhood']=df['Neighborhood'].str.replace("\n","") 
df=df[df.Borough!='Not assigned']
df.loc[df.Neighborhood == 'Not assigned', 'Neighborhood'] = df['Borough']
df=df.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ','.join(x))
df.reset_index(level=['Postcode','Borough'], inplace=True)

### The top 10 rows of the dataframe

In [6]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


### Saving the dataframe into a csv file

In [7]:
df.to_csv('Q1.csv',index= True)

### .shape method to print the number of rows and columns of the dataframe

In [8]:
df.shape


(103, 3)