# Week 3 Assignment: Segmenting and Clustering Neighborhoods in Toronto

First, the imports:
- Pandas for database management
- Requests to scrape the webpage
- BeutifulSoup to navigate the html

In [7]:
import pandas as pd
pd.options.display.max_rows = 200
import requests
from bs4 import BeautifulSoup

The web page used claims to contain a table with every postal code in Toronto, making it perfect for our needs. 
*Unfortunately, I had no easy way to verify it's accuracy, so the following lab assumes the Wikipedia article remains accurate*

In [8]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html_data = requests.get(url).text

toronto_soup = BeautifulSoup(html_data,"html5lib")

The url is scraped using ```requests.get``` and formatted as a BeautifulSoup object, which makes it possible to identify the tables in the html using ```soup.find_all('table')```

In [9]:
toronto_tables = toronto_soup.find_all('table')
len(toronto_tables)

3

Since there were only 3 tables, finding the correct one manually was easier than writing a code loop. This was done by skimming the results of:  

```print(toronto_tables[n].prettify())``` for ```0```, ```1```, and ```2```  

Table 0 contains the neighborhood data

In [10]:
neigh_table = toronto_tables[0]

### Creating and Cleaning the Dataframe
Now that we have the proper table, the following cells serve to enter the data into the Pandas dataframe in the desired form.  This notebook assumes that the first 3 non-whitespace characters of every cell make up the postal code, and that the neighborhoods are always separated from the borough by an open parenthesis '(' 

***Further details regarding the reformatting are explained in comments in the code below***

In [11]:
# I create the dataframe with the named columns, it's empty for now
neigh_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
neigh_df

# loop through all of the data cells in the table and populate the dataframe
for cell in neigh_table.find_all('td'):
    text = cell.text.strip()
    # Skip any cells that aren't assigned
    if 'Not assigned' not in text:
        # The postal codes are always the first 3 characters of the cell, this make it easy to split off using slicing 
        postalcode = text[0:3]
        
        # The remainder of the text has to be split along the opening parenthesis, and then the neighborhoods have to be reformatted
        other = text[3:].split('(')
        borough = other[0]
        neighborhood = (((other[1].strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        neigh_df = neigh_df.append({'PostalCode': postalcode,
                                    'Borough': borough,
                                    'Neighborhood': neighborhood}, ignore_index=True)


In [12]:
neigh_df['Borough'].value_counts()

North York                                                      24
Downtown Toronto                                                17
Scarborough                                                     17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East Toronto                                                     4
East York                                                        4
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
East YorkEast Toronto                                            1
MississaugaCanada Post Gateway Processing Centre                 1
EtobicokeNorthwest                                               1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
Queen's Park                                                  

In [13]:
# There are a handful of Boroughs that didn't get proccessed properly, so let's fix them
neigh_df['Borough']=neigh_df['Borough'].replace({'MississaugaCanada Post Gateway Processing Centre':'Mississauga',
                                                 'EtobicokeNorthwest':'Etobicoke Northwest',
                                                 'East YorkEast Toronto':'East York/East Toronto',
                                                 'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                                 'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                                 })
neigh_df['Borough'].value_counts()

North York                24
Downtown Toronto          17
Scarborough               17
Etobicoke                 11
Central Toronto            9
West Toronto               6
York                       5
East Toronto               4
East York                  4
Mississauga                1
East York/East Toronto     1
Etobicoke Northwest        1
Downtown Toronto Stn A     1
Queen's Park               1
East Toronto Business      1
Name: Borough, dtype: int64

### The Dataframe should now be complete! Lets take a look

In [14]:
neigh_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [16]:
neigh_df.shape

(103, 3)