## Neighborhoods in Toronto : Data Preparing

### Import required Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Fetch Data
Fetch data of Postal code of Canada with code M

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',{'class':'wikitable sortable'}).tbody

Get rows in the table

In [3]:
rows = table.find_all('tr')
rows[:2]

[<tr>
 <th>Postal code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>,
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>]

Get List of Columns in the table

In [4]:
columns = [v.text.replace('\n','') for v in rows[0].find_all('th')]
columns

['Postal code', 'Borough', 'Neighborhood']

## Clean and Prep Data

Clean and fetch relevant and required data like:
1. Ignore cells with a borough that is Not assigned
2. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
3. If more than one neighborhood can exist in one postal code area then combined them into one row with the neighborhoods separated with a comma.

In [5]:
data = []
for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    temp = [td.text.replace('\n', '') for td in tds] #Clean data ie remove 'next line' character
    if temp[1]!='Not assigned': #Case 1
        data.append(temp)
        if temp[2] == 'Not assigned': # Case 2
            temp[2] = temp[1]
        temp[2] = temp[2].replace(' /', ',') #Case 3
#         print(temp)    

In [6]:
data[:3]

[['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront']]

Create DataFrame of the list created.

In [7]:
postal_df = pd.DataFrame(data, columns=columns)
postal_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Print the shape of DataFrame

In [8]:
postal_df.shape

(103, 3)