## Web Scraping: Toronto Postal Codes

- Scraping the postal codes of Toronto from the website: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

#### Objective:
- To obtain the data of the Toronto postal codes.

#### 1. Importing libraries

In [3]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
import requests

In [5]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_html=requests.get(url).text
soup=bs(page_html,'lxml')

- We need to get Postal Code, Borough, Neighbourhood

In [40]:
#finding the table
main_list = []
main_table = soup.find('table')
all_tr = main_table.find_all('tr')
for tr in all_tr:
    all_td = tr.find_all('td')
    for td in all_td:
        details = {}
        details['PostalCode'] = td.find('p').text[0:3]
        details['Borough'] = td.find('p').text[3:].strip().split('(')[0]
        try:
            details['Neighbourhood'] = td.find('p').text[3:].strip().split('(')[1].replace(')','')
        except:
            details['Neighbourhood'] = '-'

        main_list.append(details)


In [46]:
df = pd.DataFrame(main_list)
df = df[df['Borough'] != 'Not assigned']

In [47]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Queen's Park,Ontario Provincial Government


In [52]:
print('The shape of our dataframe is {} with the following details:\n- {} rows\n- {} columns\n- {} unique postal codes\n- {} unique boroughs'.format(df.shape, df.shape[0], df.shape[1],
                                   len(df.PostalCode.unique()), len(df.Borough.unique())))

The shape of our dataframe is (103, 3) with the following details:
- 103 rows
- 3 columns
- 103 unique postal codes
- 15 unique boroughs


In [54]:
# Save the dataframe in a csv file without containing any index
df.to_csv('/Users/mac/Desktop/DataScience/Pojects_ds/coffee_shop/dataset/toronto_postal_codes.csv', index=False)