<h1>Segmenting and Clustering Neighbourhoods in Toronto</h1> 
<h3>Part-1</h3> 
<h3>Web Scrapping and converting into pandas dataframe</h3>

<h4>Web Scrapping</h4>

In [1]:
#importing libraries
import pandas as pd 
import requests
from bs4 import BeautifulSoup #for web scrapping

In [2]:
#inspecting the URL page

URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)

print(page.status_code)
print(page.headers)


200
{'Date': 'Thu, 14 May 2020 04:20:16 GMT', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Server': 'ATS/8.0.7', 'Content-Type': 'text/html; charset=UTF-8', 'X-Content-Type-Options': 'nosniff', 'P3P': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-language': 'en', 'Last-Modified': 'Thu, 07 May 2020 17:47:29 GMT', 'Content-Encoding': 'gzip', 'Age': '18537', 'X-Cache': 'cp5008 miss, cp5009 hit/10', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Set-Cookie': 'WMF-Last-Access=14-May-2020;Path=/;HttpOnly;secure;Expires=Mon, 15 Jun 2020 00:00:00 GMT, WMF-Last-Access-Global=14-May-2020;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Mon, 15 Jun 2020 00:00:00 GMT, GeoIP=IN:AS:Guwahati:26.18:91.75:v4; Path=/; secure; Domain=.wikipedia.org', 'X-Client-IP': '27.56.59.226', 'Cache-Control': 'private, s-maxage=0, max-age=0, must-revalidat

In [3]:
#Scrapping the data
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
trs = table.find_all('tr')
rows = []
for tr in trs:
    i = tr.find_all('td')
    if i:
        rows.append(i)
        
lst = []
for row in rows:
    postalcode = row[0].text.rstrip()
    borough = row[1].text.rstrip()
    neighborhood = row[2].text.rstrip()
    if borough != 'Not assigned':
        if neighborhood == 'Not assigned':
            neighborhood = borough
        lst.append([postalcode, borough, neighborhood])

<h4>Converting into Pandas dataframe</h4>

In [4]:
cols = ['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame(lst, columns=cols)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
#detailed description of the data frame
df.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,103,103,103
unique,103,10,98
top,M1B,North York,Downsview
freq,1,24,4


In [6]:
df = df.groupby('PostalCode').agg(
    {
        'Borough':'first', 
        'Neighborhood': ', '.join,}
    ).reset_index()


In [7]:
#combining more than one neighbourhoods for a single postal code into one row seperted by comma
df.loc[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
53,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [9]:
#Finding out the number of rows and columns of the dataframe
df.shape

(103, 3)

In [10]:
#storing the contents of the dataframe "df" to toronto.csv file
df.to_csv('toronto.csv', index=False)