## Scraping data from Wikipedia and creating dataframe of neighborhoods in Toronto

##### Using the BeautfifulSoup package to scrape data on the Toronto, Canada that is in the table of postal codes, from Wikipedia and further using Pandas to  to read the table into a pandas dataframe (df)

#### Importing libraries

In [1]:
# Importing libraries for webscraping (BeautifulSoup) and dataframe (Pandas)

import requests
from bs4 import BeautifulSoup
import pandas as pd

print('All done! Needed libraries imported!')

All done! Needed libraries imported!


#### Scraping from Wikipedia

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
table = soup.find('tbody')
rows = table.select('tr')
row = [r.get_text() for r in rows]

### Data preprocessing 

In [4]:
df = pd.DataFrame(row)
df1 = df[0].str.split('\n', expand=True)
df2 = df1.rename(columns=df1.iloc[0])
df3 = df2.drop(df2.index[0])
df3.head()

Unnamed: 0,Unnamed: 1,Postcode,Borough,Neighbourhood,Unnamed: 5
1,,M1A,Not assigned,Not assigned,
2,,M2A,Not assigned,Not assigned,
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,


#### Rename Postcode to Postal Code

In [5]:
df4 = df3.rename(columns={'Postcode': 'PostalCode'})
df4.head()

Unnamed: 0,Unnamed: 1,PostalCode,Borough,Neighbourhood,Unnamed: 5
1,,M1A,Not assigned,Not assigned,
2,,M2A,Not assigned,Not assigned,
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,


#### Processing only the cells that have an assigned borough

In [6]:
df5 = df4[df4.Borough != 'Not assigned']
df5.head()

Unnamed: 0,Unnamed: 1,PostalCode,Borough,Neighbourhood,Unnamed: 5
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,
6,,M6A,North York,Lawrence Heights,
7,,M6A,North York,Lawrence Manor,


#### Combination of neighborhoods that exist in same postal code area

In [7]:
df6 = df5.groupby(['PostalCode', 'Borough'], sort = False).agg(','.join)
df6.reset_index(inplace = True)
df6.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


#### Giving Borough and Neighborhood same value

In [8]:
df7 = df6.replace("Not assigned", "Queen's Park")
df7.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


## 2. Latititude and Longitude of Neighborhoods

#### Load csv file from http://cocl.us/Geospatial_data and rename PostalCode to be same as first dataframe

In [9]:
data = "http://cocl.us/Geospatial_data"
df8 = pd.read_csv(data)
df8.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df8.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Merge dataframes into one (1st and 2nd)

In [10]:
df9 = pd.merge(df7, df8, on='PostalCode')
df9.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


## 3. Explore and cluster the neighborhoods in Toronto

#### Exploring and clustering of the neighborhoods in Toronto, using only boroughs that contain the word Toronto

##### Importing libraries and packages

In [11]:
conda update -n base -c defaults conda


Collecting package metadata: ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\Anaconda3

  added / updated specs:
    - conda


The following NEW packages will be INSTALLED:

  conda-package-han~ pkgs/main/win-64::conda-package-handling-1.3.11-py37_0

The following packages will be UPDATED:

  conda                                       4.6.11-py37_0 --> 4.8.2-py37_0


Preparing transaction: ...working... done
Verifying transaction: ...working... failed

Note: you may need to restart the kernel to use updated packages.



EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: C:\ProgramData\Anaconda3




In [12]:
# Import libraries and packages needed for the project
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

print('All done! Needed libraries imported!')

Collecting package metadata: ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\Anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.2                |           py37_0         3.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be INSTALLED:

  conda-package-han~ conda-forge/win-64::conda-package-handling-1.6.0-py37h2fa13f4_1
  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0

The following packages will be UPDATED:

  conda                      pkgs/main::conda-4.6.11-py37_0 --> conda-forge::conda-4.8.2-py37_0

The following packages will be SUPERSEDED by a higher-



  current version: 4.6.11
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: C:\ProgramData\Anaconda3




Collecting package metadata: ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\Anaconda3

  added / updated specs:
    - folium=0.5.0


The following NEW packages will be INSTALLED:

  altair             conda-forge/noarch::altair-4.0.1-py_0
  branca             conda-forge/noarch::branca-0.3.1-py_0
  conda-package-han~ conda-forge/win-64::conda-package-handling-1.6.0-py37h2fa13f4_1
  folium             conda-forge/noarch::folium-0.5.0-py_0
  vincent            conda-forge/noarch::vincent-0.4.4-py_1

The following packages will be UPDATED:

  conda                      pkgs/main::conda-4.6.11-py37_0 --> conda-forge::conda-4.8.2-py37_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                         pkgs/main --> conda-forge


Preparing transaction: ...working... done
Verifying transaction: ...working... failed
All done! Needed libraries imported!




  current version: 4.6.11
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda



EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: C:\ProgramData\Anaconda3




In [15]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

print('All done! Needed libraries imported!')

All done! Needed libraries imported!


### df for boroughs that contain the word Toronto

In [20]:
Toronto=df9[df9['Borough'].str.contains('Toronto')]
Toronto

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568
31,M6H,West Toronto,"Dovercourt Village,Dufferin",43.669005,-79.442259


### Visualization of neighborhood clustering

In [21]:
address = 'Toronto'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

Toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(Toronto['Latitude'], Toronto['Longitude'], 
                                           Toronto['Borough'], Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Toronto_map)  
    
Toronto_map