# Wikipedia Webscraping

For this exercise I have chosen to use the BeautifulSoup webscraper. The first step is therefore to install the package. 

In [1]:
! pip install beautifulsoup4



The next step is to pass the html object (the wiki page) through the bs4. Using the inspect funtion on my web browser, I was able to locate the class of the table, and proceeded to store references to the table rows and columns in similarly named variables. 

In [2]:
from bs4 import BeautifulSoup
import requests

source = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

soup = BeautifulSoup(source, 'html.parser')

table = soup.find('table', {'class':"wikitable sortable"}).tbody
rows = table.find_all ('tr')
columns = [v.text.replace ('\n','') for v in rows[0].find_all('th')]
print(columns)

['Postal Code', 'Borough', 'Neighborhood']


Now we have the data we want we can pass it to a pandas dataframe. We then populate this dataframe with the rows of the wikitable. I also exported a CSV to check the full result list. 

In [3]:
import pandas as pds

df = pds.DataFrame(columns=columns)

for i in range(1,len(rows)):
    tds=rows[i].find_all('td')

    values = [tds[0].text.replace ('\n',''),tds[1].text.replace ('\n',''),tds[2].text.replace ('\n','')]
    
    df = df.append(pds.Series(values,index=columns),ignore_index=True)

   # df.to_csv(r'C:\Users\Docherty.ATGCH\Desktop\Coding101' + 'hello.csv', index=False)


I then started to manipulate the dataframe, by removing all instances where the Borough was not assigned. 

In [4]:
df = df[df['Borough']!='Not assigned']
print(df)

    Postal Code           Borough  \
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
5           M6A        North York   
6           M7A  Downtown Toronto   
..          ...               ...   
160         M8X         Etobicoke   
165         M4Y  Downtown Toronto   
168         M7Y      East Toronto   
169         M8Y         Etobicoke   
178         M8Z         Etobicoke   

                                          Neighborhood  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
5                     Lawrence Manor, Lawrence Heights  
6          Queen's Park, Ontario Provincial Government  
..                                                 ...  
160      The Kingsway, Montgomery Road, Old Mill North  
165                               Church and Wellesley  
168  Business reply mail Processing Centre

I then also filled in all non assigned neighborhoods with the borough value

In [5]:
df.loc[df['Neighborhood']=='Not assigned','Neighborhood']=df['Borough']

print(df)

df.to_csv(r'C:\Users\Docherty.ATGCH\Desktop\Coding101' + 'hello.csv', index=False)

    Postal Code           Borough  \
2           M3A        North York   
3           M4A        North York   
4           M5A  Downtown Toronto   
5           M6A        North York   
6           M7A  Downtown Toronto   
..          ...               ...   
160         M8X         Etobicoke   
165         M4Y  Downtown Toronto   
168         M7Y      East Toronto   
169         M8Y         Etobicoke   
178         M8Z         Etobicoke   

                                          Neighborhood  
2                                            Parkwoods  
3                                     Victoria Village  
4                            Regent Park, Harbourfront  
5                     Lawrence Manor, Lawrence Heights  
6          Queen's Park, Ontario Provincial Government  
..                                                 ...  
160      The Kingsway, Montgomery Road, Old Mill North  
165                               Church and Wellesley  
168  Business reply mail Processing Centre

#### Here is a snapshot of the table

In [6]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


## And here is the shape of the dataframe

In [7]:
df.shape

(103, 3)

# GEOLOCATION OF POSTCODES

In [8]:
import pandas as pd

urlcsv = 'http://cocl.us/Geospatial_data'

df2 = pd.read_csv(urlcsv)

df2

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [9]:
result_table = pd.merge(df,df2,on=["Postal Code"])
result_table

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


# Now we map out the neighbourhoods
### First we import the libraries, and fix the coordinates of Toronto

In [10]:
#result_table.to_csv(r'C:\Users\Docherty.ATGCH\Desktop\Coding101' + 'helloAgain.csv', index=False)
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes
import folium
from geopy.geocoders import Nominatim

address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### Now we put circlemarkers over the centrepoint of each neighbourhood, and display the map

In [11]:



map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)


for lat, lng, borough, neighborhood in zip(result_table['Latitude'], result_table['Longitude'], result_table['Borough'], result_table['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto