<a href="https://colab.research.google.com/github/NovaMaja/webscraping/blob/master/webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Webscraping with Beautiful Soup

In this notebook we will be exploring how to use webscraping with  beautiful soup to obtain data about jobs on indeed.com. This notebook and example materials is developed by [Nova Institute](https://novainstitute.ca) and is released under the [MIT license](https://https://github.com/NovaMaja/webscraping/blob/master/LICENSE). 

##Imports
###requests
We will use the **requests** library to get the raw html from a webpage. **requests** makes all http requests simple, and you can use it with GET, POST, PUT, DELETE, HEAD, OPTIONS and there are a lot of useful functions included in the library. See http://docs.python-requests.org/ for more info on the **requests** library.

###BeautifulSoup
**Beautiful Soup** is a library for parsing web pages. It makes a parsing tree of a webpage based on the html structure. With a parsing tree it is easy to navigate through the contents of the webpage and get the information we are looking for. The **Beautiful Soup** [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a great quickstart section to get you started.

###Pandas
**Pandas** is a library for organizing data into series and dataframes (kind of like tables). **pandas** has a lot of built in functions that makes sorting and manipulating data a breeze. Read the Pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/) for mor info.

In [0]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Get and parse the web page
The first thing we will do is to go to indeed.com in our browser (we recommend [chrome](https://www.google.com/chrome/)) and do a search for a job we are interested in. In our case we searched for Data Scientist in Toronto, ON.

Once we have a serach we are happy with we need to copy the url from the browser and paste it as an argument to the requests.get() function.

After we loaded the webpage into our page variable we will pass it on to BautifulSoup to parse it into a parsing tree for us, using *html.parser*

In [0]:
page = requests.get('https://www.indeed.ca/jobs?q=data+scientist&l=Toronto%2C+ON')
soup = BeautifulSoup(page.text, 'html.parser')

soup has a function called prettify() that makes the parsing tree more human readable. We will use it to print out the information we gathered.

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="/s/1069dac/en_CA.js" type="text/javascript">
  </script>
  <link href="/s/97464e7/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://www.indeed.ca/rss?q=data+scientist&amp;l=Toronto%2C+ON" rel="alternate" title="Data Scientist Jobs in Toronto, ON" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+scientist&amp;l=Toronto%2C+ON&amp;limit=20" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=data+scientist&amp;l=Toronto%2C+ON&amp;limit=20" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
        window['closureReadyCallbacks'] = [];
    }

    function call_when_jsall_loaded(cb) {
        if (window['closureReady']) {
            cb();
        } else {
            window['closureReadyCallbacks'].push(cb);
 

This is way too much information to be useful to us. We need to find a way to filter out exactly what we are looking for. A great way to do that is to take a look at the web page using **inspect** in our browser. If we right click on one of the job listings and choose **inspect** chrome will show the relevant html for that element of the website. after a bit of digging around we can see that there is a common keyword for al the jobs in the result list. they all have class *jobsearch-SerpJobCard*. We can use that to get only the information on the cards.

In [0]:
jobcards = soup.find_all(class_ = 'jobsearch-SerpJobCard')

In [0]:
print(jobcards)

In [0]:
len(jobcards)

26

Using the same method we can single out specific information in each job card. We will grab these details and put it into a Dictionary.

In [0]:
jobDict = []

for card in jobcards:
  links = card.find_all('a')
  jobtitle = None
  for link in links:
    if(link.get('data-tn-element') == 'jobTitle'):
      jobtitle = link.get('title')
  print(jobtitle)
  company = card.find(class_='company')
  companylink = company.find('a')
  if companylink:
    companyName = companylink.contents[0].strip()
  else:
    companyName = company.contents[0].strip()
  print(companyName)
  location = card.find(class_='location').contents[0].strip()
  print(location)
  jobDict.append([companyName, jobtitle, location])

Data Scientist
Loblaw Digital
Toronto, ON
Data Scientist, Algorithms & Data Systems
Boxy Charm
Toronto, ON
Senior Data Scientist
Loblaw Digital
Toronto, ON
Lead Data Scientist - App & Partner Platform
Shopify
Toronto, ON
Data Scientist - Toronto, ON
Scotiabank
Toronto, ON
Data Scientist
Canadian Tire Corporation
Toronto, ON
Data Scientist
Knowtions Research
Toronto, ON
Data Scientist
Agility consulting Inc
Richmond Hill, ON
Junior Data Scientist
PUSH
Toronto, ON
Data Scientist
RBC
Toronto, ON
Data Scientist, Data Quality Labs - Toronto, ON
Scotiabank
Toronto, ON
Data Scientist
BrainStation
Toronto, ON
Data Scientist
EllisDon Corporation
Mississauga, ON
Data Scientist
FOUND PEOPLE INC.
Toronto, ON
Data Scientist / Actuary
Munich Re
Toronto, ON
Data Scientist I
Ingram Micro
Toronto, ON
Data Scientist
Anova Ltd.
Toronto, ON
Data Mining Scientist
Huawei Canada
Markham, ON
Scientist 1
Thermo Fisher Scientific
Mississauga, ON
Jr Data Scientist
FOUND PEOPLE INC.
Toronto, ON
Senior Data Scient

Now we can load the dictionary into a pandas dataframe to display it in an orderly fashion. A Pandas dataframe makes it easy to do further analysis on the data.

In [0]:
jobTable = pd.DataFrame(jobDict, columns=['Company', 'JobTitle', 'Location'])
jobTable

Unnamed: 0,Company,JobTitle,Location
0,Loblaw Digital,Data Scientist,"Toronto, ON"
1,Boxy Charm,"Data Scientist, Algorithms & Data Systems","Toronto, ON"
2,Loblaw Digital,Senior Data Scientist,"Toronto, ON"
3,Shopify,Lead Data Scientist - App & Partner Platform,"Toronto, ON"
4,Scotiabank,"Data Scientist - Toronto, ON","Toronto, ON"
5,Canadian Tire Corporation,Data Scientist,"Toronto, ON"
6,Knowtions Research,Data Scientist,"Toronto, ON"
7,Agility consulting Inc,Data Scientist,"Richmond Hill, ON"
8,PUSH,Junior Data Scientist,"Toronto, ON"
9,RBC,Data Scientist,"Toronto, ON"


We can use the pandas data frame to find only the jobs in the result list that has the title **Data Cientist**

In [0]:
jobTable.loc[jobTable['JobTitle']=='Data Scientist']

Unnamed: 0,Company,JobTitle,Location
0,Loblaw Digital,Data Scientist,"Toronto, ON"
5,Canadian Tire Corporation,Data Scientist,"Toronto, ON"
6,Knowtions Research,Data Scientist,"Toronto, ON"
7,Agility consulting Inc,Data Scientist,"Richmond Hill, ON"
9,RBC,Data Scientist,"Toronto, ON"
11,BrainStation,Data Scientist,"Toronto, ON"
12,EllisDon Corporation,Data Scientist,"Mississauga, ON"
13,FOUND PEOPLE INC.,Data Scientist,"Toronto, ON"
16,Anova Ltd.,Data Scientist,"Toronto, ON"
22,Quartic.ai,Data Scientist,"Oakville, ON"
