<a href="https://colab.research.google.com/github/NovaMaja/webscraping/blob/master/webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Webscraping with Beautiful Soup

In this notebook we will be exploring how to use webscraping with  beautiful soup to obtain data about jobs on indeed.com. This notebook and example materials is developed by [Nova Institute](https://novainstitute.ca) and is released under the [MIT license](https://https://github.com/NovaMaja/webscraping/blob/master/LICENSE). 

##Imports
###requests
We will use the **requests** library to get the raw html from a webpage. **requests** makes all http requests simple, and you can use it with GET, POST, PUT, DELETE, HEAD, OPTIONS and there are a lot of useful functions included in the library. See http://docs.python-requests.org/ for more info on the **requests** library.

###BeautifulSoup
**Beautiful Soup** is a library for parsing web pages. It makes a parsing tree of a webpage based on the html structure. With a parsing tree it is easy to navigate through the contents of the webpage and get the information we are looking for. The **Beautiful Soup** [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a great quickstart section to get you started.

###Pandas
**Pandas** is a library for organizing data into series and dataframes (kind of like tables). **pandas** has a lot of built in functions that makes sorting and manipulating data a breeze. Read the Pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/) for mor info.

In [0]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Get and parse the web page
The first thing we will do is to go to indeed.com in our browser (we recommend [chrome](https://www.google.com/chrome/)) and do a search for a job we are interested in. In our case we searched for Data Scientist in Toronto, ON.

Once we have a serach we are happy with we need to copy the url from the browser and paste it as an argument to the requests.get() function.

After we loaded the webpage into our page variable we will pass it on to BautifulSoup to parse it into a parsing tree for us, using *html.parser*

In [0]:
page = requests.get('https://www.indeed.ca/jobs?q=data+scientist&l=Toronto%2C+ON')
soup = BeautifulSoup(page.text, 'html.parser')

soup has a function called prettify() that makes the parsing tree more human readable. We will use it to print out the information we gathered.

In [0]:
print(soup.prettify())

This is way too much information to be useful to us. We need to find a way to filter out exactly what we are looking for. A great way to do that is to take a look at the web page using **inspect** in our browser. If we right click on one of the job listings and choose **inspect** chrome will show the relevant html for that element of the website. after a bit of digging around we can see that there is a common keyword for al the jobs in the result list. they all have class *jobsearch-SerpJobCard*. We can use that to get only the information on the cards.

In [0]:
jobcards = soup.find_all(class_ = 'jobsearch-SerpJobCard')

In [0]:
print(jobcards)

In [0]:
len(jobcards)

Using the same method we can single out specific information in each job card. We will grab these details and put it into a Dictionary.

In [0]:
jobDict = []

for card in jobcards:
  links = card.find_all('a')
  jobtitle = None
  for link in links:
    if(link.get('data-tn-element') == 'jobTitle'):
      jobtitle = link.get('title')
  print(jobtitle)
  company = card.find(class_='company')
  companylink = company.find('a')
  if companylink:
    companyName = companylink.contents[0].strip()
  else:
    companyName = company.contents[0].strip()
  print(companyName)
  location = card.find(class_='location').contents[0].strip()
  print(location)
  jobDict.append([companyName, jobtitle, location])

Now we can load the dictionary into a pandas dataframe to display it in an orderly fashion. A Pandas dataframe makes it easy to do further analysis on the data.

In [0]:
jobTable = pd.DataFrame(jobDict, columns=['Company', 'JobTitle', 'Location'])
jobTable

We can use the pandas data frame to find only the jobs in the result list that has the title **Data Scientist**

In [0]:
jobTable.loc[jobTable['JobTitle']=='Data Scientist']