<a href="https://colab.research.google.com/github/NovaMaja/webscraping/blob/master/webscraping_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic web scraping with Beautiful Soup
This notebook is meant to show the basic principles of webscraping using a very simple example page located at: https://novainstitute.ca/examples/examplePage.html 

Before we start please open the [example page](https://novainstitute.ca/examples/examplePage.html ) in your web browser (we recommend [chrome](https://www.google.com/chrome/)).

*This notebook and example materials is developed by [Nova Institute](https://novainstitute.ca) and is released under the [MIT license](https://https://github.com/NovaMaja/webscraping/blob/master/LICENSE). *


##Imports
###requests
We will use the **requests** library to get the raw html from a webpage. **requests** makes all http requests simple, and you can use it with GET, POST, PUT, DELETE, HEAD, OPTIONS and there are a lot of useful functions included in the library. See http://docs.python-requests.org/ for more info on the **requests** library.

###BeautifulSoup
**Beautiful Soup** is a library for parsing web pages. It makes a parsing tree of a webpage based on the html structure. With a parsing tree it is easy to navigate through the contents of the webpage and get the information we are looking for. The **Beautiful Soup** [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a great quickstart section to get you started.


In [0]:
import requests
from bs4 import BeautifulSoup


##Get and parse the web page
The first thing we will do is to go to indeed.com in our browser (we recommend chrome) and do a search for a job we are interested in. In our case we searched for Data Scientist in Toronto, ON.

Once we have a serach we are happy with we need to copy the url from the browser and paste it as an argument to the requests.get() function.

After we loaded the webpage into our page variable we will pass it on to BautifulSoup to parse it into a parsing tree for us, using html.parser

In [0]:
page = requests.get('https://novainstitute.ca/examples/examplePage.html')

In [0]:
soup = BeautifulSoup(page.text, 'html.parser')

soup has a function called prettify() that makes the parsing tree more human readable. We will use it to print out the information we gathered.

In [0]:
print(soup.prettify())

We can extract elements based on their tags using our parsing tree. For example, this is how to display only the title of the web-page:

In [0]:
title = soup.title.contents[0]
title = title.strip()
print(title)

If we want just the text in the body section we can get that too:

In [0]:
bodytext = soup.p.contents[0]
bodytext = bodytext.strip()
print(bodytext)

You can even find all the links on the page:

In [0]:
websites = soup.find_all('a')
for website in websites:
  website = website["href"]
  print(website)

##Downloading images
We can get the url for an image much the same way as we get the url for a link, but we want to download the image itself. For this we will write a small function using requests.get.

In [0]:
def downloadImage(url, filename):
  img = requests.get(url)
  
  with open(filename, "wb") as code:
    code.write(img.content)

Now we can get the urls for the images, and use the function we already defined to download those images.

In [0]:
images = soup.find_all('img')
i = 0;
for image in images:
  image = image["src"]
  print(image)
  downloadImage(image,"img{}.png".format(i))
  i += 1




If you are using google colab to run this notebook you should find a file called img0.png under the files tab in the menu on the left (you may need to click the little arrow button to see the menu)

If you are running the notebook locally you should find the file in the same directory as your notebook.