# Webscraping

***Scraping the Internet Programmatically***

# Getting Started

The libraries needed for reading in HTML from sites:

- **requests:** Make GET requests to the internet to go out and get the HTML for a site

- **bs4:** Parse through a set of HTML

In [1]:
import requests
from bs4 import BeautifulSoup

## Loading In a Webpage

While the get function has several extra parameters, the main two are the url and params.

`requests.get(url, params=None)`

- **url:** The link to retrieve the HTML from

- **params:** A dictionary of parameters to add as a *query string* on top of the link. If it's None, nothing is added to the url

    - *query string:* Extra piece of the link to specify further what you want from site (addition looks like **"?key1=value1&key2=value2"**)

You can then see the HTML from the response by getting **.text** from the request object.

In [2]:
url = "http://webcode.me"
my_request = requests.get(url)
print(my_request.text)

NameError: name 'url' is not defined

## Getting Your Soup

The bs4 library takes this long HTML string and turns it into a BeautifulSoup object that has several functions that make navigating the HTML to the data you want easier.

In [None]:
soup = BeautifulSoup(my_request.text, 'html.parser')
print(soup)

# Basic Navigation

Your *soup* object is associated with all the HTML, but if you know the outermost tag that you want to navigate to. Let's try to get the *title* tag for this HTML page.

In [None]:
title = soup.title
print(title)

Nice! If you want to get the text that is contained within a singular element, you can access this by getting the **.text** of a BeautifulSoup object that you have.

In [None]:
print(title.text)

## Your Turn

How can we get the text from the first p tag that we see in the HTML?

# Searching by Tags

While the method of getting each element is effective, with big webpages, there would be a ton of steps you would have to take to get to the information you want. Luckily, the bs4 library has some methods that let us get to the tags we want faster:

`BeautifulSoup.find_all(name, attrs, recursive, string, limit)`

This method gets all the tags that fit the arguments you give it:

- **name:** The tag type (things like *div, p, button*)

- **attrs:** Tags often come with inner information (attributes). We can use this as a way of filtering our search. We pass a dictionary of {attribute_name: attribute_value} to attrs to filter our search

    - example of attributes: `<div **class="my-special-div"**>'
    
- **recursive:** This parameter is by default true. If true, the function searches in you current BeautifulSoup object as well as any nested tags within

- **limit:** If there's a maximum number of elements that you want, you can specify this with the limit parameter

- **string:** Filters tag search to those that have a given string as a the text inside.

`BeautifulSoup.find(name, attrs, recursive, string)`

This method does basically the same as *find_all*, but just returns *None* if no tag is found.

Let's try and get information from the premier league soccer standings.

In [None]:
premier_league_html_string = requests.get("https://www.premierleague.com/tables").text
soup2 = BeautifulSoup(premier_league_html_string, "html.parser")
print("This HTML has " + str(len(premier_league_html_string)) + " characters!")

That's a lot of HTML to navigate. Let's try using find_all to get the tags that have the names of the teams instead:

In [None]:
all_teams = soup2.find_all("td", {"class": "team"})
first_place = all_teams[0]
print(first_place)

That's pretty close! From here we can probably get the name of the team from this:

In [None]:
first_place_name = first_place.find("span", {"class": "long"})
print(first_place_name)
print(first_place_name.text)

## Your Turn

Write a little code to go through all teams and print out the standings:

# Conclusion

And that's it! You now know the basic of scraping websites and you can apply this mentality of following the tags to any website to pull information out of it programmatically. Enjoy scraping!

For more information on the bs4 libray: https://beautiful-soup-4.readthedocs.io/en/latest/#id12

---

This notebook was created by Jonathan Keane for MSOE's GDSC webscraping workshop.