# Web Scraping- HTML and Beautiful Soup

> One of the most common ways to obtain data is through the use of _web scraping_. Web scraping, as the name suggests, is about pulling information from websites in a programmatic fashion (because copy and pasting would be way too much effort, especially for vast amounts of data).

## The Challenge

Let's say we wanted to build a model which would predict house prices given some features - for example, location, number of bedrooms, number of bathrooms. We need some way of obtaining this data - both the response and the target variables.

To introduce you to the concept of web scraping, let's try and extract data for 100 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address
    
[This URL shows houses listed for sale in London](https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list). Let's take a look at where the information that we want to extract is on the webpage.

Before we look at solving this challenge, let's take a look at what websites and HTML actually are.

## Websites

### What format does information on a website exist in?

- We know that websites don't just print data in a nice CSV or JSON format
- They have content to display information to you in a way that makes sense, like buttons, on the page
- This content is defined in an HTML file
- They also have styling

#### What is HTML?

> HTML stands for HyperText Markup Language. It consists of a tree structure of different types of web elements, like buttons, page divisions, images and more. 

This means that HTML is used to define what **content** is rendered on any webpage that you visit.

HTML markdown contains elements/tags that may contain other elements/tags.


[Let's play around with some HTML](https://code.sololearn.com/WoNr8gIeKYDr/)

### How can we get the website HTML, which contains data that we want?

When you search for a URL in a browser, here's what happens:
- Your browser makes a **GET request** to the computer (server) that serves requests from that URL endpoint
- This computer knows what web content to send you back, so it sends it in a response to the request. This stuff includes the HTML of the page that you want to view.
- Your browser gets the HTML, and knows how to present that type of data to you (it renders the webpage)

The point here being that you can get the HTML, which defines the content for any site, by making a GET request to that website.

Let's try that!

We can use the requests library to get the HTML from a website

In [None]:
import requests # import the requests library
r = requests.get('http://pythonscraping.com/pages/page3.html') # make a HTTP GET request to this website
html_string = r.text # the text attribute of this response is the HTML as a string
print(r.text)

# BeautifulSoup

What we saw above only gives us the HTML, but we want to be able to extract the data from it. After requesting the data from the webpage, we obtain a string of HTML, but looking for some specific data is a bit of a pain. We can use the **BeautifulSoup** library to extract the data from the HTML looking for specific tags and their attributes.

In [None]:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://pythonscraping.com/pages/page3.html')
html = page.text # Get the content of the webpage
soup = BeautifulSoup(html, 'html.parser') # Convert that into a BeautifulSoup object that contains methods to make the tag searcg easier
print(soup.prettify())

Let's see an example using the following [URL](http://pythonscraping.com/pages/page3.html) `'http://pythonscraping.com/pages/page3.html'`

In that webpage you will find a small list with a set of items. Let's say that you want to extract the data from the Fish Painting.

<p align="center">
  <img src='images/BS4_1.png' width=500>
  <figcaption align="center"><cite>Sample Website</cite></figcaption>
</p>


In your browser, you can see that the HTML for the page is in the `<body>` tag. Let's see how we can extract the data from this HTML. In the page, right-click on the `<body>` tag and select **Inspect Element**. There, you will see the HTML for the page, and you can see that the Fish Painting is in a `<tr>` tag.

<p align="center">
  <img src='images/BS4_2.png' width=500>
  <figcaption align="center"><cite>HTML Inspect Element Output</cite></figcaption>
</p>

You can find that tag using the method `find` that accepts the tag name, and the attributes of said tag

In [None]:
fish = soup.find(name='tr', attrs={'id': 'gift3', 'class': 'gift'}) # If it doesn't find anything it returns None

print(fish)

Inside the `tr` tag, you will find different `<td>` tags. You can find all the `<td>` tags using the method `find_all` that accepts the tag name and the attributes of said tag.

In [None]:
fish_row = fish.find_all('td') # This returns a list where each item corresponds to each td tag 

Now, you obtained a list where each element correponds to the data for each column. Thus, you can index the list to get the data you want.

In [None]:
title = fish_row[0].text
description = fish_row[1].text
price = fish_row[2].text

print(title)
print(description)
print(price)

You can keep looking for more data in the tree. For example, you can look for the parrot row taking into account that it is the sibling of the fish row.

In [None]:
parrot = fish.find_next_sibling()

And you can also find the parrot's children using the method `findChildren`:

In [None]:
parrot_children = parrot.findChildren()

## Key Takeaways

- _Web scraping_ is one of the most popular techniques used in industry to obtain data. It is about pulling information from websites in a programmatic manner.
- A _website_ contains content stored typically in an HTML file format. Websites also have styling and useful interface items like buttons to make it easier for humans to interact with the data.
- HTML stands of HyperText Markup Language. It is the standard format that websites store content in.
- Browsers obtain information from a website using the `GET` command. The response is usually an HTML.
- _BeautifulSoup_ is a Python library that is used to extract data from HTML information that is returned to our computer in response to a request
- HTML information is stored in _tags_ such as `<body>` and `<tr>`. BeautifulSoup can parse these HTML tags and extract data based on the parameters we provide.