# Beautiful Soup

### Requests

In order to get the HTML of the website, we need to make a request to get the content of the webpage. To learn more about requests in a general sense, you can check out this article.

Python has a requests library that makes getting content really easy. All we have to do is import the library, and then feed in the URL we want to GET:
```python 
import requests

webpage = requests.get('https://www.codecademy.com/articles/http-requests')
print(webpage.text)

```
This code will print out the HTML of the page.

We don’t want to unleash a bunch of requests on any one website in this lesson, so for the rest of this lesson we will be scraping a local HTML file and pretending it’s an HTML file hosted online.

In [None]:
import requests

webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content

print(webpage)

### The BeautifulSoup Object

When we printed out all of that HTML from our request, it seemed pretty long and messy. How could we pull out the relevant information from that long string?

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in. We can import it by using the line:
```pyhton
from bs4 import BeautifulSoup 
```

Then, all we have to do is convert the HTML document to a BeautifulSoup object!

If this is our HTML file, rainbow.html:
``` html
<body>
  <div>red</div>
  <div>orange</div>
  <div>yellow</div>
  <div>green</div>
  <div>blue</div>
  <div>indigo</div>
  <div>violet</div>
</body>
```
```pyhton
soup = BeautifulSoup("rainbow.html", "html.parser")
```
"html.parser" is one option for parsers we could use. There are other options, like "lxml" and "html5lib" that have different advantages and disadvantages, but for our purposes we will be using "html.parser" throughout.

With the requests skills we just learned, we can use a website hosted online as that HTML:
```python
webpage = requests.get("http://rainbow.com/rainbow.html", "html.parser")
soup = BeautifulSoup(webpage.content)
When we use BeautifulSoup in combination with pandas, we can turn websites into DataFrames that are easy to manipulate and gain insights from.
```
#### Task

1. Import the BeautifulSoup package.


2. Create a BeautifulSoup object out of the webpage content and call it soup. Use "html.parser" as the parser. Print out soup! Look at how it contains all of the HTML of the page! We will learn how to traverse this content and find what we need in the next exercises.

In [None]:
import requests
from bs4 import BeautifulSoup 

webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content

soup = BeautifulSoup(webpage)

print(soup)

### Object Types
BeautifulSoup breaks the HTML page into several types of objects.

#### Tags
A Tag corresponds to an HTML Tag in the original document. These lines of code:
```python
soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')
print(soup.div)
```
Would produce output that looks like:
```text
<div id="example">An example div</div>
```
Accessing a tag from the BeautifulSoup object in this way will get the first tag of that type on the page.

You can get the name of the tag using `.name` and a dictionary representing the attributes of the tag using `.attrs`:
```python
print(soup.div.name)
print(soup.div.attrs)
```
```text
div
{'id': 'example'}
```

#### NavigableStrings
NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling `.string`:
``` python
print(soup.div.string)
```
```text
An example div
```

#### Tasks
1. Print out the first p tag on the shellter.html page.

2. Print out the string associated with the first p tag on the shellter.html page.

In [2]:
import requests
from bs4 import BeautifulSoup

webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

print(soup.p)
print(soup.p.string)


<p class="text">Click to learn more about each turtle</p>
Click to learn more about each turtle
