# Wescraping BS Codecademy COurse

## Introduction

Before we get started, a quick note on prerequisites: This course requires knowledge of Python. Also some understanding of the Python library Pandas will be helpful later on in the lesson, but isn’t totally necessary. If you haven’t already, check out those courses before taking this one. Okay, let’s get scraping!

In Data Science, we can do a lot of exciting work with the right dataset. Once we have interesting data, we can use Pandas or Matplotlib to analyze or visualize trends. But how do we get that data in the first place?

If it’s provided to us in a well-organized csv or json file, we’re lucky! Most of the time, we need to go out and search for it ourselves.

Often times you’ll find the perfect website that has all the data you need, but there’s no way to download it. This is where BeautifulSoup comes in handy to scrape the HTML. If we find the data we want to analyze online, we can use BeautifulSoup to grab it and turn it into a structure we can understand. This Python library, which takes its name from a song in Alice in Wonderland, allows us to easily and quickly take information from a website and put it into a DataFrame.

### Instructions

1.We’ve used BeautifulSoup to take the turtle data from the Shellter website in the browser and put it into a DataFrame.

Explore the website a bit. Then, print the DataFrame turtles to see how this data is organized.

In [None]:
from preprocess import turtles
print(turtles)

                            0  ...                             4
Aesop        AGE: 7 Years Old  ...    SOURCE: found in Lake Erie
Caesar       AGE: 2 Years Old  ...      SOURCE: hatched in house
Sulla         AGE: 1 Year Old  ...    SOURCE: found in Lake Erie
Spyro        AGE: 6 Years Old  ...      SOURCE: hatched in house
Zelda        AGE: 3 Years Old  ...  SOURCE: surrendered by owner
Bandicoot    AGE: 2 Years Old  ...      SOURCE: hatched in house
Hal           AGE: 1 Year Old  ...  SOURCE: surrendered by owner
Mock        AGE: 10 Years Old  ...  SOURCE: surrendered by owner
Sparrow    AGE: 1.5 Years Old  ...    SOURCE: found in Lake Erie

[9 rows x 5 columns]
 
1/11

## Rules of Scraping

When we scrape websites, we have to make sure we are following some guidelines so that we are treating the websites and their owners with respect.

Always check a website’s Terms and Conditions before scraping. Read the statement on the legal use of data. Usually, the data you scrape should not be used for commercial purposes.

Do not spam the website with a ton of requests. A large number of requests can break a website that is unprepared for that level of traffic. As a general rule of good practice, make one request to one webpage per second.

If the layout of the website changes, you will have to change your scraping code to follow the new structure of the site.

## Requests

In order to get the HTML of the website, we need to make a request to get the content of the webpage. To learn more about requests in a general sense, you can check out this https://www.codecademy.com/articles/http-requests.

Python has a *requests* library that makes getting content really easy. All we have to do is import the library, and then feed in the URL we want to *GET*:

In [None]:
import requests
 
webpage = requests.get('https://www.codecademy.com/articles/http-requests')
print(webpage.text)

This code will print out the HTML of the page.

We don’t want to unleash a bunch of requests on any one website in this lesson, so for the rest of this lesson we will be scraping a local HTML file and pretending it’s an HTML file hosted online.

### Instructions

In [None]:
1.
Import the requests library.


2.
Make a GET request to the URL containing the turtle adoption website:

https://content.codecademy.com/courses/beautifulsoup/shellter.html

Store the result of your request in a variable called webpage_response.


3.
Store the content of the response in a variable called webpage by using .content.

Print webpage out to see the content of this HTML.

In [None]:
import requests #import the requests library
webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html') #make a get request
webpage = webpage_response.content #store the content of the HTML
print(webpage) #print the result

## The BeautifulSoup Object

When we printed out all of that HTML from our request, it seemed pretty long and messy. How could we pull out the relevant information from that long string?

BeautifulSoup is a Python library that makes it easy for us to traverse an HTML page and pull out the parts we’re interested in. We can import it by using the line:

*from bs4 import BeautifulSoup*

Then, all we have to do is convert the HTML document to a BeautifulSoup object!

If this is our HTML file, rainbow.html:

<body>
  <div>red</div>
  <div>orange</div>
  <div>yellow</div>
  <div>green</div>
  <div>blue</div>
  <div>indigo</div>
  <div>violet</div>
</body>

*soup = BeautifulSoup("rainbow.html", "html.parser")*

"html.parser" is one option for parsers we could use. There are other options, like "lxml" and "html5lib" that have different advantages and disadvantages, but for our purposes we will be using "html.parser" throughout.

With the requests skills we just learned, we can use a website hosted online as that HTML:

In [None]:
from bs4 import BeautifulSoup as bs
webpage = requests.get("http://rainbow.com/rainbow.html", "html.parser")
soup = bs(webpage.content)

#When we use BeautifulSoup in combination with pandas, we can turn websites into DataFrames 
#that are easy to manipulate and gain insights from.

### Instructions

1.
Import the BeautifulSoup package.


2.
Create a BeautifulSoup object out of the webpage content and call it soup. Use "html.parser" as the parser.

Print out soup! Look at how it contains all of the HTML of the page! We will learn how to traverse this content and find what we need in the next exercises.

In [None]:
#we import the needed libraries
import requests
from bs4 import BeautifulSoup as bs

#then we use *get* from *requests* to pull the website html content in *wepage_response*
webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')
#we store that content in the variable webpage
webpage = webpage_response.content
#then we create a soup object to store the parsed html of the page
soup = bs(webpage,'html.parser')
print(soup)

## Object Types

BeautifulSoup breaks the HTML page into several types of objects.

## Tags

A Tag corresponds to an HTML Tag in the original document. These lines of code:

soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')
print(soup.div)

Would produce output that looks like:

<div id="example">An example div</div>

Accessing a tag from the BeautifulSoup object in this way will get the first tag of that type on the page.

You can get the name of the tag using .name and a dictionary representing the attributes of the tag using .attrs:

print(soup.div.name)
print(soup.div.attrs)

div
{'id': 'example'}

In [None]:
#we try the code explained in the previous paragraph
soup = bs('<div id="example">An example div</div><p>An example p tag</p>')
print(soup.div)
print(soup.p)

Accessing a tag from the BeautifulSoup object in this way will get the first tag of that type on the page.

You can get the name of the tag using .name and a dictionary representing the attributes of the tag using .attrs:

In [None]:
print(soup.div.name) #div
print(soup.div.attrs) #{'id': 'example'}

## NavigableStrings

NavigableStrings are the pieces of text that are in the HTML tags on the page. You can get the string inside of the tag by calling .string:

In [None]:
print(soup.div.string) #An example div

### Instructions

1.
Print out the first p tag on the shellter.html page.

2.
Print out the string associated with the first p tag on the shellter.html page.

In [None]:
#First we import the needed libraries
import requests
from bs4 import BeautifulSoup as bs

#then we request to access a webpage content, and save it
webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')
webpage = webpage_response.content

#then we create the bs object
soup = bs(webpage, 'html.parser')

#now we solve exercise 1
print(soup.p) #output --> <p class="text">Click to learn more about each turtle</p>

#and we solve exercise 2
print(soup.p.string) #output --> Click to learn more about each turtle


In [None]:
#visualizing the webpage using parser
soup = bs(webpage, 'html.parser')
print(soup)

In [None]:
#visualizing the website without parser *barely any difference*
soup2 =bs(webpage)
print(soup2)

In [None]:
#visualizing the website using prettify() organizes the html in hierarchy
soup3 =bs(webpage)
print(soup3.prettify())

#### CHECK THE BeautifulSoup library documentarion here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
#### CHECK THE CHEATSEET FOR CODECADEMY BEAUTIFUL COURSE

