## Using BeautifulSoup

**Task:** From the [SPICED Academy homepage](https://www.spiced-academy.com/), extract the following paragraph using web-scraping: 
- "Unlock the power of data with our immersive Data Science bootcamp"

*Tip:*
- Use a nice browser that allows you to easily view the HTML code side-by-side with the webpage.
    - For example, in Google Chrome, you can right-click (or ctrl+click) a webpage and click '*View Page Source*'
    - You can also click '*Inspect*' to get a more interactive comparison of the HTML code that corresponds to a section of the website you're interested in.

- JSON -- data structure typical of APIs
    - cleaner data, easier to parse
- HTML -- markup language used to structure web pages
    - messier data, requires specialized tools to parse

### Step 1. Get the HTML text from a website

There are multiple ways to get to the solution!

``.find()`` always returns the first instance of your 'query'.

``.find_all()`` returns a list-like object (called a "ResultSet") that contains at least one matching result.

``.text`` returns the actual part of the tag that is outside of the < angled brackets > (i.e. the text)

#### Download the html from a web page using the requests module.

In [10]:
import requests
import re
import pandas as pd

response = requests.get('https://www.spiced-academy.com/')
spiced_html = response.text
print(type(spiced_html) ,response)


<class 'str'> <Response [200]>


In [11]:
print(spiced_html)

<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#" dir="ltr">

<head>
    <title>Your new career starts here | Spiced Academy</title>
    <meta name="description" content="Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science.">
    
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&display=swap" rel="stylesheet">
    <link rel='stylesheet' href='/css/main.css'>
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v=3">
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png?v=3">
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png?v=3">
    <link rel="mask-icon" href="/safari-pinned-tab.svg" color

### Step 2. Convert the raw HTML string to a BeautifulSoup object, so that we can parse the data.

Did you notive with the definition of HTML that there is both Markup and Markdown apparently?

Go for the shortest explanation at the moment: https://www.quora.com/What-is-the-difference-between-markup-and-markdown

In [12]:
from bs4 import BeautifulSoup
spiced_soup = BeautifulSoup(spiced_html, 'html.parser')
print(f"""{type(spiced_soup)} \n
{spiced_soup}""")


<class 'bs4.BeautifulSoup'> 

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<head>
<title>Your new career starts here | Spiced Academy</title>
<meta content="Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science." name="description"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport">
<link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&amp;display=swap" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&amp;display=swap" rel="stylesheet"/>
<link href="/css/main.css" rel="stylesheet"/>
<link href="/apple-touch-icon.png?v=3" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png?v=3" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png?v=3" rel="icon" sizes="16x16" type="image/png"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-

div element: https://www.w3schools.com/html/html_classes.asp

In [13]:
spiced_soup.find_all('div', 
                     attrs = {'class' : 'city-cohort-dates',
                             }
                    )[0].find_all('p')[2]
                   


<p>Unlock the power of data with our immersive Data Science bootcamps.</p>

### Step 3. Use the BeautifulSoup object to parse the HTML document tree down to the tag that contains the data you want.
- There are multiple ways to get to the solution!
- .find() always returns the first instance of your "query"
- .find_all() returns a list-like object (called a "ResultSet") that contains at least one matching result.
- .text returns the actual part of the tag that is outside of the **< angled brackets  >** (i.e. the text)

#### Let's check the h1 level headers instead

In [14]:
spiced_soup.find_all('h1', attrs = {'class' : 'main-heading'}
                   )


[<h1 class="main-heading main-heading--future">
             
                     Your future in tech
             </h1>,
 <h1 class="main-heading main-heading--space-bottom main-heading--secondary main-heading--space-bottom">
                 Your new career starts here
             
             </h1>,
 <h1 class="main-heading">
                 
                         Our Programs
                 </h1>,
 <h1 class="main-heading main-heading--off-center-left main-heading--space-bottom">Career <br/> Services</h1>,
 <h1 class="main-heading main-heading--space-bottom main-heading--primary">
                 What our students say:
             
             </h1>,
 <h1 class="main-heading main-heading--space-bottom main-heading--primary">
                 What our students say:
             
             </h1>,
 <h1 class="main-heading main-heading--space-bottom main-heading--primary">
                 What our students say:
             
             </h1>,
 <h1 class="main-heading 

**Which index has the career services?**

In [15]:
spiced_soup.find_all('h1')[:4]

[<h1 class="main-heading main-heading--future">
             
                     Your future in tech
             </h1>,
 <h1 class="main-heading main-heading--space-bottom main-heading--secondary main-heading--space-bottom">
                 Your new career starts here
             
             </h1>,
 <h1 class="main-heading">
                 
                         Our Programs
                 </h1>,
 <h1 class="main-heading main-heading--off-center-left main-heading--space-bottom">Career <br/> Services</h1>]

**It looks like it's the fourth header.**

Access the text of a single element with .text.

In [16]:
spiced_soup.find_all('h1')[3]#.text

<h1 class="main-heading main-heading--off-center-left main-heading--space-bottom">Career <br/> Services</h1>

**Go to the end of Step 2 above, extract the text without the HTML tags.**

### EXERCISE

Check the error you get with the command:

``spiced_soup.find_all('h1')[:3].text``

Write the command for finding the text in all of the h1 headers 
(use a for loop).

In [44]:
header_list= []
for element in spiced_soup.find_all('h1'):
    header = element.text.replace('\n', '')
    print(element.text)
    header_list.append(header)


            
                    Your future in tech
            

                Your new career starts here
            
            

                
                        Our Programs
                
Career  Services

                What our students say:
            
            

                What our students say:
            
            

                What our students say:
            
            

                What our students say:
            
            

                What our students say:
            
            


In [45]:
header_list

['                                Your future in tech            ',
 '                Your new career starts here                        ',
 '                                        Our Programs                ',
 'Career  Services',
 '                What our students say:                        ',
 '                What our students say:                        ',
 '                What our students say:                        ',
 '                What our students say:                        ',
 '                What our students say:                        ']