# Web Scraping Goals:
 - Find out how to grab some stuff out of a web page that might be useful for us
 - Find out how to do that more than once
 - Find out how to identify that stuff
 - Maybe learn a little more about HTML along the way
 - Put all that together to make an acquisition script (just like before)

# Intro to Web Scraping
- Use `requests` to download the HTML
- Use `BeautifulSoup` to parse that HTML to get the thing(s) you need

## Process
- Step 1: use the `request` library to make an HTTP request across the web
- Step 2: use the `reponse.text` property on the `response` object to get the text of the HTML

In [4]:
from requests import get

In [5]:
# define a url just like we did previously
url = 'https://site-to-scrape.glitch.me'

In [None]:
# trying to grab the json() from the page like with API content will 
# unfortunately break if its made for humans:
# get(url).json()

In [10]:
# grab the text from our url
# this gives us all the raw text that makes up the code of the 
# static site
get(url).content

b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <title>Site to Scrape!</title>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    \n    <!-- import the webpage\'s stylesheet -->\n    <link rel="stylesheet" href="/style.css">\n    \n    <!-- import the webpage\'s javascript file -->\n    <script src="/script.js" defer></script>\n  </head>  \n  <body>\n    <header>\n      <h1>This is the header!</h1>\n      <hr>\n    </header>\n    \n    <main>\n      <div>\n        <h1 class="first">\n        This is the main\n        </h1>\n        <h2>\n          This is an h2 of main\n        </h2>\n        <h3>\n          H3 inside of first div inside of main\n        </h3>\n      </div>\n      <div>\n        <h3 class="first">\n          H3 inside of second div inside of main.\n        </h3>\n        <p>\n          Here\'s some text content for us to scrape! \xf0\x9f\x91\xbd\n      

In [11]:
# let's utilize beautiful soup to add to our response content:
from bs4 import BeautifulSoup

In [13]:
# we'll keep this url in our pocket for later
url2 = 'https://web-scraping-demo.zgulde.net/news'

In [14]:
# make a soup:
# recipe:
# call BeautifulSoup on the content of our response
soup = BeautifulSoup(get(url).content, 'html.parser')

In [18]:
# If we look at soup, its of the same structure as the text, but a little cleaner
# and furthermore, its a new object -- a BeautifulSoup object
soup

<!DOCTYPE html>

<html lang="en">
<head>
<title>Site to Scrape!</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- import the webpage's stylesheet -->
<link href="/style.css" rel="stylesheet"/>
<!-- import the webpage's javascript file -->
<script defer="" src="/script.js"></script>
</head>
<body>
<header>
<h1>This is the header!</h1>
<hr/>
</header>
<main>
<div>
<h1 class="first">
        This is the main
        </h1>
<h2>
          This is an h2 of main
        </h2>
<h3>
          H3 inside of first div inside of main
        </h3>
</div>
<div>
<h3 class="first">
          H3 inside of second div inside of main.
        </h3>
<p>
          Here's some text content for us to scrape! 👽
        </p>
<p>
          Here's another paragraph of content! ☠️
        </p>
<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>
</div>
</main>
<footer>
<h1>This 

 - Begin with the End in Mind!
 - Figure out what you want to grab from a web page
 - figure out where it lives
 - figure out how many you want to grab (if theres more than one)
 - learn how to reference it ***

In [19]:
# <tag>
# some stuff
# <\tag>

In [20]:
# things we can do with soup:
# reference elements directly!

In [24]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<title>Site to Scrape!</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- import the webpage's stylesheet -->
<link href="/style.css" rel="stylesheet"/>
<!-- import the webpage's javascript file -->
<script defer="" src="/script.js"></script>
</head>
<body>
<header>
<h1>This is the header!</h1>
<hr/>
</header>
<main>
<div>
<h1 class="first">
        This is the main
        </h1>
<h2>
          This is an h2 of main
        </h2>
<h3>
          H3 inside of first div inside of main
        </h3>
</div>
<div>
<h3 class="first">
          H3 inside of second div inside of main.
        </h3>
<p>
          Here's some text content for us to scrape! 👽
        </p>
<p>
          Here's another paragraph of content! ☠️
        </p>
<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>
</div>
</main>
<footer>
<h1>This 

In [28]:
# I can reference tags with dot notation, but it seems to present 
# issues when there's more than one thing that I want.
# what do?
soup.p

<p>
          Here's some text content for us to scrape! 👽
        </p>

In [30]:
# find:
# think about find as "find first"
soup.find('p')

<p>
          Here's some text content for us to scrape! 👽
        </p>

In [39]:
# think about find_all as "find all elements"
# its not technically a list but you will interact with it
# in the same manner: [index]
soup.find_all('p')[1]

<p>
          Here's another paragraph of content! ☠️
        </p>

In [45]:
# if we want to grab the content from each thing:
# use .text to remove the html tags
# use .strip() string method to remove the extra whitespace
# use regex for anything else idk lol
soup.find_all('p')[1].text.strip()

"Here's another paragraph of content! ☠️"

In [47]:
# the text content of our tag is already a string
type(soup.find_all('p')[1].text)

str

In [44]:
# remember escape keys for whitespace as we parse this stuff:
print('here\nis\nsome\text')

here
is
some	ext


In [51]:
# select is also a really useful one:
# select will grab all css elements in a page:
# equivalents: select_one is like find
# select is like find_all

In [53]:
len(soup.select('h1'))

3

In [55]:
len(soup.find_all('h1'))

3

In [57]:
# use list comprehension to grab all the content that 
# I wanted from the h1 header tag:
# translation:
# the thing's text stripped of whitespace
# for every element
# that results from the soup.select() call
# on my h1 named tag
[thing.text.strip() for thing in soup.select('h1')]

['This is the header!', 'This is the main', 'This is the footer']

# Let's build on this:
 - Let's build a new task for ourselves:
 - Examine a new url
 - Figure out what we want to grab from it
 - Figure out how to grab it
 - Figure out how to do it programmatically

In [58]:
url2

'https://web-scraping-demo.zgulde.net/news'

In [63]:
# let's grab content from url2:
# use get :
response = get(url2).content
# use soup: 
soup2 = BeautifulSoup(response, 'html.parser')

In [67]:
len(soup2.find_all('div'))

38

We only have 12 articles here, so div may not be specific enough for our needs in this case

let's go deeper!

In [68]:
len(soup2.select('div'))

38

In [70]:
# last tactic here doesnt seem to be drilling down far enough:
# [thing.text.strip() for thing in soup2.select('div')]

In [71]:
# if we want to reference the class out of here specifically:
# classes can be referenced with dot notation
# find_all does not play with the css as well as select:
# this does not break but it gives us an empty set
soup2.find_all('div.grid.grid-cols-4')

[]

In [78]:
articles = soup2.select('div.grid.grid-cols-4')

In [90]:
articles[0].h2.text.strip()

'ten off choice'

In [92]:
articles[0].find('h2').text.strip()

'ten off choice'

In [87]:
articles[0]

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">ten off choice</h2>
<div class="grid grid-cols-2 italic">
<p> 1971-08-07 </p>
<p class="text-right">By Cynthia Gutierrez </p>
</div>
<p>Worker who develop dog series big trouble. Structure glass rule give the movie. Direction movie along seat resource.
Agent attention discussion very create mother part. Hit world model student. Discussion type citizen believe.</p>
</div>
</div>

In [77]:
# I found a way to reference the articles in the page

In [93]:
articles[0].select('p')

[<p> 1971-08-07 </p>,
 <p class="text-right">By Cynthia Gutierrez </p>,
 <p>Worker who develop dog series big trouble. Structure glass rule give the movie. Direction movie along seat resource.
 Agent attention discussion very create mother part. Hit world model student. Discussion type citizen believe.</p>]

In [94]:
[thing.text.strip() for thing in articles[0].find_all('p')]

['1971-08-07',
 'By Cynthia Gutierrez',
 'Worker who develop dog series big trouble. Structure glass rule give the movie. Direction movie along seat resource.\nAgent attention discussion very create mother part. Hit world model student. Discussion type citizen believe.']

In [95]:
def get_article_content(some_article):
    '''
    grab the content out of the beautiful soup object
    for each article in our overall set of articles
    '''
    output = {}
    output['headline'] = some_article.find('h2').text.strip()
    output['date'], output['author'], output['content'] = \
    [thing.text.strip() for thing in some_article.find_all('p')]
    return output

In [96]:
get_article_content(articles[0])

{'headline': 'ten off choice',
 'date': '1971-08-07',
 'author': 'By Cynthia Gutierrez',
 'content': 'Worker who develop dog series big trouble. Structure glass rule give the movie. Direction movie along seat resource.\nAgent attention discussion very create mother part. Hit world model student. Discussion type citizen believe.'}

In [97]:
# We can use the get_article_content for every 
# article in the list of articles,
# resulting in a list of dictionaries
[get_article_content(article) for article in articles]

[{'headline': 'ten off choice',
  'date': '1971-08-07',
  'author': 'By Cynthia Gutierrez',
  'content': 'Worker who develop dog series big trouble. Structure glass rule give the movie. Direction movie along seat resource.\nAgent attention discussion very create mother part. Hit world model student. Discussion type citizen believe.'},
 {'headline': 'car value close',
  'date': '1985-05-30',
  'author': 'By Krystal Hampton',
  'content': 'Record source bring away price. Moment help million responsibility eye talk himself. Who TV by.\nTrial area lay rich later price and force. Picture change understand anyone standard.'},
 {'headline': 'physical accept town',
  'date': '1977-06-27',
  'author': 'By James Gonzalez',
  'content': 'Activity television find building. Tree term upon. Visit place manager message hit.\nLikely thousand station occur. Administration bring amount field guess care hard. Inside leg early we teacher policy.'},
 {'headline': 'future note cold',
  'date': '1998-10-13',

In [102]:
import pandas as pd

In [103]:
def acquire_articles(url):
    '''
    Given a url with the expected strucure,
    acquire_articles will pull the contents
    of the web page and return a dataframe of 
    all of the articles inside the page
    '''
    soup = BeautifulSoup(get(url).content, 'html.parser')
    articles = soup.select('div.grid.grid-cols-4')
    return pd.DataFrame([get_article_content(article) for article in articles])

In [104]:
# test it out:
acquire_articles(url2)

Unnamed: 0,headline,date,author,content
0,child expert coach,2009-06-19,By Desiree Hicks MD,Unit family scientist send most source growth....
1,fact message just,1997-02-23,By Jennifer Pacheco MD,American society inside theory. Risk worker la...
2,responsibility interest Mr,1972-06-12,By Jeremy Blake,Beautiful line fire then relationship career h...
3,family lay key,1998-12-25,By Trevor Spencer,Cup instead yeah believe local. Space soon whi...
4,practice pass watch,1971-12-04,By Craig Brown,Outside figure international board gun.\nMemor...
5,court hundred beyond,1992-12-30,By Samantha Curtis,Girl agency boy realize notice player hit. Att...
6,while choice a,2021-04-04,By Andrew Obrien,Discuss event rate outside my determine at. Me...
7,inside these fight,2006-09-30,By Victoria Murillo,Interview throughout ready direction get.\nDir...
8,skin safe structure,1975-03-08,By Elizabeth Owens,Not opportunity feeling similar still recent w...
9,from pull mission,2009-08-24,By Kimberly Sanders,High level how be peace. Clearly write suffer ...
