# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">thank any discuss</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2002-08-16 </p>
 <p class="text-right">By Kristy Turner </p>
 </div>
 <p>Law nation over great Mr cause. His opportunity building could charge.
 Look finally stock adult oil various. Writer doctor manage commercial after quickly. Very development claim green challenge they sense.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">owner tree organization</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2015-11-06 </p>
 <p class="text-right">By Karen Lopez </p>
 </div>
 <p>House worker finish everybody o

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">thank any discuss</h2>
<div class="grid grid-cols-2 italic">
<p> 2002-08-16 </p>
<p class="text-right">By Kristy Turner </p>
</div>
<p>Law nation over great Mr cause. His opportunity building could charge.
Look finally stock adult oil various. Writer doctor manage commercial after quickly. Very development claim green challenge they sense.</p>
</div>
</div>

In [7]:
headline = article.h2.text
headline

'thank any discuss'

In [8]:
date = article.p.text.strip()
date

'2002-08-16'

In [9]:
author = article.select('.text-right')[0].text.strip()[3:]
author

'Kristy Turner'

In [10]:
content = article.select('p')[-1].text
content

'Law nation over great Mr cause. His opportunity building could charge.\nLook finally stock adult oil various. Writer doctor manage commercial after quickly. Very development claim green challenge they sense.'

Bringing it all together: Make a function

In [11]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
   
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [12]:
parse_news(article)

{'headline': 'thank any discuss',
 'date': '2002-08-16',
 'author': 'Kristy Turner',
 'content': 'Law nation over great Mr cause. His opportunity building could charge.\nLook finally stock adult oil various. Writer doctor manage commercial after quickly. Very development claim green challenge they sense.'}

In [13]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,thank any discuss,2002-08-16,Kristy Turner,Law nation over great Mr cause. His opportunit...
1,owner tree organization,2015-11-06,Karen Lopez,House worker finish everybody oil finally. Gla...
2,either next market,2012-02-06,Ryan Harris,Mother data day weight possible dark instituti...
3,picture clearly civil,2018-12-30,Virginia Moss,Pick agree score later car industry. Line spee...
4,back case car,1987-08-25,Bruce Mason,To cut serious company. Already miss personal ...
5,drive card operation,1995-03-21,Bryan Martin,Whether hold once treat. Traditional agree alw...
6,week soon matter,1980-01-05,Christine Wolfe,City fire majority seat. Indeed sign successfu...
7,book actually everything,1982-03-23,Kayla Gonzalez,Former against theory form group fill. Run ret...
8,break early fear,2018-12-24,Sara Medina,Concern significant fall better lawyer help. I...
9,reflect north newspaper,2021-05-05,Jamie Fuentes,Say across want. Light once black imagine quic...


## Scraping People

In [30]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [None]:
#https://web-scraping-demo.zgulde.net/people

In [34]:
people = soup.select('.person')

In [35]:
person = people[0]
person


<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Cassandra Simon</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Customizable national infrastructure"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">fjackson@obrien-garcia.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">+1-725-734-4868</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                1908 Nixon Locks Suite 078 <br/>
                Coreychester, KY 29045
            </p>
</div>
</div>

In [45]:
name = person.select('h2')[-1].text
name
# soup.find('h2'): finds the first matching element
# Cassandra Simon

'Cassandra Simon'

In [49]:
quote = person.select('p')[0].text.strip()
quote
#content = article.select('p')[-1].text


'"Customizable national infrastructure"'

In [58]:
email = person.select('p')[1].text.strip()
email

'fjackson@obrien-garcia.com'

In [60]:
phone = person.select('p')[2].text.strip()
phone

'+1-725-734-4868'

In [72]:
address = person.select('p')[-1].text.replace('\n', '').replace('        ','').strip()
address

'1908 Nixon Locks Suite 078 Coreychester, KY 29045'

In [76]:
address = person.select('p')[-1].text.replace('\n', '').strip()
address

'1908 Nixon Locks Suite 078                 Coreychester, KY 29045'

In [62]:
import re


SyntaxError: invalid character in identifier (<ipython-input-62-ab82c2b482ba>, line 1)

In [73]:
def parse_person(person):
    name = person.select('h2')[-1].text
    quote = person.select('p')[0].text.strip()
    email = person.select('p')[1].text.strip()
    phone = person.select('p')[2].text.strip()
    address = person.select('p')[-1].text.replace('\n', '').replace('        ','').strip()

    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [75]:
# loop through all the persons
pd.DataFrame([parse_person(peeps) for peeps in people])

Unnamed: 0,name,quote,email,phone,address
0,Cassandra Simon,"""Customizable national infrastructure""",fjackson@obrien-garcia.com,+1-725-734-4868,"1908 Nixon Locks Suite 078 Coreychester, KY 29045"
1,Dustin Mclaughlin,"""Open-architected systematic Local Area Network""",thomas02@hotmail.com,(306)312-0939,"596 Griffin Greens Christopherview, PA 38210"
2,Michelle Harris,"""Phased secondary budgetary management""",beckjohn@miller.com,(344)258-6687x547,"29200 Patrick Road Suite 110 Port Jennifer, SD..."
3,Christian Perez,"""Profound attitude-oriented emulation""",patricia26@davis.biz,+1-116-463-5393,"295 Fitzgerald Dam Suite 338 Wagnerchester, MI..."
4,Jennifer Austin,"""Object-based needs-based database""",stevenmurray@gmail.com,(435)087-9456,"6755 Ochoa Fields Apt. 879 Jordanburgh, NE 74954"
5,Kyle Ward,"""Grass-roots client-driven neural-net""",wanda97@guerrero-boyd.com,001-842-049-5807,"371 Sutton Spurs South Debra, ME 40862"
6,Emily Lee,"""Extended tangible customer loyalty""",brandy46@yahoo.com,(419)183-1708x292,"53404 Smith Key Sarahstad, NY 47876"
7,Jennifer Hernandez,"""Focused radical software""",byu@johnson.org,258-317-1295,"2547 Ryan Prairie Suite 540 East Matthew, TX 9..."
8,Sara Armstrong,"""Ameliorated 6thgeneration functionalities""",annjones@fleming.com,810-375-3703x3872,"124 Scott Summit West Michael, SC 39350"
9,Robert Baxter,"""Robust coherent help-desk""",hmills@atkins.info,363.555.3470x55274,"970 Ruiz Viaduct Apt. 890 Barnettbury, MI 70359"


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [77]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)