# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>News Example Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">News!</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid gap-y-12">
<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">key team word</h

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')
articles[0].select('.italic')[0].select('p')

[<p> 2001-04-28 </p>, <p class="text-right">By Jermaine Martinez </p>]

In [5]:
article = articles[0]

In [6]:
article.find_all('p')[-1].text

'Each there worry. Center approach play receive.\nBeautiful specific mouth happen. At pull very image model.'

Bringing it all together:

In [7]:
def process_article(article):
    date, author = articles[0].select('.italic')[0].select('p')
    return {
        'title': article.h2.text,
        'date': date.text,
        'author': author.text
    }

pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,author
0,key team word,2001-04-28,By Jermaine Martinez
1,statement represent series,2001-04-28,By Jermaine Martinez
2,foreign wish any,2001-04-28,By Jermaine Martinez
3,bad as change,2001-04-28,By Jermaine Martinez
4,market son owner,2001-04-28,By Jermaine Martinez
5,speech ok serve,2001-04-28,By Jermaine Martinez
6,task card hear,2001-04-28,By Jermaine Martinez
7,bill perform new,2001-04-28,By Jermaine Martinez
8,soldier various red,2001-04-28,By Jermaine Martinez
9,develop within throw,2001-04-28,By Jermaine Martinez


## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('p')`: to get all the elements with tag name of `p`

## Scraping People

In [8]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Germain'})
soup = BeautifulSoup(response.text)

In [9]:
people = soup.select('.person')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Frank Mata</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Total disintermediate migration"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">larry86@ray.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">460.289.4834</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 654 Jennifer Oval Apt. 084 <br/>
                 West Maryburgh, IA 40824
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full b

In [10]:
person = people[0]
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Frank Mata</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Total disintermediate migration"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">larry86@ray.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">460.289.4834</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                654 Jennifer Oval Apt. 084 <br/>
                West Maryburgh, IA 40824
            </p>
</div>
</div>

In [11]:
import re


def parse_person(person):
    name = person.h2.text
    # .p finds the first p element; or element with a tag name of "p"
    quote = person.p.text.strip()
    # email
    email = person.select('.email')[0].text
    # phone
    phone = person.select('.phone')[0].text
    # address
    address = person.select('.address')[0].text.strip()
    address = re.sub(r'\s{2,}', ' ', address)
    
    return {'name': name, 'quote': quote, 'email': email, 'phone': phone, 'address': address}

In [12]:
pd.DataFrame([parse_person(person) for person in people])

Unnamed: 0,name,quote,email,phone,address
0,Frank Mata,"""Total disintermediate migration""",larry86@ray.com,460.289.4834,"654 Jennifer Oval Apt. 084 West Maryburgh, IA ..."
1,Sara Jackson,"""User-friendly attitude-oriented Internet solu...",haynesgary@hotmail.com,(663)144-8293,"432 Schneider Lodge Apt. 059 Michaelshire, VA ..."
2,Tammy Hawkins,"""Managed fresh-thinking initiative""",patrickgardner@hotmail.com,001-733-850-9077x577,"1406 Autumn Court Suite 578 Walkermouth, VT 87491"
3,Mark Cox,"""Compatible modular portal""",nchavez@yahoo.com,946-948-8842x4976,"408 Moore Coves Suite 237 Lake Eric, SD 53465"
4,Stacy Lam,"""Grass-roots encompassing complexity""",acurtis@sullivan.info,964-092-6572,"21323 Oneal Bridge Brucehaven, KS 76496"
5,Sergio Jackson,"""Reactive intermediate infrastructure""",rsimmons@nunez.com,604-551-0252,"66855 Ronald Heights Foleyfurt, WY 40661"
6,Kathleen Moody,"""Virtual regional budgetary management""",edwardsjose@yahoo.com,655-998-0529x434,"43653 Curtis Station Apt. 823 Garciaview, NM 7..."
7,Antonio Merritt,"""Centralized optimizing software""",stephanie20@yahoo.com,+1-126-268-0484,"8788 Lee Hills Rebeccachester, CO 57181"
8,Robin Powell,"""Switchable even-keeled database""",leemarie@mora-wagner.biz,218-391-0243,"35756 Eric Glen Suite 117 East Codyburgh, NM 4..."
9,Olivia Wilson,"""Fully-configurable object-oriented conglomera...",rcook@gmail.com,001-167-566-8941,755 Heather Crossing Suite 186 South Christoph...


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```