# Data Acquisition with Web Scraping

In [92]:
import requests
from bs4 import BeautifulSoup 
import pandas as pd

First make the request. The response is a bunch of html.

In [93]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [94]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>News Example Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">News!</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid gap-y-12">
<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">star those white

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [95]:
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')
date, author = articles[0].select('.italic')[0].select('p')

In [96]:
date

<p> 2007-07-19 </p>

In [97]:
author

<p class="text-right">By Gina Lynch </p>

Bringing it all together:

In [52]:
def process_article(article):
    date, author = articles[0].select('.italic')[0].select('p')
    return {
        'title': article.h2.text,
        'date': date.text,
        'author': author.text
    }

pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,author
0,play only work,1973-01-31,By Mr. Justin Cabrera
1,back support pass,1973-01-31,By Mr. Justin Cabrera
2,institution still goal,1973-01-31,By Mr. Justin Cabrera
3,sort right customer,1973-01-31,By Mr. Justin Cabrera
4,decade throughout represent,1973-01-31,By Mr. Justin Cabrera
5,theory these role,1973-01-31,By Mr. Justin Cabrera
6,many Congress and,1973-01-31,By Mr. Justin Cabrera
7,author television medical,1973-01-31,By Mr. Justin Cabrera
8,wife road politics,1973-01-31,By Mr. Justin Cabrera
9,could vote ago,1973-01-31,By Mr. Justin Cabrera


In [143]:
response = requests.get('https://web-scraping-demo.zgulde.net/people')

In [144]:
html = response.text

In [145]:
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>Example People Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">People</h1>\n\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n\n<div id="people" class="grid grid-cols-2 gap-x-12 gap-y-16">\n    \n    <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">\n    

In [146]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Example People Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">People</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Jonathan Navarr

In [147]:
people = soup.select('.grid.grid-cols-2.gap-x-3')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Jonathan Navarro</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Programmable coherent initiative"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">fieldsdawn@sharp-kelly.org</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">880.717.0266x88570</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 4484 Adam Turnpike Suite 440 <br/>
                 West Jodishire, WY 86721
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purpl

In [148]:
people[0]

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Jonathan Navarro</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Programmable coherent initiative"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">fieldsdawn@sharp-kelly.org</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">880.717.0266x88570</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                4484 Adam Turnpike Suite 440 <br/>
                West Jodishire, WY 86721
            </p>
</div>
</div>

In [149]:
import re

In [154]:
def process_person(person):
    quote = person.select('p')[0]
    email = person.select('p')[1]
    phone = person.select('p')[2]
    address = person.select('p')[3]
    address = address.text.strip('\n')
    address = re.sub(r'\s{2,}',' ',address)
    return {
        'quote': quote.text.strip(),
        'name' : person.h2.text,
        'email': email.text,
        'phone': phone.text,
        'address': address
    }

df = pd.DataFrame([process_person(person) for person in people])

In [155]:
df

Unnamed: 0,quote,name,email,phone,address
0,"""Programmable coherent initiative""",Jonathan Navarro,fieldsdawn@sharp-kelly.org,880.717.0266x88570,"4484 Adam Turnpike Suite 440 West Jodishire, ..."
1,"""Universal incremental hierarchy""",David Best,santosjohn@jones-williams.com,001-284-690-6112x321,"73168 Derek Wall Greenton, UT 10027"
2,"""Re-contextualized executive hardware""",Kathryn King,bensonjanet@garrett-ali.com,001-875-579-3305x371,"8975 Jon Estate Richardbury, NJ 17289"
3,"""Re-engineered zero tolerance interface""",Stephanie Dixon,apierce@walker.com,+1-673-068-0431x6708,"9608 Bernard Canyon Suite 997 Anthonyborough,..."
4,"""Robust interactive definition""",Michael Guzman,ksalazar@yahoo.com,293-853-7578x6846,"1215 Karen Garden Apt. 471 Hintonside, VT 19792"
5,"""Managed clear-thinking artificial intelligence""",Dennis Mcdaniel,ylin@yahoo.com,113-369-0876x85056,"821 Sharon Flat Apt. 695 North Randallfort, I..."
6,"""Upgradable asynchronous hub""",Jessica Cruz,albert71@yahoo.com,(644)553-5701,"77961 Newman Estate Suite 998 New Darin, WI 6..."
7,"""Decentralized coherent utilization""",Jennifer Bishop,hcontreras@yahoo.com,517-744-2527x6406,"66709 Shane Hollow Apt. 403 Lindseyhaven, AR ..."
8,"""Stand-alone bi-directional matrix""",Claudia Stone,rhonda04@benson.com,864.198.5483,41758 Christopher Station Suite 084 New Chris...
9,"""Compatible empowering paradigm""",Carrie Hernandez,sharon69@garcia.org,001-902-876-1187,"97238 Brian Villages Cathyfurt, IN 13490"


In [136]:
df.address

0                    74643 Roy Cliff \n            ...
1                    50689 Gutierrez Locks Apt. 973...
2                    34130 Morris Courts Suite 365 ...
3                    81390 Powers Cliff \n         ...
4                    39606 Thomas Forest Suite 284 ...
5                    981 Collins Divide \n         ...
6                    34012 Cummings Meadows Suite 8...
7                    1105 Clark Island \n          ...
8                    11134 Rivera Motorway Suite 17...
9                    8317 Houston Well Suite 854 \n...
Name: address, dtype: object