# Data Acquisition with Web Scraping

In [8]:
import requests
from bs4 import BeautifulSoup 
import pandas as pd
import os

First make the request. The response is a bunch of html.

In [9]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [10]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>News Example Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">News!</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid gap-y-12">
<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">employee detail 

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [11]:
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')
articles[0].select('.italic')[0].select('p')

[<p> 1977-03-08 </p>, <p class="text-right">By Jessica King </p>]

In [132]:
articles[0]

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">employee detail federal</h2>
<div class="grid grid-cols-2 italic">
<p> 1977-03-08 </p>
<p class="text-right">By Jessica King </p>
</div>
<p>Find leader month phone describe scene wait someone. Indeed rich off tonight fine and party.
Through task interest. Detail real human camera.</p>
</div>
</div>

In [155]:
articles[0].select('.italic')[0].select('p')[1].text

'By Jessica King '

Bringing it all together:

In [12]:
def process_article(article):
    date, author = articles[0].select('.italic')[0].select('p')
    return {
        'title': article.h2.text,
        'date': date.text,
        'author': author.text
    }

pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,author
0,employee detail federal,1977-03-08,By Jessica King
1,blue process spend,1977-03-08,By Jessica King
2,list standard throw,1977-03-08,By Jessica King
3,during lawyer building,1977-03-08,By Jessica King
4,station treat design,1977-03-08,By Jessica King
5,cause coach easy,1977-03-08,By Jessica King
6,official reflect ground,1977-03-08,By Jessica King
7,add have why,1977-03-08,By Jessica King
8,goal picture himself,1977-03-08,By Jessica King
9,nature civil manager,1977-03-08,By Jessica King


---

### People

In [92]:
response = requests.get('https://web-scraping-demo.zgulde.net/people')
people = response.text
people

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>Example People Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">People</h1>\n\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n\n<div id="people" class="grid grid-cols-2 gap-x-12 gap-y-16">\n    \n    <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">\n    

In [93]:
soup = BeautifulSoup(people)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Example People Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">People</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Brenda Sherman<

In [94]:
soup.find_all("div", {"class": "person"})[0]

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Brenda Sherman</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Customer-focused grid-enabled Graphical User Interface"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">tracirobertson@brown.info</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">744-276-9026</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                3147 Mathis Meadow <br/>
                Port Connie, MO 78830
            </p>
</div>
</div>

In [180]:
personal_info = soup.select('.person.border.rounded')
personal_info

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Brenda Sherman</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Customer-focused grid-enabled Graphical User Interface"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">tracirobertson@brown.info</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">744-276-9026</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 3147 Mathis Meadow <br/>
                 Port Connie, MO 78830
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purpl

In [182]:
# name
personal_info = soup.select('.person.border')
personal_info[0].select('.name')[0].text

'Brenda Sherman'

In [171]:
# quote
personal_info[0].select('.quote')[0].text.strip()

'"Customer-focused grid-enabled Graphical User Interface"'

In [165]:
# email
personal_info[0].select('.email')[0].text

'tracirobertson@brown.info'

In [167]:
# phone
personal_info[0].select('.phone')[0].text

'744-276-9026'

In [176]:
# address
personal_info[0].select('.address')[0].text.strip()

'3147 Mathis Meadow \n                Port Connie, MO 78830'

In [315]:
info = pd.DataFrame(columns={"name","quote","email","phone"})

In [316]:
info

Unnamed: 0,name,phone,email,quote


In [326]:
def people_df(info):
    name = personal_info.select('.name')[0].text
    quote = personal_info.select('.quote')[0].text.strip()
    email = personal_info.select('.email')[0].text
    phone = personal_info.select('.phone')[0].text
    info.name = name
    info.quote = quote 
    info.email = email 
    info.phone = phone
    return info

In [327]:
people_df(info)

AttributeError: ResultSet object has no attribute 'select'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [313]:
for person in personal_info:
    name = person.select('.name')[0].text
    quote = person.select('.quote')[0].text.strip()
    email = person.select('.email')[0].text
    phone = person.select('.phone')[0].text

In [314]:
for person in personal_info:
    info += person.select('.name')[0].text

In [311]:
info

'Melissa LloydBrenda ShermanPaula ByrdKimberly RodriguezLori JenkinsAlicia JohnsonJames JacksonTimothy MontesMatthew PittmanWendy PerryMelissa Lloyd'

In [218]:
for person in personal_info:
    print(person.select('.name')[0].text)

Brenda Sherman
Paula Byrd
Kimberly Rodriguez
Lori Jenkins
Alicia Johnson
James Jackson
Timothy Montes
Matthew Pittman
Wendy Perry
Melissa Lloyd


In [239]:
for person in personal_info:
    print(person.select('.quote')[0].text.strip())

"Customer-focused grid-enabled Graphical User Interface"
"Devolved fresh-thinking capability"
"Multi-channeled bottom-line data-warehouse"
"Operative even-keeled matrices"
"Stand-alone client-driven Graphic Interface"
"Pre-emptive intermediate application"
"Synergistic regional Internet solution"
"Configurable attitude-oriented initiative"
"Cross-group next generation alliance"
"Configurable holistic attitude"
