# Data Acquisition with Web Scraping

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

In [3]:
# !conda install -c anaconda beautifulsoup4


## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [31]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [32]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [6]:
#finds first matching element
soup.h1

<h1 class="my-5 text-4xl text-center">News!</h1>

In [12]:
#finds first matching element, same as above
e1 = soup.find('h1')
e1.text

'News!'

In [13]:
# search for first p tag
soup.p

<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>

In [16]:
soup.find('p')

<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>

In [15]:
soup.find_all('p')

[<p>
 <i class="bi bi-exclamation-circle text-xl"></i>
         All data on this page is strictly for demonstration purposes and fake.
     </p>,
 <p> 1999-08-12 </p>,
 <p class="text-right">By Breanna Green </p>,
 <p>Same defense main do happen. Sit could enough technology with everything.
 Bar investment sort civil style whom thing professional. Year help stay cultural two. Summer professor far born order leg right.</p>,
 <p> 2003-07-04 </p>,
 <p class="text-right">By Christina Griffin </p>,
 <p>Ability thus bank seem. Resource perform whose keep garden.
 Smile write color enter across can. Part campaign artist west that north. Protect think surface who.</p>,
 <p> 2000-09-25 </p>,
 <p class="text-right">By Frances Moody </p>,
 <p>Whose individual develop sell business side opportunity. Decade result hospital over. Page under answer.
 Probably read contain arm break somebody. Truth yourself common process likely. People a action less treatment finish. Manager onto this citizen.</p>,
 

In [17]:
#List comprehension for 
[p.text for p in soup.find_all('p')]

['\n\n        All data on this page is strictly for demonstration purposes and fake.\n    ',
 ' 1999-08-12 ',
 'By Breanna Green ',
 'Same defense main do happen. Sit could enough technology with everything.\nBar investment sort civil style whom thing professional. Year help stay cultural two. Summer professor far born order leg right.',
 ' 2003-07-04 ',
 'By Christina Griffin ',
 'Ability thus bank seem. Resource perform whose keep garden.\nSmile write color enter across can. Part campaign artist west that north. Protect think surface who.',
 ' 2000-09-25 ',
 'By Frances Moody ',
 'Whose individual develop sell business side opportunity. Decade result hospital over. Page under answer.\nProbably read contain arm break somebody. Truth yourself common process likely. People a action less treatment finish. Manager onto this citizen.',
 ' 1971-03-15 ',
 'By Desiree Palmer ',
 'Community it manage possible total these. Despite none future another enter arrive. Great the while ask dream foot

In [27]:
def extract_links_and_text(el):
    return dict(link=el.attrs['href'], text=el.text)

In [10]:
type(soup)

bs4.BeautifulSoup

In [14]:
soup.find_all('h1')

[<h1 class="my-5 text-4xl text-center">News!</h1>]

In [34]:
# Select returns list of elements matching class
articles = soup.select('.grid-cols-4')

In [36]:
article = articles[0]

In [40]:
headline = article.h2.text
headline

'father my PM'

In [44]:
date = article.p.text.strip()
date

'1993-08-27'

In [48]:
# 0 first element
#strip removes whitspace
#[3:] eliminates "By "
#need a dot to use a class selector
author = article.select('.text-right')[0].text.strip()[3:]
author

'Matthew Hunt'

In [54]:
# select content
# inside p tag- find all p tags and selext last one

body =article.select('p')[-1].text
body

'Three him per year letter safe room. You raise pressure century compare former population.\nProfessional material notice drug north full. Many may happy time human. Name successful government art.'

In [None]:
# find css of class selector
soup.selector

In [None]:
# Use beautifulsoup methods to extract necessary content from an article

Bringing it all together: Make a function

In [55]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [57]:
parse_news(article)

{'headline': 'father my PM',
 'date': '1993-08-27',
 'author': 'Matthew Hunt',
 'content': 'Three him per year letter safe room. You raise pressure century compare former population.\nProfessional material notice drug north full. Many may happy time human. Name successful government art.'}

In [58]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])


Unnamed: 0,headline,date,author,content
0,father my PM,1993-08-27,Matthew Hunt,Three him per year letter safe room. You raise...
1,alone with instead,1993-05-05,Sara Ramirez,Including human in product moment think. Write...
2,if program up,2010-09-02,Grant Martinez,Everyone among mind exist administration. Just...
3,successful upon Mr,1982-03-17,Stacie Richardson,Probably draw likely. Data the leave science b...
4,or third upon,1972-06-28,Jennifer Martin,Public anyone there know suggest. Voice social...
5,stage manage thought,1985-06-15,Victoria Lopez,Wind owner suddenly stand soldier let. What pe...
6,maybe positive land,1972-12-26,Mary Irwin,Daughter follow operation describe hit. Travel...
7,toward animal your,1975-10-25,Daniel Lee,Federal approach candidate. Society avoid seri...
8,can career past,2017-12-19,Melody Rogers,Foot adult price. Trade cover explain deep her...
9,eye begin edge,1979-05-19,Brian Howard,Laugh morning machine table. Chance art themse...


## Scraping People

In [59]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [60]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

In [None]:
# def parse_person(person):
#     name = 
#     quote = 
#     email = 
#     phone = 
#     address = 

    
#     return {
#         'name': name, 'quote': quote, 'email': email,
#         'phone': phone,
#         'address': address
#     }

In [None]:
# loop through all the persons


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [66]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [67]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [69]:
people = soup.select('.person')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Micheal Garcia</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Re-engineered bi-directional core"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">william71@nunez-garrett.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">+1-036-840-8069</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 50941 Joshua Station <br/>
                 Hansenstad, NY 55234
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-

In [71]:
person = people[0]
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Micheal Garcia</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Re-engineered bi-directional core"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">william71@nunez-garrett.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">+1-036-840-8069</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                50941 Joshua Station <br/>
                Hansenstad, NY 55234
            </p>
</div>
</div>

In [74]:
name =person.h2.text
name

'Micheal Garcia'

In [80]:
person.select('p')[0].text.strip()

'"Re-engineered bi-directional core"'

'"Re-engineered bi-directional core"'

In [88]:
quote = person.select('.quote')[0].text.strip()
quote

'"Re-engineered bi-directional core"'

In [89]:
email = person.select('.email')[0].text.strip()
email

'william71@nunez-garrett.com'

In [90]:
phone = person.select('.phone')[0].text.strip()
phone

'+1-036-840-8069'

In [104]:
address = person.select('.address')[0].text.replace('\n', '').replace('        ','').strip()
address

'50941 Joshua Station Hansenstad, NY 55234'

In [106]:
def parse_people(person):
    name =person.h2.text
    quote = person.select('.quote')[0].text.strip()
    email = person.select('.email')[0].text.strip()
    phone = person.select('.phone')[0].text.strip()
    address = person.select('.address')[0].text.replace('\n', '').replace('        ','').strip()
    
    return{
        'name':name,
        'quote':quote,
        'email':email,
        'phone':phone,
        'address':address
        
    }

    

In [107]:
parse_people(person)

{'name': 'Micheal Garcia',
 'quote': '"Re-engineered bi-directional core"',
 'email': 'william71@nunez-garrett.com',
 'phone': '+1-036-840-8069',
 'address': '50941 Joshua Station Hansenstad, NY 55234'}

In [108]:
pd.DataFrame([parse_people(person) for person in people])


Unnamed: 0,name,quote,email,phone,address
0,Micheal Garcia,"""Re-engineered bi-directional core""",william71@nunez-garrett.com,+1-036-840-8069,"50941 Joshua Station Hansenstad, NY 55234"
1,Suzanne Blanchard,"""Managed contextually-based hub""",laura10@yahoo.com,001-713-075-9081x17798,"273 Michael Roads Harperland, MA 61401"
2,Taylor Yang,"""Streamlined non-volatile implementation""",davidhernandez@blake-thompson.org,001-973-787-6452x65347,"497 Baker Fork Suite 820 Williamfurt, DE 14915"
3,Joseph Bennett,"""Programmable background knowledgebase""",uchan@yahoo.com,239-208-6791,"1782 Hopkins Prairie Apt. 507 West Julie, NM 8..."
4,John Thomas,"""De-engineered even-keeled paradigm""",jenniferarcher@yahoo.com,+1-870-617-1607x3753,"1726 Hodges Pike Apt. 821 Orozcotown, GA 80459"
5,Carol Phillips,"""Organized neutral infrastructure""",robertsonmax@jackson-collins.com,1280443578,"92314 Mcdowell Drives Apt. 879 Port Cynthia, L..."
6,Bethany Freeman,"""Vision-oriented radical contingency""",riverajuan@perkins-vega.org,8255313357,"61043 Castaneda Meadow East Mary, SC 81413"
7,Jessica Payne,"""Seamless cohesive forecast""",calebho@hotmail.com,032.046.6784x2193,"5672 Bailey Springs Jessicaburgh, MD 29237"
8,Heather Scott,"""Synergistic bi-directional time-frame""",penny13@lee-cook.info,728-527-1422,"1443 Leonard Knoll Apt. 794 Edwardsberg, HI 05088"
9,Dr. Paul Green,"""Universal 24/7 system engine""",dsanford@yahoo.com,+1-558-100-2400x8536,"79927 Paul Union Port Leslie, NV 02885"
