# Web Scraping

In [1]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
url = 'https://web-scraping-demo.zgulde.net/news'
response = get(url)
response

<Response [200]>

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>News Example Page</title>
    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap


In [4]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.text, 'html.parser')

In [5]:
articles = soup.select('div.grid.grid-cols-4')

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">traditional family early</h2>
<div class="grid grid-cols-2 italic">
<p> 1972-12-02 </p>
<p class="text-right">By Scott Mcguire </p>
</div>
<p>Nearly them painting statement somebody. Operation yeah hold.
Matter international indeed respond kid and name. Have dog report.</p>
</div>
</div>

In [7]:
def parse_news_article(article):
    output = {}
    output['headline'] = article.find('h2').text
    output['date'], output['byline'], output['description'] = [p.text for p in article.find_all('p')]
    return output

In [8]:
pd.DataFrame([parse_news_article(article) for article in articles])

Unnamed: 0,headline,date,byline,description
0,traditional family early,1972-12-02,By Scott Mcguire,Nearly them painting statement somebody. Opera...
1,seek easy tree,1990-05-16,By Tracy Snyder,Final theory bring fall appear law defense. Co...
2,successful network put,1974-08-31,By Lindsey Hernandez,Until trouble you action. Better area human ne...
3,second group explain,2016-08-23,By Jackson Caldwell,Carry particular ready lay catch opportunity w...
4,official view will,1971-10-24,By Diana Fields,Artist degree per final thought many according...
5,cell store smile,2001-06-15,By Alexander Golden,Then later account four too painting. Rise exp...
6,thank understand attack,1979-11-21,By Hannah Parker,Indicate foot serious. Push grow manager my re...
7,prepare strong once,2017-01-12,By Robert Cruz,Official usually between. Spend base trial age...
8,company star actually,2013-02-22,By Douglas Vega,Evening step forward standard. Member so tree ...
9,student resource respond,1993-12-07,By Justin Foster,Suddenly control hot someone including safe tr...


People exercise 

In [9]:
url = 'https://web-scraping-demo.zgulde.net/people'
response = get(url)
response

<Response [200]>

In [10]:
response.text[:400]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>Example People Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstr'

In [11]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.text, 'html.parser')

In [12]:
people = soup.select('div.person')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Leroy Miles</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Upgradable neutral productivity"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">gcoleman@hotmail.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">296.400.2632</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 289 Alicia Village <br/>
                 North Russell, TX 60587
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full bord

In [13]:
person = people[0]
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Leroy Miles</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Upgradable neutral productivity"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">gcoleman@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">296.400.2632</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                289 Alicia Village <br/>
                North Russell, TX 60587
            </p>
</div>
</div>

In [14]:
person.find_all('p')

[<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Upgradable neutral productivity"
         </p>,
 <p class="email col-span-8">gcoleman@hotmail.com</p>,
 <p class="phone col-span-8">296.400.2632</p>,
 <p class="col-span-8">
                 289 Alicia Village <br/>
                 North Russell, TX 60587
             </p>]

In [15]:
def parse_people(person):
    output = {}
    output['name'] = person.find('h2').text
    output['quote'], output['email'], output['phone'], output['address'] = [p.text.strip() for p in person.find_all('p')]
    return output

In [16]:
parse_people(person)

{'name': 'Leroy Miles',
 'quote': '"Upgradable neutral productivity"',
 'email': 'gcoleman@hotmail.com',
 'phone': '296.400.2632',
 'address': '289 Alicia Village \n                North Russell, TX 60587'}

In [17]:
pd.DataFrame([parse_people(person) for person in people])

Unnamed: 0,name,quote,email,phone,address
0,Leroy Miles,"""Upgradable neutral productivity""",gcoleman@hotmail.com,296.400.2632,289 Alicia Village \n North Rus...
1,Terry Watson,"""Re-engineered background data-warehouse""",shafferjoe@pugh.biz,+1-639-642-0739x681,"2016 Patel Curve \n Nelsonview,..."
2,Christina Anderson,"""Function-based homogeneous migration""",elee@gmail.com,(693)780-5919x3660,3907 Patricia Mountain Apt. 340 \n ...
3,Donna Black,"""Compatible analyzing implementation""",tammie49@mitchell-nichols.biz,001-196-315-4763,82219 Hardy Spurs \n South Lind...
4,Timothy Willis,"""Distributed responsive workforce""",kingbrooke@becker-schmidt.com,001-873-759-6898x66290,89754 Tiffany Grove Apt. 291 \n ...
5,James Howard,"""Automated well-modulated extranet""",dhartman@gmail.com,9709512314,77813 Cervantes Club \n North D...
6,Jennifer Ellis,"""Customer-focused bottom-line forecast""",olsonphillip@peters.com,(119)479-2783x79206,24806 Williams Circle \n Crysta...
7,Madeline Martinez,"""Programmable disintermediate alliance""",jeremymalone@hicks.org,719.927.8157,"268 Tyler Island \n Yorkshire, ..."
8,Joseph Goodwin,"""Team-oriented high-level parallelism""",thompsonethan@yahoo.com,8754940766,44296 Gonzales Turnpike \n Sara...
9,Donald Lewis,"""Virtual methodical capability""",smithjessica@chambers.org,610.901.8518x73218,"702 Lisa Mall \n Kevinshire, WY..."


# Exercises
---

1. Codeup Blog Articles
 - Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. 
 - For each post: 
    - title
    - content
 - Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries:
    - each dictionary representing one article
    - dictionary should look like this:
        
        {
        'title': 'the title of the article',
        'content': 'the full text content of the article'
        }


In [18]:
headers = {'user-agent': 'Innis Data Science Cohort'}
url = 'https://codeup.com/blog/'
response = get(url, headers=headers)
response

<Response [200]>

In [19]:
response.text[:400]

'<!DOCTYPE html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<link rel="pingback" href="https://codeup.com/xmlrpc.php" />\n\n\t<script type="text/javascript">\n\t\tdocument.documentElement.className = \'js\';\n\t</script>\n\t\n\t<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi'

In [20]:
soup = BeautifulSoup(response.text, 'html.parser')

In [115]:
blogs = soup.find_all('article')[:16]

In [116]:
blog = blogs[0]

In [117]:
blog.find('h2').text

'From Bootcamp to Bootcamp | A Military Appreciation Panel'

In [118]:
blog.find('p').text[:12]

'Apr 27, 2022'

In [119]:
blog.find('p').text[14:]

' Alumni Stories, Dallas, Events, Featured, Military, San Antonio, Veterans, Virtual, Workshops'

In [120]:
blog.find_all('p')[1].text

'In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans!...'

In [121]:
def parse_blog(blog):
    output = {}
    output['title'] = blog.find('h2').text
    output['date'] = blog.find('p').text[:12]
    output['tags'] = blog.find('p').text[14:]
    output['content'] = blog.find_all('p')[1].text
    return output

In [122]:
parse_blog(blog)

{'title': 'From Bootcamp to Bootcamp | A Military Appreciation Panel',
 'date': 'Apr 27, 2022',
 'tags': ' Alumni Stories, Dallas, Events, Featured, Military, San Antonio, Veterans, Virtual, Workshops',
 'content': 'In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans!...'}

In [123]:
blogs[5].find_all('p')[1].text

"On this International Women's Day 2022 we wanted to tell stories about women in tech. What better way to do that than..."

In [124]:
pd.DataFrame([parse_blog(blog) for blog in blogs])

Unnamed: 0,title,date,tags,content
0,From Bootcamp to Bootcamp | A Military Appreci...,"Apr 27, 2022","Alumni Stories, Dallas, Events, Featured, Mil...","In honor of Military Appreciation Month, join ..."
1,Our Acquisition of the Rackspace Cloud Academy...,"Apr 14, 2022","Codeup News, Featured, IT Training","Just about a year ago on April 16th, 2021 we a..."
2,Learn to Code: HTML & CSS on 4/30,"Apr 1, 2022","Virtual, Workshops",HTML & CSS are the design building blocks of a...
3,Learn to Code: Python Workshop on 4/23,"Mar 31, 2022","Events, Virtual, Workshops","According to LinkedIn, the ""#1 Most Promising ..."
4,Coming Soon: Cloud Administration,"Mar 17, 2022",Codeup News,We're launching a new program out of San Anton...
5,5 Books Every Woman In Tech Should Read,"Mar 8, 2022",Featured,On this International Women's Day 2022 we want...
6,Codeup Start Dates for March 2022,"Jan 26, 2022",Codeup News,As we approach the end of January we wanted to...
7,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022","Codeup News, Dallas Newsletter, Featured, Tips...",We are so happy to announce that VET TEC benef...
8,Dallas Campus Re-opens With New Grant Partner,"Dec 30, 2021","Codeup News, Featured",We are happy to announce that our Dallas campu...
9,Codeup’s Placement Team Continues Setting Records,"Nov 19, 2021","Codeup News, Employers",Our Placement Team is simply defined as a grou...


---
2. News Articles
 - We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.
 - Write a function that scrapes the news articles for the following topics:
     - Business
     - Sports
     - Technology
     - Entertainment
 - The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:
 
    {
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
    }

In [129]:
url = 'https://inshorts.com/en/read/business'
response = get(url)
response

<Response [200]>

In [130]:
response.text[:500]

'<!doctype html>\n<html lang="en">\n\n<head>\n  <meta charset="utf-8" />\n  <style>\n    /* The Modal (background) */\n    .modal_contact {\n        display: none; /* Hidden by default */\n        position: fixed; /* Stay in place */\n        z-index: 8; /* Sit on top */\n        left: 0;\n        top: 0;\n        width: 100%; /* Full width */\n        height: 100%;\n        overflow: auto; /* Enable scroll if needed */\n        background-color: rgb(0,0,0); /* Fallback color */\n        background-color: rgba(0,'

In [131]:
soup = BeautifulSoup(response.text, 'html.parser')

In [158]:
news = soup.select('.news-card')

In [166]:
news[0].find('div')

<div class="news-card-image" style="background-image: url('https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2022/05_may/6_fri/img_1651833498385_383.jpg?')">
</div>