# Web Scraping Acquire Notebook
By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)


In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

### 1. Codeup Blog Articles

Scrape the article text from the following pages:

- https://codeup.com/codeups-data-science-career-accelerator-is-here/
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```
Plus any additional properties you think might be helpful.

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [3]:
response

<Response [200]>

In [4]:
response.status_code

200

In [5]:
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


In [6]:
# Make a soup variable holding the response content
# default is html.parser
soup = BeautifulSoup(response.content, 'html.parser')


In [7]:
soup.find('h1')

<h1 class="jupiterx-post-title" itemprop="headline">Codeup’s Data Science Career Accelerator is Here!</h1>

In [8]:
articlebody = soup.find('div', itemprop= 'articleBody')

In [9]:
articlebody.text

# replace \n and \xa0

#potential regex to remove the formatting '\\[n|xa0]'

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [10]:
article = soup.find('div', class_='jupiterx-post-content')
article.text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [11]:
with open('article.txt', 'w') as f:
    f.write(article.text)

In [12]:
def get_article_text():
    # if we already have the data, read it locally
    if path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='jupiterx-post-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

In [13]:
website_list = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
                'https://codeup.com/data-science-myths/',
                'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
                'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
                'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

# all titles are under jupiterx-post-header (header class)

title_finder = 'jupiterx-post-header'

# all body text is under jupiterx-post-content (div class)

body_finder = 'jupiterx-post-content'

In [14]:
# # initalize empty list for the dictionaries
# article_list = []

# # set up headers
# headers = {'User-Agent': 'Codeup Data Science'} 

# # loop through list of websites
# for website in website_list: 
    
#     # get response
#     response = get(website, headers=headers)

#     # create soup object
#     soup = BeautifulSoup(response.text)
#     # find title
#     title = soup.find('header', class_=title_finder).text

#     # find body
#     body = soup.find('div', class_=body_finder).text
    
#     # create dictionary
#     dictionary = {'title': title,
#                  'content': body}
    
#     # add dictionary to list of dictionaries
#     article_list.append(dictionary)



In [15]:
soup.get_text(strip=True)

soup.title.string

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [16]:
# have list of websites and appropriate finders

def get_blog_articles(websites, title_finder, body_finder):
    '''
    This function takes in a list of website urls, 
    the title finder and body finder (must be the same for each article)
    And returns a list of dictionaries with title text and body text in dictionaries
    Keys in dictionaries are 'title' and 'content'
    '''
    
    # initalize empty list for the dictionaries
    article_list = []
    
    # set up headers
    headers = {'User-Agent': 'Codeup Data Science'} 
    
    # loop through list of websites
    for website in website_list: 
        
        # get response
        response = get(website, headers=headers)
    
        # create soup object
        soup = BeautifulSoup(response.text)
        # find title
        #title = soup.find('header', class_=title_finder).text
        title = soup.title.string
    
        # find body
        body = soup.find('div', class_=body_finder).get_text(strip=True)
        #body = soup.get_text(strip=True)
        
        # create dictionary
        dictionary = {'title': title,
                     'content': body}
        
        # add dictionary to list of dictionaries
        article_list.append(dictionary)
        
    return article_list

In [17]:
website_list = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
                'https://codeup.com/data-science-myths/',
                'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
                'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
                'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

# all titles are under jupiterx-post-header (header class)

title_finder = 'jupiterx-post-header'

# all body text is under jupiterx-post-content (div class)

body_finder = 'jupiterx-post-content'

articles_list = get_blog_articles(website_list, title_finder, body_finder)

In [20]:
import pandas as pd
pd.DataFrame(articles_list)

Unnamed: 0,title,content
0,Codeup’s Data Science Career Accelerator is He...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths - Codeup,By Dimitri Antoniou and Maggie GiustData Scien...
2,Data Science VS Data Analytics: What’s The Dif...,"By Dimitri AntoniouA week ago, Codeuplaunched ..."
3,10 Tips to Crush It at the SA Tech Job Fair - ...,SA Tech Job FairThe third bi-annualSan Antonio...
4,Competitor Bootcamps Are Closing. Is the Model...,Competitor Bootcamps Are Closing. Is the Model...


### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:
```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```
Hints:

- a. Start by inspecting the website in your browser. Figure out which elements will be useful.
- b. Start by creating a function that handles a single article and produces a dictionary like the one above.
- c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.


In [21]:
# get response 

url = 'https://inshorts.com/en/read/'
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers=headers)

#### Finders
Headline: 'span' itemprop = 'headline'

```html
<span itemprop="headline">Amazon job posting fuels speculations about plan to accept payments in crypto</span>
```

Body Text: 'div' itemprop = 'articleBody'

```html
<div itemprop="articleBody">A new job posting by Amazon has fuelled speculations that the e-commerce major may begin accepting Bitcoin, Ether and other cryptocurrencies as a form of payment. According to the job posting, Amazon's Payments Acceptance &amp; Experience team is hiring a 'Digital Currency and Blockchain Product Lead'. Following the speculations around Amazon's plan, Bitcoin surged near $40,000 on Monday.</div>
```

In [22]:
soup = BeautifulSoup(response.content)

In [23]:
soup.find_all('span', itemprop = 'headline')

[<span itemprop="headline">'Were told to delete porn from servers', claim Raj Kundra's employees: Report</span>,
 <span itemprop="headline">JEE Advanced 2021 exam for admission to IITs to be held on October 3: Govt</span>,
 <span itemprop="headline">What is India's schedule at Tokyo Olympics for tomorrow?</span>,
 <span itemprop="headline">You gave your best: PM on Bhavani's 'sorry' tweet after crashing out of Olympics</span>,
 <span itemprop="headline">Delta COVID-19 variant is now dominant in most European countries: WHO</span>,
 <span itemprop="headline">I believe in Karma: U'khand CM on shifting to official 'jinxed' residence</span>,
 <span itemprop="headline">1972 study's prediction of human society's collapse may come true: New research</span>,
 <span itemprop="headline">China's 23-yr-old TikTok star falls 160 ft from crane to her death during livestream</span>,
 <span itemprop="headline">UK High Court declares Vijay Mallya 'bankrupt' for Indian banks to realise debt</span>,
 <sp

In [24]:
soup.find_all('span', itemprop = 'headline')[0]

<span itemprop="headline">'Were told to delete porn from servers', claim Raj Kundra's employees: Report</span>

In [25]:
soup.find_all('span', itemprop = 'headline')[0].get_text()

"'Were told to delete porn from servers', claim Raj Kundra's employees: Report"

In [26]:
soup.find_all('div', itemprop = 'articleBody')[0].get_text()

"An employee at Raj Kundra's Viaan Industries has told the police that they were instructed to delete all the pornographic content from their servers, The Indian Express reported citing police sources. The employee said that the instructions came soon after an FIR was registered in February. The police have reportedly added sections of the destruction of evidence against the accused."

In [27]:
# set up base url and categories list
base_url = 'https://inshorts.com/en/read'
categories = ['sports', 'entertainment', 'business', 'technology']

In [28]:
# figure out how to use the categories to add to base url
base_url + '/' + categories[0]

'https://inshorts.com/en/read/sports'

In [29]:
# make function to create a list of urls 
# probably could just call this line in the main function but it was fun

def create_urls(base_url, categories):
    '''
    This function takes in a baseurl and list of categories
    It will create a new list with the base url a / and each category
    
    This is for scraping info from the inshorts website
    '''
    
    website_list = [base_url + '/' + category for category in categories]
    
    return website_list

In [30]:
create_urls(base_url, categories)

['https://inshorts.com/en/read/sports',
 'https://inshorts.com/en/read/entertainment',
 'https://inshorts.com/en/read/business',
 'https://inshorts.com/en/read/technology']

In [33]:
def get_blog_articles2(base_url, categories): # title_finder, body_finder
    '''
    This function takes in a list of website urls, 
    the title finder and body finder (must be the same for each article)
    And returns a list of dictionaries with title text and body text in dictionaries
    Keys in dictionaries are 'title' and 'content'
    Returns dataframe of Titles, Articles, and Categories
    '''
    
    # initalize empty list for the dictionaries
    article_list = []
    
    # set up headers
    headers = {'User-Agent': 'Codeup Data Science'} 
    
    # create list of websites using the categories
    website_list = create_urls(base_url, categories)
    
    # loop through list of websites and category list
    for website, category in zip(website_list, categories): 
        
        # get response
        response = get(website, headers=headers)
    
        # create soup object
        soup = BeautifulSoup(response.text)
        
        # find titles
        headlines= soup.find_all('span', itemprop = 'headline')
        
        # find bodies 
        bodies = soup.find_all('div', itemprop = 'articleBody')
        
        # loop through length of headlines (could also be bodies) use index to get text and add to dictionary
        for i in range(len(headlines)):
            title = headlines[i].get_text()
            body = bodies[i].get_text()
            
            # create dictionary
            dictionary = {'title': title,
                         'content': body,
                         'category': category}
            
            # add dictionary to list of dictionaries
            article_list.append(dictionary)
        
    return pd.DataFrame(article_list)

In [34]:
# set up base url and categories list
base_url = 'https://inshorts.com/en/read'
categories = ['sports', 'entertainment', 'business', 'technology']

# test function
get_blog_articles2(base_url, categories)

Unnamed: 0,title,content,category
0,How does the medal tally look like after Monda...,China have slipped to third position in medal ...,sports
1,Manipur govt appoints Mirabai Chanu as Additio...,Manipur government has appointed weightlifter ...,sports
2,Video going viral with claim that Surya Namask...,A video that has recently gone viral with the ...,sports
3,You gave your best: PM on Bhavani's 'sorry' tw...,"PM Narendra Modi tweeted ""You gave your best a...",sports
4,India lose 0-2 to Germany in women's hockey at...,India lost 0-2 to Germany in a Pool A women's ...,sports
...,...,...,...
92,"12-year-old UK boy sells 3,350 NFTs for over $...","Benyamin Ahmed, a 12-year-old boy from UK, sol...",technology
93,Team Great Britain becomes 1st Olympic team to...,Team Great Britain has become the first Olympi...,technology
94,Dealing with misinformation is like fighting c...,Facebook CEO Mark Zuckerberg while discussing ...,technology
95,37.8% of 10-year-olds have Facebook accounts a...,About 37.8% and 24.3% of 10-year-old children ...,technology
