## Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

## Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)



### 1. Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [2]:
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}


{'title': 'the title of the article',
 'content': 'the full text content of the article'}

Plus any additional properties you think might be helpful.

---

### Set URLS

In [3]:
# get urls 
url1 = "https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/"
url2 = "https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/"
url3 = "https://codeup.com/workshops/san-antonio/in-person-workshop-learn-to-code-javascript-on-7-26/"
url4 = "https://codeup.com/workshops/in-person-workshop-learn-to-code-python-on-7-19/"
url5 = "https://codeup.com/tips-for-prospective-students/is-our-cloud-administration-program-right-for-you/"

### URL 1

In [4]:
# set headers and response
headers = {'User-Agent': 'CodeUp Data Science'}
response = get(url1, headers=headers)

In [5]:
# sanity check
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [6]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [7]:
soup.select('div.et_post_meta_wrapper')

[<div class="et_post_meta_wrapper">
 <h1 class="entry-title">What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science</h1>
 <p class="post-meta"><span class="published">Jul 7, 2022</span> | <a href="https://codeup.com/category/data-science/" rel="category tag">Data Science</a>, <a href="https://codeup.com/category/featured/" rel="category tag">Featured</a>, <a href="https://codeup.com/category/tips-for-prospective-students/" rel="category tag">Tips for Prospective Students</a></p><img alt="Data Science Biog Header" class="" height="675" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 1080px, 100vw" src="https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2022/07/CA-Blog-Header-1.png" srcset="https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2022/07/CA-Blog-Header-1.png 1080w, https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2022/07/CA-Blog-Header-1-480x279.png 480w" width="1

### With the post meta wrapper I can view
- The entry-title
- The span class published
- Category Tag
- entry-content

In [8]:
# get the title of the article
title = soup.find('h1', class_ = 'entry-title').text
title

'What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science'

In [9]:
# get the date published
published = soup.find('span', class_='published').text
published

'Jul 7, 2022'

In [10]:
# get article category
category = soup.find('a', rel='category tag').text
category

'Data Science'

In [11]:
#get article content
content = soup.find('div', class_='entry-content').text.strip().replace('\n', ' ')
content

'If you are interested in embarking on a career in tech, you’re probably wondering what your new job title could be, and even what your salary might look like.*\xa0In this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries.\xa0Today we will be diving into our Data Science program, with four potential job titles you could take on! Program Overview\xa0 During this 20-week program, you will have the opportunity to take your career to new heights with data science being one of the most needed jobs in tech. You’ll gather data, then clean it, explore it for trends, and apply machine learning models to make predictions. Upon completing this program, you will know how to turn insights into actionable recommendations. You’ll be a huge asset to any company, having all the technical skills to become a data scientist with projects upon projects of expe

In [12]:
def parse_blog_articles(url):
    """
    This function used together with the get_blog_articles() function, 
    parses the html from the website and displays it in python
    """
    url = url
    
    # establish header
    headers = {'User-Agent':'CodeUp Data Science'}
    resposne = get(url, headers=headers)
    
    # create soup variable containing response content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # create a dictionary that holds each url and its content
    output = {}
    output['title'] = soup.find('h1', class_ = 'entry-title').text
    output['published'] = soup.find('span', class_='published').text
    output['category'] = soup.find('a', rel='category tag').text
    output['content'] = soup.find('div', class_='entry-content').text.strip().replace('\n', ' ')
    
    return output

In [13]:
def get_blog_articles(url):
    """
    This function takes in a list of url from CodeUp blod articles.
    Used with pare_blog_articles, it looks for title, published date, category, and content
    then displays them.
    """
    output = []
    
    for urls in url:
        output.append(parse_blog_articles(urls))
        
    return output

In [14]:
# create a list of urls
url_list = [url1, url2, url3, url4, url5]

In [15]:
get_blog_articles(url_list)

[{'title': 'What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science',
  'published': 'Jul 7, 2022',
  'category': 'Data Science',
  'content': 'If you are interested in embarking on a career in tech, you’re probably wondering what your new job title could be, and even what your salary might look like.*\xa0In this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries.\xa0Today we will be diving into our Data Science program, with four potential job titles you could take on! Program Overview\xa0 During this 20-week program, you will have the opportunity to take your career to new heights with data science being one of the most needed jobs in tech. You’ll gather data, then clean it, explore it for trends, and apply machine learning models to make predictions. Upon completing this program, you will know how to turn insights into action

---

### 2) News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

Business
Sports
Technology
Entertainment The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

In [16]:
#Define individual news category urls

url1 = 'https://inshorts.com/en/read/business'
url2 = 'https://inshorts.com/en/read/sports'
url3 = 'https://inshorts.com/en/read/technology'
url4 = 'https://inshorts.com/en/read/entertainment'

#make a list of all these urls
urls = [url1, url2, url3, url4]

### URL 1

In [17]:
# set headers
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url1, headers=headers)
response

<Response [200]>

In [18]:
soup = BeautifulSoup(response.content, 'html.parser')

In [19]:
#Find the title of the first article on the page
cards = soup.find('div', class_ = 'card-stack')

articles = cards.find_all('div', class_ = 'news-card')
article = articles[0]

In [20]:
#Find the title of the first article on the page
title = article.find('span', itemprop = 'headline').text
title

'Rupee hits record low of 79.97 against US dollar'

In [21]:
#Find the author of the first article
author = article.find('span', class_ = 'author').text
author

'Ridham Gambhir'

In [22]:
#Find the publication date
published = article.find('span', clas = 'date').text.split(',')[0]
published

'18 Jul 2022'

In [23]:
content = article.find('div', itemprop = 'articleBody').text
content

"The rupee hit a record low of 79.97 against the US dollar on Monday after opening at 79.76. The Finance Ministry, while speaking about the matter said that global factors such as the Russia-Ukraine war, soaring crude oil prices and tightening of global financial conditions are the major reasons for the rupee's weakening."

In [24]:
## Create function that creates a dictionary out of a single article on an inshorts.com page

def parse_news_article(article, category):
    output = {}

    output['category'] = category
    output['title'] = article.find('span', itemprop = 'headline').text.strip()
    output['author'] = article.find('span', class_ = 'author').text
    output['date'] = article.find('span', clas = 'date').text.split(',')[0]
    output['content'] = article.find('div', itemprop = 'articleBody').text

    return output

In [25]:
def parse_news_page(category):
    url = 'https://inshorts.com/en/read/' + category
    response = get(url)
    soup = BeautifulSoup(response.text)

    cards = soup.select('.news-card')
    articles = []

    for card in cards:
        articles.append(parse_news_article(card, category))

    return articles

In [26]:
parse_news_page('business')

[{'category': 'business',
  'title': 'Rupee hits record low of 79.97 against US dollar',
  'author': 'Ridham Gambhir',
  'date': '18 Jul 2022',
  'content': "The rupee hit a record low of 79.97 against the US dollar on Monday after opening at 79.76. The Finance Ministry, while speaking about the matter said that global factors such as the Russia-Ukraine war, soaring crude oil prices and tightening of global financial conditions are the major reasons for the rupee's weakening."},
 {'category': 'business',
  'title': 'Rupee closes at an all-time low of 79.98 against US dollar',
  'author': 'Ridham Gambhir',
  'date': '18 Jul 2022',
  'content': 'The rupee on Monday hit a fresh record low as it ended closer to the 80-mark to close at 79.98 against the US dollar. This was the seventh consecutive session when the rupee weakened. So far this year, the currency has weakened 7.05% against the US dollar. Meanwhile, BSE Sensex closed 760 points higher at 54,521 on Monday.'},
 {'category': 'busin

In [27]:
# cache the data, and turn it into a dataframe (function)
def get_news_articles(use_cache=True):
    if os.path.exists('news_articles.json') and use_cache:
        return pd.read_json('news_articles.json')

    categories = ['business', 'sports', 'technology', 'entertainment']

    articles = []

    for category in categories:
        print(f'Getting {category} articles')
        articles.extend(parse_news_page(category))

    df = pd.DataFrame(articles)
    df.to_json('news_articles.json', orient='records')
    return df