# Data Acquisition

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

import acquire_codeup_blog
import acquire_news_articles

## Codeup Blog Articles

### Scrape the article text from the following pages:

* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/data-science-myths/
* https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
* https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
* https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

### Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

`{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}`

#### First let's find the info we want from just one link and then we can make a loop that will grab it for each article

#### Let's start with text of the article

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
article_text = soup.find('div', class_='jupiterx-post-content clearfix').text
article_text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students wi

#### Great we got the text. Now let's get the title

In [4]:
article_title = soup.find('h1', class_='jupiterx-post-title').text
article_title

'Codeup’s Data Science Career Accelerator is Here!'

### Now that we know how to get each element, let's loop through the articles and get everything into a list of dictionaries

In [5]:
links = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
'https://codeup.com/data-science-myths/',
'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

articles = []

for link in links:
    url = link
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    article_text = soup.find('div', class_='jupiterx-post-content clearfix').text
    article_title = soup.find('h1', class_='jupiterx-post-title').text
    
    article_dict = {'title': article_title,
                    'content': article_text}
    articles.append(article_dict)

articles[0:2]

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, a

### Now let's put this all into another .py file.

In [6]:
acquire_codeup_blog.get_blog_articles()[0:2]

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, a

## News Articles

### We will now be scraping text data from [inshorts](https://inshorts.com/), a website that provides a brief overview of many different topics.

### Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

### The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

`{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}`

#### So first we will start with the home page

In [7]:
url = 'https://inshorts.com/en/read'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')

### Let's get the headlines for the articles first

In [8]:
headlines = soup.find_all('span', itemprop='headline')
headlines

[<span itemprop="headline">Students who moved due to lockdown can write Board exam from new location: Govt</span>,
 <span itemprop="headline">Cong-ruled states imposed it 1st: Ravi Shankar after Rahul says lockdown failed</span>,
 <span itemprop="headline">COVID-19 could lead to 'lockdown generation' as 1 in 6 youths stop work: ILO</span>,
 <span itemprop="headline">Team India should embrace split coaching: Ex-Australia head coach Darren Lehmann</span>,
 <span itemprop="headline">This is insulting: Manoj Tiwary on KKR not tagging him in IPL 2012 throwback post</span>,
 <span itemprop="headline">Pablo Escobar's brother sues Apple for $2.6 bn over alleged iPhone X hack</span>,
 <span itemprop="headline">Trump threatens to 'close down' social media after Twitter fact-checks his tweets</span>,
 <span itemprop="headline">Who are Doug Hurley &amp; Bob Behnken, NASA astronauts to be launched by SpaceX?</span>,
 <span itemprop="headline">Situation at border with India controllable: China amid 

#### Now if we want to get just one element we can pull it from the list

In [9]:
headlines[0].text

'Students who moved due to lockdown can write Board exam from new location: Govt'

#### Now let's get the content for each article

In [10]:
contents = soup.find_all('div', itemprop='articleBody')
contents[:2]

[<div itemprop="articleBody">CBSE Class 10 and 12 students who have moved to different states or to their home districts due to the lockdown can appear for the pending Board Exams from there, HRD Minister Ramesh Pokhriyal Nishank said. "The CBSE will issue a notification in this regard and modalities for registration of such requests," he added.</div>,
 <div itemprop="articleBody">After Rahul Gandhi said the nationwide lockdown failed, Union Minister Ravi Shankar Prasad said that Congress-ruled states imposed it first. Prasad said, "The first state to announce a lockdown was Punjab followed by Rajasthan. And now Maharashtra and Punjab were the first ones to extend the lockdown till May 31, even before the meeting of chief ministers with the PM."</div>]

#### Again we can pull out each on individually

In [11]:
contents[0].text

'CBSE Class 10 and 12 students who have moved to different states or to their home districts due to the lockdown can appear for the pending Board Exams from there, HRD Minister Ramesh Pokhriyal Nishank said. "The CBSE will issue a notification in this regard and modalities for registration of such requests," he added.'

### Now we should be able to build a look that gets combines each piece

In [12]:
articles = []

for n in range(len(headlines)):
    article = {'title': headlines[n].text,
              'content': contents[n].text,
              'catagory': 'home'}
    articles.append(article)

articles[:2]

[{'title': 'Students who moved due to lockdown can write Board exam from new location: Govt',
  'content': 'CBSE Class 10 and 12 students who have moved to different states or to their home districts due to the lockdown can appear for the pending Board Exams from there, HRD Minister Ramesh Pokhriyal Nishank said. "The CBSE will issue a notification in this regard and modalities for registration of such requests," he added.',
  'catagory': 'home'},
 {'title': 'Cong-ruled states imposed it 1st: Ravi Shankar after Rahul says lockdown failed',
  'content': 'After Rahul Gandhi said the nationwide lockdown failed, Union Minister Ravi Shankar Prasad said that Congress-ruled states imposed it first. Prasad said, "The first state to announce a lockdown was Punjab followed by Rajasthan. And now Maharashtra and Punjab were the first ones to extend the lockdown till May 31, even before the meeting of chief ministers with the PM."',
  'catagory': 'home'}]

### Now we are going to have to collect all the articles from three different pages so we will have to change the url for each page.

In [13]:
pages = ['/business', '/sports', '/technology','/entertainment']

articles = []

for page in pages:
    url = 'https://inshorts.com/en/read' + page
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    headlines = soup.find_all('span', itemprop='headline')
    contents = soup.find_all('div', itemprop='articleBody')
    
    for n in range(len(headlines)):
        article = {'title': headlines[n].text,
                  'content': contents[n].text,
                  'catagory': page[1:]}
        articles.append(article)

articles[:2]

[{'title': 'Firm whose stock surged 1000% in 2020 starts human trials of Covid-19 vaccine',
  'content': "US biotech company Novavax said it has started Phase 1 clinical trial of its experimental coronavirus vaccine and has enrolled 130 volunteers. Novavax's valuation has surged by over 1,000% this year to about $2.7 billion despite having no products on the market. The trial is being supported by $388 million in funding from the Coalition for Epidemic Preparedness Innovations (CEPI).",
  'catagory': 'business'},
 {'title': "India's economic growth seen at 1.2% in Q4 FY20: SBI report",
  'content': "India's economy is estimated to have grown at 1.2% for the quarter ended March 2020, said the SBI's Ecowrap report. It said the economic activity in the last seven days of March was completely suspended due to lockdown, causing an estimated loss of around ₹1.4 lakh crore. Subsequently, it projected the annual GDP growth for 2019-20 to be around 4.2%.",
  'catagory': 'business'}]

### Now let put it into a .py file

In [14]:
acquire_news_articles.get_news_articles()[:2]

[{'title': 'Firm whose stock surged 1000% in 2020 starts human trials of Covid-19 vaccine',
  'content': "US biotech company Novavax said it has started Phase 1 clinical trial of its experimental coronavirus vaccine and has enrolled 130 volunteers. Novavax's valuation has surged by over 1,000% this year to about $2.7 billion despite having no products on the market. The trial is being supported by $388 million in funding from the Coalition for Epidemic Preparedness Innovations (CEPI).",
  'catagory': 'business'},
 {'title': "India's economic growth seen at 1.2% in Q4 FY20: SBI report",
  'content': "India's economy is estimated to have grown at 1.2% for the quarter ended March 2020, said the SBI's Ecowrap report. It said the economic activity in the last seven days of March was completely suspended due to lockdown, causing an estimated loss of around ₹1.4 lakh crore. Subsequently, it projected the annual GDP growth for 2019-20 to be around 4.2%.",
  'catagory': 'business'}]

## Bonus:

### Scrape the text of all the articles linked on codeup's blog page.

In [15]:
url = 'https://codeup.com/resources/#blog'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

links = []

for a in soup.find_all('a', class_='jet-listing-dynamic-link__link', href=True):
    links.append(a['href'])

In [16]:
articles = []

for link in links:
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(link, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    article_text = soup.find('div', class_='jupiterx-post-content clearfix').text
    article_title = soup.find('h1', class_='jupiterx-post-title').text

    article_dict = {'title': article_title,
                    'content': article_text}
    
    articles.append(article_dict)

articles[:2]

[{'title': 'From Bootcamp to Bootcamp: Two Military Veterans Discuss Their Transition Into Tech',
  'content': 'Are you a veteran or active-duty military member considering your next steps? Our alumni have been in your boots. In a recent virtual panel, two vets discussed their transition into technology careers with Codeup: Benny Fields III, a retired Air Force Master Sergeant turned Full Stack Web Developer, and Jeffery Roeder, a Navy Intelligence Analyst turned Data Scientist. Whether you’re interested in Data Science or Web Development, here are some key takeaways from the event.\xa0Why Codeup?“The GI Bill was a huge plus, but the icing on the cake was the placement program.” – Benny FieldsAfter retiring from the Air Force, Benny Fields took a job as a technical writer, but he quickly became more interested in the software he was writing about than the writing itself. His friend suggested looking into a coding bootcamp, which he did. He liked that Codeup accepts the GI Bill and the 