# Data Acquisition

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

import acquire_codeup_blog
import acquire_news_articles

## Codeup Blog Articles

### Scrape the article text from the following pages:

* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/data-science-myths/
* https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
* https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
* https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

### Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

`{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}`

#### First let's find the info we want from just one link and then we can make a loop that will grab it for each article

#### Let's start with text of the article

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [3]:
article_text = soup.find('div', class_='jupiterx-post-content clearfix').text
article_text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students wi

#### Great we got the text. Now let's get the title

In [4]:
article_title = soup.find('h1', class_='jupiterx-post-title').text
article_title

'Codeup’s Data Science Career Accelerator is Here!'

### Now that we know how to get each element, let's loop through the articles and get everything into a list of dictionaries

In [5]:
links = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
'https://codeup.com/data-science-myths/',
'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

articles = []

for link in links:
    url = link
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    article_text = soup.find('div', class_='jupiterx-post-content clearfix').text
    article_title = soup.find('h1', class_='jupiterx-post-title').text
    
    article_dict = {'title': article_title,
                    'content': article_text}
    articles.append(article_dict)

articles[0:2]

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, a

### Now let's put this all into another .py file.

In [6]:
acquire_codeup_blog.get_blog_articles()[0:2]

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, a

## News Articles

### We will now be scraping text data from [inshorts](https://inshorts.com/), a website that provides a brief overview of many different topics.

### Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

### The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

`{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}`

#### So first we will start with the home page

In [7]:
url = 'https://inshorts.com/en/read'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')

### Let's get the headlines for the articles first

In [8]:
headlines = soup.find_all('span', itemprop='headline')
headlines

[<span itemprop="headline">Ravi Mohan Saini, who won ₹1 cr in KBC Junior at 14, becomes SP of Porbandar</span>,
 <span itemprop="headline">BookMyShow fires, sends staff on leave without pay; 270 employees impacted</span>,
 <span itemprop="headline">Karnataka government bans flights from 5 states amid rising coronavirus cases</span>,
 <span itemprop="headline">36-day-old baby recovers from COVID-19 in Mumbai; CMO shares video</span>,
 <span itemprop="headline">23-yr-old eats 40 rotis, 10-plate rice at quarantine centre; cook says he's tired</span>,
 <span itemprop="headline">Coronavirus vaccines being developed by 30 groups in India, 20 in good pace: Govt</span>,
 <span itemprop="headline">This will be a Big Day for Social Media and Fairness: Trump</span>,
 <span itemprop="headline">Delhi govt issues advisory to prevent probable locust attack</span>,
 <span itemprop="headline">BJP spokesperson Sambit Patra hospitalised with coronavirus symptoms: Reports</span>,
 <span itemprop="headline

#### Now if we want to get just one element we can pull it from the list

In [9]:
headlines[0].text

'Ravi Mohan Saini, who won ₹1 cr in KBC Junior at 14, becomes SP of Porbandar'

#### Now let's get the content for each article

In [10]:
contents = soup.find_all('div', itemprop='articleBody')
contents[:2]

[<div itemprop="articleBody">IPS officer Ravi Mohan Saini, who won ₹1 crore in KBC Junior when he was 14 years old, took charge as Superintendent of Police, Porbandar, Gujarat on Tuesday. Saini, who is now 33 years old, qualified for Indian Police Service in 2014 with AIR 461. A native of Rajasthan's Alwar, Saini is the son of a retired Navy officer.</div>,
 <div itemprop="articleBody">BookMyShow's Founder and CEO Ashish Hemrajani in an e-mail to employees said the company is firing and furloughing (leave without pay) 270 employees. "To those leaving us, I'm truly sorry for this decision," Hemrajani said. Fired employees will get severance equivalent to a minimum of 2 months' salary irrespective of their tenure or as per notice period, whichever is higher.</div>]

#### Again we can pull out each on individually

In [11]:
contents[0].text

"IPS officer Ravi Mohan Saini, who won ₹1 crore in KBC Junior when he was 14 years old, took charge as Superintendent of Police, Porbandar, Gujarat on Tuesday. Saini, who is now 33 years old, qualified for Indian Police Service in 2014 with AIR 461. A native of Rajasthan's Alwar, Saini is the son of a retired Navy officer."

### Now we should be able to build a look that gets combines each piece

In [12]:
articles = []

for n in range(len(headlines)):
    article = {'title': headlines[n].text,
              'content': contents[n].text,
              'catagory': 'home'}
    articles.append(article)

articles[:2]

[{'title': 'Ravi Mohan Saini, who won ₹1 cr in KBC Junior at 14, becomes SP of Porbandar',
  'content': "IPS officer Ravi Mohan Saini, who won ₹1 crore in KBC Junior when he was 14 years old, took charge as Superintendent of Police, Porbandar, Gujarat on Tuesday. Saini, who is now 33 years old, qualified for Indian Police Service in 2014 with AIR 461. A native of Rajasthan's Alwar, Saini is the son of a retired Navy officer.",
  'catagory': 'home'},
 {'title': 'BookMyShow fires, sends staff on leave without pay; 270 employees impacted',
  'content': 'BookMyShow\'s Founder and CEO Ashish Hemrajani in an e-mail to employees said the company is firing and furloughing (leave without pay) 270 employees. "To those leaving us, I\'m truly sorry for this decision," Hemrajani said. Fired employees will get severance equivalent to a minimum of 2 months\' salary irrespective of their tenure or as per notice period, whichever is higher.',
  'catagory': 'home'}]

### Now we are going to have to collect all the articles from three different pages so we will have to change the url for each page.

In [13]:
pages = ['/business', '/sports', '/technology','/entertainment']

articles = []

for page in pages:
    url = 'https://inshorts.com/en/read' + page
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    headlines = soup.find_all('span', itemprop='headline')
    contents = soup.find_all('div', itemprop='articleBody')
    
    for n in range(len(headlines)):
        article = {'title': headlines[n].text,
                  'content': contents[n].text,
                  'catagory': page[1:]}
        articles.append(article)

articles[:2]

[{'title': 'Twitter CEO donates $10M to project giving $1,000 cash to COVID-19 hit families',
  'content': "Twitter's billionaire CEO Jack Dorsey has donated $10 million to Project 100 which will give $1,000 in cash to American families who have been affected by the COVID-19 pandemic. Other donors to Project 100 include Alphabet and Google CEO Sundar Pichai, Microsoft Co-founder Bill Gates and others. Dorsey also donated $10 million this month to help US prison fight COVID-19.",
  'catagory': 'business'},
 {'title': "US firm buys Serum Institute parent's Czech unit to make Covid-19 vaccine",
  'content': "US biotech firm Novavax has announced it's buying Czech Republic-based Praha Vaccines, a unit of Cyrus Poonawalla Group, which also owns Serum Institute of India, for ₹1,270 crore. The facility is expected to provide an annual capacity of over 1 billion doses of antigen starting in 2021 for Novavax's Covid-19 vaccine candidate. Novavax currently has no product on the market.",
  'cata

### Now let put it into a .py file

In [14]:
acquire_news_articles.get_news_articles()

Unnamed: 0.1,Unnamed: 0,title,content,catagory
0,0,"Twitter CEO donates $10M to project giving $1,...",Twitter's billionaire CEO Jack Dorsey has dona...,business
1,1,US firm buys Serum Institute parent's Czech un...,US biotech firm Novavax has announced it's buy...,business
2,2,25-year-old Anant Ambani joins $65 billion Jio...,Asia's richest person Mukesh Ambani's 25-year-...,business
3,3,Google in talks to buy 5% stake in Vodafone Id...,Google is exploring an investment in Vodafone ...,business
4,4,Microsoft in talks to buy 2.5% stake in Jio fo...,Microsoft is in talks with Mukesh Ambani-led R...,business
...,...,...,...,...
95,95,Digital rights of 'Laxmmi Bomb' sold for ₹125 ...,The makers of Akshay Kumar's 'Laxmmi Bomb' hav...,entertainment
96,96,"Neither good nor bad times last, even coronavi...","Commenting on COVID-19 pandemic, singer Asha B...",entertainment
97,97,Complaint against cinematographer Shyam K Naid...,A complaint of cheating has been filed against...,entertainment
98,98,SRK's KKR announces relief packages for cities...,Shah Rukh Khan's IPL team Kolkata Knight Rider...,entertainment


## Bonus:

Scrape the text of all the articles linked on codeup's blog page.

In [15]:
url = 'https://codeup.com/resources/#blog'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

links = []

for a in soup.find_all('a', class_='jet-listing-dynamic-link__link', href=True):
    links.append(a['href'])

links[:2]

['https://codeup.com/bootcamp-to-bootcamp/',
 'https://codeup.com/how-to-get-started-on-a-programming-exercise/']

In [16]:
articles = []

for link in links:
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(link, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    article_text = soup.find('div', class_='jupiterx-post-content clearfix').text
    article_title = soup.find('h1', class_='jupiterx-post-title').text

    article_dict = {'title': article_title,
                    'content': article_text}
    
    articles.append(article_dict)

articles[:2]

[{'title': 'From Bootcamp to Bootcamp: Two Military Veterans Discuss Their Transition Into Tech',
  'content': 'Are you a veteran or active-duty military member considering your next steps? Our alumni have been in your boots. In a recent virtual panel, two vets discussed their transition into technology careers with Codeup: Benny Fields III, a retired Air Force Master Sergeant turned Full Stack Web Developer, and Jeffery Roeder, a Navy Intelligence Analyst turned Data Scientist. Whether you’re interested in Data Science or Web Development, here are some key takeaways from the event.\xa0Why Codeup?“The GI Bill was a huge plus, but the icing on the cake was the placement program.” – Benny FieldsAfter retiring from the Air Force, Benny Fields took a job as a technical writer, but he quickly became more interested in the software he was writing about than the writing itself. His friend suggested looking into a coding bootcamp, which he did. He liked that Codeup accepts the GI Bill and the 

In [17]:
acquire_codeup_blog.get_all_blog_articles()

Unnamed: 0.1,Unnamed: 0,title,content
0,0,From Bootcamp to Bootcamp: Two Military Vetera...,Are you a veteran or active-duty military memb...
1,1,How to Get Started On Any Programming Exercise,Programming is hard. Whether you’re just begin...
2,2,The Best Path to a Career in Data Science,"In our blog, “The Best Path To A Career In Sof..."
3,3,Getting Hired in a Remote Environment,As a career accelerator with a tuition refund ...
4,4,The Remote Codeup Student Experience,Communities across Texas have now lived in a r...
...,...,...,...
94,94,Press Release: Free Learn to Code Bootcamp for...,Press Release: Free Learn to Code Bootcamp for...
95,95,What The SA Tech Job Fair Says About San Antonio,What The SA Tech Job Fair Says About San Anton...
96,96,Why Choose Codeup?,Why Choose Codeup?Prospective students sometim...
97,97,Use Your Texas Unemployment Benefits at Codeup,Use Your Texas Unemployment Benefits at Codeup...
