# Data Acquisition 

## Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd

## Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```python
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful.



In [2]:
url = 'https://codeup.edu/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

In [3]:
soup = BeautifulSoup(response.content, 'html.parser')

### class='more-link' is one way to get access to each article link

In [4]:
soup.select('.more-link') #soup.find_all('a', class_='more-link')

[<a class="more-link" href="https://codeup.edu/featured/apida-heritage-month/">read more</a>,
 <a class="more-link" href="https://codeup.edu/featured/women-in-tech-panelist-spotlight/">read more</a>,
 <a class="more-link" href="https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/">read more</a>,
 <a class="more-link" href="https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/">read more</a>,
 <a class="more-link" href="https://codeup.edu/events/women-in-tech-madeleine/">read more</a>,
 <a class="more-link" href="https://codeup.edu/codeup-news/panelist-spotlight-4/">read more</a>]

In [5]:
soup.select('.more-link')[0]

<a class="more-link" href="https://codeup.edu/featured/apida-heritage-month/">read more</a>

In [6]:
soup.select('.more-link')[0]['href']

'https://codeup.edu/featured/apida-heritage-month/'

### Using list comprehension to get all the links out

In [7]:
links = [link['href'] for link in soup.select('.more-link')]
links

['https://codeup.edu/featured/apida-heritage-month/',
 'https://codeup.edu/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.edu/events/women-in-tech-madeleine/',
 'https://codeup.edu/codeup-news/panelist-spotlight-4/']

### Get title and content from article

In [8]:
url = links[0]
response = get(url, headers=headers)
soup = BeautifulSoup(response.text)

In [9]:
soup.find('h1', class_='entry-title').text

'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa'

In [10]:
soup.find('div', class_='entry-content').text.strip()

'May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.\n\nIn an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.\nArbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.\nAt Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA).\nHere is how the rest

### Put it together

In [11]:
url = 'https://codeup.edu/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

links = [link['href'] for link in soup.select('.more-link')]

articles = []

for url in links:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()
    
    article_dict = {
        'title': title,
        'content': content
    }
    
    articles.append(article_dict)

In [12]:
articles[0:5]

[{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'content': 'May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.\n\nIn an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.\nArbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.\nAt Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-America

### Put in df

In [13]:
blog_article_df = pd.DataFrame(articles)
blog_article_df

Unnamed: 0,title,content
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...


In [14]:
blog_article_df.to_csv('blog_articles.csv', index=False)

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```python
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

In [15]:
url = 'https://inshorts.com/en/read'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')

### Get title

In [16]:
soup.find_all('span', itemprop='headline')[0].text

'Supreme Court verdict on same-sex marriage likely tomorrow'

### Get content

In [17]:
soup.find_all('div', itemprop='articleBody')[0].text

'The Supreme Court on Tuesday will likely deliver its verdict on whether same-sex marriages should be legally recognised in India, reports said. In May, the court reserved its verdict on petitions made by several LGBTQIA+ couples in the matter. The petitions argued that marriage brings with it several rights, privileges, and obligations that are "bestowed and protected by the law".'

### Put it together

In [18]:
categories = ['business', 'sports', 'technology', 'entertainment']

inshorts = []

for category in categories:
    
    url = 'https://inshorts.com/en/read' + '/' + category
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    titles = [span.text for span in soup.find_all('span', itemprop='headline')]
    contents = [div.text for div in soup.find_all('div', itemprop='articleBody')]
    
    for i in range(len(titles)):
        
        article = {
            'title': titles[i],
            'content': contents[i],
            'category': category,
        }
        
        inshorts.append(article)

In [19]:
inshorts[0:5]

[{'title': 'Some working overtime to harm us: Adani Group amid Mahua Moitra bribery allegations',
  'content': 'Adani Group has reacted amid allegations of TMC MP Mahua Moitra \'accepting bribes from Hiranandani Group\' for asking questions \'targeting\' Gautam Adani in Parliament. "This development corroborates our [earlier] statement...that some groups and individuals have been working overtime to harm our name, goodwill and market standing," said an Adani spokesperson. Moitra has denied the allegations.',
  'category': 'business'},
 {'title': 'Oil prices steady above $90 as investors assess Israel-Hamas war',
  'content': 'Brent oil prices steadied above $90 (over ₹7,492) per barrel on Monday as investors are trying to assess the impact of the ongoing Israel-Hamas war. This comes amid concerns of potential escalation involving Iran. Palestinian militant group Hamas had launched rockets at Israel in a surprise attack earlier this month. ',
  'category': 'business'},
 {'title': "SC re

In [20]:
inshorts_article_df = pd.DataFrame(inshorts)
inshorts_article_df

Unnamed: 0,title,content,category
0,Some working overtime to harm us: Adani Group ...,Adani Group has reacted amid allegations of TM...,business
1,Oil prices steady above $90 as investors asses...,"Brent oil prices steadied above $90 (over ₹7,4...",business
2,SC rejects telcos' plea to see licence fee as ...,The Supreme Court on Monday rejected a request...,business
3,SpiceJet stock dip amid 'Gangwal not intereste...,SpiceJet's shares tanked 11% on Monday after a...,business
4,"HDFC Bank's Q2 profit jumps 50% to ₹15,976 crore",HDFC Bank on Monday reported a net profit of o...,business
5,What is the TCS bribes-for-jobs scandal?,The bribes-for-jobs scandal at Tata Consultanc...,business
6,Former Bank of China Chairman Liu arrested ove...,"Liu Liange, who resigned as the Chairman of Ba...",business
7,"BioNTech warns of write-off of up to ₹7,888 cr...",Germany's BioNTech flagged write-downs of up t...,business
8,Rupee hits 1-year low of 83.28 against US dollar,The Indian Rupee hit a one-year low of 83.28 a...,business
9,Activision Blizzard CEO to leave firm with $40...,Activision Blizzard CEO Bobby Kotick will leav...,business


In [21]:
inshorts_article_df.to_csv('news_articles.csv', index=False)