# Exercises

In [1]:
# imports
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Codeup Blog Articles

### Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

### Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

'title': 'the title of the article',

'content': 'the full text content of the article'

In [2]:
soup = BeautifulSoup((requests.get('https://codeup.edu/blog/', 
                                   headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0"})).text, 'html.parser')

In [3]:
soup.select('.more-link')

[<a class="more-link" href="https://codeup.edu/featured/apida-heritage-month/">read more</a>,
 <a class="more-link" href="https://codeup.edu/featured/women-in-tech-panelist-spotlight/">read more</a>,
 <a class="more-link" href="https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/">read more</a>,
 <a class="more-link" href="https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/">read more</a>,
 <a class="more-link" href="https://codeup.edu/events/women-in-tech-madeleine/">read more</a>,
 <a class="more-link" href="https://codeup.edu/codeup-news/panelist-spotlight-4/">read more</a>]

In [4]:
blog_links = [element['href'] for element in soup.find_all('a', class_='more-link')]
blog_links

['https://codeup.edu/featured/apida-heritage-month/',
 'https://codeup.edu/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.edu/events/women-in-tech-madeleine/',
 'https://codeup.edu/codeup-news/panelist-spotlight-4/']

In [10]:
# no touchy
soup = BeautifulSoup((requests.get('https://codeup.edu/blog/', 
                                   headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0"})).text, 'html.parser')
blog_links = [element['href'] for element in soup.find_all('a', class_='more-link')]
def get_blog_articles():
    all_blogs = []
    for link in blog_links:
        soup = BeautifulSoup((requests.get(link, 
                            headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0"})).text, 'html.parser')
        title = soup.find('h1', class_='entry-title').text
        body = soup.find('div', class_='entry-content').text.strip().replace('\n', ' ')
        row = {'title': title, 'article': body}
        all_blogs.append(row)
    return pd.DataFrame(all_blogs)

In [11]:
articles = get_blog_articles()
articles

Unnamed: 0,title,article
0,Spotlight on APIDA Voices: Celebrating Heritag...,May is traditionally known as Asian American a...
1,Women in tech: Panelist Spotlight – Magdalena ...,Women in tech: Panelist Spotlight – Magdalena ...
2,Women in tech: Panelist Spotlight – Rachel Rob...,Women in tech: Panelist Spotlight – Rachel Rob...
3,Women in Tech: Panelist Spotlight – Sarah Mellor,Women in tech: Panelist Spotlight – Sarah Mell...
4,Women in Tech: Panelist Spotlight – Madeleine ...,Women in tech: Panelist Spotlight – Madeleine ...
5,Black Excellence in Tech: Panelist Spotlight –...,Black excellence in tech: Panelist Spotlight –...


## News Articles

### We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

### Write a function that scrapes the news articles for the following topics:
- Business
- Sports
- Technology
- Entertainment

### The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

'title': 'The article title',

'content': 'The article content',

'category': 'business' # for example

Hints:
- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

#### Inshorts

In [7]:
base_url = 'https://inshorts.com/en/read/'
categories = [
    'business',
    'entertainment',
    'technology',
    'sports'
]

In [8]:
[print(requests.get(base_url + cat)) for cat in categories]

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


[None, None, None, None]

In [9]:
test_soup = BeautifulSoup(requests.get(base_url+categories[0]).text, 'html.parser')

In [10]:
test_soup.find_all('div', itemprop='articleBody')[0].text

'Victoria\'s Secret ex-CEO Leslie Wexner\'s foundation announced it\'s cutting its "financial and programmatic" ties with Harvard. "We are stunned and sickened at the dismal failure of Harvard\'s leadership to take a clear...stand against the barbaric murders of innocent Israeli civilians," The Wexner Foundation said. Harvard\'s leaders were "tiptoeing" over Hamas\' attacks against Israel, it added.'

In [11]:
test_soup.find_all('span', itemprop='headline')[0].text

"Victoria's Secret ex-CEO cuts Harvard ties for not backing Israel"

In [6]:
def get_news_article():
    base_url = 'https://inshorts.com/en/read/'
    categories = [
    'business',
    'entertainment',
    'technology',
    'sports'
    ]
    all_articles = pd.DataFrame(columns = ['title', 'body', 'category'])
    for category in categories:
        category_url = base_url + category
        cont = requests.get(category_url).text
        soup = BeautifulSoup(cont, 'html.parser')
        title = [element.text for element in soup.find_all('span', itemprop='headline')]
        body = [element.text for element in soup.find_all('div', itemprop='articleBody')]
        cat = pd.DataFrame({'title': title, 'body': body, 'category': category})
        all_articles =pd.concat([all_articles, cat], axis = 0, ignore_index=True)
    return all_articles

In [7]:
news = get_news_article()
news

Unnamed: 0,title,body,category
0,Victoria's Secret ex-CEO cuts Harvard ties for...,Victoria's Secret ex-CEO Leslie Wexner's found...,business
1,HDFC Bank's Vigil Aunty ad gets criticism for ...,HDFC Bank's latest advertisement featuring Vig...,business
2,IMEC big opportunity for investors to partner ...,PM Narendra Modi at the Global Maritime India ...,business
3,"ICICI Bank fined ₹12 crore, Kotak ₹3.95 crore ...",The Reserve Bank of India (RBI) has imposed a ...,business
4,35 lakh weddings in 23 days to generate record...,Traders' body Confederation of All India Trade...,business
5,"Mahadev app key accused arrives from Dubai, he...","Mrugank Mishra, a key accused in Mahadev betti...",business
6,Bankman-Fried's trial delay request over Adder...,A US court denied FTX Founder Sam Bankman-Frie...,business
7,Indian wheat prices hit 8-month high,Prices of wheat in India have reached an eight...,business
8,Dabur gets ₹321 crore GST demand notice,Dabur India has received a notice to pay Goods...,business
9,'Kill list' of LinkedIn staff being fired was ...,A list with names of about 500 employees was l...,business


## Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).