In [1]:
import pandas as pd
import re

from requests import get
from bs4 import BeautifulSoup

# 1. Codeup Blog Articles

## Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

## Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:


{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
## Plus any additional properties you think might be helpful.

In [2]:
def get_blog_links():
    
    url = 'https://codeup.edu/blog/'

    #user with access to webpage
    headers = {'User-Agent': 'Codeup Data Science'}

    response = get(url, headers=headers)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    links = soup.find_all("h2")
        
    new_links = []

    for article in links:
        
        if article.find("a"):

            new_links.append(article.find("a").get("href")) 
        
    return links, new_links

In [3]:
links, new_links = get_blog_links()
new_links

['https://codeup.edu/featured/apida-heritage-month/',
 'https://codeup.edu/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.edu/events/women-in-tech-madeleine/',
 'https://codeup.edu/codeup-news/panelist-spotlight-4/']

In [41]:
def blog_titles():
    
    links, new_links = get_blog_links()
    
    titles = []
    
    for article in links:
        
        titles.append(article.get_text())
        
    titles.remove('Git Codeupdates')
        
    return titles

In [42]:
blog_titles()

['Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
 'Women in tech: Panelist Spotlight – Magdalena Rahn',
 'Women in tech: Panelist Spotlight – Rachel Robbins-Mayhill',
 'Women in Tech: Panelist Spotlight – Sarah Mellor',
 'Women in Tech: Panelist Spotlight – Madeleine Capper',
 'Black Excellence in Tech: Panelist Spotlight – Wilmarie De La Cruz Mejia']

In [33]:
def blog_content():
    
    links, new_links = get_blog_links()
    
    headers = {'User-Agent': 'Codeup Data Science'}
    
    all_content = []

    for url in new_links:
        
        response = get(url, headers=headers)
    
        soup = BeautifulSoup(response.content, 'html.parser')

        content = soup.select(".entry-content")[0].find_all("p")
        
        all_content.append(content)

    clean_content = []

    for c in all_content:
        
        paragraphs = []

        for p in c:

            paragraphs.append(p.get_text())        
    
        clean_content.append(paragraphs)
    
    return clean_content

In [37]:
blog_content()

[['May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.',
  '',
  'In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.',
  'Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.',
  'At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American individuals. Hence, we will now use the term Asian Pacific Islander Desi American (APIDA).',
 

In [26]:
# Create a dictionary for each article
article_data = {
    'title': blog_titles(),
    'content': blog_content()
}

article_data

{'title': ['Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'Women in tech: Panelist Spotlight – Magdalena Rahn',
  'Women in tech: Panelist Spotlight – Rachel Robbins-Mayhill',
  'Women in Tech: Panelist Spotlight – Sarah Mellor',
  'Women in Tech: Panelist Spotlight – Madeleine Capper',
  'Black Excellence in Tech: Panelist Spotlight – Wilmarie De La Cruz Mejia',
  'Git Codeupdates'],
 'content': ['May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.',
  '',
  'In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.',
  'Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988

In [44]:
list_of_dicts = []

for title, content_list in zip(blog_titles(), blog_content()):
    entry = {'title': title, 'content': content_list}
    list_of_dicts.append(entry)
    
list_of_dicts

[{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
  'content': ['May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.',
   '',
   'In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers.',
   'Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen.',
   'At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI 

# 2. News Articles

## We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

## Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment
## The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
## Hints:

* Start by inspecting the website in your browser. Figure out which elements will be useful.
* Start by creating a function that handles a single article and produces a dictionary like the one above.
* Next create a function that will find all the articles on a single page and call the function you created in the
  last step for every article on the page.
* Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.