## Acquire Exercises - Web Scraping

In [1]:
# Imports

import warnings
warnings.filterwarnings("ignore")

import requests
from requests import get
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

### 1. Codeup Blog Articles

Scrape the article text from the following pages:

- https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/ 
    
- https://codeup.com/tips-for-prospective-students/why-should-i-become-a-system-administrator/
    
- https://codeup.com/codeup-news/codeup-candidate-for-accreditation/
    
- https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/
    
- https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/'

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article.

In [2]:

def codeup_blog_urls():
    
    """ return list of URLs for codeup blogs for exercise """
    
    url1 = 'https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/' 

    url2 ='https://codeup.com/tips-for-prospective-students/why-should-i-become-a-system-administrator/'
    
    url3 ='https://codeup.com/codeup-news/codeup-candidate-for-accreditation/'
    
    url4 ='https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/'
    
    url5 ='https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/'
    
    return [url1, url2, url3, url4, url5]

def acquire_codeup_blog(url):
    
    """ returns dict of one codeup blog's title, date, category, and content """
    
    # set agent
    agent = 'codeup ds germain'
    
    # query
    response = requests.get(url, headers={'User-Agent': agent})
    
    # soup
    soup = BeautifulSoup(response.text)
    
    # get title
    title = soup.select('.entry-title')[0].text
    
    # get date
    date = soup.select('.published')[0].text
    
    # get category
    category = soup.find_all('a', {'rel':'category tag'})[0].text
    
    # grab all unformatted paragraphs
    paragraphs = soup.find_all('div', {'class':'et_pb_module et_pb_post_content et_pb_post_content_0_tb_body'})[0]\
    .find_all('p')
    
    # create list for formatted paragraphs
    paragraph_list = []
    
    # iterate paragraphs
    for paragraph in paragraphs:
        
        # add to list
        paragraph_list.append(paragraph.text)
        
    # destroy href markers
    content = " ".join(paragraph_list).replace('\xa0', ' ')
    
    # create dict
    blog_info_dict = {'title':title, 'date':date, 'category':category, 'content':content}
    
    # return dict
    return blog_info_dict

def get_blogs():
    
    """ queries, returns a dataframe of each codeup blog article's stuff """
    
    list_of_blog_dicts = []
    for url in codeup_blog_urls():
        list_of_blog_dicts.append(acquire_codeup_blog(url))
    return pd.DataFrame(list_of_blog_dicts)



In [3]:
acquire_codeup_blog('https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/')

{'title': 'Inclusion at Codeup During Pride Month (and Always)',
 'date': 'Jun 4, 2021',
 'category': 'Codeup News',
 'content': 'Happy Pride Month! Pride Month is a dedicated time to celebrate and support the LGBTQIA+ community. At Codeup, one of our core values is Cultivating Inclusive Growth, something that takes on many shapes, sizes, forms, and colors. From representation in tech to empowering and supporting all, let’s reflect on how we live out this core value for our LGBTQIA+ community, not just during Pride Month, but always. We’re firm believers that the people making tech should look like the people using it, which is everyone. We’re proud to offer Pride Scholarships year round, which aim to increase, support, and promote representation of the LGBTQIA+ community in tech. However, representation is only one part of cultivating inclusive growth. We want to help create a thriving tech community where everyone feels represented, but also safe and empowered. In a 2019 survey condu

In [4]:
get_blogs()

Unnamed: 0,title,date,category,content
0,Codeup Launches First Podcast: Hire Tech,"Aug 25, 2021",Codeup News,Any podcast enthusiasts out there? We are plea...
1,Why Should I Become a System Administrator?,"Aug 23, 2021",Tips for Prospective Students,"With so many tech careers in demand, why choos..."
2,Announcing our Candidacy for Accreditation!,"Jun 30, 2021",Codeup News,Did you know that even though we’re an indepen...
3,Codeup Takes Over More of the Historic Vogue B...,"Jun 21, 2021",Codeup News,Codeup is moving into another floor of our His...
4,Inclusion at Codeup During Pride Month (and Al...,"Jun 4, 2021",Codeup News,Happy Pride Month! Pride Month is a dedicated ...


### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment


The end product of this should be a function named get_news_articles that returns a list of dictionaries

In [5]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]

base_url = 'https://inshorts.com/en/read/'

In [6]:
def get_article(article, category):
    
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [7]:
def get_articles(category, base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [8]:
news_articles = get_articles('business', base ="https://inshorts.com/en/read/")
news_articles[0]

{'title': 'Navi Mutual Fund offers SIPs starting at ₹500',
 'content': 'Navi Mutual Fund is offering investors the option to start Systematic Investment Plans (SIPs) with ₹500 on its app. “With the recent spike in demand for mutual funds, Navi promises to offer a hassle-free digital experience to users and it is emerging as a preferred platform for direct investments”, the company said. Users can start investing via the Navi app.',
 'category': 'business'}

In [9]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [10]:
df = get_all_news_articles(categories)

df

Unnamed: 0,title,content,category
0,Navi Mutual Fund offers SIPs starting at ₹500,Navi Mutual Fund is offering investors the opt...,business
1,Elon Musk and Jeff Bezos are now worth nearly ...,The combined net worth of the world's two rich...,business
2,Cognizant had to choose clients to serve: CEO ...,Cognizant CEO Brian Humphries has said the fir...,business
3,'Man who takes 6 months parental leave is a lo...,Several Twitter users criticised US-based Pala...,business
4,"Delhi HC notice to RBI, SBI over banning UPI p...",The Delhi High Court on Thursday issued notice...,business
...,...,...,...
145,Afghanistan is confronting an epic humanitaria...,During a regional conference of Afghanistan's ...,world
146,US General Mark Milley confirms China's hypers...,US General Mark Milley has confirmed that Chin...,world
147,Civil war has spread throughout Myanmar: UN envoy,"Christine Schraner Burgener, the outgoing UN e...",world
148,China to build outpost for Tajikistan special ...,China will finance the construction of an outp...,world
