## Web Scraping

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

import pandas as pd
import numpy as np

### 1. Codeup Blog Articles

Scrape the article text from the following pages:

 - https://codeup.com/codeups-data-science-career-accelerator-is-here/
 - https://codeup.com/data-science-myths/
 - https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
 - https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
 - https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/
 \
Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article.

In [2]:
def get_blog_articles(url):
    '''
    This function takes in a url and pull the necessary elements off the website
    then creates a dictionary with those elements
    '''

    # create an empty dictionary to append to
    blog_dict = {}
    
    # fetch the data
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    
    # pull the elements from the soup
    article = soup.find('div', class_='jupiterx-post-content') # pulling body of text
    title = soup.find('h1', class_='jupiterx-post-title') # pulling title as text
    
    # append the elements to the dictionary
    blog_dict = {'title': title.text,
                'content': article.text}
    
    # return dictionary
    return blog_dict

In [4]:
# create my list of urls
url_list = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
           'https://codeup.com/data-science-myths/',
           'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
           'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
           'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

# create an empty list
list_of_blogs=[]

# create a for loop for all the urls in the list to pull elements from and return a dict
for url in url_list:
    list_of_blogs.append(get_blog_articles(url))

In [6]:
# take a look at the first entry in the dictionary
list_of_blogs[0]

{'title': 'Codeup’s Data Science Career Accelerator is Here!',
 'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace,

### 2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment


The end product of this should be a function named get_news_articles that returns a list of dictionaries

In [7]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]

base_url = 'https://inshorts.com/en/read/'

In [12]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [13]:
def get_articles(category, base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [15]:
news_articles = get_articles('business', base ="https://inshorts.com/en/read/")
news_articles[0]

{'title': "China's ex-teacher turned billionaire no more a billionaire as shares fall 98%",
 'content': "China's Larry Chen, a former teacher who became a billionaire with edtech company Gaotu Techedu, lost his billionaire status after his company's shares fell 98%. Chen, Gaotu Techedu's Founder and CEO, is now worth $336 million according to Bloomberg. The development comes as China's new regulations banned companies teaching school curriculums from making profits, raising capital or going public.",
 'category': 'business'}

In [16]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [18]:
df = get_all_news_articles(categories)

df

Unnamed: 0,title,content,category
0,Amazon job posting fuels speculations about pl...,A new job posting by Amazon has fuelled specul...,business
1,China's ex-teacher turned billionaire no more ...,"China's Larry Chen, a former teacher who becam...",business
2,"Musk takes a jibe at rival car companies, says...",Tesla CEO and the world's second-richest perso...,business
3,Govt paid Infosys ₹164.5 crore for new Income ...,The government paid ₹164.5 crore to Infosys to...,business
4,"Unemployment rate rises in both urban, rural a...",India's unemployment rate soared to 7.14% in t...,business
...,...,...,...
142,Afghan Army chief postpones India visit amid T...,Afghan Army chief General Wali Mohammad Ahmadz...,world
143,46 Afghan soldiers flee to Pakistan in retreat...,The Pakistani Army on Monday said that 46 Afgh...,world
144,UAE extends ban on passenger flights from Indi...,The UAE has extended a ban on passenger flight...,world
145,Man accused of trying to kill Mali's interim P...,A man accused of trying to kill Mali's interim...,world


In [None]:
# Soup.title.string