## Data Acquisition Scratchpad

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

#### At a high level, we'll go about web scraping through this process:
1. Manually explore the site in a web browser, and identify the relevant HTML elements.
2. Use the requests module to obtain the HTML from the page.
3. Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
4. (Maybe) Script the process of requesting another page and parsing the data from it as well.
5. Take this data further down the data science pipeline.

#### Steps
1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
2. Assign the address of the web page to a variable named url.
3. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
4. Print the response text to ensure you have an html page.
5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
6. Use BeautifulSoup to parse the HTML into a variable ('soup').
7. Identify the key tags you need to extract the data you are looking for.
8. Create a dataframe of the data desired.
9. Run some summary stats and inspect the data to ensure you have what you wanted.
10. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
11. Create a corpus of the column with the text you want to analyze.
12. Store that corpus for use in a future notebook.

### Codeup Blog Articles

https://codeup.com/codeups-data-science-career-accelerator-is-here/

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.

In [3]:
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />
<style type="text/css" id="nab-alternative-loader-style"></style>
<script type="text/javascript" id="nelio-ab-testing-kickoff">/*nelio-ab-testing-kick


We'll use the beautiful soup library to work with HTML data in python.

In [4]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content')
article.text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

Now that we have some text to process, we can store it for future use:

In [6]:
with open('article.txt', 'w') as f:
    f.write(article.text)

We can now package all of our code up in a nice function that we can use later:

In [9]:
def get_article_text():
    # if we already have the data, read it locally
    if os.path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='jupiterx-post-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

In [10]:
get_article_text()

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

In [12]:
url_list = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/', 
           'https://codeup.com/data-science-myths/',
           'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
           'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
           'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

In [20]:
def get_title_content(url):
    '''
    This function will take in a single url as a string, create an empty dictionary. It will return a dictionary with the 'title' of the article, and 'content' of the article.
    '''
    # create empty dictionary
    blog_dict = {}
    
    # the header tells the website who is pulling the data
    headers = {'User-Agent': 'Codeup Data Science'}
    
    # response is 'get'ting the data from the website
    response = get(url, headers=headers)
    
    # make that delicious Campbells (return the text)
    soup = BeautifulSoup(response.text)
    
    # give me the entire content as text
    content = soup.find('div', class_='jupiterx-post-content')
    
    # give me the title as text
    title = soup.find('h1', class_='jupiterx-post-title')
    
    # add title and content to dictionary
    blog_dict = {'title': title.text,
                'content': content.text}

    return blog_dict

In [21]:
def get_blog_articles(url_list):
    '''
    This function takes in a list of URLs from the Codeup blog, and returns a dictionary of each articles 'title' and content'
    '''
    
    # create an empty ist
    list_of_blogs = []
    
    # cycle through each url in the list
    for url in url_list:
        # call on the 'get_title_content' function to create a dictionary for each url
        # append each dictionary
        list_of_blogs.append(get_title_content(url))
        
    return list_of_blogs
    

In [23]:
get_blog_articles(url_list)

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspac

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

- Write a function that scrapes the news articles for the following topics:
    - Business
    - Sports
    - Technology
    - Entertainment

- The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

In [25]:
# {
#     'title': 'The article title',
#     'content': 'The article content',
#     'category': 'business' # for example
# }

#### Hints:
- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [26]:
# categories of news articles
categories = ["business", "sports", "technology", "entertainment", "science", "world"]

In [27]:
# base url for all articles
base_url = 'https://inshorts.com/en/read/'

In [28]:
# establish variable for first cat in cats
first_cat = categories[0]

In [29]:
# url for the first category
first_page = base_url + first_cat

In [32]:
# let's check our work so far...
print(f'first page: {first_page}')
print(f'headers: {headers}')

first page: https://inshorts.com/en/read/business
headers: {'User-Agent': 'Codeup Data Science'}


In [33]:
# step one: get our content
response = get(first_page, headers=headers)

In [34]:
# did it work?
response.text[:400]

'<!doctype html>\n<html lang="en">\n\n<head>\n  <meta charset="utf-8" />\n  <style>\n    /* The Modal (background) */\n    .modal_contact {\n        display: none; /* Hidden by default */\n        position: fixed; /* Stay in place */\n        z-index: 8; /* Sit on top */\n        left: 0;\n        top: 0;\n        width: 100%; /* Full width */\n        height: 100%;\n        overflow: auto; /* Enable scroll if ne'

In [35]:
# make our soup
soup = BeautifulSoup(response.text)

In [38]:
# what are we doing here?????????????????????
articles = soup.select('.news-card')

In [39]:
# what does the first article look like?
articles[0]

<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/chinas-exteacher-turned-billionaire-no-more-a-billionaire-as-shares-fall-98-1627290782038" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Pragya Swastik" itemprop="name"></span>
</span>
<span content="China's ex-teacher turned billionaire no more a billionaire as shares fall 98%" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2021/07_jul/26_mon/img_1627289502940_772.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span c

In [40]:
# what are we doing here???????????????????????????
articles[1].select("[itemprop='articleBody']")[0]

<div itemprop="articleBody">A new job posting by Amazon has fuelled speculations that the e-commerce major may begin accepting Bitcoin, Ether and other cryptocurrencies as a form of payment. According to the job posting, Amazon's Payments Acceptance &amp; Experience team is hiring a 'Digital Currency and Blockchain Product Lead'. Following the speculations around Amazon's plan, Bitcoin surged near $40,000 on Monday.</div>

In [42]:
def get_article(article, category):
    '''
    This function takes in a single article and category, and returns a dictionary with the
    article 'title', 'content', and 'category'.
    '''
    # Attribute selector (grabbing the title)
    title = article.select("[itemprop='headline']")[0].text
    
    # article body (grabbing the content)
    content = article.select("[itemprop='articleBody']")[0].text
    
    # create the empty dictionary
    output = {}
    
    # add each variable to the dictionary
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [43]:
def get_articles(category, base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers
    headers = {'User-Agent': 'Codeup Data Science'}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    # create an empty list
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output

In [44]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    # Create a list for all inshorts articles
    all_inshorts = []
    
    # loop through each category
    for category in categories:
        # grab each article from a particular category
        all_category_articles = get_articles(category)
        # add each list of articles/category to the all_inshorts list
        all_inshorts = all_inshorts + all_category_articles

    # make it a dataframe
    df = pd.DataFrame(all_inshorts)
    return df