# Acquire Data through Web Scraping

When the data you need is not accessible through CSVs, APIs, SQL, or other types, there is an option. This option is known as web scraping.

!!!caution "Web Scraping Ethics"
    Make sure the website's terms of use allow for web scraping. You can generally find a terms of service page, or take a look at `example.com/robots.txt` to find the policy for computers looking at the web site.

At a high level, we'll go about web scraping through this process:

1. Manually explore the site in a web browser, and identify the relevant HTML elements.
1. Use the `requests` module to obtain the HTML from the page.
1. Use `BeautifulSoup` to parse the HTML and obtain the text/data that we want.
1. (Maybe) Script the process of requesting another page and parsing the data from it as well.
1. Take this data further down the data science pipeline.

### Steps 

1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
2. Assign the address of the web page to a variable named url.
3. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
4. Print the response text to ensure you have an html page.
5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
6. Use BeautifulSoup to parse the HTML into a variable ('soup').
7. Identify the key tags you need to extract the data you are looking for.
8. Create a dataframe of the data desired.
9. Run some summary stats and inspect the data to ensure you have what you wanted.
10. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
11. Create a corpus of the column with the text you want to analyze.
12. Store that corpus for use in a future notebook.


In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

For this lesson, we'll take a look at an article from Codeup's blog.

In [2]:
url = 'https://codeup.com/data-science/math-in-data-science/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

After making the request, we'll perform a quick sanity check to make sure what we are looking at is indeed HTML data.

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


Now we will take a look at the actual web page contents and inspect the source to understand the structure a bit.

As we see from the first line of the response, the server sent us an HTML document. This document describes the overall structure of that web page, along with its specific content (which is what makes that particular page unique).

For the most part, all of the pages from a single website will have the same (or very similar) overall structure. To write our script, we will need to understand the HTML structure of one page, and we will use the browser’s Developer Tools to do that. 

- `command + option + u` will let you view the source of a page in chrome.
- `command + option + i` will open up the chrome dev tools page inspector.
- Right clicking on specific text in the page and selecting 'inspect' will take you right to the html of that text

In general, we'll be looking for HTML tags, and using a couple properties of those tags to identify the content that we want. Two element properties are important to us:

- `class`: This is a list of the class(es) that are applied to an element, these can be used to target certain elements, but are not guaranteed to be unique.
- `id`: This is a unique identifier for an element on a page.

We'll use the beautiful soup library to work with HTML data in python.

In [4]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

## Beautiful Soup Methods and Properties
- `soup.title.string` gets the page's title (the same text in the browser tab for a page, this is the `<title>` element
- `soup.prettify()` is useful to print in case you want to see the HTML
- `soup.find_all("a")` find all the anchor tags, or whatever argument is specified.
- `soup.find("h1")` finds the first matching element
- `soup.get_text()` gets the text from within a matching piece of soup/HTML
- The `soup.select()` method takes in a CSS selector as a string and returns all matching elements. **super useful**

In [5]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', id='main-content')
article.text

'\n\n\n\n\n\n\n\nWhat are the Math and Stats Principles You Need for Data Science?Oct 21, 2020 | Data Science\n\n\nComing into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?\nWhat are the main math principles you need to know to get into Codeup’s Data Science program?\n\n\nAlgebra\nDo you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:\xa0\n\nVariables (x, y, n, etc.)\nFormulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).\nOrder of evaluation: PEMDAS: parentheses, exponents, then multipl

Now that we have some text to process, we can store it for future use:

In [6]:
with open('article.txt', 'w') as f:
    f.write(article.text)

We can now package all of our code up in a nice function that we can use later:

In [7]:
def get_article_text():
    # if we already have the data, read it locally
    if path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()
    
    # otherwise go fetch the data
    url = 'https://codeup.com/data-science/math-in-data-science/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', id='main-content')
    
    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)
    
    return article.text

## HTML and CSS Crash Course

HTML is the language for content and structure on the web. This means that HTML specifies what content is what: tex, images, links, tables, containers, etc...

CSS is the language for styling and presentation. This means CSS specifies color, background, texture, position, etc...

### HTML Basics

HTML consists of elements denoted by tags. These tags are contained in angle brackets like `<main>`. Notice how there are opening and closing tags that contain other elements.

HTML tags nest inside of other HTML tags, just like directories and files are nested in other directories.

[Further reading on HTML Elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)

````html
<html>
    <head>
        <title>This is the title of the page</title>
    </head>
    <body>
        <heading>
            <h1>Welcome to the blog!</h1>
            <p>Blog is short for "back-log"</p>
        </heading>
        <main>
            <h2>Read your way to insight!</h2>
            <section id="posts">
                <article class="blog_post">
                    <h3>Hello World</h3>
                    <p>This is the first post!</p>
                </article>
                <article class="blog_post">
                    <h3>HTML Is Awesome</h3>
                    <p>It's the language and structure for the web!</p>
                </article>
                <article class="blog_post">
                    <h3>CSS Is Totally Rad</h3>
                    <p>CSS Selectors are super powerful</p>
                </article>
            </section>
        </main>
        <footer>
            <p>All rights reserved.</p>
        </footer>
    </body>
</html>
````



### CSS Selectors

- The name of the element itself is a selector. For example `soup.select("p")` will select every paragraph tag and `soup.select("footer")` selects the footer element (and everything inside it)
- The id selector is denoted with a `#`. For example `soup.select("#posts")` will return the html element noted with the `id=posts` attribute
- The class selector is denoted with a `.` symbol before the class name. For example, `soup.select(".blog_post")` returns all of the elements that have that class name.

[Further reading on CSS Selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors)

## Further Reading

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 
- [Web Scraping with Beautiful Soup Tutorial](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)
- [Practitioner's Guide to Understanding Text](https://www.kdnuggets.com/2018/07/practitioners-guide-processing-understanding-text-1.html)

## Exercises

By the end of this exercise, you should have a file named `acquire.py` that
contains the specified functions. If you wish, you may break your work into
separate files for each website (e.g. `acquire_codeup_blog.py` and
`acquire_news_articles.py`), but the end function should be present in
`acquire.py` (that is, `acquire.py` should  import `get_blog_articles` from
the `acquire_codeup_blog` module.)

1. Codeup Blog Articles

    Visit [Codeup's Blog](https://codeup.com/blog/) and record the urls for at
    least 5 distinct blog posts. For each post, you should scrape at least the
    post's title and content.

    Encapsulate your work in a function named `get_blog_articles` that will return a
    list of dictionaries, with each dictionary representing one article. The shape
    of each dictionary should look like this:

    ```python
    {
        'title': 'the title of the article',
        'content': 'the full text content of the article'
    }
    ```

    Plus any additional properties you think might be helpful.

    **Bonus:** Scrape the text of **_all_** the articles linked on [codeup's blog page](https://codeup.com/blog/).

1. News Articles

    We will now be scraping text data from [inshorts](https://inshorts.com/), a
    website that provides a brief overview of many different topics.

    Write a function that scrapes the news articles for the following topics:

    - Business
    - Sports
    - Technology
    - Entertainment

    The end product of this should be a function named `get_news_articles` that
    returns a list of dictionaries, where each dictionary has this shape:

    ```python
    {
        'title': 'The article title',
        'content': 'The article content',
        'category': 'business' # for example
    }
    ```

    Hints:

    1. Start by inspecting the website in your browser. Figure out which
       elements will be useful.
    1. Start by creating a function that handles a single article and produces a
       dictionary like the one above.
    1. Next create a function that will find all the articles on a single page
       and call the function you created in the last step for every article on
       the page.
    1. Now create a function that will use the previous two functions to scrape
       the articles from all the pages that you need, and do any additional
       processing that needs to be done.

1. Bonus: cache the data

    Write your code such that the acquired data is saved locally in some form or
    fashion. Your functions that retrieve the data should prefer to read the
    local data instead of having to make all the requests everytime the function
    is called. Include a boolean flag in the functions to allow the data to be
    acquired "fresh" from the actual sources (re-writing your local cache).