In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

In [3]:
url = 'https://codeup.com/data-science/math-in-data-science/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

### ensuring what we are scraping is html data

In [4]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.edu/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [6]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

## Beautiful Soup Methods and Properties  

- **`soup.title.string`** gets the page's title (the same text in the browser tab for a page, this is the `<title>` element  

- **`soup.prettify()`** is useful to print in case you want to see the HTML
  
- **`soup.find_all("a")`** find all the anchor tags, or whatever argument is specified.
  
- **`soup.find("h1")`** finds the first matching element
  
- **`soup.get_text()`** gets the text from within a matching piece of soup/HTML
  
- The **`soup.select()`** method takes in a CSS selector as a string and returns all matching elements. super useful

In [11]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', id='main-content')
article.text

'\n\n\n\n\n\nWhat are the Math and Stats Principles You Need for Data Science?\nOct 21, 2020 | Data Science\n\n\nComing into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?\nWhat are the main math principles you need to know to get into Codeup’s Data Science program?\n\n\nAlgebra\nDo you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:\xa0\n\nVariables (x, y, n, etc.)\nFormulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).\nOrder of evaluation: PEMDAS: parentheses, exponents, then multiplic

### Now that we have some text to process, we can store it for future use:

In [12]:
with open('article.txt', 'w') as f:
    f.write(article.text)

### We can now package all of our code up in a nice function that we can use later:

In [13]:
def get_article_text():
    # if we already have the data, read it locally
    if path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/data-science/math-in-data-science/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', id='main-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

## HTML and CSS Crash Course

- HTML is the language for content and structure on the web. This means that HTML specifies what content is what: tex, images, links, tables, containers, etc...

- CSS is the language for styling and presentation. This means CSS specifies color, background, texture, position, etc...

### HTML Basics

- HTML consists of elements denoted by tags. These tags are contained in angle brackets like **`<main>`**. Notice how there are opening and closing tags that contain other elements.

- HTML tags nest inside of other HTML tags, just like directories and files are nested in other directories.

**Further reading on HTML Elements: https://developer.mozilla.org/en-US/docs/Web/HTML/Element**

In [19]:
# hmtl code for output below in comments

# <html>
#     <head>
#         <title>This is the title of the page</title>
#     </head>
#     <body>
#         <heading>
#             <h1>Welcome to the blog!</h1>
#             <p>Blog is short for "back-log"</p>
#         </heading>
#         <main>
#             <h2>Read your way to insight!</h2>
#             <section id="posts">
#                 <article class="blog_post">
#                     <h3>Hello World</h3>
#                     <p>This is the first post!</p>
#                 </article>
#                 <article class="blog_post">
#                     <h3>HTML Is Awesome</h3>
#                     <p>It's the language and structure for the web!</p>
#                 </article>
#                 <article class="blog_post">
#                     <h3>CSS Is Totally Rad</h3>
#                     <p>CSS Selectors are super powerful</p>
#                 </article>
#             </section>
#         </main>
#         <footer>
#             <p>All rights reserved.</p>
#         </footer>
#     </body>
# </html>

## OUTPUT:

<html>
    <head>
        <title>This is the title of the page</title>
    </head>
    <body>
        <heading>
            <h1>Welcome to the blog!</h1>
            <p>Blog is short for "back-log"</p>
        </heading>
        <main>
            <h2>Read your way to insight!</h2>
            <section id="posts">
                <article class="blog_post">
                    <h3>Hello World</h3>
                    <p>This is the first post!</p>
                </article>
                <article class="blog_post">
                    <h3>HTML Is Awesome</h3>
                    <p>It's the language and structure for the web!</p>
                </article>
                <article class="blog_post">
                    <h3>CSS Is Totally Rad</h3>
                    <p>CSS Selectors are super powerful</p>
                </article>
            </section>
        </main>
        <footer>
            <p>All rights reserved.</p>
        </footer>
    </body>
</html>


# CSS Selectors

- The name of the element itself is a selector. For example **`soup.select("p")`** will select every paragraph tag and **`soup.select("footer")`** selects the footer element (and everything inside it)  

- The id selector is denoted with a **`#`**. For example **`soup.select("#posts")`** will return the html element noted with the **`id=posts`** attribute  

- The class selector is denoted with a **`.`** symbol before the class name. For example, **`soup.select(".blog_post")`** returns all of the elements that have that class name.

Further reading on CSS Selectors: https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors