# Web Scraping Fundamentals


This notebook covers web scraping using Beautiful Soup.

Data is abundant on the web. The data can be stored as a file, e.g. csv, xlsx, txt file, _or_ it can be __scraped from the web__.

## Web scraping preparation
1. Visit the webisite you want to scrape.
2. Look at the HTML of the page to understand it's structure
3. Copy the link to the webpage.

In [2]:
# Import libraries for web scraping
from requests import get
from bs4 import BeautifulSoup
import sys
import os

## Making the initial request: GET

1. Copy and paste the webpage url into the jupyter notebook as a string. Assign the string to a variable named `url`.
1. Specify the header as `{'User-Agent': 'Codeup Data Science'}`
- If a header is _not_ specified an error is raised:<br>
>__403 Error: Forbidden client error status response code indicates that the server understood the request but refuses to authorize it__. Or as Digital Ocean puts it "4XX - Client Error (you messed up)."
3. Use the `get()` function. Pass the `url` as the first arguement and `headers` as the second arguement. Assign the function to a variable named `response` to store the result, the web-page.

In [3]:
# 1 : The web page we want to scrape
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'

# 2 : This must be a formal greeting to the server.
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent

# 3 : Store the web page returned from the url request into a variable named repsonse.
response = get(url, headers=headers)

### Best Practices: Check the reponse object
1. Check the status code of the Reponse object.
- If the code is 200. The request was successful.
- Anything else, look up the code and try again.
2. Check the Response object content using `.text`
- The web page contents will be displayed on screen, formatted as HTML.

In [4]:
# Check the status code to see if our request was fulfilled. 200 == Success.
response.status_code

200

In [5]:
# The webpage we acquired is in HTML format.
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	
	<!-- This site is optimized with the Yoast SEO plugin v15.2.1 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>Codeup’s Data Science Career Accelerator is Here! - Codeup</title>
	<meta name="description" content="The rumors are true! The time has arriv


## Transforming the Response: Beautiful Soup

Using Beautiful Soup we can add functionality to the body of text.
1. Transform the Response object into a Beautiful Soup oject using `BeautifulSoup()`.
- Pass the web page format as the prefix to .parser, in this case `html.parser`.

Beautiful Soup Documents:<br>
```python
soup?
'''
:param markup: A string or a file-like object representing
 markup to be parsed.

:param features: Desirable features of the parser to be
 used. This may be the name of a specific parser ("lxml",
 "lxml-xml", "html.parser", or "html5lib") or it may be the
 type of markup to be used ("html", "html5", "xml"). It's
 recommended that you name a specific parser, so that
 Beautiful Soup gives you the same results across platforms
 and virtual environments.'''
 ```

In [6]:
# Transform the Response object as a Beautiful Soup Object.

# Beautiful Soup(param markup == reponse.content, parser='html.paser')
soup = BeautifulSoup(response.content, 'html.parser')

## Accessing the contents of soup

In [19]:
# soup.children returns an iterator object that allows us to see the
#
list(soup.children)[0]

'html'

In [16]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.Tag]

In [23]:
list(soup.sibling)

TypeError: 'NoneType' object is not iterable

### Finding Titles
Using Beautiful Soup's `.find()` function you can access the HTML elements contained in a Beautiful Soup object.

In [6]:
soup.find('title')

<title>Codeup’s Data Science Career Accelerator is Here! - Codeup</title>

In [7]:
soup.title.string

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [8]:
soup.find('title').text

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [9]:
soup.find('title').get_text()

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [10]:
soup.find('title').string

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

### Finding Headings

Using Beautiful Soup's `.find()` function you can access the HTML elements contained in the Beautiful Soup object.

In [11]:
soup.find('h1')

<h1 class="jupiterx-post-title" itemprop="headline">Codeup’s Data Science Career Accelerator is Here!</h1>

In [12]:
soup.h1.string

'Codeup’s Data Science Career Accelerator is Here!'

In [13]:
soup.find('h1', class_='jupiterx-post-title').text

'Codeup’s Data Science Career Accelerator is Here!'

In [14]:
soup.find('h1').get_text()

'Codeup’s Data Science Career Accelerator is Here!'

In [15]:
soup.find('h1').string

'Codeup’s Data Science Career Accelerator is Here!'

In [16]:
soup.h1.string == soup.find('h1').text == soup.find('h1').string

True

### Article
Using `soup.find()` is useful. In an HTML file there are multiple \<div> classes. 

In [17]:
article = soup.find('div', class_='jupiterx-post-content')

# access the text strored in article.
print(article.text)

The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students will work with real

```python
# Write the text stored in `article` to a file named `article.txt.
with open('article.txt', 'w') as f:
    f.write(article.text)
```

In [37]:
def get_article_text():
    '''
    This function retrieves data scraped from:
    
    https://codeup.com/codeups-data-science-career-accelerator-is-here/
    
    Returns the contents of tag: <div class=jupiterx-post-content> as a string.
    '''
    # check to see the file exists. If it does open it
    # and return the contents as a string.
    sys.path.insert(1, '../data/')

    if os.path.exists('data/article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='jupiterx-post-content')

    # cache the data locally
    with open('article.txt', 'w') as f:
        f.write(article.text)

    # return the contents of article as a string.
    return article.text

## Appendix

### Checking the Response object headers

In [20]:
# Checking to see what headers represents/means
print(response.headers)

{'Server': 'nginx', 'Date': 'Fri, 13 Nov 2020 23:29:38 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=20', 'Vary': 'Accept-Encoding, Accept-Encoding, Accept-Encoding,Cookie', 'Link': '<https://codeup.com/wp-json/>; rel="https://api.w.org/", <https://codeup.com/wp-json/wp/v2/posts/619>; rel="alternate"; type="application/json", <https://codeup.com/?p=619>; rel=shortlink', 'X-Powered-By': 'WP Engine', 'X-Cacheable': 'SHORT', 'Cache-Control': 'max-age=600, must-revalidate', 'X-Cache': 'HIT: 3', 'X-Cache-Group': 'normal', 'Content-Encoding': 'gzip'}


In [21]:
# Displays the entire web page.
# Interesting. The content has a memory value.
type(response.content)

bytes

### Beautiful Soup Methods

In [22]:
# Heading 2
print(soup.find_all('h2'))

[<h2 class="jupiterx-post-related-label">Recommended Posts</h2>]


```python
# Extracting div class
soup.div.extract?
```

Wow.
```python'''
Signature: soup.div.extract()
Docstring:
Destructively rips this element out of the tree.

:return: `self`, no longer part of the tree.'''
```