# An Introduction to Web Scraping


Derived from:
- [https://gist.github.com/jonathanmorgan/1373860ed08fe8ecc319](https://gist.github.com/jonathanmorgan/1373860ed08fe8ecc319).
- [https://github.com/dssg/hitchhikers-guide/blob/master/curriculum/basic-web-scraping/Scraping.ipynb](https://github.com/dssg/hitchhikers-guide/blob/master/curriculum/basic-web-scraping/Scraping.ipynb)
- Ryan Mitchell's [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do)

# Table of Contents
- [Introduction](#Introduction)

    - [Learning Objectives](#Learning-Objectives)
    - [Topics](#Topics)

- [Setup - Load Python packages](#Setup---Load-Python-packages)
- [HTML](#HTML)

    - [HTML and web page structure](#HTML-and-web-page-structure)
    - [Useful HTML tags](#Useful-HTML-tags)
    - [Jupyter Notebooks and %%HTML](#Jupyter-Notebooks-and-%%HTML)
    - [Useful HTML attributes](#Useful-HTML-attributes)

- [Getting, parsing and interacting with HTML](#Getting,-parsing-and-interacting-with-HTML)

    - [Using Python and `requests` to get HTML for a page](#Using-Python-and-requests-to-get-a-webpage's-HTML)
    - [HTTP Methods with Requests](#HTTP-Methods-with-Requests)
    - [Finding Data to Scrape](#Finding-Data-to-Scrape)
    - [Response Object](#Response-Object)
    - [Tag Soup](#Tag-Soup)
    - [Exploring and Inspecting a Webpage](#Exploring-and-Inspecting-a-Webpage)
    - [Building Our First Parser](#Building-Our-First-Parser)
        
- [Appendix](#Appendix)

    - [Note on HTML Parsers](#Note-on-HTML-Parsers)
    - [Exploring HTML in a web page](#Exploring-HTML-in-a-web-page)
    - [Submitting a form](#Submitting-a-form)
    - [HTTP Response Example](#HTTP-Response-Example)
    - [More Examples](#More-Examples)

# Introduction

- Back to [Table of Contents](#Table-of-Contents)

APIs are great, but lots of useful information on the Internet is not neatly packaged in an API.  Useful information and information sources are embedded in HTML all over the Internet, and you can use Python to discover and collect data directly from web sites.  HTML, the language of the world wide web, is much less precise and well-defined than a good API, however, and so the programs you write to do this have the potential to be significantly more complicated than API calls.

Below, we introduce you to making network requests using HTTP, then we show how to use HTTP to scrape data from web pages using Python's requests and beautifulsoup modules.

_Note: Before we begin, be advised that before you scrape a web site, you should read its terms of service and make sure that scraping is permitted.  Some sites will not want you to scrape their content, and when a site makes that clear in its terms of service, you should respect their wishes.  In addition, even if a site is OK with you scraping their pages, it is polite to spread your requests at least a few seconds apart, so that you don't put too much load on their servers._

## Learning Objectives

- Back to [Table of Contents](#Table-of-Contents)

** Learning objectives:**

- **Become familiar with the basics of web scraping.**  Understand the structure of websites that require scraping (HTML, CSS, and javascript) and the tools Python provides (requests, beautifulsoup, selenium).
- **Learn the tools used to interact with network-based resources.** Explore the the tools for talking directly with servers over HTTP connection, and then understand how to choose a tool.
- **Become familiar with methods for different types of web scraping.** Understand some general classes of scraping one often has to do, and how to go about each.

## Topics

- Back to [Table of Contents](#Table-of-Contents)

Outline of topics covered in this notebook:

- Brief overview of technical details of web technologies one uses when scraping (HTML, CSS, Javascript).
- Intro to Python scraping tools (`requests`, BeautifulSoup, Selenium, webkit etc.)
- Common scraping tasks and how you accomplish them.

# Setup - Load Python packages

- Back to [Table of Contents](#Table-of-Contents)

In [None]:
# interacting with websites and web-APIs
import re # regular expressions module
import requests # easy way to interact with web sites and services
import json # read/write JavaScript Object Notation (JSON)
from bs4 import BeautifulSoup

import pandas as pd # to write out our final dataset

# from selenium import webdriver
# browser = webdriver.Firefox()

In [None]:
print("Package versions")
print("requests: {}".format(requests.__version__))
print("json: {}".format(json.__version__))

# HTML

- Back to [Table of Contents](#Table-of-Contents)

HTML is one of the three languages in which the World Wide Web is written, and contains most of the information about the structure and organization of a webpage. This means that when we scrape websites, we are most concerned with examining the HTML, rather than the CSS (language for style and aesthetics) or the Javascript (language for interaction). HTML is made up of elements or tags (like `<table>`, `<p>`, and `<h1>`), some of which have attributes (`<a href="http://nytimes.com">Link to the New York Times</a>`).

When you scrape web pages for data, you'll be requesting and processing HTML documents.  An HTML document generally has a standard structure, and a set of common tags that have specific, well-known uses.  HTML has also been around a long time and used by a lot of people, however, and so it tends to only be uniform to a point.  Where a good API is simple and consistent, HTML on the web can be all over the place (sometimes called 'tag soup'), from hand-coded documents from the early Internet that aren't even internally consistent to modern sites that don't have HTML at all, just javascript that builds the web page as it runs in the browser.

There is substantial variety, but most HTML shares a common structure and uses a small set of tags to hold key information.

### HTML and web page structure

- Back to [Table of Contents](#Table-of-Contents)

_Includes information from: Mozilla's Introduction to HTML: [https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Introduction](https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Introduction)._

An HTML document has a standard structure that has been consistent since the start of the Internet:

    <!DOCTYPE html>
    <html lang="en">
        <head>
            <title>Test Website!</title>
            <meta name="author" content="Alex Engler"/>
            <meta name="description" content="This is a simple example of an HTML website"/>
        </head>
        <body>
            <h1>Main heading in my document</h1>
            <!-- Note that it is "h" + "1", not "h" + the letters "one" --> 
            <p>Look Ma, I am coding <abbr title="Hyper Text Markup Language">HTML</abbr>.</p>
        </body>
    </html>
    
If you were to copy the above code and into a text file with suffix `.html`, you could open it in a web browser and you would see the image below.
   
![Rendered HTML Examples](03-images/test-website.png)

    
You might be able to see some connections between the image above and the HTML code. Here's an explanation of the tags in the code and what they do.

- The document always starts with a DOCTYPE element that tells the browser what version of HTML the document uses.
- The outer element is always `<html>`, and can contain a "lang" attribute that specifies the language of the page's text. Note that the page starts with an opening `<html>` tag and ends with a closing `</html>` tag. Many HTML tags follow this opening and closing pattern, though not all.
- There are always two elements inside `<html>`:
    - `<head>`, where meta-information about the document is stored, such as:
        - the `<title>` of a page, which you can see displayed in the image on the tab at the top of the browser.
        - `<meta>` tags which include information about the webpage, in this case the author and a description.
        - references to the Javascript (JS) and Cascading Style Sheet (CSS) files used by the page, thoguh we have none in our simple example.
    
    - `<body>`, where the (generally visible) contents of the document are stored. The `<body>` element contains the actual HTML turned by a browser into the web page you see.  It can contain as little or as much HTML as is needed to output a web page's data (from the above simple example to thousands of lines of code).
        - the `<h1>` opening tag and `</h1>` closing tag convert the text within it into the bolded header.
        - the `<p>` opening tag and the `</p>` closing tag converst the text within it into the paragraph text.
        - Comments start with `<!-- `, end with ` -->`, can span multiple lines, and can be located anywhere in a document.
        - Indentation can be helpful for understanding HTML, but it is not required, nor is it required to be consistent as in Python.

### Useful HTML tags

- Back to [Table of Contents](#Table-of-Contents)

The following attributes and tags tend to be particularly useful when one is scraping or crawling web pages.

_Note: It is a best practice to always make HTML tag names all lower case, but you often will find older pages where the case is mixed or all caps._

- ***`<div>`*** and ***`<span>`*** - `<div>` and `<span>` tags are the most common containers in modern web pages.  They often have "name", "id" and "class" attributes that allow one to easily retrieve them from an HTML document.   Example:
    
        <div name="portant_stuff" id="portant_stuff" class="portant_stuff">
            <!-- Place cat picture here. -->
        </div>


- ***h***eading elements (***`<h1>`***, ***`<h2>`***, ***`<h3>`***, ...) - Heading elements are often used to hold header text at the top of sections of a web page, sometimes just inside `<div>` tags that wrap different sections of a document.  This makes them useful for targeting a parser to a specific region of an HTML document.
- ***`<p>`*** - ***p***aragraph tags ( `<p>` ) generally hold body text, but in the past have also been used like a `<div>` or `<span>`, wrapping logical sections of a page, not just discrete paragraphs.
- ***`<a>`*** - ***a***nchor tags are used to create links within a web page.  Any time you see a clickable link on a page, in the HTML source, that links is wrapped in an anchor tag.  Example:

        <a href="http://data.jrn.cas.msu.edu/sourcenet/admin">sourcenet admin</a>
    
- ***`<table>`, `<tr>`, and `<td>`*** - HTML tables are a common way to structure data.  A `<table>` is made up of rows (`<tr>`), and rows are made up of columns (`<td>`).  Example:

        <table>
            <tr>
                <td>Column 1!</td>
                <td>Column 2!</td>
            </tr>
            <tr>
                <td>Value 1!</td>
                <td>Value 2!</td>
            </tr>
        </table>

- ***`<ul>`, `<ol>`, and `<li>`*** - lists, both unordered (`<ul>` - bulleted) and ordered (`<ol>` - with leading numbers), are made up of list items (`<li>`).
- ***`<form>`***, ***`<input>`***, ***`<select>`***, and ***`<textarea>`*** elements - `<form>`, `<input>`, `<select>`, and `<textarea>` elements are used to create web forms that users can enter information into, then submit to a web page.  For sites that dole out piecemeal access to information via a search or filter page, you can often figure out how to talk with the server by creating and submitting form submissions.  You'll still need to parse the information you want from the HTML response, but it is often better than having to grab all the data by hand.

### Jupyter Notebooks and %%HTML

Jupyter Notebooks can render HTML if you start the cell with the `%%HTML`. Thus you can test your understanding of simple HTML tags without leaving this environment. For instance, see if you can change the list below from an unordered list to an ordered list. Can you add another list item? Can you bold or italicize one item?

In [None]:
%%HTML

<h2>Testing HTML in Jupyter</h2>
<ul>
    <li>Item One</li>
    <li>Item Two</li>
</ul>

### Useful HTML attributes

- Back to [Table of Contents](#Table-of-Contents)

It is a best practice to make attribute names all lower case, and to always place quotation marks around attribute values, but it is not required by the HTML specification.  In older pages you might see mixed-case or all caps attribute names, and attribute values that are not enclosed in quotation marks.  Attributes that can be used to find particular elements within an HTML document:

- ***`id`***, ***`name`*** - `id` and `name` attributes can be applied to any element, and a given value is often (but not always) only applied to one tag within a page, making them good for finding specific tags within a document.
- ***`class`*** - The `class` attribute is often used to group similar elements together. A web developer could then refer to this class in order to select or makes changes to all of the elements with that class attribute.

In [None]:
%%HTML
<style>
    p.important {color: #1976d2}
</style>

<p>Here is some text.</p>
<p class="important"> Here is text we want to emphasize.</p>
<p>More text, but it is not a big deal.</p>
<p class="important">This text is important again.</p>

## Getting, parsing and interacting with HTML

- Back to [Table of Contents](#Table-of-Contents)

In order to scrape or crawl a web site, you will need to do the following:

- Make an HTTP request to the server for the resource you want to work with.  If you are trying to interact with a form, this will include parameters to match the inputs of the form you are trying to interact with.
- Take the body of the HTTP response (the HTML for the page you want to scrape or crawl) and use a python library to parse the HTML into a form that is easy to search, filter, and interact with.
- Interact with the parsed HTML to retrieve the information you care about, then do with it what you will.

### Using Python and `requests` to get a webpage's HTML

- Back to [Table of Contents](#Table-of-Contents)

There are a number of different ways to retrieve HTML pages for scraping or crawling.  For this exercise, we'll be using the `requests` Python library.  `requests` is a good balance between ease of use and ability to deal with complicated HTML.  It can deal with almost any HTML you throw at it and supports sessions and cookies, but it doesn't support Javascript, so extremely dynamic web pages retrieved by it will not end up giving you the HTML you'd see in a browser.

For more complicated pages, you have a couple of options that are beyond the scope of this notebook:

- ***`PhantomJS (webkit)`*** - `Webkit` is the browser engine that Apple's OS X and iOS Safari browsers use.  `PhantomJS` is a headless implementation of a webkit browser that can be installed and used inside Python (More information: _PhantomJS home_ - [http://phantomjs.org/](http://phantomjs.org/)).

- ***`selenium`*** - `Selenium` lets you to Python code that control an actual browser on your computer - Firefox, Chrome, IE, or Opera.  Selenium is powerful.  It also can be complicated to set up and use.  For complex, javascript based sites, it is often your only straightforward option, however (More information: _Selenium Home_ - [http://docs.seleniumhq.org/](http://docs.seleniumhq.org/)).


### HTTP Methods with Requests

- Back to [Table of Contents](#Table-of-Contents)

Python's `requests` exposes methods for each of the HTTP request methods:

Use the HTTP GET method To read a part of a resource. Most usually returns XML or JSON data.
- GET = `requests.get('https://example.gov/to_get')`

Use the HTTP POST method To create a new resource:
- POST = `requests.post('https://example.gov/to_post', data={'key':'value'})`

Use the HTTP PUT method To update a resource:
- PUT = `requests.put('https://example.gov/to_put', data={'key':'value'})`

Use the HTTP DELETE method To delete a resource:
- DELETE = `requests.delete('https://example.gov/to_delete')`

And at its most basic, `requests.get` lets you submit a request by passing a string URL to the appropriate method and returns a parsed response object that makes it easy to get at response code, header variables, and the body of the request. Let's explore a practical example.

## Finding Data to Scrape

- Back to [Table of Contents](#Table-of-Contents)

Intuitively, you are going to be able to scrape any data that has some consistent structure within an HTML page. An easy first example of this is grabbing data within an HTML table (`<table>` elements). However, as you'll see, the concepts that work here extend to any data stored within a recognizably consistent pattern of HTML tags. To get started, we're going to grab the below table of data from the Chicago Workforce Board.

[!Image of Workforce Board Site](03-images/site.png)

In [None]:
# We'll start by just copying the URL from our web browser and saving it as a variable:

## The real world version of this website can be found here:
# url = "http://www.workforceboard.org/job-seekers/"

## In our development environment, it is here:
url = "http://deepdish.adrf.info/contrib/chicagojobs.html"

response = requests.get(url)
print(type(response))

## If HTML doesn't load entirely:
# browser.get(url)
# soup = BeautifulSoup(browser.page_source)

## Response Object

- Back to [Table of Contents](#Table-of-Contents)

This returns a [response object](http://docs.python-requests.org/en/v1.0.0/api/#requests.Response), which has a number of attributes that tell us about the sucess of the HTTP request and the content of the server's response.

For instance, we can check the [status code](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html) of the HTTP request. Codes in the 200's generally mean a successful request, whereas codes in the 400's and 500's typically indicate an error. We can also ask about exactly what was returned and how that [data is encoded](http://kunststube.net/encoding/).

In [None]:
# Check the status code (we use str() since this returns an int:
print("Status code " + str(response.status_code) )
## Returns a status of 200 - that's good.

# Header - Content Type
print("Content type " + response.headers['content-type'])
## We were expecting HTML, so that's good too

In [None]:
# We can also print out all the text from the response:
print(response.text)

## Tag Soup

- Back to [Table of Contents](#Table-of-Contents)

The messy code above is HTML as it is often found in the wild, which has inspired the term 'tag soup.' Since we only want a small part of this code - that which contains the social services data - we can use Python's excellent HTML parsing package, [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).

<blockquote>
You didn't write that awful page. You're just trying to get some data out if it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
</blockquote>

We can use `BeautifulSoup` function to transform the response object's text into a BeautifulSoup object. Note you may get a message about parsers - you can ignore this for now, but there's more information [in the appendix](#Note-on-Parsers) at the end of this notebook.

In [None]:
soup = BeautifulSoup(response.text)
print(type(soup))

In [None]:
## Right off the bat, this gives us new methods, like prettify, that make our HTML a lot easier to work with.
print(soup.prettify())

In [None]:
# We can also ask our BeautifulSoup object for specific tags, like the title:
soup.title

In [None]:
# Notice that included the tag itself, but we could just get the text with:
soup.title.text

In [None]:
## Can clean this up with .replace method:
print(soup.title.text.encode('utf-8').replace("\r","").replace("\n","").replace("\t",""))

In [None]:
# Or alternatively, return just the tag name:
soup.title.name

In [None]:
# This works well for <title> since there should only be one for any webpage.
# For more common tags, we can use the "find" method to grab the first tag of a certain type (and its contents):
soup.find("a")

In [None]:
# Alternatively, you can find all the tags of a certain type with "find_all":
soup.find_all("ul")

In [None]:
# To get more specific, you can find HTML tags by both their type and attributes, such as their id:
soup.find("div", {"id": "sidebarcol"})

## Exploring and Inspecting a Webpage

- Back to [Table of Contents](#Table-of-Contents)

The methods available within the `BeautifulSoup` object let us intelligently search our scraped HTML. These tools will let us quickly grab all the data on this webpage once we know what we are looking for. Sometimes, the easiest way to find what you want to scrape is to start with a manual inspection. In this case, in our web browser, we can right click on the elements we are curious about and select `inspect`.

![Using Inspect](03-images/inspect.png)

After doing this, we will be able to see the HTML code associated with the visual part of the webpage.

![HTML Table Code](03-images/table.png)

From this, we can see that these rows of data are in fact in an HTML `<table>` tag. That's enough to get started.

In [None]:
table = soup.find("tbody")
print(table)

In [None]:
# Within that table, we can search further for each table row <tr>:
table.find_all("tr")

In [None]:
## Instead of printing that out, let's save it as a variable called rows:
rows = table.find_all("tr")
type(rows)

In [None]:
# And we can look within each row (below, just the first row) for table elements <td>:
rows[0].find_all("td")

In [None]:
# Alternatively, we can use the 'findChildren' method.
print(rows[0].findChildren('td'))
print(rows[1].findChildren('td'))
print(rows[2].findChildren('td'))
print(rows[3].findChildren('td'))
print(rows[4].findChildren('td'))
# Note that 'child' is a relative term, refering to a tag within a tag. 
# The container tag is called the 'parent' tag, likewise relative to the child tag.

## Building Our First Parser

- Back to [Table of Contents](#Table-of-Contents)

With just the python we've learned so far, we can make a parse our scraper data. We'll need a list object to fill with results, a for loop to iterate over all the `<tr>` elements, BeautifulSoup's `find_all` method, and the list's `append` method. 

In [None]:
addresses = [] # create an empty list to store the director names


for i in range(0, len(rows) -1): # Perform a loop over the number of rows in the table
    row = rows[i] # Subset list to just one row

    address = row.find_all("td")[2].text # grab text within the second <td> tag
    
    addresses.append(address) # Add this name to our list
    
# And now we have a list of all the addresses of the Chicago Workforce Centers
print(addresses)

## Finish the Scraper:

Use what you've learned so far to finish the scraper, making new lists of `center_names` and `phone_numbers`.

In [None]:
## Write your scraping code here.













## Then move onto saving the data to a pandas dataframe.

In [None]:
## Save our scraped data to a new pandas dataframe
# Recall we imported pandas as pd

# Create a python dictionary (list of key-value pairs)
d = {"center_name" : pd.Series(center_names),
    "address" : pd.Series(addresses),
    "phone_number" : pd.Series(phone_numbers)}

# Easy to convert to recognizable pandas dataframe (tabular data):
df = pd.DataFrame(d)

print(type(d))
print(type(df))
print(df.shape)
df[:10]

In [None]:
## Save our scraped data as a csv:
df.to_csv("chicago-workforce-centers.csv", encoding="UTF-8")

In [None]:
## The correct answer to the web scraping exercise is below:

# Use requests to grab the HTML page for Chicago Workforce Centers
url = "http://deepdish.adrf.info/contrib/chicagojobs.html"
response = requests.get(url)

# Create BeautifulSoup object and pull out the table rows:
soup = BeautifulSoup(response.text)
table = soup.find("table")
rows = table.find_all("tr")

# Create lists to hold our scraped data
centers = []
addresses = []
phone_numbers = []

rows = rows[1:] #Skip the header row
for row in rows:
    
    name_td = row.find_all("td")[1]
    if name_td.find("a"):
        center_name = name_td.find("a").text
    else:        
        center_name = name_td.text

    centers.append(center_name)
    addresses.append(row.find_all("td")[2].text)
    phone_numbers.append(row.find_all("td")[3].text)
    
## Create pandas dataframe:
centers_df = pd.DataFrame({"center_name" : pd.Series(centers),
    "address" : pd.Series(addresses),
    "phone_number" : pd.Series(phone_numbers)})

## A little cleanup to remove extraneous tags:
centers_df["center_name"] = centers_df["center_name"].str.replace("<td>", "")
centers_df["center_name"] = centers_df["center_name"].str.replace("<br/>", "")

centers_df[:]

# Appendix

## Note on HTML Parsers

- Back to [Table of Contents](#Table-of-Contents)

If HTML is particularly malformed or invalid, different HTML parsers will render that HTML differently, some better than others.  For example, when the built-in Python HTML parser (`html.parser`) finds bad HTML, it sometimes simply stops parsing, lumping the remainder of the document into a big string blob that is all but useless for scraping.  On the same document, however, `html5lib` will not get confused at all.

For cases where a particular parser doesn't work for a given document, BeautifulSoup supports explicitly choosing from multiple parsers, selected by passing an optional second argument to the BeautifulSoup constructor:

- ***html.parser*** - Python’s html.parser - fast, doesn't require additional packages, but not able to deal with truly gnarly HTML. Usage:
   
        # built-in Python parser for HTML
        BeautifulSoup( response_html, "html.parser" )

- ***lxml*** - fast, lenient parser that can process both HTML and XML.  Must be installed (`conda install lxml` or `pip install lxml`). Best for XML documents and clean HTML.  Usage:
   
        # lxml for HTML
        BeautifulSoup( response_html, "lxml" )

        # lxml for XML
        BeautifulSoup( response_html, [ "lxml", "xml" ] )
        BeautifulSoup( response_html, "xml" )

- ***html5lib*** - extremely lenient, parses the way a web browser does, and makes valid HTML 5.  But, slow compared to lxml and Python's built-in HTML parser, and must be installed (`conda install html5lib` or `pip install html5lib`).  Usage:
   
        # html5lib for HTML
        BeautifulSoup( response_html, "html5lib" )
            
- More information: [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)

### Exploring HTML in a web page

- Back to [Table of Contents](#Table-of-Contents)

Once you've targeted a web page and pulled down it HTML source, you'll need to examine the HTML to try to figure out how to most easily get at the information you seek.  You'll need to learn how to view, filter, and search inside the HTML result of any in-browser rendering so you can figure out what you need to target.  Some useful tools in Firefox:

- ***View selection source*** command - this command, available when you select text in a web page, then right click on the selection, will pull up just the HTML for the selected area of the web page.  This doesn't give you context, but it can be extremely useful if you are trying to get an `id`, `class`, or `name` for a particular element.

- ***Firefox's built-in developer tools*** - [https://developer.mozilla.org/en-US/docs/Tools](https://developer.mozilla.org/en-US/docs/Tools) - Firefox's built in developer tools are impressive, and they've recently pulled in all the functionality that the Firebug plugin provides, as well, allowing for inspection and searching within a document.  A full accounting is beyond the scope of this notebook, but these are worth learning more about if you scrape a lot.

- ***Web Developer*** toolbar - [https://addons.mozilla.org/en-US/firefox/addon/web-developer/](https://addons.mozilla.org/en-US/firefox/addon/web-developer/) - The web developer toolbar provides many tools that can be helpful in analyzing a page you want to scrape or crawl through.  In particular, a few essential features:

    - In the "View Source" menu, "View Generated Source" will present for you the source of the page once any in-browser rendering is done.  This is the source that you are looking at, not necessarily the HTML that was initially sent over to your browser.  You can use this source to plan how you'll interact with the final result of rendering the page.
    - In the "Forms" menu, the command "Display Form Details" will show you details on each form on the page, including all the `<inputs>` for each form.  This goes a long way toward helping you figure out how a form you want to try to interact with works.
    - The "Outline" menu contains numerous tools that allow you to see more information about a part of the current web page when you hover your mouse over it.

- For more information, see: [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)

## Submitting a form

- Back to [Table of Contents](#Table-of-Contents)

Steps for submitting a web form:

- Load the page in a browser and look it over, identify form you want to submit programmatically.
- Get source for page.
- In source, find the form.  Options:

    - use "web developer" toolbar in Firefox, go to "Forms"-->"View Form Information".
    - use the "web developer" toolbar --> "View Source" --> "View Generated Source", then do a text search for text associated with the form.
    - Enable Firebug and all firebug panels, load the page you are interested in, then open up Firebug window, go to the HTML tab,  then search for terms you are looking for.

- Once you find the form:

    - look for the `<form>` element so you can figure out if the form is expecting a GET or a POST request, and the URL where the form should be submitted.
    
        Example:
        
            <form action="action_page.php" method="GET">
            
        Where:
        
        - action = URL of page where form should be submitted.
        - method = type of request to make (usually will be either "get" or "post").

    - then, look for all the `<input>`s, `<textarea>`s, and `<select>`s to the form, so you can get their names and figure out what information you have to pass to the form in each parameter to get results back.
    
        Example:
        
            First name:<br>
            <input type="text" name="firstname">
            <br>
            Last name:<br>
            <input type="text" name="lastname">

- Once you know names and values you need to pass, then build the code to submit requests to the FORM.

## HTTP Response Example

- Back to [Table of Contents](#Table-of-Contents)

**Example:**

    HTTP/1.1 200 OK
    Date: Mon, 23 May 2005 22:38:34 GMT
    Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
    Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
    ETag: "3f80f-1b6-3e1cb03b"
    Content-Type: text/html; charset=UTF-8
    Content-Length: 131
    Accept-Ranges: bytes
    Connection: close

    <html>
    <head>
      <title>An Example Page</title>
    </head>
    <body>
      Hello World, this is a very simple HTML document.
    </body>
    </html>

An HTTP response is very similar.  It contains:

- a text **status line** that includes the following, in this order, with each separated by a single space:

    - the specific version of HTTP you are using.
    - the status code for the request.  Common status codes:
    
        - 200 - OK
        - 404 - File not found
        - 500 - server error
        - 503 - server down

    - a status message for the request.

- a **header block** that contains one or more header variables, name-value pairs with name separate from value by a colon and a space.

    - example: `Content-Type: text/html`
    
- a blank line
- the body of the response which could contain just about anything, depending on what you've requested.  For a request from a web browser to a web server, for example the response will contain the HTML for the page, which the browser will render.  For an API request, the response body could contain data in any number of formats (JSON, XML, etc.).

## More Examples

- Back to [Table of Contents](#Table-of-Contents)

Examples:

- [http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/)
- [http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python](http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python)