# So you want to scrape.

## Observe, orient, decide, act

Before you start any scraping task, you'll want to invest a chunk of time into figuring out what you actually have to work with. Time you spend on the front end could save you from having to start completely over a good chunk of the way through.

So, let's look at what we have to work with. Because you still have that **0-start web server** tab running, you have a little web server going. Click this link to open it in your browser:

http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html

This is a localized partial copy of a Texas Department of Criminal Justice site, showing "Executed offenders" -- people put to death by the state. They seem to be numbered in orderm, most recent at the top, and appear to show 558 executions in about 36 years.

There's no pagination. Each person has a link for more information, and a link to the final statement. 

![Executed offender page"](support/mainpage.png "Main executed offender page")

On the main page, you see things like first name, last name, race, gender, age and TDCJ number. The page is from the Texas Department of Criminal Justice, or TDCJ. This is an inmate number -- a unqiue identifier.

The newest executee is listed as Braziel, Jr., Alvin. Leave your cursor over the "Offender Information" link for Braziel and your browser, in the bottom-left corner, will show you it's http://localhost/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html . This is a good sign; it's not _javascript:something_. You can work with this. Let's hit that link.

What do you want to scrape? Probably all of it.

![Stephen Colbert wants it all](support/giveittome.gif)

How hard will it be to grab that big bunch of narrative stuff? In most browsers, you can right-click (maybe around the "Name") entry in the biography) and left-click on Inspect. You should see something like this.

![Output of HTML inspector](support/tableinspect.png)

At first blush, this all looks good. **(Hint: It won't be.)** On the right side we see the main part of this demographic information is all in a HTML table: * &lt;table class="table\_deathrow indent"&gt; *. That's emminently scrapable.

Flip back to that initial web page, and let's get Python started.

In [25]:
import requests   # external dependency
from pyquery import PyQuery as pq   # external dependency

import csv
from collections import OrderedDict

# Requests is a great way to get web pages into Python.
# PyQuery is a great way to actually rip apart those pages.
# And the built-in CSV module lets you save your data so you can do stuff with it later.
# And OrderedDict is a great storage mechanism. Ignore it for now.

In [26]:
# What are we scraping? Well, we start at the main page.
hosturl = "http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html"
r = requests.get(hosturl)
html = r.content

### Planning your scrape

So it looks like all the stuff we want is inside this *table* tag, with a class of *tdcj\_table*.

Our HTML has been loaded in as a variable called *html*.

The most basic PyQuery syntax:

* The first parenthesis is the hunk of HTML you're trying to parse

* The second parenthesis is the hunk of HTML you're trying to extract, a tag. You can add on specifics, like *"p#somename">* or *div.someclass*.

So let's isolate the table first. How?

In [27]:
# You get the table here ... ?



# Now, we have the table. Your actual data is inside *tr* tags, table rows.
# But the first table row contains header stuff, not actual data. Let's pull just the first real row of data,
# which is row 0. How?

# Slice it off.

In [28]:
# This I'll show you. Let's look:

row = pq(table)("tr")[1]
pq(row).html()

'\n    <td style="text-align: center">558</td>\n    <td style="text-align: center"><a href="dr_info/brazielalvin.html" title="Offender Information for Joseph Garcia">Offender Information</a></td>\n    <td style="text-align: center"><a href="dr_info/brazielalvinlast.html" title="Last Statement of Joseph Garcia">Last Statement</a></td>\n    <td style="text-align: center">Braziel, Jr.</td>\n    <td style="text-align: center">Alvin</td>\n    <td style="text-align: center">999393</td>\n    <td style="text-align: center">43</td>\n    <td style="text-align: center">12/11/2018</td>\n    <td style="text-align: center">Black</td>\n    <td style="text-align: center">Dallas</td>\n  '

In [29]:
# How do we get at that first nugget of information, the execution number?
# What about the rest? Let's pull it together a little:

for row in pq(table)("tr")[1:6]:      # Look only at the first five rows of actual usable data
    ExecutionNo = pq(row)("td")[0].text
    print(ExecutionNo)

# Now scrape the rest of the stuff. Easy, right?
# What data structure are you putting this into?

558
557
556
555
554


In [31]:
# Let's take a look at those URLs now:
for row in pq(table)("tr")[1:6]:      # Look only at the first five rows of actual usable data
    pass