## Scraping with Python

Python is a handy programming language, with some excellent tools for handling scraping projects and also something called BeautifulSoup, which will make you feel like a moron every time you have to type stuff like soup.beautify. We'll try a different tool here and show you how to get started.

We're going to assume you're using some flavor of Python 3 here. If you've downloaded this from https://github.com/PalmBeachPost/nicar19scraping , you'll want to make sure you have dependencies -- the modules Python needs to do everything here -- installed.

If you're in NICAR ***************************************************************************************

If you're using the git repo and looking at this:
_pip install -r requirements.txt_ should be reasonably safe.

Then:
_jupyter notebook_

Your web browser should open up now, and you'll see a list of files. The complete tutorial is available locally as *************. The starter shell is ************. And soon you'll see what page to open.

## Observe, orient, decide, act

Before you start any scraping task, you'll want to invest a chunk of time into figuring out what you actually have to work with. 

A simple story: I thought I was going to have to write a scraper, but then the search page I was looking at had an export button that made a nice file that Excel opened up. It looked great, all the same text that was on my screen. I figured out how to dynamically generate that export button link so I could run it on a regular basis and schedule it.

Problem was, it really had all the __text__ on my screen -- like the date of a case, the name of the person, the outcome of the investigation. What the export didn't have was critical and not immediately : A link from the case number to the actual paperwork supporting the case. The exported file was missing something that was absolutely critical.

So, let's look at what we have to work with. Because of that earlier command -- _python -m http.server_ -- you should have a little web server running already. Let's go to a particular file. Click this link to open it in your browser.

http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html



This is a localized partial copy of a Texas Department of Criminal Justice site, showing "Executed offenders" -- people put to death by the state. They seem to be numbered in order, with the newest executions first, and 558 total. If you scroll down, you'll see these go back to very late 1982, meaning we're looking at about 36 years to get 558 executions.

Another thing to notice: There's no pagination. There's no "page 1 of 26" here. That makes life much easier.

When you actually scrape your data, you can start running your own analysis. As you look, though, maybe scribble some notes of things you want think about looking analyzing from this.

![Executed offender page"](support/mainpage.png "Main executed offender page")

On the main page, you see things like first name, last name, race, gender, age and TDCJ number. The page is from the Texas Department of Criminal Justice, or TDCJ. This is an inmate number -- a unqiue identifier.

The newest executee is listed as Braziel, Jr., Alvin. Leave your cursor over the "Offender Information" link for Braziel and your browser, in the bottom-left corner, will show you it's http://localhost/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html . This is a good sign; it's not _javascript:something_. You can work with this. Let's hit that link.

Here there's a bit more of a biography, more demographic details, and a description of an awful crime. Important to note: On the previous page he was "Braziel, Jr." and "Alvin". Here, we see the name is no longer broken up but is formatted differently with more information: "Braziel, Alvin Avon Jr." all at once.

### What are we trying to scrape here?
###### Where are we going, and why are we in this handbasket?

Well, you may not always know what you're going to want intially. Almost anything involving death penalty cases, you probably want the demographic information. And you probably want the history of the case. And you probably want the final statement. And Texas may keep the last meal here. So it's consult with Stephen Colbert about where to start:

![Stephen Colbert wants it all](support/giveittome.gif)

How hard will it be to grab that big bunch of narrative stuff? In most browsers, you can right-click (maybe around the "Name") entry in the biography) and left-click on Inspect. You should see something like this.

![Output of HTML inspector](support/tableinspect.png)

On the right side we see the main part of this demographic information is all in a HTML table: * &lt;table class="table\_deathrow indent"&gt; *

This is good. We can work with this. Move your cursor through the inspection area and you'll see different rows highlight. Every row of the table -- *&lt;tr&gt;* -- is a row in what you're seeing. The two sides of that table are separated by *&lt;td&gt;* tags, or table data tags, the HTML marker for a cell. This is very good.

As long as all your scraping projects can go this well, you'll be fortunate. Will you?

![Image of hand showing sticker of Stop and Pray. Source: https://www.pexels.com/photo/photography-of-a-persons-hand-with-stop-signage-823301/](support/stopandpray.jpg)


Yeah, no.

Let's get reoriented: The stuff off the main page is an index to the more detailed subpages. So we'll need to scrape that main page first, get some of that information, and then start scraping the subpages and get more. And along the way, we'll probably need to look at other pages including the final statement and download the photos.

So where do we start? We start at the beginning.

## Basic scraper setup

So when you're scraping web pages, you need something to download web pages. Here, we'll use the great _requests_ library:

_import requests_

We'll need something that can parse the web pages, or break them into understandable chunks that you can maneuver through. We'll use the splendid _PyQuery_ library. This one's a little odd to set up; it breaks Python tradition by having mixed case for the main module, and that's also annoying to type. You can run a raw import statement on it, but each time you'd have to type _pyquery.PyQuery(somethingsomething)_ and at that point you might as just try to cuddle a honey badger. Let's get in the habit of a different import statement that will mean we need a lot less typing. In fact, let's just use _pq_ to mean _pyquery.PyQuery_. If you use PyQuery, just copy-paste this line every time until you have it memorized. After that, it's so much easier.

_from pyquery import PyQuery as pq_

We'll want to do **something** with our scraped data. Chances are, even if you keep processing it directly in Python, you'll probably want to save some snapshots to disk. And chances are, the CSV format is the one you'll want. So, let's add one more module.

_import csv_

You'll almost certainly need more modules as you go on -- maybe something to change the formatting of the dates, say. But you can add what you need later.

You can have your own style. I tend to put external dependencies in their own section at the very top of the file, and built-in module statements following a blank line, like this:


In [1]:
import requests   # external dependency
from pyquery import PyQuery as pq   # external dependency

import csv

We can now start scraping web pages. Where do we start? Well, we know what URL we're starting with. And we know requests is used to get stuff ...

In [2]:
hosturl = "http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html"
r = requests.get(hosturl)

In [3]:
# Let's see what we've downloaded, just part of it
r.content[:2000]
# Yep, we have a web page!

b'<!doctype html>\n<html lang="en-US"><!-- InstanceBegin template="/Templates/generic_inside.dwt" codeOutsideHTMLIsLocked="false" -->\n<head>\n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<!-- stylesheet: global -->\n<link rel="stylesheet" href="../stylesheets/global.css">\n<!-- stylesheet: page-specific -->\n<link rel="stylesheet" href="../stylesheets/content.css">\n<link rel="stylesheet" href="../stylesheets/menu_style.css">\n<!-- InstanceBeginEditable name="stylesheets" -->\n\n<!-- InstanceEndEditable -->\n<!-- jQuery library (if CDN fails, use local copy) -->\n<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>\n<script type="text/javascript"> window.jQuery || document.write(\'<script src="../javascripts/jquery.min.js"><\\/script>\') </script>\n<!-- javascripts -->\n<script type="text/javascript" src="../javascripts/google_analytics.js"></script>\n<script type="text/javascri

In [4]:
# Now, let's get the actual stuff, and we'll put it in a variable called html.
html = r.content

### Planning the first scrape
Let's take a look at the main page code in our browser, either through the right-click-Left-click-on-Inspect or View Source(1). The code we really want to find is in here a bit:

*(1) Right-click-left-click-on-inspect will bite you at some point. View Source is much safer because JavaScript, but less convenient.*

What can we tell from this? There's a couple headers, whatever. Fine. But there's a table with a class of *tdcj\_table* where the good stuff is. The first row of the table is its headers, with everything inside *th* cells. Beneath that is our first row of real data, contained in *td* cells. Sometimes we just want the text, like *Alvin*. Sometimes we're looking for the URLs that aren't actually text, like the link.

So what's a sensible data structure here? We know we're going to have to roll up at least one subpage to get more details. So we need a data structure that lets us access the same stuff easily, and here's where Python is handy with something called a dictionary. It provides access by name to some data, often a single value but sometimes a whole thing. For example, you might see a dictionary pointing to just a single text string, or a list of things:

*presidents[1] = "George Washington"*

*food['fruits'] = ['apple', 'orange', 'lemon', 'lime', 'tomato']*

When I'm scraping, I'm almost always putting each row of data into its own dictionary, with a particular type called OrderedDict, which keeps the order. So in this case, we're looking for, like:

*mydictionary = {'first name': "Alvin", "last name": "Braziel Jr.*}

It would make sense to put each person's dictionary of data into one big dictionary that holds them. But to do that, we need a key, a unique value to refer to one person.

As we've seen, the names of Texas' executed inmates don't even match from the index page to the subpage, and it's entirely possible Texas, given enough time, would execute several people named Robert Smith. So don't go by name. However, that inmate identifier, the "TDJC Number," is assigned to single person, and presumably even Texas would never execute a person more than once.

Presumably.

In [5]:
from collections import OrderedDict

masterdict = OrderedDict()

Remember before? ... *&lt;table class="tdcj_table indent" ...* That's the good stuff. Let's try getting it.

In PyQuery, the first thing you put in is the first, big hunk of HTML. After that, you're looking for a tag, like "table" or "td" or "a". You can use a dot to show a class, like *p.narrative*, to find the code wrapped in something like *&lt;p class="narrative"&gt;*. You can use a # to show a name, like *p#beginning*, which would look for *&lt;p id="beginning"&gt;*


In [6]:
table = pq(html)("table.tdcj_table")   # Pick out the table with the good stuff

In [7]:
# Let's take a look at the top:
table.html()[:2000]
# Yep, we have our table.

'\n    <caption>Executed Offenders</caption>\n  <tr>\n    <th style="text-align: center" scope="col">Execution</th>\n    <th style="text-align: center; width: 16%" scope="col">Link</th>\n    <th style="text-align: center; width: 13%" scope="col">Link</th>\n    <th style="text-align: center" scope="col">Last Name</th>\n    <th style="text-align: center" scope="col">First Name</th>\n    <th style="text-align: center; width: 7%" scope="col">TDCJ<br/>Number</th>\n    <th style="text-align: center" scope="col">Age</th>\n    <th style="text-align: center" scope="col">Date</th>\n    <th style="text-align: center" scope="col">Race</th>\n    <th style="text-align: center" scope="col">County</th>\n  </tr>\n  <tr>\n    <td style="text-align: center">558</td>\n    <td style="text-align: center"><a href="dr_info/brazielalvin.html" title="Offender Information for Joseph Garcia">Offender Information</a></td>\n    <td style="text-align: center"><a href="dr_info/brazielalvinlast.html" title="Last State

OK, now let's flip back to that web page again and just ... look. In Python, you start counting from 0. So table row 0 is going to be the headers; our good data begins at row 1. Our first column of real data is the execution, will be in table data cell 0, and we want the text ... You know what? Let's just sketch this out.
* cell 0, text to ExecutionNo.
* cell 1, URL, to BioURL
* cell 2, URL, to Statement
* cell 3, text, to LastName
* cell 4, text, to FirstName
* cell 5, text, to InmateNumber
* cell 6, text, to Age
* cell 7, text, to ExecutionDate
* cell 8, text, to Race
* cell 9, text, to County

We should maybe fiddle with the order. Maybe something like this.

* cell 5, text, to InmateNumber
* cell 0, text to ExecutionNo.
* cell 3, text, to LastName
* cell 4, text, to FirstName
* cell 6, text, to Age
* cell 7, text, to ExecutionDate
* cell 8, text, to Race
* cell 9, text, to County
* cell 1, URL, to BioURL
* cell 2, URL, to Statement

The URLs are going to be very important to us for the scrape, but they'll be near-worthless to us in the spreadsheet we want to build.

In [8]:
# Let's try getting the execution number. Our HTML table is stored in the cunningly named "table" variable.
# And we want to skip the first row, right? Let's just iterate beginning at the second row, which in Python is 1.
# We can cheat and try like 5 rows at a time.

for row in pq(table)("tr")[1:6]:  # Pick up first five rows of real data, skipping header row
    print(pq(row)("td")[0].text)


558
557
556
555
554


In [9]:
# OK, we have our execution numbers. You know what? I really don't want to type that "pq" stuff eight times over.
# So we can  build a stupid lil function. But ...
# If I were a dreamer, but then again, no.
# Let's save the routine for later and just copy paste. And we'll want to start storing stuff in masterdict.
# Key it to the inmate number. Keep it in an OrderedDict. OK, fine.
# My standard has been that a **row** comes in, and a **line** goes out. No real rhyme or reason, but ...
# let's keep with it.

for row in pq(table)("tr")[1:6]:   # Pick off first five lines of real data
    line = OrderedDict()
    line['InmateNo'] = pq(row)("td")[5].text
    line['ExecutionNo'] = pq(row)("td")[0].text
    line['LastName'] = pq(row)("td")[3].text
    line['FirstName'] = pq(row)("td")[4].text
    line['Age'] = pq(row)("td")[6].text
    line['ExecutionDate'] = pq(row)("td")[7].text
    line['Race'] = pq(row)("td")[8].text
    line['County'] = pq(row)("td")[9].text
    print(line)


OrderedDict([('InmateNo', '999393'), ('ExecutionNo', '558'), ('LastName', 'Braziel, Jr.'), ('FirstName', 'Alvin'), ('Age', '43'), ('ExecutionDate', '12/11/2018'), ('Race', 'Black'), ('County', 'Dallas')])
OrderedDict([('InmateNo', '999441'), ('ExecutionNo', '557'), ('LastName', 'Garcia'), ('FirstName', 'Joseph'), ('Age', '47'), ('ExecutionDate', '12/04/2018'), ('Race', 'Hispanic'), ('County', 'Dallas')])
OrderedDict([('InmateNo', '999062'), ('ExecutionNo', '556'), ('LastName', 'Ramos'), ('FirstName', 'Robert'), ('Age', '64'), ('ExecutionDate', '11/14/2018'), ('Race', 'Hispanic'), ('County', 'Hidalgo')])
OrderedDict([('InmateNo', '999381'), ('ExecutionNo', '555'), ('LastName', 'Acker'), ('FirstName', 'Daniel'), ('Age', '46'), ('ExecutionDate', '9/27/2018'), ('Race', 'White'), ('County', 'Hopkins')])
OrderedDict([('InmateNo', '999351'), ('ExecutionNo', '554'), ('LastName', 'Clark'), ('FirstName', 'Troy'), ('Age', '51'), ('ExecutionDate', '9/26/2018'), ('Race', 'White'), ('County', 'Smith

In [10]:
# OK, we're much of the way there. We still have a leash on to show only the first five rows of data,
# that Python slicing of [1:6].
# We're missing the critical URLs, though. Let's look at the last row we pulled:

In [11]:
print(pq(row))

<tr>
    <td style="text-align: center">554</td>
    <td style="text-align: center"><a href="dr_info/clarktroy.html" title="Offender Information for Troy Clark">Offender Information</a></td>
    <td style="text-align: center"><a href="dr_info/clarktroylast.html" title="Last Statement of Troy Clark">Last Statement</a></td>
    <td style="text-align: center">Clark</td>
    <td style="text-align: center">Troy</td>
    <td style="text-align: center">999351</td>
    <td style="text-align: center">51</td>
    <td style="text-align: center">9/26/2018</td>
    <td style="text-align: center">White</td>
    <td style="text-align: center">Smith</td>
  </tr>
  


In [12]:
# If we grab just the text of this thing, we're not going to get much:
print(pq(row)("td")[1].text)

None


In [13]:
# Wait, why none? Because there's no raw text in the cell. What text there is is wrapped inside the *a* tag.
# Let's feed it into another round of PyQuery:
print(pq(pq(row)("td")[1])("a").text)


<bound method PyQuery.text of [<a>]>


#### Well, crap.

"Bound method" means it's looking for something like a function, so throw in some ()s.

Is this a predictable behavior in PyQuery? Not that I can figure out. But if you see "bound method," try extra parenthesesesies. Keep working your problem.

![Clown wondering how stuff works](support/magnets.gif)

In [14]:
print(pq(pq(row)("td")[1])("a").text())


Offender Information


But we don't actually want that text. What we want is the URL, which is stored as an **attribute** of the *a* tag. Right?

*&lt;a href="something"&gt;*


Other attributes include things like *class* and *id* tags. But here's how to grab that *href* attribute of the *a* tag:

In [15]:
print(pq(pq(row)("td")[1])("a").attr("href"))

dr_info/clarktroy.html


**dr_info/clarktroy.html**? What the hell kind of URL is that? Well, it's a relative URL. Starting from the directory you're in now, it's going to look inside *dr_info* for a file named clarktroy.html. We started off here:
http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html

So *dr\_executed\_offenders.html* is a page inside *death\_row* ... so let's start gluing this together.

http://localhost/www.tdcj.texas.gov/death_row/ + dr_info/clarktroy.html ... becomes:

http://localhost/www.tdcj.texas.gov/death_row/dr_info/clarktroy.html

Open it. We're in business!

Let's set a variable we can work with.

In [16]:
biopagebase = "http://localhost/www.tdcj.texas.gov/death_row/"

### How much did we just skip around?

Well, not too much. We were figuring out how to start getting at our subpages, but we're not going to scrape them just yet. Let's take a look. We have masterdict. We have almost the entirety of the main page scraper. Let's finish it.

Before we were limiting Python to looking at rows 1 to 6 with *[1:6]*. That skipped the header row and then got us five rows of actual usable data. Let's take that off.

We also now know we need to set that variable to make sense of our partial URLs.

We also know we need to get a proper URL for all the last statements.

So if this gets us the URL from the second column:

*print(pq(pq(row)("td")[1])("a").attr("href"))*

maybe this will get us the URL for the third column:

*print(pq(pq(row)("td")[2])("a").attr("href"))*



In [17]:
print(pq(pq(row)("td")[2])("a").attr("href"))

dr_info/clarktroylast.html


In [18]:
# Promising. Let's check out if that same URL prefix still works:
print(biopagebase + pq(pq(row)("td")[2])("a").attr("href"))

http://localhost/www.tdcj.texas.gov/death_row/dr_info/clarktroylast.html


#### So, now what?

![Guy waiting](support/bride.gif)

From a few sections up, we've ... figured out what we need to scrape the first section.

Except we also forgot to actually, you know, save this data somewhere. So let's save it all in *masterdict*, with the key being the inmate number. The value will be the entire row/line of data.

We also need to scrape the full URLs for the subpages, so ... Let's just piece it together.

Remember before we were looking at rows 1 to 6, skipping the header row. Now, we want to look from row 1 to the end. Scrape it all! That syntax will be *[1:]*

In [19]:
biopagebase = "http://localhost/www.tdcj.texas.gov/death_row/"
for row in pq(table)("tr")[1:]:   # Skip that header row, get everything else
    line = OrderedDict()
    line['InmateNo'] = pq(row)("td")[5].text
    line['ExecutionNo'] = pq(row)("td")[0].text
    line['LastName'] = pq(row)("td")[3].text
    line['FirstName'] = pq(row)("td")[4].text
    line['Age'] = pq(row)("td")[6].text
    line['ExecutionDate'] = pq(row)("td")[7].text
    line['Race'] = pq(row)("td")[8].text
    line['County'] = pq(row)("td")[9].text
    line['BioURL'] = biopagebase + pq(pq(row)("td")[1])("a").attr("href")
    line['StatementURL'] = biopagebase + pq(pq(row)("td")[2])("a").attr("href")
    masterdict[line['InmateNo']] = line

In [20]:
# Wait, did that just work?
print(f"Lines scraped: {len(masterdict)}")

Lines scraped: 558


In [21]:
# Did we just get it all?
print(line)

OrderedDict([('InmateNo', '592'), ('ExecutionNo', '1'), ('LastName', 'Brooks, Jr.'), ('FirstName', 'Charlie'), ('Age', '40'), ('ExecutionDate', '12/07/1982'), ('Race', 'Black'), ('County', 'Tarrant'), ('BioURL', 'http://localhost/www.tdcj.texas.gov/death_row/dr_info/brookscharlie.html'), ('StatementURL', 'http://localhost/www.tdcj.texas.gov/death_row/dr_info/brookscharlielast.html')])


### Now for the subpages

Yeah, we're not done yet. We want to at least figure out how to scrape those subpages. So ... Let's do it. We can traverse through *masterdict* and pick out *BioURL* from each of the entries inside. Then we just open up those URLs and start scraping.

Except. .. We don't need to scrape 558 of these things all at once while testing. Let's just keep using our last *line* entry  and see what we've got.

Let's take a look at that last person's BioURL:
http://localhost/www.tdcj.texas.gov/death_row/dr_info/brookscharlie.html

At first blush, the biography seems to match the newest guy. Missing a photo. Some stuff is recorded kind of strangely. But ... Let's give it a try.

In [22]:
BioURL = line['BioURL']
r = requests.get(BioURL)
html = r.content    # We can recycle "html" as our main variable because we're all done scraping the initial page.
                    # If we weren't, this would be ... bad.
    


#### Here there be dragons

Above, we just recycled variables called *r* and *html*. We can do this because we've completed our scrape of the main page. But if we went row-by-row on the main page and then scraped the BioURL pages in each row, our code would break in horrible ways. So ... mind the gap.

Let's go take a look at what we've got. From that bio page, it looks like the main table is somewhere around here:

`...
    <h1>Death Row Information</h1>
    <h2>Offender Information</h2>
    <table class="table_deathrow indent">
      <tr>
        <td rowspan="7" style="vertical-align: top">Photo not available</td>
        <td style="vertical-align: top" class="table_deathrow_bold_align_right">Name</td>
        <td style="vertical-align: top" class="table_deathrow_align_left">Brooks, Charlie Jr.</td>
      </tr>
      <tr>
        <td style="vertical-align: top" class="table_deathrow_bold_align_right">TDCJ Number</td>
        <td style="vertical-align: top" class="table_deathrow_align_left">592</td>
      </tr>
      <tr>
        <td style="vertical-align: top" class="table_deathrow_bold_align_right">Date of Birth</td>
        <td style="vertical-align: top" class="table_deathrow_align_left">9/1/1942</td>
    ...
`

So, we can recycle our *table* variable again. We're looking for a tag of *table* with a class of *table_deathrow*.

And here the layout is kind of ... weird, right? We have no header row.

Our first row has a bunch of elements in it; there's one table data cell that should hold the photo, but it's actually set to span seven columns. And then there's the description of "Name" and the actual name, for a total of nine columns. Then the next line shows ... two columns.

We could sit here and write special handling for the first row, but then we're repeating. What we can say safely is really simple:

The description of the data we want is always in the second-to-last column.

The actual data we want is in the last column. So let's just proceed that way, and see what we can come up with.



In [23]:
table = pq(html)("table.table_deathrow")

In [24]:
for row in pq(table)("tr"):
    description = pq(row)("td")[-2].text
    data = pq(row)("td")[-1].text
    print(f"{description}: {data}")

Name: Brooks, Charlie Jr.
TDCJ Number: 592
Date of Birth: 9/1/1942
Date Received: 4/25/1978
Age (when    Received): 35
Education Level (Highest Grade Completed): 12
Date of Offense: 12/14/1976
Age (at the time of Offense): 34
County: Tarrant
Race: Black
Gender: Male
Hair Color: Black
Height: 5' 9"
Weight: 150
Eye Color: mar (according to DPS records)
Native County: Tarrant
Native State: Texas


In [25]:
# now what? Well, ... Let's add it to our data. We can actually just use their field descriptions as our data identifier.
for row in pq(table)("tr"):
    description = pq(row)("td")[-2].text
    data = pq(row)("td")[-1].text
    line[description] = data
print(line)

OrderedDict([('InmateNo', '592'), ('ExecutionNo', '1'), ('LastName', 'Brooks, Jr.'), ('FirstName', 'Charlie'), ('Age', '40'), ('ExecutionDate', '12/07/1982'), ('Race', 'Black'), ('County', 'Tarrant'), ('BioURL', 'http://localhost/www.tdcj.texas.gov/death_row/dr_info/brookscharlie.html'), ('StatementURL', 'http://localhost/www.tdcj.texas.gov/death_row/dr_info/brookscharlielast.html'), ('Name', 'Brooks, Charlie Jr.'), ('TDCJ Number', '592'), ('Date of Birth', '9/1/1942'), ('Date Received', '4/25/1978'), ('Age (when    Received)', '35'), ('Education Level (Highest Grade Completed)', '12'), ('Date of Offense', '12/14/1976'), ('Age (at the time of Offense)', '34'), ('Gender', 'Male'), ('Hair Color', 'Black'), ('Height', '5\' 9"'), ('Weight', '150'), ('Eye Color', 'mar (according to DPS records)'), ('Native County', 'Tarrant'), ('Native State', 'Texas')])


### The road is clear. Move out!

We've now got something that should get us most of the biographical data from these subpages. Let's bring it all together, and scrape a few hundred pages. And we just start writing and ...

<code>
for InmateNo in masterdict:
    line = masterdict[InmateNo]  # Get that whole person-level dictionary
    r = requests.get(line['BioURL'])   # Get their biography page
    html = r.content
    table = pq(html)("table.table_deathrow")
    for row in pq(table)("tr"):
        description = pq(row)("td")[-2].text
        data = pq(row)("td")[-1].text
        line[description] = data
    # Now let's write the line back to masterdict to save our changes
    masterdict[InmateNo] = line    `
</code>    


### NO. No. No. Just, no.

OK, it was that easy to write the rest of that part of the scraper, but we're leaving a bunch of great data on the table.

Remember the first person we looked at?

http://localhost/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html

Photo. We didn't write anything to handle photo. And while that biobox looks the same -- maybe, there's a lot to keep track of -- there's a bunch of stuff below that whole biobox. So let's chill for a second and look at Alvin Braziel.

`<table class="table_deathrow indent">
  <tr>
    <td rowspan="7" style="vertical-align: top"><img src="brazielalvin2.jpg" alt="Picture of Offender" /></td>`
    
But we saw with the other person:

`        <td rowspan="7" style="vertical-align: top">Photo not available</td>
`

So ... not everyone will have an image. And we have that fragmentary URL again. Maybe we can take that old *biopagebase* URL fragment and tack it on and see if that works for the photo:

http://localhost/www.tdcj.texas.gov/death_row/brazielalvin2.jpg

Nope, that doesn't work. But remember when we got the local BioURL we were seeing addresses like *dr_info/clarktroylast.html*

That means when we're actually looking at the page, we're already in the *dr_info* folder. If we just have a reference to *brazielalvin2.jpg*, it's in that same folder.

http://localhost/www.tdcj.texas.gov/death_row/dr_info/brazielalvin2.jpg

And **that** works. So let's set this:

photobase = "http://localhost/www.tdcj.texas.gov/death_row/dr_info/"

Now we've figured out the URL scheme. How do we get at the photo URL? Well, it's going to be in the first *td* tag of *table.table_deathrow*.

Let's try:

In [26]:
r = requests.get("http://localhost/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html")
html = r.content
photobase = "http://localhost/www.tdcj.texas.gov/death_row/dr_info/"
photocell = pq(pq(html)("table.table_deathrow"))("td")[0]
print(pq(photocell))



<td rowspan="7" style="vertical-align: top"><img src="brazielalvin2.jpg" alt="Picture of Offender"/></td>
    


### Success! Ish. 

So we want to look for an img tag inside of all that mess.

And if we find the image tag ...

We want to extract the SRC attribute

And if we have that SRC attribute, we want to prepend that "photobase" URL before it.

Feel like you're going in circles? Try this: https://en.wikipedia.org/wiki/If_You_Give_a_Mouse_a_Cookie#Plot

But we can do this. Put it all together.

In [27]:
photocellURL = pq(pq(pq(html)("table.table_deathrow"))("td")[0])("img").attr("src")

In [28]:
if photocellURL:   # If we've found that img src tag ...
    photocellURL = photobase + photocellURL
else:
    photocellURL = ""
line['PhotoURL'] = photocellURL

### But wait! There's more!

That detail page has a bunch more stuff:

`...
    <p><span class="bold">Prior Occupation</span><br />
    Laborer </p>
    <p> <span class="bold">Prior Prison Record</span><br />
    #792374  on a 5 year sentence from Dallas   County for 1 count of  sexual assault of a child. (Current offense was committed prior to the offender  being incarcerated for the sexual assault conviction.) </p>
    <p> <span class="bold">Summary of Incident</span><br />
    On  9/21/1993 at 9:00 p.m. in Mesquite,  Braziel approached a newlywed couple walking on a jogging trail of a community  college. Braziel demanded money. When it was discovered that neither of the two  had any money in their possession, Braziel shot the 27 year old white male,  resulting in his death. Braziel then sexually assaulted the 23 year old white  female. Braziel linked to the crime in January 2001 when his DNA was found to  match the DNA taken from the female victim. </p>
    <p> <span class="bold">Co-Defendants</span><br />
    None </p>
    <p> <span class="bold">Race and Gender of Victim</span><br />
    White  male</p>
...
`

So ... Let's grab it. In this case, we're **really** lucky -- these are the only *p* tags in the entire page. So inside of the *p* tag, there's a *span* tag with a description as text; and the data we want is directly part of the text of the *p* tag.

In [29]:
for graf in pq(html)("p"):
    data = pq(graf)("p").text().strip()
    description = pq(graf)("span")[0].text.strip()
    print(description + ": " + data)
    # line[description] = data

Prior Occupation: Prior Occupation
Laborer
Prior Prison Record: Prior Prison Record
#792374 on a 5 year sentence from Dallas County for 1 count of sexual assault of a child. (Current offense was committed prior to the offender being incarcerated for the sexual assault conviction.)
Summary of Incident: Summary of Incident
On 9/21/1993 at 9:00 p.m. in Mesquite, Braziel approached a newlywed couple walking on a jogging trail of a community college. Braziel demanded money. When it was discovered that neither of the two had any money in their possession, Braziel shot the 27 year old white male, resulting in his death. Braziel then sexually assaulted the 23 year old white female. Braziel linked to the crime in January 2001 when his DNA was found to match the DNA taken from the female victim.
Co-Defendants: Co-Defendants
None
Race and Gender of Victim: Race and Gender of Victim
White male


### Well, shoot.

We're getting the description as part of our data. OK. So PyQuery does have a way around this; we can ignore stuff in **child elements** like the *span* tag by looking only at *outerHtml*.

In [30]:
for graf in pq(html)("p"):
    data = pq(pq(graf)("p")).text().strip()
    description = pq(graf)("span")[0].text.strip()
    print(description + ": " + data)

Prior Occupation: Prior Occupation
Laborer
Prior Prison Record: Prior Prison Record
#792374 on a 5 year sentence from Dallas County for 1 count of sexual assault of a child. (Current offense was committed prior to the offender being incarcerated for the sexual assault conviction.)
Summary of Incident: Summary of Incident
On 9/21/1993 at 9:00 p.m. in Mesquite, Braziel approached a newlywed couple walking on a jogging trail of a community college. Braziel demanded money. When it was discovered that neither of the two had any money in their possession, Braziel shot the 27 year old white male, resulting in his death. Braziel then sexually assaulted the 23 year old white female. Braziel linked to the crime in January 2001 when his DNA was found to match the DNA taken from the female victim.
Co-Defendants: Co-Defendants
None
Race and Gender of Victim: Race and Gender of Victim
White male


### Seriously?

We're closer, but our "data" section contains our "description" from within the *span*. And this is getting annoying.

So we can back this out slightly for a really stupid fix. text_content() sometimes works as a method.



In [31]:
graf.text_content()

' Race and Gender of Victim\n  White  male'

And there we have it. And that *br* tag comes through as *\n*, a new line. So let's stick with just the text, we'll split on the *\n*, and take the second half of this thing, strip off some white space, and this is ... just gonna look ugly.

In [32]:
for graf in pq(html)("p"):
    data = graf.text_content().split("\n")[1].strip()
    description = pq(graf)("span")[0].text.strip()
    print(description + ": " + data)

Prior Occupation: Laborer
Prior Prison Record: #792374  on a 5 year sentence from Dallas   County for 1 count of  sexual assault of a child. (Current offense was committed prior to the offender  being incarcerated for the sexual assault conviction.)
Summary of Incident: On  9/21/1993 at 9:00 p.m. in Mesquite,  Braziel approached a newlywed couple walking on a jogging trail of a community  college. Braziel demanded money. When it was discovered that neither of the two  had any money in their possession, Braziel shot the 27 year old white male,  resulting in his death. Braziel then sexually assaulted the 23 year old white  female. Braziel linked to the crime in January 2001 when his DNA was found to  match the DNA taken from the female victim.
Co-Defendants: None
Race and Gender of Victim: White  male


### Ugly wins over broken!

Now we can scrape everything from the main page and the bio subpage.

You're free to scrape the last statement page on your own. The techniques you used already here will help you.

Without further ado, let's piece this (almost) together:

In [33]:
for InmateNo in masterdict:
    line = masterdict[InmateNo]  # Get that whole person-level dictionary
    r = requests.get(line['BioURL'])   # Get their biography page
    html = r.content
    table = pq(html)("table.table_deathrow")
    for row in pq(table)("tr"):
        description = pq(row)("td")[-2].text
        data = pq(row)("td")[-1].text
        line[description] = data
    # Now let's write the line back to masterdict to save our changes
    try:
        photocellURL = pq(pq(pq(html)("table.table_deathrow"))("td")[0])("img").attr("src")
        if photocellURL:   # If we've found that img src tag ...
            photocellURL = photobase + photocellURL
        else:
            photocellURL = ""
    except:
        photocellURL = ""
    line['PhotoURL'] = photocellURL
    for graf in pq(html)("p"):
        data = graf.text_content().split("\n")[1].strip()
        description = pq(graf)("span")[0].text.strip()
        line[description] = data
    masterdict[InmateNo] = line    

IndexError: list index out of range

In [34]:
html[:500]

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x00\x00\x00d\x00d\x00\x00\xff\xfe\x00\x1fLEAD Technologies Inc. V1.01\x00\xff\xdb\x00C\x00\x06\x04\x04\x05\x04\x04\x06\x05\x05\x05\x07\x06\x06\x07\t\x10\n\t\x08\x08\t\x13\x0e\x0e\x0b\x10\x17\x14\x18\x18\x16\x14\x16\x16\x19\x1c$\x1f\x19\x1b"\x1b\x16\x16 + "&\')))\x18\x1e-0,(0$()\'\xff\xc4\x00\xd2\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01}\x01\x02\x03\x00\x04\x11\x05\x12!1A\x06\x13Qa\x07"q\x142\x81\x91\xa1\x08#B\xb1\xc1\x15R\xd1\xf0$3br\x82\t\n\x16\x17\x18\x19\x1a%&\'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz\x83\x84\x85\x86\x87\x88\x89\x8a\x92\x93\x94\x95\x96\x97\x98\x99\x9a\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xff\xc0\x00\x0b\x08\x0

In [35]:
# This does not look like a useful HTML file, huh? Let's find out what's happening:
print(line['BioURL'])

http://localhost/www.tdcj.texas.gov/death_row/dr_info/_ramos.jpg


![You gotta be shitting me! image](support/ygbsm.gif)

#### Yes, some of the information pages are ... pictures of an information page.

Now what? Let's circle back. We're not parsing someone's typewritten form right now. So ... let's not. If there's no "html" in the filename, let's not try to parse the guts.

In [40]:
for InmateNo in masterdict:
    line = masterdict[InmateNo]  # Get that whole person-level dictionary
    bioURL = line['BioURL']
    if "html" in bioURL:
        r = requests.get(bioURL)   # Get their biography page
        html = r.content
        table = pq(html)("table.table_deathrow")
        for row in pq(table)("tr"):
            description = pq(row)("td")[-2].text
            data = pq(row)("td")[-1].text
            line[description] = data
        # Now let's write the line back to masterdict to save our changes
        try:
            photocellURL = pq(pq(pq(html)("table.table_deathrow"))("td")[0])("img").attr("src")
            if photocellURL:   # If we've found that img src tag ...
                photocellURL = photobase + photocellURL
            else:
                photocellURL = ""
        except:
            photocellURL = ""
        line['PhotoURL'] = photocellURL
        for graf in pq(html)("p"):
            data = graf.text_content().split("\n")[1].strip()
            description = pq(graf)("span")[0].text.strip()
            line[description] = data
        masterdict[InmateNo] = line    

IndexError: list index out of range

In [None]:
print(bioURL)

OK, well, huh. That didn't quite work out; some of these supplementary sections have more than one paragraph in the description.

`...
    <p><span class="bold">Prior Occupation</span><br />
    Wrecker  Driver/General Construction/Lineman/Laborer </p>
    <p><span class="bold">Prior Prison Record</span><br />
    None </p>
    <p><span class="bold">Summary of Incident</span><br />
    On  09/26/1986, in Harris County, Texas, AWFUL THING </p>
    <p> On  04/16/1992 in Harris County, Texas, MORE AWFUL</p>
    <p> On  10/19/1993, LET'S JUST STOP HERE</p>
    ...
    `
OK, we can figure out how to serialize this, then, if we really want to. But we're also getting crazy into complexity. And maybe these details aren't quite so important.

So what if we say, here's a paragraph tag. If there's no *span* inside of it, let's append something showing the record is incomplete and just not try to scrape the rest of it right now. If that's a bad move, we can rescrape later.


In [46]:

for InmateNo in masterdict:
    line = masterdict[InmateNo]  # Get that whole person-level dictionary
    bioURL = line['BioURL']
    if "html" in bioURL and "no_info_available" not in bioURL:  # !!!!!!!!!!!!!!!!!!!!!!! This broke too
        r = requests.get(bioURL)   # Get their biography page
        html = r.content
        table = pq(html)("table.table_deathrow")
        for row in pq(table)("tr"):
            description = pq(row)("td")[-2].text
            data = pq(row)("td")[-1].text
            line[description] = data
        # Now let's write the line back to masterdict to save our changes
        try:
            photocellURL = pq(pq(pq(html)("table.table_deathrow"))("td")[0])("img").attr("src")
            if photocellURL:   # If we've found that img src tag ...
                photocellURL = photobase + photocellURL
            else:
                photocellURL = ""
        except:
            photocellURL = ""
        line['PhotoURL'] = photocellURL
        for graf in pq(html)("p"):
            spanhere = pq(graf)("span")
            if spanhere:   # If this is a complete description + initial tag:
                try:   # Yes, something else broke !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                    data = graf.text_content().split("\n")[1].strip()
                    description = pq(graf)("span")[0].text.strip()
                except:
                    pass
            else:
                line[description] += " ***INCOMPLETE**** "
            line[description] = data
        masterdict[InmateNo] = line    

![This is fine meme](support/thisisfine.gif)

The original code above broke, then broke again. Scraping of some pages goes smoothly. Sometimes, not so much, and your best-laid plans crumble. The more time you spend observing your data, the better your scrape will go.

## But wait, there's more!

We now have most of our scrapable data in one spot, but it's inconsistent; some have those details from the biography page; some do not. What we have to do is standardize, standardize, standardize.


In [49]:
# First, get a list of our keys, like our column headers.
headers = []
for InmateNo in masterdict:
    row = masterdict[InmateNo]
    for key in row.keys():
        if key not in headers:
            headers.append(key)
print(headers)

['InmateNo', 'ExecutionNo', 'LastName', 'FirstName', 'Age', 'ExecutionDate', 'Race', 'County', 'BioURL', 'StatementURL', 'Name', 'TDCJ Number', 'Date of Birth', 'Date Received', 'Age (when    Received)', 'Education Level (Highest Grade Completed)', 'Date of Offense', 'Age (at the time of Offense)', 'Gender', 'Hair Color', 'Height', 'Weight', 'Eye Color', 'Native County', 'Native State', 'PhotoURL', 'Prior Occupation', 'Prior Prison Record', 'Summary of Incident', 'Co-Defendants', 'Race and Gender of Victim', 'Age (when Received)', 'Summary of incident', 'Native Country']


In [50]:
# That looks reasonable. Now, let's cycle through again and standardize:
for InmateNo in masterdict:
    row = masterdict[InmateNo]
    line = OrderedDict()
    for key in headers:
        if key in row:
            line[key] = row[key]
        else:
            line[key] = ""   # Give it a blank entry
    
    # Our data is now standardized but has not been saved.
    masterdict[InmateNo] = line   # Copy over the new data.

In [53]:
import csv
with open('report.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=headers)
    writer.writeheader()
    for InmateNo in masterdict:
        line = masterdict[InmateNo]
        writer.writerow(line)