## Scraping with Python

Python is a handy programming language, with some excellent tools for handling scraping projects and also something called BeautifulSoup, which will make you feel like a moron every time you have to type stuff like soup.beautify. We'll try a different tool here and show you how to get started.

We're going to assume you're using some flavor of Python 3 here. If you've downloaded this from https://github.com/PalmBeachPost/nicar19scraping , you'll want to make sure you have dependencies -- the modules Python needs to do everything here -- installed.

If you're in NICAR ***************************************************************************************

If you're using the git repo and looking at this:
_pip install -r requirements.txt_ should be reasonably safe.

Then:
_jupyter notebook_

Your web browser should open up now, and you'll see a list of files. The complete tutorial is available locally as *************. The starter shell is ************. And soon you'll see what page to open.

## Observe, orient, decide, act

Before you start any scraping task, you'll want to invest a chunk of time into figuring out what you actually have to work with. 

A simple story: I thought I was going to have to write a scraper, but then the search page I was looking at had an export button that made a nice file that Excel opened up. It looked great, all the same text that was on my screen. I figured out how to dynamically generate that export button link so I could run it on a regular basis and schedule it.

Problem was, it really had all the __text__ on my screen -- like the date of a case, the name of the person, the outcome of the investigation. What the export didn't have was critical and not immediately : A link from the case number to the actual paperwork supporting the case. The exported file was missing something that was absolutely critical.

So, let's look at what we have to work with. Because of that earlier command -- _python -m http.server_ -- you should have a little web server running already. Let's go to a particular file. Click this link to open it in your browser.

http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html



This is a localized partial copy of a Texas Department of Criminal Justice site, showing "Executed offenders" -- people put to death by the state. They seem to be numbered in order, with the newest executions first, and 558 total. If you scroll down, you'll see these go back to very late 1982, meaning we're looking at about 36 years to get 558 executions.

Another thing to notice: There's no pagination. There's no "page 1 of 26" here. That makes life much easier.

When you actually scrape your data, you can start running your own analysis. As you look, though, maybe scribble some notes of things you want think about looking analyzing from this.

![Executed offender page"](support/mainpage.png "Main executed offender page")

On the main page, you see things like first name, last name, race, gender, age and TDCJ number. The page is from the Texas Department of Criminal Justice, or TDCJ. This is an inmate number -- a unqiue identifier.

The newest executee is listed as Braziel, Jr., Alvin. Leave your cursor over the "Offender Information" link for Braziel and your browser, in the bottom-left corner, will show you it's http://localhost/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html . This is a good sign; it's not _javascript:something_. You can work with this. Let's hit that link.

Here there's a bit more of a biography, more demographic details, and a description of an awful crime. Important to note: On the previous page he was "Braziel, Jr." and "Alvin". Here, we see the name is no longer broken up but is formatted differently with more information: "Braziel, Alvin Avon Jr." all at once.

### What are we trying to scrape here?
###### Where are we going, and why are we in this handbasket?

Well, you may not always know what you're going to want intially. Almost anything involving death penalty cases, you probably want the demographic information. And you probably want the history of the case. And you probably want the final statement. And Texas may keep the last meal here. So it's consult with Stephen Colbert about where to start:

![Stephen Colbert wants it all](support/giveittome.gif)

How hard will it be to grab that big bunch of narrative stuff? In most browsers, you can right-click (maybe around the "Name") entry in the biography) and left-click on Inspect. You should see something like this.

![Output of HTML inspector](support/tableinspect.png)

On the right side we see the main part of this demographic information is all in a HTML table: * &lt;table class="table\_deathrow indent"&gt; *

This is good. We can work with this. Move your cursor through the inspection area and you'll see different rows highlight. Every row of the table -- *&lt;tr&gt;* -- is a row in what you're seeing. The two sides of that table are separated by *&lt;td&gt;* tags, or table data tags, the HTML marker for a cell. This is very good.

As long as all your scraping projects can go this well, you'll be fortunate. Will you?

![Image of hand showing sticker of Stop and Pray. Source: https://www.pexels.com/photo/photography-of-a-persons-hand-with-stop-signage-823301/](support/stopandpray.jpg)


Yeah, no.

Let's get reoriented: The stuff off the main page is an index to the more detailed subpages. So we'll need to scrape that main page first, get some of that information, and then start scraping the subpages and get more. And along the way, we'll probably need to look at other pages including the final statement and download the photos.

So where do we start? We start at the beginning.

## Basic scraper setup

So when you're scraping web pages, you need something to download web pages. Here, we'll use the great _requests_ library:

_import requests_

We'll need something that can parse the web pages, or break them into understandable chunks that you can maneuver through. We'll use the splendid _PyQuery_ library. This one's a little odd to set up; it breaks Python tradition by having mixed case for the main module, and that's also annoying to type. You can run a raw import statement on it, but each time you'd have to type _pyquery.PyQuery(somethingsomething)_ and at that point you might as just try to cuddle a honey badger. Let's get in the habit of a different import statement that will mean we need a lot less typing. In fact, let's just use _pq_ to mean _pyquery.PyQuery_. If you use PyQuery, just copy-paste this line every time until you have it memorized. After that, it's so much easier.

_from pyquery import PyQuery as pq_

We'll want to do **something** with our scraped data. Chances are, even if you keep processing it directly in Python, you'll probably want to save some snapshots to disk. And chances are, the CSV format is the one you'll want. So, let's add one more module.

_import csv_

You'll almost certainly need more modules as you go on -- maybe something to change the formatting of the dates, say. But you can add what you need later.

You can have your own style. I tend to put external dependencies in their own section at the very top of the file, and built-in module statements following a blank line, like this:


In [7]:
import requests   # external dependency
from pyquery import PyQuery as pq   # external dependency

import csv

We can now start scraping web pages. Where do we start? Well, we know what URL we're starting with. And we know requests is used to get stuff ...

In [8]:
hosturl = "http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html"
r = requests.get(hosturl)

In [9]:
# Let's see what we've downloaded, just part of it
r.content[:2000]
# Yep, we have a web page!

b'<!doctype html>\n<html lang="en-US"><!-- InstanceBegin template="/Templates/generic_inside.dwt" codeOutsideHTMLIsLocked="false" -->\n<head>\n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<!-- stylesheet: global -->\n<link rel="stylesheet" href="../stylesheets/global.css">\n<!-- stylesheet: page-specific -->\n<link rel="stylesheet" href="../stylesheets/content.css">\n<link rel="stylesheet" href="../stylesheets/menu_style.css">\n<!-- InstanceBeginEditable name="stylesheets" -->\n\n<!-- InstanceEndEditable -->\n<!-- jQuery library (if CDN fails, use local copy) -->\n<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>\n<script type="text/javascript"> window.jQuery || document.write(\'<script src="../javascripts/jquery.min.js"><\\/script>\') </script>\n<!-- javascripts -->\n<script type="text/javascript" src="../javascripts/google_analytics.js"></script>\n<script type="text/javascri

In [10]:
# Now, let's get the actual stuff, and we'll put it in a variable called html.
html = r.content

### Planning the first scrape
Let's take a look at the main page code in our browser, either through the right-click-Left-click-on-Inspect or View Source. The code we really want to find is in here a bit:

What can we tell from this? There's a couple headers, whatever. Fine. But there's a table with a class of *tdcj\_table* where the good stuff is. The first row of the table is its headers, with everything inside *th* cells. Beneath that is our first row of real data, contained in *td* cells. Sometimes we just want the text, like *Alvin*. Sometimes we're looking for the URLs that aren't actually text, like the link.

So what's a sensible data structure here? We know we're going to have to roll up at least one subpage to get more details. So we need a data structure that lets us access the same stuff easily, and here's where Python is handy with something called a dictionary. It provides access by name to some data, often a single value but sometimes a whole thing. For example, you might see a dictionary pointing to just a single text string, or a list of things:

*presidents[1] = "George Washington"*

*food['fruits'] = ['apple', 'orange', 'lemon', 'lime', 'tomato']*

When I'm scraping, I'm almost always putting each row of data into its own dictionary, with a particular type called OrderedDict, which keeps the order. So in this case, we're looking for, like:

*mydictionary = {'first name': "Alvin", "last name": "Braziel Jr.*}

It would make sense to put each person's dictionary of data into one big dictionary that holds them. But to do that, we need a key, a unique value to refer to one person.

As we've seen, the names of Texas' executed inmates don't even match from the index page to the subpage, and it's entirely possible Texas, given enough time, would execute several people named Robert Smith. So don't go by name. However, that inmate identifier, the "TDJC Number," is assigned to single person, and presumably even Texas would never execute a person more than once.

Presumably.

In [11]:
from collections import OrderedDict

masterdict = OrderedDict()

Remember before? ... *&lt;table class="tdcj_table indent" ...* That's the good stuff. Let's try getting it.

In PyQuery, the first thing you put in is the first, big hunk of HTML. After that, you're looking for a tag, like "table" or "td" or "a". You can use a dot to show a class, like *p.narrative*, to find the code wrapped in something like *&lt;p class="narrative"&gt;*. You can use a # to show a name, like *p#beginning*, which would look for *&lt;p id="beginning"&gt;*


In [13]:
table = pq(html)("table.tdcj_table")   # Pick out the table with the good stuff

In [18]:
# Let's take a look at the top:
table.html()[:2000]
# Yep, we have our table.

'\n    <caption>Executed Offenders</caption>\n  <tr>\n    <th style="text-align: center" scope="col">Execution</th>\n    <th style="text-align: center; width: 16%" scope="col">Link</th>\n    <th style="text-align: center; width: 13%" scope="col">Link</th>\n    <th style="text-align: center" scope="col">Last Name</th>\n    <th style="text-align: center" scope="col">First Name</th>\n    <th style="text-align: center; width: 7%" scope="col">TDCJ<br/>Number</th>\n    <th style="text-align: center" scope="col">Age</th>\n    <th style="text-align: center" scope="col">Date</th>\n    <th style="text-align: center" scope="col">Race</th>\n    <th style="text-align: center" scope="col">County</th>\n  </tr>\n  <tr>\n    <td style="text-align: center">558</td>\n    <td style="text-align: center"><a href="dr_info/brazielalvin.html" title="Offender Information for Joseph Garcia">Offender Information</a></td>\n    <td style="text-align: center"><a href="dr_info/brazielalvinlast.html" title="Last State

OK, now let's flip back to that web page again and just ... look. In Python, you start counting from 0. So table row 0 is going to be the headers; our good data begins at row 1. Our first column of real data is the execution, will be in table data cell 0, and we want the text ... You know what? Let's just sketch this out.
* cell 0, text to ExecutionNo.
* cell 1, URL, to BioURL
* cell 2, URL, to Statement
* cell 3, text, to LastName
* cell 4, text, to FirstName
* cell 5, text, to InmateNumber
* cell 6, text, to Age
* cell 7, text, to ExecutionDate
* cell 8, text, to Race
* cell 9, text, to County

We should maybe fiddle with the order. Maybe something like this.

* cell 5, text, to InmateNumber
* cell 0, text to ExecutionNo.
* cell 3, text, to LastName
* cell 4, text, to FirstName
* cell 6, text, to Age
* cell 7, text, to ExecutionDate
* cell 8, text, to Race
* cell 9, text, to County
* cell 1, URL, to BioURL
* cell 2, URL, to Statement

The URLs are going to be very important to us for the scrape, but they'll be near-worthless to us in the spreadsheet we want to build.

In [19]:
# Let's try getting the execution order. Our HTML table is stored in the cunningly named "table" variable.
# And we want to skip the first row, right? Let's just iterate beginning at the second row, which in Python is 1.
# We can cheat and try like 5 rows at a time.

for row in pq(table)("tr")[1:6]:  # Pick up first five rows of real data, skipping header row
    print(pq(row)("td")[0].text)


558
557
556
555
554
553
552
551
550
549


In [24]:
# OK, we have our execution numbers. You know what? I really don't want to type that "pq" stuff eight times over.
# So we can  build a stupid lil function. But ...
# If I were a dreamer, but then again, no.
# Let's save the routine for later and just copy paste. And we'll want to start storing stuff in masterdict.
# Key it to the inmate number. Keep it in an OrderedDict. OK, fine.
# My standard has been that a **row** comes in, and a **line** goes out. No real rhyme or reason, but ...
# let's keep with it.

for row in pq(table)("tr")[1:6]:   # Pick off first five lines of real data
    line = OrderedDict()
    line['InmateNo'] = pq(row)("td")[5].text
    line['ExecutionNo'] = pq(row)("td")[0].text
    line['LastName'] = pq(row)("td")[3].text
    line['FirstName'] = pq(row)("td")[4].text
    line['Age'] = pq(row)("td")[6].text
    line['ExecutionDate'] = pq(row)("td")[7].text
    line['Race'] = pq(row)("td")[8].text
    line['County'] = pq(row)("td")[9].text
    print(line)


OrderedDict([('InmateNo', '999393'), ('ExecutionNo', '558'), ('LastName', 'Braziel, Jr.'), ('FirstName', 'Alvin'), ('Age', '43'), ('ExecutionDate', '12/11/2018'), ('Race', 'Black'), ('County', 'Dallas')])
OrderedDict([('InmateNo', '999441'), ('ExecutionNo', '557'), ('LastName', 'Garcia'), ('FirstName', 'Joseph'), ('Age', '47'), ('ExecutionDate', '12/04/2018'), ('Race', 'Hispanic'), ('County', 'Dallas')])
OrderedDict([('InmateNo', '999062'), ('ExecutionNo', '556'), ('LastName', 'Ramos'), ('FirstName', 'Robert'), ('Age', '64'), ('ExecutionDate', '11/14/2018'), ('Race', 'Hispanic'), ('County', 'Hidalgo')])
OrderedDict([('InmateNo', '999381'), ('ExecutionNo', '555'), ('LastName', 'Acker'), ('FirstName', 'Daniel'), ('Age', '46'), ('ExecutionDate', '9/27/2018'), ('Race', 'White'), ('County', 'Hopkins')])
OrderedDict([('InmateNo', '999351'), ('ExecutionNo', '554'), ('LastName', 'Clark'), ('FirstName', 'Troy'), ('Age', '51'), ('ExecutionDate', '9/26/2018'), ('Race', 'White'), ('County', 'Smith

In [None]:
# OK, we're much of the way there. We still have a leash on to show only the first five rows of data.
# We're missing the critical URLs, though. Let's look at the last row we pulled:

In [29]:
print(pq(row))

<tr>
    <td style="text-align: center">554</td>
    <td style="text-align: center"><a href="dr_info/clarktroy.html" title="Offender Information for Troy Clark">Offender Information</a></td>
    <td style="text-align: center"><a href="dr_info/clarktroylast.html" title="Last Statement of Troy Clark">Last Statement</a></td>
    <td style="text-align: center">Clark</td>
    <td style="text-align: center">Troy</td>
    <td style="text-align: center">999351</td>
    <td style="text-align: center">51</td>
    <td style="text-align: center">9/26/2018</td>
    <td style="text-align: center">White</td>
    <td style="text-align: center">Smith</td>
  </tr>
  


In [30]:
# If we grab just the text of this thing, we're not going to get much:
print(pq(row)("td")[1].text)

None


In [34]:
# Wait, why none? Because there's no raw text in the cell. What text there is is wrapped inside the *a* tag.
# Let's feed it into another round of PyQuery:
print(pq(pq(row)("td"))("a").text)


<bound method PyQuery.text of [<a>, <a>]>


In [35]:
# Well, crap. "Bound method" means it's looking for something like a function, so throw in some ()s.

print(pq(pq(row)("td"))("a").text())

# See how PyQuery now wants *text()* instead of *text*? Yeah. It's fun, and sometimes unpredictable.
# Keep working it over.

Offender Information Last Statement


But we don't actually want that text. What we want is the URL, which is stored as an **attribute** of the *a* tag.

Other attributes include things like *class* and *id* tags. But here's how to grab it:

In [36]:
print(pq(row)("td")("a").attr("href"))

dr_info/clarktroy.html


**dr_info/clarktroy.html**? What the hell kind of URL is that? Well, it's a relative URL. Starting from the directory you're in now, it's going to look inside *dr_info* for a file named clarktroy.html. We started off here:
http://localhost/www.tdcj.texas.gov/death_row/dr_executed_offenders.html

So *dr\_executed\_offenders.html* is a page inside *death\_row* ... so let's start gluing this together.

http://localhost/www.tdcj.texas.gov/death_row/ + dr_info/clarktroy.html ... becomes:

http://localhost/www.tdcj.texas.gov/death_row/dr_info/clarktroy.html

Open it. We're in business!

In [None]:
biopagebase = "http://localhost/www.tdcj.texas.gov/death_row/"