# Scraping with Python

Python is a handy programming language, with some excellent tools for handling scraping projects and also something called BeautifulSoup, which will make you feel like a moron every time you have to type stuff like soup.beautify. We'll try a different tool here and show you how to get started.

We're going to assume you're using some flavor of Python 3 here. If you've downloaded this from https://github.com/PalmBeachPost/nicar19scraping , you'll want to make sure you have dependencies -- the modules Python needs to do everything here -- installed. Typically this is with

_pip install -r requirements.txt_

From the project folder -- the root of the git repo if you've downloaded it that way, or ??????????????? at NICAR, you'll probably want to open two command prompts. In one, you'll want to start a web server:

_python -m http.server_

And in the other, you'll launch Jupyter Notebook, a handy way to build Python programs in small, useful chunks:

_jupyter notebook_


## Observe, orient, decide, act

Before you start any scraping task, you'll want to invest a chunk of time into figuring out what you actually have to work with. 

A simple story: I thought I was going to have to write a scraper, but then the search page I was looking at had an export button that made a nice file that Excel opened up. It looked great, all the same text that was on my screen. I figured out how to dynamically generate that export button link so I could run it on a regular basis and schedule it.

Problem was, it really had all the __text__ on my screen -- like the date of a case, the name of the person, the outcome of the investigation. What the export didn't have was critical and not immediately : A link from the case number to the actual paperwork supporting the case. The exported file was missing something that was absolutely critical.

So, let's look at what we have to work with. Because of that earlier command -- _python -m http.server_ -- you should have a little web server running already. Let's go to a particular file. Click this link to open it in your browser.

http://localhost:8888/tree/www.tdcj.texas.gov/death_row/dr_executed_offenders.html



This is a localized partial copy of a Texas Department of Criminal Justice site, showing "Executed offenders" -- people put to death by the state. They seem to be numbered in order, with the newest executions first, and 558 total. If you scroll down, you'll see these go back to very late 1982, meaning we're looking at about 36 years to get 558 executions.

Another thing to notice: There's no pagination. There's no "page 1 of 26" here. That makes life much easier.

When you actually scrape your data, you can start running your own analysis. As you look, though, maybe scribble some notes of things you want think about looking analyzing from this.

![Executed offender page"](support/mainpage.png "Main executed offender page")

On the main page, you see things like first name, last name, race, gender, age and TDCJ number. The page is from the Texas Department of Criminal Justice, or TDCJ. This is an inmate number -- a unqiue identifier.

The newest executee is listed as Braziel, Jr., Alvin. Leave your cursor over the "Offender Information" link for Braziel and your browser, in the bottom-left corner, will show you it's http://localhost:8888/tree/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html . This is a good sign; it's not _javascript:something_. You can work with this. Let's hit that link.

Here there's a bit more of a biography, more demographic details, and a description of an awful crime. Important to note: On the previous page he was "Braziel, Jr." and "Alvin". Here, we see the name is no longer broken up but is formatted differently with more information: "Braziel, Alvin Avon Jr." all at once.

### What are we trying to scrape here?
###### Where are we going, and why are we in this handbasket?

Well, you may not always know what you're going to want intially. Almost anything involving death penalty cases, you probably want the demographic information. And you probably want the history of the case. And you probably want the final statement. And Texas may keep the last meal here. So it's consult with Stephen Colbert about where to start:

![Stephen Colbert wants it all](support/giveittome.gif)

How hard will it be to grab that big bunch of narrative stuff? In most browsers, you can right-click (maybe around the "Name") entry in the biography) and left-click on Inspect. You should see something like this.

![Output of HTML inspector](support/tableinspect.png)

On the right side we see the main part of this demographic information is all in a HTML table: * &lt;table class="table\_deathrow indent"&gt; *

This is good. We can work with this. Move your cursor through the inspection area and you'll see different rows highlight. Every row of the table -- *&lt;tr&gt;* -- is a row in what you're seeing. The two sides of that table are separated by *&lt;td&gt;* tags, or table data tags, the HTML marker for a cell. This is very good.

As long as all your scraping projects can go this well, you'll be fortunate. Will you?

![Image of hand showing sticker of Stop and Pray. Source: https://www.pexels.com/photo/photography-of-a-persons-hand-with-stop-signage-823301/](support/stopandpray.jpg)


Yeah, no.

Let's get reoriented: The stuff off the main page is an index to the more detailed subpages. So we'll need to scrape that main page first, get some of that information, and then start scraping the subpages and get more. And along the way, we'll probably need to look at other pages including the final statement and download the photos.

So where do we start? We start at the beginning.

## Basic scraper setup

So when you're scraping web pages, you need something to download web pages. Here, we'll use the great _requests_ library:

_import requests_

We'll need something that can parse the web pages, or break them into understandable chunks that you can maneuver through. We'll use the splendid _PyQuery_ library. This one's a little odd to set up; it breaks Python tradition by having mixed case for the main module, and that's also annoying to type. You can run a raw import statement on it, but each time you'd have to type _pyquery.PyQuery(somethingsomething)_ and at that point you might as just try to cuddle a honey badger. Let's get in the habit of a different import statement that will mean we need a lot less typing. In fact, let's just use _pq_ to mean _pyquery.PyQuery_. If you use PyQuery, just copy-paste this line every time until you have it memorized. After that, it's so much easier.

_from pyquery import PyQuery as pq_

We'll want to do **something** with our scraped data. Chances are, even if you keep processing it directly in Python, you'll probably want to save some snapshots to disk. And chances are, the CSV format is the one you'll want. So, let's add one more module.

_import csv_

You'll almost certainly need more modules as you go on -- maybe something to change the formatting of the dates, say. But you can add what you need later.

You can have your own style. I tend to put external dependencies in their own section at the very top of the file, and built-in module statements following a blank line, like this:


In [1]:
import requests   # external dependency
from pyquery import PyQuery as pq   # external dependency

import csv

We can now start scraping web pages. Where do we start? Well, we know what URL we're starting with. And we know requests is used to get stuff ...

In [2]:
hosturl = "http://localhost:8888/tree/www.tdcj.texas.gov/death_row/dr_executed_offenders.html"
r = requests.get(hosturl)

In [3]:
# Let's see what we've downloaded, just part of it
r.content[:200]
# Yep, we have a web page!

b'<!DOCTYPE HTML>\n<html>\n\n<head>\n    <meta charset="utf-8">\n\n    <title>Jupyter Notebook</title>\n    <link rel="shortcut icon" type="image/x-icon" href="/static/base/images/favicon.ico?v=97c6417ed01bdc0'

In [6]:
# Now, let's feed that into PyQuery.
html = pq(r.content)

### Planning the first scrape
Let's take a look at the main page code in our browser, either through the right-click-Left-click-on-Inspect or View Source. The code we really want to find is in here a bit:

What can we tell from this? There's a couple headers, whatever. Fine. But there's a table with a class of *tdcj\_table* where the good stuff is. The first row of the table is its headers, with everything inside *th* cells. Beneath that is our first row of real data, contained in *td* cells. Sometimes we just want the text, like *Alvin*. Sometimes we're looking for the URLs that aren't actually text, like the link.

So what's a sensible data structure here? We know we're going to have to roll up at least one subpage to get more details. So we need a data structure that lets us access the same stuff easily, and here's where Python is handy with something called a dictionary. It provides access by name to some data, often a single value but sometimes a whole thing. For example, you might see a dictionary pointing to just a single text string, or a list of things:

*presidents[1] = "George Washington"*

*food['fruits'] = ['apple', 'orange', 'lemon', 'lime', 'tomato']*

When I'm scraping, I'm almost always putting each row of data into its own dictionary, with a particular type called OrderedDict, which keeps the order. So in this case, we're looking for, like:

*mydictionary = {'first name': "Alvin", "last name": "Braziel Jr.*}

It would make sense to put each person's dictionary of data into one big dictionary that holds them. But to do that, we need a key, a unique value to refer to one person.

As we've seen, the names of Texas' executed inmates don't even match from the index page to the subpage, and it's entirely possible Texas, given enough time, would execute several people named Robert Smith. So don't go by name. However, that inmate identifier, the "TDJC Number," is assigned to single person, and presumably even Texas would never execute a person more than once.

In [4]:
from collections import OrderedDict

masterdict = OrderedDict()

### What NOT to scrape

