# Scraping with Python

Python is a handy programming language, with some excellent tools for handling scraping projects and also something called BeautifulSoup, which will make you feel like a moron every time you have to type stuff like soup.beautify. We'll try a different tool here and show you how to get started.

We're going to assume you're using some flavor of Python 3 here. If you've downloaded this from https://github.com/PalmBeachPost/nicar19scraping , you'll want to make sure you have dependencies -- the modules Python needs to do everything here -- installed. Typically this is with

_pip install -r requirements.txt_

From the project folder -- the root of the git repo if you've downloaded it that way, or ??????????????? at NICAR, you'll probably want to open two command prompts. In one, you'll want to start a web server:

_python -m http.server_

And in the other, you'll launch Jupyter Notebook, a handy way to build Python programs in small, useful chunks:

_jupyter notebook_


## Observe, orient, decide, act

Before you start any scraping task, you'll want to invest a chunk of time into figuring out what you actually have to work with. 

A simple story: I thought I was going to have to write a scraper, but then the search page I was looking at had an export button that made a nice file that Excel opened up. It looked great, all the same text that was on my screen. I figured out how to dynamically generate that export button link so I could run it on a regular basis and schedule it.

Problem was, it really had all the __text__ on my screen -- like the date of a case, the name of the person, the outcome of the investigation. What the export didn't have was critical and not immediately : A link from the case number to the actual paperwork supporting the case. The exported file was missing something that was absolutely critical.

So, let's look at what we have to work with. Because of that earlier command -- _python -m http.server_ -- you should have a little web server running already. Let's go to a particular file. Click this link to open it in your browser.

http://localhost:8000/www.tdcj.texas.gov/death_row/dr_executed_offenders.html

This is a localized partial copy of a Texas Department of Criminal Justice site, showing "Executed offenders" -- people put to death by the state. They seem to be numbered in order, with the newest executions first, and 558 total. If you scroll down, you'll see these go back to very late 1982, meaning we're looking at about 36 years to get 558 executions.

Another thing to notice: There's no pagination. There's no "page 1 of 26" here. That makes life much easier.

When you actually scrape your data, you can start running your own analysis. As you look, though, maybe scribble some notes of things you want think about looking analyzing from this.

![Executed offender page"](support/mainpage.png "Main executed offender page")

On the main page, you see things like first name, last name, race, gender, age and TDCJ number. The page is from the Texas Department of Criminal Justice, or TDCJ. This is an inmate number -- a unqiue identifier.

The newest executee is listed as Braziel, Jr., Alvin. Leave your cursor over the "Offender Information" link for Braziel and your browser, in the bottom-left corner, will show you it's http://localhost:8000/www.tdcj.texas.gov/death_row/dr_info/brazielalvin.html . This is a good sign; it's not _javascript:something_. You can work with this. Let's hit that link.

Here there's a bit more of a biography, more demographic details, and a description of an awful crime. Important to note: On the previous page he was "Braziel, Jr." and "Alvin". Here, we see the name is no longer broken up but is formatted differently with more information: "Braziel, Alvin Avon Jr." all at once.

### What are we trying to scrape here?
###### Where are we going, and why are we in this handbasket?

Well, you may not always know what you're going to want intially. Almost anything involving death penalty cases, you probably want the demographic information. And you probably want the history of the case. And you probably want the final statement. And Texas may keep the last meal here. So it's consult with Stephen Colbert about where to start:

![Stephen Colbert wants it all](support/giveittome.gif)

How hard will it be to grab that big bunch of narrative stuff? In most browsers, you can right-click (maybe around the "Name") entry in the biography) and left-click on Inspect. You should see something like this.

![Output of HTML inspector](support/tableinspect.png)

On the right side we see the main part of this demographic information is all in a HTML table: * &lt;table class="table\_deathrow indent"&gt; *

This is good. We can work with this. Move your cursor through the inspection area and you'll see different rows highlight. Every row of the table -- *&lt;tr&gt;* -- is a row in what you're seeing. The two sides of that table are separated by *&lt;td&gt;* tags, or table data tags, the HTML marker for a cell. This is very good.

As long as all your scraping projects can go this well, you'll be fortunate. Will you?

![Image of hand showing sticker of Stop and Pray. Source: https://www.pexels.com/photo/photography-of-a-persons-hand-with-stop-signage-823301/](support/stopandpray.jpg)


Yeah, no.

Let's get reoriented: The stuff off the main page is an index to the more detailed subpages. So we'll need to scrape that main page first, get some of that information, and then start scraping the subpages and get more. And along the way, we'll probably need to look at other pages including the final statement and download the photos.

So where do we start? We start at the beginning.

## Basic scraper setup