### Notebook 3: Scraping All Prisons

This notebook is dedicated to testing and organizing the function to pull the main page information for all inmates for all prisons (108 units). 

This final version of this function is saved in `inmate_scrape.py`, and it was run on an AWS server. It ran for over 72 hours. Ultimately, I realized that I did not need the main page information, that everything was embedded on each individual inmate's page, but it was very much a part of the work flow to create the function and acquire the data through AWS.

This is really for reference to understand and document workflow. I do not recommend running through this here - the function will take days to do all of the prisons.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from time import sleep

In [20]:
#the first creates a dictionary
#the second goes all the way to list of strings of url (havent tested in the wild)
from inmate_scrape import scrape_for_inmate_hrefs, inmate_hrefs_list_of_urls

In [21]:
#basic is the two column table of information
#inclu priors takes the top rows as well, creates a column for each prior and each section of it
from inmate_scrape import individual_scrape_incl_priors, individual_scrape_basic

In [22]:
#takes main table info from one prison
#one puts into df the other into dictionary
from inmate_scrape import one_prison_superficial_df, oneprison_to_dict_list

### Some testing. 
Getting it running for one or two prisons here, and then get an instance running on AWS.

**Starting by creating the lists I'll need, or at least the lists I think I'll need.**

In [5]:
data_url = 'https://www.texastribune.org/library/data/texas-prisons/units/'
res = requests.get(data_url)
soup = BeautifulSoup(res.content)

tbody = soup.find('tbody')

List of `individual prison links`:

In [6]:
prison_links = []
for row in tbody.find_all('tr'):
    
    name = row.find({'a', 'href'}).attrs
    
    prison_links.append(name)
    
links = pd.DataFrame(prison_links)

prison_link_list = links['href']

prison_link_list[:10]

0            /library/data/texas-prisons/units/allred/
1              /library/data/texas-prisons/units/beto/
2              /library/data/texas-prisons/units/boyd/
3          /library/data/texas-prisons/units/bradshaw/
4        /library/data/texas-prisons/units/bridgeport/
5           /library/data/texas-prisons/units/briscoe/
6              /library/data/texas-prisons/units/byrd/
7    /library/data/texas-prisons/units/chase-field-...
8           /library/data/texas-prisons/units/clemens/
9          /library/data/texas-prisons/units/clements/
Name: href, dtype: object

In [92]:
links[links.duplicated()] #no dups here 

Unnamed: 0,href


List of the `names of the prisons` (want to remember this code when making my `all_prisons_scrape function` to ensure I get the full names of each prison in the cell).

In [7]:
#this will be useful when creating the larger function to ensure I correctly pull in the names
#my previous version was messed up and only captured the first word regardless
prison_name_list = []
for row in tbody.find_all('tr'):
    
    name = row.find('td', {'data-title': 'Name'}).text
    
    prison_name_list.append(name)
prison_name_list[:10]

['Allred',
 'Beto',
 'Boyd',
 'Bradshaw',
 'Bridgeport',
 'Briscoe',
 'Byrd',
 'Chase Field Wilderness',
 'Clemens',
 'Clements']

Creating a list of slugs to get to the individual pages for each prison. It is important to do `try` and `except: break`, because I do not know how to have it go the length of the amount of slugs, instead I manually entered a range larger than any prisons have. If I don't tell it to stop, it always breaks, returning 
`None type cannot...` whatever. 

In [8]:
page_number_slugs = []

for i in range(1, 140): #because 140 is more pages than any prison has
    slug = '?page='+str(i)
    
    page_number_slugs.append(slug)
    
page_number_slugs[:10]

['?page=1',
 '?page=2',
 '?page=3',
 '?page=4',
 '?page=5',
 '?page=6',
 '?page=7',
 '?page=8',
 '?page=9',
 '?page=10']

### Currently, I have the following lists:
1. `prison_link_list`: has the slugs for each of the prisons in alpha order. Would need to do a `base_url +` situation for this particular list when scraping.

2. `prison_name_list`: literally just an alpha order list of prison names. Not totally sure that I'll need the list, but the code is likely to be useful for creating the column in the df for the big function.

3. `page_number_slugs`: has the slugs for page numbers. Would need to do a `prison_url +` situation for this when scraping. Also need to make sure to do `try, except:break` because the slugs list is too long for all of the prisons.

Running an individual prison for the already existing function, in preparation of building out for all prisons.

In [9]:
bridgeport_url = 'https://www.texastribune.org/library/data/texas-prisons/units/bridgeport/'

In [10]:
brid_res = requests.get(bridgeport_url + page_number_slugs[5])
soup = BeautifulSoup(brid_res.content)

In [11]:
#works, think i could do it more efficiently
soup.find({'head' :'title'}).text.replace('\n', '').split(' |')[0]

'Bridgeport Unit'

In [12]:
#better
head = soup.find('head')

prison_name = head.find('title').text.split(' |')[0]

prison_name

'Bridgeport Unit'

Testing it out in a for loop before embedding into the function.

In [13]:
#works
test_dict_list = []
for slug in page_number_slugs[:3]:
    sleep(1)
    res = requests.get(bridgeport_url + slug)
    soup = BeautifulSoup(res.content)

    print(f'{slug}')

    tbody = soup.find('tbody')
    head = soup.find('head')
    prison_name = head.find('title').text.split(' |')[0]
    



    for row in tbody.find_all('tr'):
        inmates = {}
        inmates['name'] = row.find('td').text.strip()
        inmates['prison'] = prison_name
        


        test_dict_list.append(inmates)

 
    print('Finished.')

?page=1
Finished.
?page=2
Finished.
?page=3
Finished.


In [14]:
#works
pd.DataFrame(test_dict_list)[:5]

Unnamed: 0,name,prison
0,"Virgle Ellis, III",Bridgeport Unit
1,Bruce Allan Jones,Bridgeport Unit
2,Mark Dolph,Bridgeport Unit
3,Lonel Hart,Bridgeport Unit
4,Darryl Cortez Thomas,Bridgeport Unit


In [15]:
#works - it works for one nicely
###PLAN TO EMBED THIS INTO THE BIG FUNCTION
bridgeport_dict = oneprison_to_dict_list(bridgeport_url, page_number_slugs)

?page=1
?page=2
?page=3
?page=4
?page=5
?page=6
?page=7
?page=8
?page=9
?page=10
?page=11
?page=12
?page=13
?page=14
?page=15
?page=16
?page=17
?page=18
?page=19
No more pages.
Finished.


In [16]:
#works
pd.DataFrame(bridgeport_dict)[:5]

Unnamed: 0,age,crime_location,entered_on,home_county,main_crime,name,prison,term
0,60,Tarrant,2/2/2012,Tarrant,AGG ROBBERY,"Virgle Ellis, III",Bridgeport Unit,85 years
1,63,Harris,6/18/1985,Harris,AGG SEX ASLT,Bruce Allan Jones,Bridgeport Unit,60 years
2,50,Bowie,3/13/2019,Bowie,UNL POSS FIREARM BY FELON W/2,Mark Dolph,Bridgeport Unit,58 years
3,47,Gregg,1/19/2017,Gregg,RETALIATION,Lonel Hart,Bridgeport Unit,55 years
4,54,Harris,7/18/2013,Harris,AGG ROBBERY,Darryl Cortez Thomas,Bridgeport Unit,50 years


In [17]:
cleveland_url = 'https://www.texastribune.org/library/data/texas-prisons/units/cleveland/'

In [23]:
cleveland_df = one_prison_superficial_df(cleveland_url, page_number_slugs)

?page=1
?page=2
?page=3
?page=4
?page=5
?page=6
?page=7
?page=8
?page=9
?page=10
?page=11
?page=12
?page=13
?page=14
?page=15
?page=16
?page=17
?page=18
No more pages.
Finished.


In [24]:
cleveland_df[:5]

Unnamed: 0,name,age,main_crime,entered_on,term,crime_location,home_county,prison
0,Ernest Gutierrez,55,ATT CAP MURDER,7/10/2014,40 years,Bexar,Bexar,Cleveland Unit
1,Gary Gerard Horn,56,UNAUTH USE MTR VEH,8/11/2011,40 years,Harris,Harris,Cleveland Unit
2,Roy Lee Cain,64,AGG SEX ASLT,6/10/2016,36 years,Smith,Smith,Cleveland Unit
3,Ricky Daranell Foster,59,BURG HABIT,11/22/2011,35 years,Dallas,Dallas,Cleveland Unit
4,Jesse Jermaine Jackson,38,AGG ROBBERY,12/29/2005,30 years,Dallas,Dallas,Cleveland Unit


### Building out the superficial for all prisons function.

**1. Need to remember to practice with `only one or two prisons`, and then put it on AWS to run.**
- have the `full_prison_link_list` now, which has the entire prison urls in alpha order
- this is what i need to do my testing on `full_prison_link_list[14:16]` because these are two relatively small inmate populations 

**2. When it's there. Start working on creating the individual details function for AWS.**

In [30]:
full_prison_link_list = []

main_url = 'https://www.texastribune.org'
for prison in prison_link_list:
    
    link = main_url+prison
    
    full_prison_link_list.append(link)

In [33]:
num = 0
for prison in full_prison_link_list[14:16]:
    print(f'{num}: {prison}')
    num += 1

0: https://www.texastribune.org/library/data/texas-prisons/units/cotulla/
1: https://www.texastribune.org/library/data/texas-prisons/units/crain/


In [52]:
#works - since it does it for each prison, they get added into the same list, dont have to build a new list
dict_list = []

for prison in full_prison_link_list[14:16]:

    for slug in page_number_slugs:
        try:
            sleep(1)
            res = requests.get(prison + slug)
            soup = BeautifulSoup(res.content, features='lxml')

            print(f'{slug}')

            tbody = soup.find('tbody')
            head = soup.find('head')
            prison_name = head.find('title').text.split(' |')[0]



            for row in tbody.find_all('tr'):
                inmates = {}
                inmates['name'] = row.find('td').text.strip()
                inmates['age'] = row.find('td', {'data-title': 'Age'}).text
                inmates['main_crime'] = row.find('td', {'data-title': 'Main Crime'}).text
                inmates['entered_on'] = row.find('td', {'data-title': 'Entered On'}).text
                inmates['term'] = row.find('td', {'data-title': 'Term'}).text
                inmates['crime_location'] = row.find('td', {'data-title': 'Crime Location'}).text
                inmates['home_county'] = row.find('td', {'data-title': 'Home County'}).text
                inmates['prison'] = prison_name
                
                

                dict_list.append(inmates)
                
                
                

        except:
            print(f'No more pages.')
            break



    print('Finished.')

?page=1
?page=2
?page=3
?page=4
?page=5
?page=6
?page=7
?page=8
?page=9
?page=10
?page=11
?page=12
?page=13
?page=14
?page=15
?page=16
?page=17
?page=18
?page=19
?page=20
?page=21
No more pages.
Finished.
?page=1
?page=2
?page=3
?page=4
?page=5
?page=6
?page=7
?page=8
?page=9
?page=10
?page=11
?page=12
?page=13
?page=14
?page=15
?page=16
?page=17
?page=18
?page=19
?page=20
?page=21
?page=22
?page=23
?page=24
?page=25
?page=26
?page=27
?page=28
?page=29
?page=30
?page=31
?page=32
?page=33
?page=34
?page=35
?page=36
?page=37
?page=38
?page=39
?page=40
?page=41
?page=42
?page=43
?page=44
?page=45
?page=46
?page=47
?page=48
?page=49
?page=50
?page=51
?page=52
?page=53
?page=54
?page=55
?page=56
?page=57
No more pages.
Finished.


In [54]:
tester = pd.DataFrame(dict_list)

tester['prison'].unique()

array(['Cotulla Unit', 'Crain Unit'], dtype=object)

In [62]:
for prison in full_prison_link_list[:3]:
    print(prison.split('/')[-2])

allred
beto
boyd


In [66]:
for slug in page_number_slugs[:3]:
    print(slug[1:])

page=1
page=2
page=3


In [89]:
#takes almost 2 hours to run
#at the time of running, there were 107 and a total of 138,647 inmates
#print statements show the prison number out of 107.
def entire_system_sup_dict_lxml(full_prison_link_list, page_number_slugs):
    dict_list = []
    num = 1
    for prison in full_prison_link_list:

        for slug in page_number_slugs:
            try:
                sleep(1)
                res = requests.get(prison + slug)
                soup = BeautifulSoup(res.content, features = 'lxml')
                
                
                call_it = prison.split('/')[-2]

                print(f"prison {call_it} is prison {num}")
                print(f'{slug[1:]}')
                
                

                tbody = soup.find('tbody')
                head = soup.find('head')
                prison_name = head.find('title').text.split(' |')[0]
                

                counter = 1



                for row in tbody.find_all('tr'):
                    inmates = {}
                    try:
                        inmates['name'] = row.find('td').text.strip()
                    except:
                        inmates['name'] = None
                    try:
                        inmates['age'] = row.find('td', {'data-title': 'Age'}).text
                    except:
                        inmates['age'] = None
                    try:
                        inmates['main_crime'] = row.find('td', {'data-title': 'Main Crime'}).text
                    except:
                        inmates['main_crime'] = None
                    try:
                        inmates['entered_on'] = row.find('td', {'data-title': 'Entered On'}).text
                    except:
                        inmates['entered_on'] = None
                    try:
                        inmates['term'] = row.find('td', {'data-title': 'Term'}).text
                    except:
                        inmates['term'] = None
                    try:
                        inmates['crime_location'] = row.find('td', {'data-title': 'Crime Location'}).text
                    except:
                        inmates['crime_location'] = row.find('td', {'data-title': 'Crime Location'}).text
                    try:
                        inmates['home_county'] = row.find('td', {'data-title': 'Home County'}).text
                    except:
                        inmates['home_county'] = None
                    try:
                        inmates['prison'] = prison_name
                    except:
                        inmates['prison'] = None 



                    dict_list.append(inmates)


                    print(counter)
                    counter += 1


            except:
                print(f'No more pages.')
                print(' ')
                break
                
        num += 1

    return dict_list

    print('Finished.')


In [90]:
tester_lxml = entire_system_sup_dict_lxml(full_prison_link_list[14:16], page_number_slugs[2:15])

prison cotulla is prison 1
page=3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

In [81]:
for prison in full_prison_link_list[:4]:
    print(prison.split('/')[-2])

allred
beto
boyd
bradshaw


In [None]:
count = 1 
for prison in full_prison_link_list[:4]:

    for slug in page_number_slugs[3:10]:


        print(prison)
        print(f'{slug[1:]}')

In [86]:
#below has solid print_statements. its clean and could be complete
#started running just before 2:30
#still going strong at 3:42 - no idea where i am because i did it with poor print statements

def entire_system_sup_dict(full_prison_link_list, page_number_slugs):
    dict_list = []
    num = 1
    for prison in full_prison_link_list:

        for slug in page_number_slugs:
            try:
                sleep(1)
                res = requests.get(prison + slug)
                soup = BeautifulSoup(res.content)
                
                
                call_it = prison.split('/')[-2]

                print(f"prison {call_it} is prison {num}")
                print(f'{slug[1:]}')
                
                

                tbody = soup.find('tbody')
                head = soup.find('head')
                prison_name = head.find('title').text.split(' |')[0]
                

                counter = 1



                for row in tbody.find_all('tr'):
                    inmates = {}
                    try:
                        inmates['name'] = row.find('td').text.strip()
                    except:
                        inmates['name'] = None
                    try:
                        inmates['age'] = row.find('td', {'data-title': 'Age'}).text
                    except:
                        inmates['age'] = None
                    try:
                        inmates['main_crime'] = row.find('td', {'data-title': 'Main Crime'}).text
                    except:
                        inmates['main_crime'] = None
                    try:
                        inmates['entered_on'] = row.find('td', {'data-title': 'Entered On'}).text
                    except:
                        inmates['entered_on'] = None
                    try:
                        inmates['term'] = row.find('td', {'data-title': 'Term'}).text
                    except:
                        inmates['term'] = None
                    try:
                        inmates['crime_location'] = row.find('td', {'data-title': 'Crime Location'}).text
                    except:
                        inmates['crime_location'] = row.find('td', {'data-title': 'Crime Location'}).text
                    try:
                        inmates['home_county'] = row.find('td', {'data-title': 'Home County'}).text
                    except:
                        inmates['home_county'] = None
                    try:
                        inmates['prison'] = prison_name
                    except:
                        inmates['prison'] = None 



                    dict_list.append(inmates)


                    print(counter)
                    counter += 1


            except:
                print(f'No more pages.')
                print(' ')
                break
                
        num += 1

    return dict_list

    print('Finished.')


In [87]:
tester_dict = entire_system_sup_dict(full_prison_link_list[14:16], page_number_slugs[2:15])

prison cotulla is prison 1
page=3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prison cotulla is prison 1
page=11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

In [88]:
pd.DataFrame(tester_dict)

Unnamed: 0,age,crime_location,entered_on,home_county,main_crime,name,prison,term
0,27,Harris,10/16/2018,Harris,POSS WID METH 400G,Manuel Camacho-Hernandez,Cotulla Unit,15 years
1,40,Kerr,5/4/2018,Kerr,POSS CS PG2 ACETYLPSI 4-400G W,Zachariah Joseph Taylor,Cotulla Unit,15 years
2,34,Hidalgo,6/27/2018,Hidalgo,AGG SEX ASLT CHILD,Erik Gomez Delafuente,Cotulla Unit,15 years
3,39,Bandera,8/31/2018,Bandera,ASLT PUB SERVANT,Lloyd Alan Maxwell,Cotulla Unit,15 years
4,23,Harris,12/8/2017,Harris,AGG SEX ASLT CHILD U/14,Victor Ramirez,Cotulla Unit,15 years
5,28,Kerr,11/3/2017,Kerr,SEXUAL ASLT/CHILD,Shawn Stephen Schwab,Cotulla Unit,15 years
6,48,Harris,6/5/2018,San Patricio,AGG ROBBERY-DWPN,Archie Goff,Cotulla Unit,15 years
7,61,Harris,10/10/2018,Harris,AGG SEX ASLT CHILD U/14,Sergio Nieto,Cotulla Unit,15 years
8,25,Harris,3/5/2019,Harris,POSS WIT MAN DEL CONT SUB 400G,Jose Rodriguez,Cotulla Unit,15 years
9,57,Travis,5/15/2018,Travis,AGG SEX ASLT CHILD,Roel Salvat,Cotulla Unit,15 years
