# Vignette: Using the nys_parole_scraper package

### MDS Final Project: Kellyann Hayes

##### Read the Docs: https://nys-parole-scraper.readthedocs.io/en/latest/
##### TestPyPi: https://test.pypi.org/project/nys-parole-scraper/

As described in the ReamMe documentation, given an input of a csv file with identifying information of individuals (NYSID (New York state ID) or Name and DOB), this package will search for them via the [New York State DOCCS parolee lookup](https://publicapps.doccs.ny.gov/ParoleeLookup/default) and return all parole information, and summary statistics, for the individuals who are found in the system. 

This aim of this package is to aid in the data collection efforts for organization serving justice-involved populations in New York State; as such, it prioritizes accessibility for non-programmers by keeping it simple and exporting data directly as CSV and Excel files to a directory on your computer. However, the package can also be used to manipulate and perform analysis on data by returning them as python objects. Organizations working with justice-involved people in New York State may find that they are in need of parole data for multiple reasons, including program evaluation; program planning and design; understanding the parole needs of their participants; and reporting to funders. Organzations with a large number of clients who either are on parole or may be on parole can use this scraper to establish which of their clients are on parole; details about the parole including current status, release date, and charge information; information about the parole officer, which may help organizations in providing advocacy services; charge information; etc. The package will also return some basic summary statsitics and frequency tables to give an overview of the data. For the full list of data collected by the scraper, please visit the package ReadMe. 

This vignette runs through a brief example of the use of the parole_scraper on a sample of names collected randomly from the DOCCS parolee lookup (mostly John Does, Jane Does, etc). To note, the scraper is able to search for individuals by two means: 

- NYSID, if provided and if valid 
- Name and Year of Birth, if first name, last name, and DOB are all provided
- (See the ReadMe for more detailed information on how this data is provided in a csv file input.)

Because this a random sample taken from the parole lookup tool itself, and NYSID is not available in any of the DOCCS lookup tools, I was unable to search by NYSID for this example (though I did include some fake input data with NYSIDs to demonstrate that the scraper is able to search by NYSID and exclude invalid NYSIDs). The scraper has been tested with "real" data for which I had NYSIDs (but will not use in this example to protect confidentiality), and is effective in that usage as well. 

First, I import the package and assign the path of my csv file containing my test data (see the ReadMe for detailed information about the csv file format).
(Please note that I'm having an issue on my personal computer in which dependencies that are already installed are not being recognized during pip install - which is why I installed using --no-deps after ensuring that I do indeed have everything installed. Ususally, the package can be installed using:
```pip install -i https://test.pypi.org/simple/ nys-parole-scraper```

In [4]:
!pip install --no-deps -i https://test.pypi.org/simple/ nys-parole-scraper

Looking in indexes: https://test.pypi.org/simple/
Collecting nys-parole-scraper
  Using cached https://test-files.pythonhosted.org/packages/01/88/9b2b4a6432721ffa63978f0c401936e520134ba51b716776e34e972dbc69/nys_parole_scraper-0.1.0-py3-none-any.whl (14 kB)
Installing collected packages: nys-parole-scraper
Successfully installed nys-parole-scraper-0.1.0


In [6]:
from nys_parole_scraper import nys_parole_scraper

path = "C:/Users/khaye/OneDrive/Documents/Parole_Scraping/DATA_FILES/test_data.csv"

Next, assign the folder in which you would like your output (output will be a csv file containing the scraped parole data and an excel file containing summary statistics - see the ReadMe/Docstrings for more information) to be created. 


In [7]:
dir1 = "C:/Users/khaye/OneDrive/Documents/Parole_Scraping"

Finally, run the package using the path and dir1 variables as parameters.

In [8]:
nys_parole_scraper.parole_scraper(path, dir1)

  lambda x: x.str.contains(conv ,case=False)).any(axis=1).astype(int)


(      ID     DIN:          Name: Date of birth:  Age: Race / ethnicity:  \
 0   3470  22R1450       John Doe     1986-01-10  36.0          HISPANIC   
 0   6803  20A0756       John Doe     1993-01-01  29.0             BLACK   
 0   4467  17A2608       John Doe     1984-01-01  38.0             BLACK   
 0   8598  12A2901       John Doe     1981-04-30  41.0          HISPANIC   
 0   9928  14A0871       John Doe     1982-02-20  40.0          HISPANIC   
 ..   ...      ...            ...            ...   ...               ...   
 0    103  08B1910   John H Brown     1941-08-17  81.0             BLACK   
 0    474  07R3152   John L Brown     1959-04-29  63.0             BLACK   
 0    174  09B2869   John W Brown     1979-11-10  43.0             WHITE   
 0    508  00R5812  Johnnie Brown     1964-04-24  58.0             BLACK   
 0    186  12A1672   Johnny Brown     1960-09-17  62.0             BLACK   
 
    Release to parole supervision:  Months Since Release: Parole status:  \
 0        

The output above shows partial data from the scraped output, and the summary statistics. The data has been exported to my directory as shown: 


    - Documents
        - Parole_Scraping
            -Output_YYYYMMdd_HH.MM.SS
                - parole_full_output_YYYYMMdd_HH.MM.SS.csv
                - summary_statistics_YYYYMMdd_HH.MM.SS.xslx

The summary statistics tables are found in individual sheets inthe summary_statistics excel file. The text "YYYYMMdd_HH.MM.SS" is a unique date and time key that will be appended to all file names. 

If you would like this data returned as Pandas DataFrames in addition to be exported to your directory, you can assign two variables to the functioncall, with the first variable being the one which will store the full scraped output, and the second storing a list of summary statistic DataFrames. See below:  

In [9]:
full_output, sum_stats = nys_parole_scraper.parole_scraper(path, dir1)

  lambda x: x.str.contains(conv ,case=False)).any(axis=1).astype(int)


Now, the output has been once again exported to my directory as a csv and excel file, but they are also available as python objects, with full_output being the scraped data as a pandas DataFrame and sum_stats being a list of pandas DataFrames that contain summary statistics tables. 

In [10]:
full_output.head()

Unnamed: 0,ID,DIN:,Name:,Date of birth:,Age:,Race / ethnicity:,Release to parole supervision:,Months Since Release:,Parole status:,Effective date:,...,County 2,County 3,County 4,County 5,County 6,County 7,County 8,County 9,County 10,Date Info Scraped:
0,3470,22R1450,John Doe,1986-01-10,36.0,HISPANIC,2022-10-27,1.6,In Custody,2022-10-27,...,,,,,,,,,,2022-12-16
0,6803,20A0756,John Doe,1993-01-01,29.0,BLACK,2020-04-03,32.4,Absconded,NaT,...,,,,,,,,,,2022-12-16
0,4467,17A2608,John Doe,1984-01-01,38.0,BLACK,2017-07-27,64.7,Discharged,2018-06-17,...,,,,,,,,,,2022-12-16
0,8598,12A2901,John Doe,1981-04-30,41.0,HISPANIC,2018-08-02,52.5,Discharged,2022-04-14,...,,,,,,,,,,2022-12-16
0,9928,14A0871,John Doe,1982-02-20,40.0,HISPANIC,2015-03-10,93.2,Discharged,2016-03-10,...,,,,,,,,,,2022-12-16


In [11]:
sum_stats

[             Age:  Months Since Release:  Total Convictions
 count  104.000000             104.000000         104.000000
 min     26.000000               1.600000           1.000000
 mean    56.346154             229.463462           1.298077
 max     83.000000             537.700000           4.000000,
   Race/Ethnicity  Count      %
 0          BLACK     82  78.8%
 1          WHITE     12  11.5%
 2       HISPANIC     10   9.6%,
      Age  Count      %
 0   60 +     45  43.3%
 1  51-60     27  26.0%
 2  41-50     14  13.5%
 3  31-40     13  12.5%
 4  21-30      5   4.8%,
   Parole Status  Count      %
 0    Discharged     85  81.7%
 1       Revoked      7   6.7%
 2        Active      4   3.8%
 3      Deceased      4   3.8%
 4    In Custody      2   1.9%
 5     Absconded      2   1.9%,
    Counties All Convictions Count      %
 0                    Albany     4   3.8%
 1                     Bronx     7   6.7%
 2                    Cayuga     1   1.0%
 3                Chautauqua     1

While it is not the primary use of the nys_parole_scraper package, there is another function available for use on the full_output dataframe: freq_table(), available in the scraper_functions module of the nys_parole_scraper package. This function allows for quick generation of any categorical variables in the output data. 

Parameters: freq_table(df, column, col_name)
- df: the pandas DataFrame containing the scraped parole output. In the example above, this is "full_output"
- column: a string; the name of the categorical column you would like to create a frequency table for
- index_name: a string; what you would like to name the column of categories

For example, if I was interested in seeing the frequency of different top charge classes, I could run the code below:

In [12]:
from nys_parole_scraper import scraper_functions as sf

sf.freq_table(full_output, 'Class 1', 'Conviction 1 Classes')

Unnamed: 0,Conviction 1 Classes,Count,%
0,D,34,32.7%
1,C,24,23.1%
2,B,21,20.2%
3,E,18,17.3%
4,A,7,6.7%
