In [115]:
import csv
import pandas as pd
import os
import datetime

## Report 710 Preprocessing
**Input**: A folder that contains the Report 710, in CSV format, for each administrative region in AZ, run on the current date.

**Output**: Raw data in the format of `validation_incarceration_population_person_level`. See raw data config for that file [here](https://github.com/Recidiviz/pulse-data/blob/main/recidiviz/ingest/direct/regions/us_az/raw_data/us_az_validation_incarceration_population_person_level.yaml).


### Prework

See instructions for how to access and download these reports in the [ACIS Reports Cheat Sheet](https://docs.google.com/document/d/1D_ZsDS7FQMbzychIjRBD6BTi2h31_PLcMNogmz8Ys3Y/edit?usp=sharing) at go/arizona. 

You should start this process with a folder of 16 CSV files with this structure: 

### Process

**Action Required:** Enter the path to the directory where the reports are saved in the `directory_in_str` variable.

In [None]:
directory_in_str = "/Users/elisegonzalez/Downloads/Report 710 Downloaded 2024-11-11"

Combine all CSV files in the input directory into a single DataFrame. Remove unnecessary information and improper formatting.

In [None]:
all_rows = []
directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith(".csv"):
        file_path = os.path.join(directory_in_str, filename)
        with open(file_path) as f:
            # skip first line
            reader = csv.reader(f)
            next(reader, None)
            for row in reader:
                all_rows.append(row)
    else:
        print("Skipped processing of " + filename)
        continue

In [None]:
processed = pd.DataFrame(all_rows[1:]).iloc[:, 27:40]
processed.columns = [
    "ADC_NUMBER",
    "NAME",
    "CU",
    "IR",
    "MD",
    "MH",
    "ED",
    "WB",
    "WK",
    "SA",
    "SX",
    "DU",
    "LOC",
]
processed.set_index("ADC_NUMBER", inplace=True)
processed["REPORT_DATE"] = datetime.date.today()

**Action Required:** Make sure the processed table is formatted as expected.

In [None]:
processed.head()

Export the processed data to a CSV file in the input directory. 

In [97]:
processed.to_csv(
    os.path.join(
        directory_in_str, "validation_incarceration_population_person_level.csv"
    )
)

### Test and Upload as Raw Data

**Action Required:** Copy the commands below into a terminal to upload the processed data to a scratch bucket and test import to a sandbox.

In [None]:
python -m recidiviz.tools.ingest.operations.upload_raw_state_files_to_ingest_bucket_with_date os.path.join(directory_in_str, 'validation_incarceration_population_person_level.csv') --region us_az --project-id recidiviz-staging --date datetime.date.today() --destination-bucket recidiviz-staging-us-az-test --dry-run False

In [None]:
python -m recidiviz.tools.ingest.operations.import_raw_files_to_sandbox --state-code US_AZ --sandbox-dataset-prefix arizona --source-bucket recidiviz-staging-us-az-test --file-tag-filter-regex validation_incarceration_population_person_level --infra-type legacy

**Action Required:** Copy the commands below into a terminal to upload the processed data to the AZ ingest bucket.

Staging:

In [None]:
python -m recidiviz.tools.ingest.operations.upload_raw_state_files_to_ingest_bucket_with_date os.path.join(directory_in_str, 'validation_incarceration_population_person_level.csv') --region us_az --project-id recidiviz-staging --date datetime.date.today() --destination-bucket recidiviz-staging-direct-ingest-state-us-az --dry-run False

Production:

In [None]:
python -m recidiviz.tools.ingest.operations.upload_raw_state_files_to_ingest_bucket_with_date os.path.join(directory_in_str, 'validation_incarceration_population_person_level.csv') --region us_az --project-id recidiviz-123 --date datetime.date.today() --destination-bucket recidiviz-123-direct-ingest-state-us-az --dry-run False