Distiller

Distiller, provides easier access to audit data. What is a multi-day process for some agencies has been reduced to a few minutes by helping grant managers sorting through the Federal Audit Clearinghouse to identify the specific audits that they need to know have been completed and confirm if action needs to be taken by themselves or others. This tool is also helpful auditors, agency CFOs as well as grantees to be aware of audit activity happening by federal agency/sub agency.

This codebase spun out of the 10x Federal Grant Reporting Project that is exploring simpler, faster, easier, better resolution of single audit findings by agencies and grantees alike.

Features of the Distiller:

quickly match Assistance Listings (CFDA#s) from beta.sam.gov to parent and sub-agencies
filter by agency/sub-agency, audits with findings, cognizant and oversight agencies and findings by agency filter
sort by grantee name, # of findings, Fiscal year end, Audit accepted date, name of Cognizant or Oversight agency and whether or not an audit has Questioned costs
view only the findings for which your agency/sub-agency is Cognizant or Oversight for and responsible for resolving
copy findings text and corrective action plans directly from the Distiller
download a csv of your search results so you can process the data further or upload to an agency's case management system

Visit the Distiller prototype

Audit PDF Extraction

This is a proof-of-concept module that is a companion to the Distiller. Download any audit pdf from the distiller and run it though this module that reads Single Audit PDFs page by page and extracts findings text and corrective action plans which can be viewed as a csv.

This saves time for grants managers searching for audit findings prior to fiscal year 2019 (before this was available in the Federal Audit Clearinghouse's Data Collection Form) and anyone who wants to take findings and corrective action plans text and copy them into another place.

This reduces errors We've heard that copying and pasting from audit pdfs is unreliable due to pdf formatting so agency staff have resorted to retyping this information into their grants management tracking systems (which vary from spreadsheets to more robust case management systems). Retyping is prone to error in addition to taking away time from other tasks. Auditors and grantees who need to copy text of findings and corrective action plans into the Federal Audit Clearinghouse's Data Collection Form can more easily copy text from a csv instead of a poorly formatted or image pdf.

Results will vary with this tool as the text in these pdfs are no where near standardized and some older pdfs contain images instead of readable text which will limit any efforts to extract text from these PDFs. We recommend updates to the policy for single audit format requirements to require findings to be written in a standardized format to make it easier and more reliable for them to be read by natural language processing and compared across audits for risk management.

See the corresponding module for more details.

How you can help

Visit the Distiller prototype on cloud.gov or test out the Single Audit PDF Extraction module and let us know if it useful or if something isn't working as expected. Email the team at federal-grant-reporting@gsa.gov or File an issue and assign it to @bpdesigns.

Local development

To develop against a Docker-based Postgres DB and a local virtual environment, follow these steps.

Dependencies

Install [Docker][https://www.docker.com/]. If you're on OS X, install Docker for Mac. If you're on Windows, install Docker for Windows.
Install Python dependencies into a Pipenv-managed virtual environment

pipenv install --dev

NOTE: Python 3.7 is required.

Local settings

Optionally, create a local settings file in distiller/settings/local.py:

from .development import *

CHROME_DRIVER_LOCATION = os.path.join(BASE_DIR, 'chromedriver')

Database server

To start a database server, run one of these commands:

# Run in the foreground
docker-compose up db
# Run in the background
docker-compose up -d db

Initialize database

Create database tables:

docker-compose run app python manage.py migrate

Create a Django admin user to access the admin interface:

docker-compose run app python manage.py createsuperuser

Start a development webserver

pipenv run manage.py runserver

Visit http://localhost:8000/ directly to access the site.

You can access the admin panel at http://localhost:8000/admin by logging in with the super user credentials you created in the step above.

Running ETLs

This application relies on external data sources. To populate the database with required data, run these ETL jobs.

Download source tables

To load, first download the source table dumps. By default, these tables will be placed in timestamped directories under /imports. In production, these files will be placed in an S3 bucket.

To may download specific tables, or all tables at once.

pipenv run python manage.py download_table --all
pipenv run python manage.py download_table --audit
pipenv run python manage.py download_table --cfda
pipenv run python manage.py download_table --finding
pipenv run python manage.py download_table --findingtext

Load tables

pipenv run python manage.py load_table --all
pipenv run python manage.py load_table --audit
pipenv run python manage.py load_table --cfda
pipenv run python manage.py load_table --finding
pipenv run python manage.py load_table --findingtext

Cloud.gov jobs

In the deployed environment, django-apscheduler is used to refresh all tables daily at 12:00 AM EST.

At 1:00 AM EST, a complete crawl of the 2019 documents are completed, and then refreshed in the database.

NOTE: django-apscheduler is not an optimal choice for a cloud.gov deployment, because jobs are run in the same container as the web server. Usage of cf run-task would be preferred. Future work should involve scheduling tasks with cf run-task in an outside environment, such as a CircleCI periodic job.

requirements.txt

To install the spacy English language pack (en_core_web_sm) on cloud.gov, we must explicitly specify this dependency as a file download in requirements.txt (see this related spacy Github issue. Unfortunately, this also means maintaining Pipenv and a requirements file.

When updating package versions, the requirements.txt can be re-generated by running pipenv run pip freeze > requirements.txt, removing the en_core_web_sm package line and then re-adding the en_core_web_sm https release link so that cloud.gov can install the language pack correctly.

cf-service-connect

For local debugging, install the cf service-connect plugin and run ./bin/tunnel_cf_db to access the postgres database on cloud.gov.

Scraping FAC documents

Scrapy is used to download documents from the Federal Audit Clearinghouse website.

Here are some example crawls:

# Crawl all documents processed by FAC beftween 9/2/19 and 9/5/19
pipenv run scrapy crawl fac -a date_processed_from=09/01/2019 -a date_processed_to=09/5/2019

# Crawl all documents from prior year with CFDA `11.*`
pipenv run scrapy crawl fac -a cfda=11

# Crawl all documents from prior year with CFDA `11.2*`
pipenv run scrapy crawl fac -a cfda=11.2

# Crawl all documents from prior year with CFDA `11.123`
pipenv run scrapy crawl fac -a cfda=11.123

# Crawl all documents from prior year with CFDA `11*`
# Also, for debugging, open a copy of each search results page in a browser.
pipenv run scrapy crawl fac -a cfda=11.123 -a open_pages=1

In production usage, metadata on these documents should be output to file, so it may be loaded into the Distiller database. Use the -t and -o Scrapy options to specify a format and target file name:

pipenv run scrapy crawl fac -a cfda=11.123 -t json -o test.json

There are two scripts checked into the repository that will assist in refreshing a subset of CFDA prefices on a cloud.gov deployment. To recrawl the entire FAC:

cf run-task demo-fac-distiller --command "/home/vcap/app/bin/crawl" -m 2G

To crawl all documents since the last FAC load, using the "date processed" FAC field:

cf run-task demo-fac-distiller --command "/home/vcap/app/bin/crawl_update" -m 2G

Deployment to cloud.gov

To expedite new deploys to cloud.gov, we provide a deployment script at bin/setup-new-cf-instance.sh. Please note that you must be logged in cloud.gov or cloud foundry with the appropriate permissions to create an app and its associated services.

Running tests

To run the test suite with pytest:

pipenv run pytest

Contributing

See CONTRIBUTING for additional information.

Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.circleci		.circleci
bin		bin
distiller		distiller
.cfignore		.cfignore
.flake8		.flake8
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
ISSUE_TEMPLATE.md		ISSUE_TEMPLATE.md
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
Procfile		Procfile
README.md		README.md
docker-compose.yml		docker-compose.yml
manage.py		manage.py
manifest.yml		manifest.yml
mypy.ini		mypy.ini
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.sh		run.sh
runtime.txt		runtime.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distiller

Audit PDF Extraction

How you can help

Local development

Dependencies

Local settings

Database server

Initialize database

Start a development webserver

Running ETLs

Download source tables

Load tables

Cloud.gov jobs

requirements.txt

cf-service-connect

Scraping FAC documents

Deployment to cloud.gov

Running tests

Contributing

Public domain

About

Releases

Packages

Contributors 4

Languages

18F/FAC-Distiller

Folders and files

Latest commit

History

Repository files navigation

Distiller

Audit PDF Extraction

How you can help

Local development

Dependencies

Local settings

Database server

Initialize database

Start a development webserver

Running ETLs

Download source tables

Load tables

Cloud.gov jobs

requirements.txt

cf-service-connect

Scraping FAC documents

Deployment to cloud.gov

Running tests

Contributing

Public domain

About

Resources

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages