Skip to content

Latest commit

 

History

History
14 lines (12 loc) · 468 Bytes

README.md

File metadata and controls

14 lines (12 loc) · 468 Bytes

US-Congress-Corpora-Builder

A set of Python tools to download the Senate and House transcripts and convert them to usable text.

Usage

sh setup.sh
sh build-corpera.sh

The text transcripts will be in transcripts-txt/ and will be named by chamber of congress and date.

Roadmap

  • Downloading PDFs by date range
  • Converting them into usable text
  • Seperating the text by speaker and eliminating non-spoken text (See SeperateSpeeches.py)