-
Notifications
You must be signed in to change notification settings - Fork 3
Home
TenK
is an R package aimed at simplifying the collection of SEC 10-K annual reports. It contains the following features:
- Robust scraping and parsing of reports using the rvest package
- Resolves FTP urls to their HTML counterparts, which increases the speed of retrieving the documents and adds a lot of useful metadata.
- Cleans and returns either full reports or just the business desciption for each report.
This document introduces basic usage of the TenK
package.
A copy of this documentation is available via R in PDF format. To view it, execute vignette("TenK")
in your R console.
- Package name: TenK
- Version: 0.01
- Documentation
- Report an issue
TenK
can correctly scrape approximately 90% of all business descriptions. If any, issues are usually related to the following causes:
- The business description has been omitted
- The business description is located somewhere at the end of the document
- The report uses unconventional paragraph styles (this will result in the program being unable to find the description and returning "NA").
The main function in this package, TenK_process
, takes as its input a URL belonging to a 10-K report. The URL can point either to the FTP or the HTML version of the report. If the user passes an FTP url, then TenK_process
automatically determines the HTML version and collects useful metadata. If the user passes an HTML url, TenK_process
also collects metadata and returns the scraped text. Currently, TenK_process
either returns the full 10-K report, or the business description section.
The figure below schematically outlines this process