Skip to content
Jasper Ginn edited this page Jul 18, 2016 · 9 revisions

TenK is an R package aimed at simplifying the collection of SEC 10-K annual reports. It contains the following features:

  1. Robust scraping and parsing of reports using the rvest package
  2. Resolves FTP urls to their HTML counterparts, which increases the speed of retrieving the documents and adds a lot of useful metadata.
  3. Cleans and returns either full reports or just the business desciption for each report.

This document introduces basic usage of the TenK package.

A copy of this documentation is available via R in PDF format. To view it, execute vignette("TenK") in your R console.

1. Package information

1.1 Known issues

TenK can correctly scrape approximately 90% of all business descriptions. If any, issues are usually related to the following causes:

  1. The business description has been omitted
  2. The business description is located somewhere at the end of the document
  3. The report uses unconventional paragraph styles (this will result in the program being unable to find the description and returning "NA").

1.2 How does TenK work?

The main function in this package, TenK_process, takes as its input a URL belonging to a 10-K report. The URL can point either to the FTP or the HTML version of the report. If the user passes an FTP url, then TenK_process automatically determines the HTML version and collects useful metadata. If the user passes an HTML url, TenK_process also collects metadata and returns the scraped text. Currently, TenK_process either returns the full 10-K report, or the business description section.

The figure below schematically outlines this process

Scraping

Clone this wiki locally