Skip to content

This script is to download 10-k filing textual data (.htm) through Sec Edgar API, and scrape specific sections (items).

License

Notifications You must be signed in to change notification settings

Theling/10-K-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

10-K-scraper

This script is to download 10-k filing textual data (.htm) through Sec Edgar API, and to scrape specific sections, then save them into .txt file. You are welcomed to do modifications on this scripts.

Platform & Dependency:

  • Python 3.6 and standard libraries
  • BeautifulSoup

Introduction:

  • TenKDownloader(CIK, start_time, end_time) will return a TenKDownloader object. CIK can be one string or a list of string. Check https://www.sec.gov/Archives/edgar/cik-lookup-data.txt to see the CIKs for companies that you are looking for. Sometimes this argument can be symbol. start_time and end_time are in format %Y%m%d.

    Attributes:

    Method:

    • download(path='./data') will download coresponding 10-k filing in ./data/<CIK>/date.htm. Implementation of this function is to use BeautifulSoup to scrape the web page and retrieve the .htm file;

    Data

    • all_url is a Python dictionary (key: CIK; value: list of tuple (date, filing url)).
  • TenKScraper(section, next_section) will return a TenKScraper object. section is something like 'item 1', and next_section is where you stop. For example, if you want to scrape section 'item 2', you can create TenKScraper('item 2', 'item 3').

    Attributes:

    Method:

    • scrape(htm_file, txt_file) will scrape and write textual data into txt_file, and will also return the text as a string. Implementation of this function is based on the work of http://community.mis.temple.edu/zuyinzheng/pythonworkshop/, using regular expression to recognize bond tag. You can customize the pattern which is p1-p13 in my code. Note that output path must exist, but the txt file is not necessary to be existed.

Example

from TenK import TenKDownloader, TenKScraper

company_CIK = ['6281', '6769']
downloader = TenKDownloader(company_name, '20150101','20181101')
downloader.download()

scraper = TenKScraper('Item 1A', 'Item 1B')  # scrape text start from Item 1A, and stop by Item 1B
scraper2 = TenKScraper('Item 7', 'Item 8')
scraper.scrape('./data/6281/20171122.htm', './data/txt/test.txt') # make sure ./data/txt exists
scraper2.scrape('./data/6769/20180223.htm', './data/txt/test2.txt')

About

This script is to download 10-k filing textual data (.htm) through Sec Edgar API, and scrape specific sections (items).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages