# SEC EDGAR database

## What is EDGAR?

[EDGAR](https://www.sec.gov/edgar.shtml) is the Electronic Data Gathering, Analysis, and Retrieval system used at the U.S. Securities and Exchange Commission (SEC). EDGAR is the primary system for submissions by companies and others who are required by law to file information with the SEC. 

Containing millions of company and individual filings, EDGAR benefits investors, corporations, and the U.S. economy overall by increasing the efficiency, transparency, and fairness of the securities markets. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average.

## Who has access to EDGAR’s information?

Access to EDGAR’s public database is **free**—allowing you to research, for example, a public company’s financial information and operations by reviewing the filings the company makes with the SEC. You can also research information provided by mutual funds (including money market funds), exchange-traded funds (ETFs), variable annuities, and individuals.

## What was the goal of this project?

Even though the database is very comprehensive and requires companies by law to submit their filings electroniclly since 1996, it can be hard to download information in a bulk. For one part of my PhD project, however, it was necessary to do exactly that. 
> Starting from 1996 up to the end of 2019, I was interested in information from the 10-K and 10-Q filings of US corporations. Fortunately their is a R package available that facilitates this task - `edgar`.  

The `edgar` package can be installed and loaded like any other R package:

In [None]:
install.packages('edgar')
library(edgar)

# it is also always a good idea to load the tidyverse
library(tidyverse)

## Getting the master indexes from the database

The filings on the SEC website are organized according to a master index file for each quarter. The code below creates a vector for the sample period and downloads all master indexes to your working directory. It stores them as separate .Rda files for each year.

In [None]:
# Create a vector for the sample period
period <- c(1996:2019)

# Get the EDGAR master indexes
getMasterIndex(period)
## Downloads quarterly master index files for the sample period.

## Convert the Master Indexes to a file with all 10-Q and 10-K filings (URLs and Header information)

The separate .Rda files can be combined into a single dataframe. The code below starts by creating an empty dataframe `out.file`, which is used to store the combined master indexes. In a next step, all the file names for the master indexes get stored in the `file.names` vector. The for loop takes advantage of the objects that were created by iterating over all master indexes and storing them together in the `out.file`.   

In [None]:
# Combines the Master Indexes into one file
out.file<-""
file.names <- Sys.glob("*.Rda")
for(i in 1:length(file.names)){
  load(file.names[i])
  out.file <- rbind(out.file, year.master)
}
out.file <- out.file[-1,]

save(out.file, file = "master.Rda")
# this file contains the index on all EDGAR filings from 1996 to 2019 (around 19 million...)

The approximately 19 million observations in the resulting file represent all filings made by corporations during the sample period. However, my research interest focused on the 10-K and 10-Q forms. These forms represent the annual and quarterly report, respectively. For more information regarding the different form types visit: [Descriptions of SEC forms](https://www.sec.gov/info/edgar/forms/edgform.pdf) 

In [None]:
# New data frame with only the 10-Q and 10-K filings
data <- out.file %>% filter(form.type == "10-K" | form.type == "10-Q")

Furthermore, in my specific research project, I was only interested in the filings of companies that were also part of the Compustat database. The identifier that was available to me in both sets was the companies CIK code. However, the Compustat database is proprietary and thus, I cannot share the file with all CIK codes of the companies. At this point, you should limit your dataset to companies that matter for your specific problem.

In [None]:
# Read the file with the CIK codes from the quarterly Compustat file
CIK <- read_csv("../CIK.txt", col_names = FALSE)
CIK <- as.integer(CIK[1,])

# Keep only observations from the quarterly Compustat file
data <- data %>% filter(cik %in% CIK)

##### at this point we have a data frame with 10-K and 10-Q filings for the Compustat quarterly universe (incl. URLs) #####

The next step in my PhD project will involve a text-search through all the filings in my dataframe. Therefore, I have to download the filings as .txt files from the database. In my case the number of distinct companies was around 11'000 over the whole sample period. These companies produced more than 364'000 separate 10-K and 10-Q filings. The final folder of all downloades hat an approximate size of 1TB and it took me 4 days to store. If your project is similar in spirit, I highly recommend downloading to an external drive or NAS ;-)

In [None]:
# Downloads all Filings and gets information from the filing header (change directory to an external drive or NAS)
period <- c(1996:2019)
header.df <- getFilingHeader(cik.no = CIK, form.type = c('10-K', '10-Q'), filing.year = period)