# Scrape EDGAR advanced search
### Scrape the full text search result from EDGAR to local file

My code is best for exploring and extracting small sample (you have specific targets and refined keyword).

For bulk execution, it is better to download filing index and search each filing content one by one. A great reference is http://kaichen.work/?p=681 from Professor Kai Chen. He is a great coding mentor for academic research. I learned a lot from him.

#### Webpage: https://www.sec.gov/edgar/search/

There is another sec search engine "https://www.sec.gov/cgi-bin/srch-edgar".
Pro for that one is that it displays all search results. Con is that it can't select specific filing type.

#### Last updated on: 03/02/2021
#### Created by: Lydia Lu Tong

In [1]:
# Import packages
# Please make sure you installed them already
import pandas as pd
import numpy as np
import re
import time
import io
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager #this ensures that you have the latest driver exe

In [2]:
# Function to generate the selection results about filing type & date range
def search_main(filetype, datefrom = "2019-01-01", dateto = "2019-12-31",):
    
    daterange = "dateRange=custom&category=custom&startdt={}&enddt={}".format(datefrom,dateto)
    
    if len(filetype) == 1:
        filerange = "&forms=" + filetype[0]
    elif len(filetype) == 0:
        filerange = ""
    else:
        filerange = "&forms=" + "%252C".join(filetype)
    root = " https://www.sec.gov/edgar/search/#/"
    main = root + daterange + filerange
    
    return main

In [3]:
# Function to add keyword or phrase you want to search
def search_text(yourtext):
    input_text = driver.find_elements_by_xpath("//div/input[@class='company form-control border-onfocus hide-on-short-form text-black']")
    input_text[0].send_keys(yourtext)
    button_search = driver.find_elements_by_xpath("//div/button[@id='search']")
    button_search[0].click()

### Main

In [4]:
### Set up 
adsh = []
file = []
name = []
time_file = []
time_report = []
list_cik = []
list_entity = []
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 90.0.4430
Get LATEST driver version for 90.0.4430
Driver [C:\Users\ltong\.wdm\drivers\chromedriver\win32\90.0.4430.24\chromedriver.exe] found in cache


In [5]:
### Set up the selection
filetype = ['20-F']
datefrom = "2020-10-01"
dateto = "2020-12-31"
###Set up your search keyword below
yourtext = r''


main = search_main(filetype,datefrom,dateto)
driver.get(main)
time.sleep(1)
search_text(yourtext)

#### You also can mannually adjust the search requirements here and then continue to next code block

#### The max ouput of search results is 1000. Please tailor your time span and filing type to make sure it is within the limit. If the # of search results exceeds 1000, the website only displays 1000 of them so my code will only capture that 1000 obs.

In [6]:
### Get search results total page numbers
elements = driver.find_elements_by_xpath("//ul/li/a[@class='page-link']")
temp = []
for i in elements:
    try:
        t = int(i.text)
    except:
        t = 0
    temp.append(t)
maxpage = max(temp)
if maxpage == 0:
    maxpage = maxpage + 1
print(maxpage)

1


In [7]:
### Start scraping
mainnow =  driver.current_url
for i in range(1,maxpage+1):
    page = mainnow + "&page={}".format(str(i))
    driver.get(page)
    ### Select additional columns you need
    driver.execute_script("window.scrollTo(0, 750)") 
    time.sleep(1)
    col_cik = driver.find_elements_by_xpath("//div[@class='form-check form-check-inline show-columns-checkbox']/input[@id='col-cik']")
    col_cik[0].click()
    
    filing = driver.find_elements_by_xpath("//table/tbody/tr/td[@class='filetype']/a")
    filed = driver.find_elements_by_xpath("//table/tbody/tr/td[@class='filed']")
    report = driver.find_elements_by_xpath("//table/tbody/tr/td[@class='enddate']")
    entity = driver.find_elements_by_xpath("//table/tbody/tr/td[@class='entity-name']")
    cik = driver.find_elements_by_xpath("//table/tbody/tr/td[@class='cik']")
          
    for i in range(0,len(filed)):
        time_file.append(filed[i].text)
        time_report.append(report[i].text)
        list_entity.append(entity[i].text)
        list_cik.append(cik[i].text.replace('CIK ',''))
        data_adsh = filing[i].get_attribute('data-adsh') 
        data_file = filing[i].get_attribute('data-file-name') 
        adsh.append(data_adsh)
        file.append(data_file)
        name.append(filing[i].text)

In [8]:
print(len(list_entity))

56


In [9]:
df = pd.DataFrame({'entity': list_entity, 'cik': list_cik, 
                   'time_filed': time_file,'time_report':time_report,
                   'file_name':name,'file_link':file,
                  'file_num':adsh})
df.head()

Unnamed: 0,entity,cik,time_filed,time_report,file_name,file_link,file_num
0,ATIF Holdings Ltd (ATIF),1755058,2020-12-31,2020-07-31,20-F (Annual report - foreign issuer),tm2038402d1_20f.htm,0001104659-20-140671
1,OneSmart International Education Group Ltd (ONE),1722380,2020-12-31,2020-08-31,20-F (Annual report - foreign issuer),one-20200831x20f.htm,0001104659-20-140691
2,Vision Marine Technologies Inc. (VMAR),1813783,2020-12-30,2020-08-31,20-F (Annual report - foreign issuer),tm2039477d1_20f.htm,0001104659-20-140616
3,Bright Scholar Education Holdings Ltd (BEDU),1696355,2020-12-23,2020-08-31,20-F (Annual report - foreign issuer),bedu-20200831x20f.htm,0001104659-20-139121
4,"Borqs Technologies, Inc. (BRQS, BRQSW)",1650575,2020-12-18,2019-12-31,20-F/A (Annual report - foreign issuer),f20f2019a1_borqstechnologies.htm,0001213900-20-043529


In [10]:
df.to_excel("{}_{}_{}.xlsx".format("-".join(filetype),datefrom,dateto),index=False)

You can access to FULL filing content by using the cik number and file number.

SEC filing has a fixed weblink structure.

Filing link example to see the pattern:

https://www.sec.gov/Archives/edgar/data/0001701261/000110465919077298/0001104659-19-077298-index.html

https://www.sec.gov/Archives/edgar/data/1404935/000147793219001558/filename1.htm