## Web Scraping the SEC query page

In the past, we managed to make requests to the edgar archive using url manipulation. 

Now, our goal will be to make use of the search tool from the EDGAR database. 
This will allow us to make a query and filter the filings by:
- form type
- company name 
- date

We need to keep in mind the scope of the search results and that there might be more than one company with the same name. 

Links:
- information about filings : https://www.sec.gov/oiea/Article/edgarguide.html
- search query page : https://www.sec.gov/edgar/searchedgar/companysearch

In [2]:
# importing libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Defining parameters for the Search

We need to manually create an URL that can take care of our defined search

Here is a list of potentially important parameters:
- action
- CIK/ticker
- type
- dateb
- owner
- output 
- count

After building the URL, we are ready to make a request to the endpoint

In [14]:
# base URL
endpoint = r"https://www.sec.gov/cgi-bin/browse-edgar"

# define a parameters dictionary
param_dict = {'action':'getcompany',
              'CIK':'META',
              'type':'10-k',
              'dateb':'20220101',
              'owner':'exclude',
              'start':'',
              'output':'',
              'count':'100',
             }

# now let's build the URL and make a request to it
param_url = ""
for item in param_dict:
    param_url += item + "=" + param_dict[item] + "&"

# define headers
headers = {'user-agent':'William',
          'Host':'www.sec.gov'}

# request the url content
request_url = endpoint + "?" + param_url[:-1]
content = requests.get(request_url, headers=headers)
soup = BeautifulSoup(content.content, 'html.parser')

# let the user know the request is successful
print('nice, ur request has been successful')
print(request_url)

nice, ur request has been successful
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=META&type=10-k&dateb=20220101&owner=exclude&start=&output=&count=100


### Time to parse the soup

We now have an html page which is the content of our query. 
We can retrieve different types of filing by parsing the table in the request page

In [21]:
# find the document table with data
table = soup.find_all('table', class_='tableFile2')

# define a base url for building the link of each filing
base_url = r"https://www.sec.gov"

master_list = []

# loop through each table
for row in table[0].find_all('tr'):
    
    # get columns
    cols = row.find_all('td')
    
    # if the column isn't empty
    if len(cols) != 0:
        
        # save important information
        filing_type = cols[0].text.strip()
        filing_date = cols[3].text.strip()
        filing_numb = cols[4].text.strip()
        filing_doc_href = cols[1].find('a', {'href':True, 'id':'documentsbutton'})
        filing_num_href = cols[4].find('a')
        filing_int_href = cols[1].find('a', {'href':True, 'id':'interactiveDataBtn'})
        
        if filing_doc_href != None:
            filing_doc_url = base_url + filing_doc_href['href']
        else:
            filing_doc_url = 'no link'
            
        if filing_int_href != None:
            filing_int_url = base_url + filing_int_href['href']
        else:
            filing_int_url = 'no link'
            
        if filing_num_href != None:
            filing_num_url = base_url + filing_num_href['href']
        else:
            filing_num_url = 'no link'
        
        # now let's save all this data into a dictionary
        filing_dict = {}
        filing_dict['type'] = filing_type
        filing_dict['date'] = filing_date
        filing_dict['numb'] = filing_numb
        filing_dict['links'] = {}
        filing_dict['links']['doc'] = filing_doc_url
        filing_dict['links']['int'] = filing_int_url
        filing_dict['links']['num'] = filing_num_url
        
        # print out some information for the user to see
        print('-' * 100)
        print('Filing Type: ' + filing_type)
        print('Filing_Date: ' + filing_date)
        print('Filing_Numb: ' + filing_numb)
        print('Document links: ' + filing_doc_url)
        print('Interactive links: ' + filing_int_url)
        print('Filing Number links: ' + filing_num_url)
        
        # add the dictionary to the master list
        master_list.append(filing_dict)

----------------------------------------------------------------------------------------------------
Filing Type: 10-K
Filing_Date: 2021-01-28
Filing_Numb: 001-3555121561789
Document links: https://www.sec.gov/Archives/edgar/data/1326801/000132680121000014/0001326801-21-000014-index.htm
Interactive links: https://www.sec.gov/cgi-bin/viewer?action=view&cik=1326801&accession_number=0001326801-21-000014&xbrl_type=v
Filing Number links: https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&filenum=001-35551&owner=exclude&count=100
----------------------------------------------------------------------------------------------------
Filing Type: 10-K
Filing_Date: 2020-01-30
Filing_Numb: 001-3555120559618
Document links: https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index.htm
Interactive links: https://www.sec.gov/cgi-bin/viewer?action=view&cik=1326801&accession_number=0001326801-20-000013&xbrl_type=v
Filing Number links: https://www.sec.gov/cgi-b

### Parsing the information of the master list

We can now grab all the desired URLs from the master list with a simple for loop


In [24]:
for report in master_list[0:3]:
    
    print('-' * 100)
    print(report['links']['doc'])
    print(report['links']['int'])
    print(report['links']['num'])

----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/1326801/000132680121000014/0001326801-21-000014-index.htm
https://www.sec.gov/cgi-bin/viewer?action=view&cik=1326801&accession_number=0001326801-21-000014&xbrl_type=v
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&filenum=001-35551&owner=exclude&count=100
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index.htm
https://www.sec.gov/cgi-bin/viewer?action=view&cik=1326801&accession_number=0001326801-20-000013&xbrl_type=v
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&filenum=001-35551&owner=exclude&count=100
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/1326801/000132680119000009/0001326801-19-