<a href="https://colab.research.google.com/github/Maggiey01/Rights-Colab-YH/blob/main/download_10_K.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

The purpose of this notebook is to download and scrape Form 10-Ks from SEC's EDGAR. More specifically, we:
* download the list of 10-K submissions by a specified submission quarter (e.g. /2021/QTR1/)
* download each 10-K submission in .txt format
* scrape and clean text data from each 10-K submission
* save the result into a .txt file with the naming convention of "Ticker_FormType_FiscalYearEnd_AccensionNumber.txt" where AccensionNumber is a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System.

All results are saved in [this Google Drive Folder](https://drive.google.com/drive/u/1/folders/1FYyn78joGMX1rrPKBdijP9NDqydmDE5v) organized by submission quarters.

**Before working with this notebook, please make a copy of this template onto your own Google Drive.** 

Please note that this process will generally take 1-5 seconds per submission (depending on the size). As examples:
* For submission period of 2020Q1, it took 306 minutes to download 6061 submissions.
* For submission period of 2021Q1, it took 229 minutes to download 7005 submissions.

As such, it is a good idea to restart the runtime in the Colab environment to avoid timeout.


## Step 1: Install package required

In [None]:
pip install ftfy

Collecting ftfy
  Downloading ftfy-6.0.3.tar.gz (64 kB)
[?25l[K     |█████                           | 10 kB 23.3 MB/s eta 0:00:01[K     |██████████▏                     | 20 kB 10.9 MB/s eta 0:00:01[K     |███████████████▎                | 30 kB 8.5 MB/s eta 0:00:01[K     |████████████████████▍           | 40 kB 7.4 MB/s eta 0:00:01[K     |█████████████████████████▌      | 51 kB 4.2 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 61 kB 4.4 MB/s eta 0:00:01[K     |████████████████████████████████| 64 kB 1.8 MB/s 
Building wheels for collected packages: ftfy
  Building wheel for ftfy (setup.py) ... [?25l[?25hdone
  Created wheel for ftfy: filename=ftfy-6.0.3-py3-none-any.whl size=41933 sha256=d7a7711b0ec6d2008dbaba6810e9e4a4a667eefdb143c2c6380c769b4fea6e51
  Stored in directory: /root/.cache/pip/wheels/19/f5/38/273eb3b5e76dfd850619312f693716ac4518b498f5ffb6f56d
Successfully built ftfy
Installing collected packages: ftfy
Successfully installed ftfy-6.0.3


In [None]:
pip install python-edgar

Collecting python-edgar
  Downloading python_edgar-3.1.3-py3-none-any.whl (8.6 kB)
Installing collected packages: python-edgar
Successfully installed python-edgar-3.1.3


In [None]:
import edgar

## Step 2: Download and scrape 10-Ks (Go next section for Proxy Statement)

**Please update the following BEFORE executing the code.**
* `FORM_TYPES`: This should be the form that we are looking to download.
* `GET_QUARTERS` - This should be the submission year and quarter in the form of "YYYY/QTR#".
* `PLAIN_TEXT_PATH` - This should be the destination folder for the results.

In [None]:
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

##### UPDATE ME! #####
FORM_TYPES = ['10-K']

##### UPDATE ME! #####
GET_QUARTERS = ['/2020/QTR1/']
#GET_QUARTERS = ['/2021/QTR1/']

# CIK_TICKER_LOOKUP_PATH = '/content/drive/MyDrive/data_for_good/ticker.txt'
#from google.colab import drive
#drive.mount('/content/drive')
#cd "drive/MyDrive/Rights Colab YH/"

##### UPDATE ME! #####
PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Datasets/10k_clean_text/_0.0 downloaded/2021Q1/'
#PLAIN_TEXT_PATH = '/content/drive/MyDrive/Rights Colab YH/'


INDEX_PATH = 'www.sec.gov/Archives/edgar/full-index/'


def get_logger():
    t = datetime.now().strftime('%m%d-%H%M')
    logger = logging.getLogger()
    ch = logging.StreamHandler()
    fh = logging.FileHandler(f'log_{t}.txt')
    formatter = logging.Formatter('%(asctime)s: %(message)s', datefmt='%H:%M:%S')
    ch.setFormatter(formatter)
    fh.setFormatter(formatter)
    logger.addHandler(ch)
    logger.addHandler(fh)
    logger.setLevel(logging.INFO)
    return logger


def parse_company_index(req):
    lines = req.text.split('\n')
    HEADER_LINE = 8
    COL_BREAKS = [62, 74, 86, 98, 120]
    ret = []
    for i in range(HEADER_LINE+2, len(lines)-1):
        line_list = []
        line_list.append(lines[i][:COL_BREAKS[0]].rstrip())
        line_list.append(lines[i][COL_BREAKS[0]:COL_BREAKS[1]].rstrip())
        line_list.append(lines[i][COL_BREAKS[1]:COL_BREAKS[2]].rstrip())
        line_list.append(lines[i][COL_BREAKS[2]:COL_BREAKS[3]].rstrip())
        line_list.append(lines[i][COL_BREAKS[3]:].rstrip())
        ret.append(line_list)
    columns = ['company', 'form_type', 'cik', 'date_filed', 'link']
    return pd.DataFrame(ret, columns=columns)


def get_annual_filings_df():
    logger.info('retrieving filings index...')
    out = []
    for quarter in GET_QUARTERS:
        logger.info(quarter)
        #'https://www.sec.gov/Archives/edgar/full-index/2021/QTR3/'
        #r = requests.get('https://' + INDEX_PATH + quarter + 'company.idx')
        r = requests.get('https://' + INDEX_PATH + quarter + 'xbrl.idx')
        idx_df = parse_company_index(r)
        #print(idx_df)
        out.append(idx_df)
        #print("get: ", out)
    full_company_index = pd.concat(out, axis=0).reset_index(drop=True)
    #print(full_company_index.iloc[0])
    full_company_index.cik = full_company_index.cik.astype(np.int64)
    full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])

    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()

    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

    annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings


def scrape_filing(link, ticker, form_type):
    url_prefix = "https://www.sec.gov/Archives/"
    # print(url_prefix + link)
    # sleep(30)
    file = None
    while file is None:
      try:
        file = urllib.request.urlopen(url_prefix + link)
      except:
        print('Opening', link, 'failed. Retrying...')
        file = None
    out = ''
    por = ''
    in_doc = False
    i = 0
    while True:
        i += 1
        line = file.readline().decode('utf-8', 'ignore')
        if line.startswith("CONFORMED PERIOD OF REPORT"):
            por = line.split(':')[-1].strip()
        if line.startswith("<TEXT>"):
            in_doc = True
        if in_doc:
            # bs4/lxml handles <br>s by simply removing them, which squashes words
            # replace <br>s with spaces before passing to bs4
            cleanline = re.sub(r'<br>|<BR>', ' ', line)
            out += cleanline + ' '
        if line.startswith("</TEXT>"):
            break
    return BeautifulSoup(out, 'lxml'), por


def parse_soup(soup, base_element='p', is_xbrl=False):
    out = []
    base_els = soup.find_all(base_element)

     # for iXBRL, the first <div> is a large metadata block, so skip it
    if is_xbrl: base_els = base_els[1:]

    n_base_els = len(base_els)

    i = 0
    in_table = False
    while i < n_base_els:
        el = base_els[i]

        # skip divs that contain other divs or tables to avoid recursion
        # i.e. divs that contain divs would otherwise appear twice
        descendants = [d.name for d in el.descendants]
        if base_element in descendants or 'table' in descendants:
            i += 1
            continue

        if el.parent.name != 'td': # ordinary line
            if in_table:
                out.append('[END TABLE]')
                out.append('\n')
                in_table = False
            # remove line breaks inside elements (iXBRL filings)
            out.append(el.text.replace('\n', ''))
            i += 1
            continue

        # loop through tables row-wise
        elif el.parent.name == 'td':
            if not in_table:
                out.append('[BEGIN TABLE]')
                in_table = True
            row_el = el.parent.parent
            # handling for poorly-formed table markup
            if row_el.name != 'tr': break

            # sometimes text is contained directly in <td>s without <div>s or <p>s inside
            # so search on <td> instead
            row_tds = row_el.find_all('td')

            n_tds = len(row_tds)
            row_text = ''
            for el in row_tds:
                # Tables in most annual filings contain tds with a single text element.
                # Some tables have <td>s containing multiple <div>s, which would otherwise
                # become squashed into a single string without spaces between words...
                if len(el.find_all('div')) >1:
                    row_text += ' '.join([e.text for e in el.find_all('div')])
                else:
                    # iXBRL filings often contain extra line breaks in text elements:
                    row_text += el.text.replace('\n', ' ') + ' '
                # since the row-wise loop is searching for <tr> elements, we only increment
                # if the <tr> contains the base_element (i.e. <div> or <p>), in order to keep
                # the counter i in sync with the base_els iterable
                if base_element in [e_.name for e_ in el.children]:
                    i += 1
            out.append(row_text)

    return ('\n').join(out)


def get_filing_text(soup):
    n_p = len(soup.find_all('p'))
    n_div = len(soup.find_all('div'))
    n_span = len(soup.find_all('span'))

    # if there are <span>s, the file is probably iXBRL.
    if n_span > n_p: return parse_soup(soup, 'div', is_xbrl=True)

    # if not iXBRL, use <p> or <div>, whichever is more abundant in the markup
    elif n_p > n_div: return parse_soup(soup, 'p', is_xbrl=False)
    else: return parse_soup(soup, 'div', is_xbrl=False)


def get_clean_text(str_in):
    ret = str_in
    ret = re.sub(r'\x9f', '•', ret) # used as bullet in 0000004904-19-000009
    ret = ftfy.fix_text(ret)
    ret = re.sub(r'\xa0', ' ', ret) # remove \xa0

    # remove page breaks
    ret = re.sub(r'\s+(-\s*\d+\s*-\s*)+\s*(Table of Contents\s*)+\s*\n', '\n', ret) # e.g. 0001178879-19-000024
    ret = re.sub(r'\s*\d+(\s*Table of Contents\n)+\n', ' ', ret) # e.g. 0000764180-19-000023
    ret = re.sub(r'\s*(\d+\n)+\s*\n\n', ' ', ret) # e.g. 0000824142-19-000040
    ret = re.sub(r'\n\s*\d+\s*\n', ' ', ret) # base case: digits separated by line breaks

    # U+2022, U+00B7, U+25AA, U+25CF, U+25C6
    ret = re.sub(r'([•·▪●◆])\s*\n', '\1 ', ret) # combine orphaned bullet chars separated from text by newline
    ret = re.sub(r'([•·▪●◆])(\w)', r'\1 \2', ret) # separate bullets squished next to text
    ret = re.sub(r'[ ]*([•·▪●◆])[ ]*', r'\1 ', ret) # consolidate whitespace around bullets

    # not active in order to preserve visual structure of multi-level bulleted lists
    #ret = re.sub(r'[•·▪●◆]', '•', ret) # use single bullet char

    # join orphaned sentences (line starts with a lower-case word)
    ret = re.sub(r'([a-z\,])\s*\n\s*([a-z])', r'\1 \2', ret)

    # fix table delimiters not separated by newline
    ret = re.sub(r'\[END TABLE\](.)', r'[END TABLE]\n\1', ret)
    ret = re.sub(r'(.)\[BEGIN TABLE\]', r'\1\n[BEGIN TABLE]', ret)

    # remove empty tables
    ret = re.sub(r'\[BEGIN TABLE\]s*\n\s*\[END TABLE\]', r'\n', ret)

    # remove table delmiters if table has only one row
#    ret = re.sub(r'\[BEGIN TABLE\]\s*\n([^\n]+)\n\s*\[END TABLE\]', r'\1', ret)

    ret = re.sub(r'(\d)\s*\)', r'\1)', ret)
    ret = re.sub(r'\s*%', r'%', ret)

    ret = re.sub(r'\n\s*\n', r'\n', ret) # remove empty lines

    return ret


def convert_to_plain_text(df):
    # scrape and parse filings in the DataFrame; save as plain text
    n_rows = df.shape[0]
    for i, row in enumerate(df.iterrows()):
        ticker = row[1].ticker
        accession_num = row[1].accession_number
        form_type = row[1].form_type
        soup, por = scrape_filing(row[1].link, ticker, form_type)
        text = get_filing_text(soup)
        text = get_clean_text(text)
        new_fname = f'{ticker}_{form_type}_{por}_{accession_num}.txt'
        try:
          with open(PLAIN_TEXT_PATH + new_fname, 'w', encoding='utf-8') as f:
              f.write(text)
              f.close()
          logger.info(f'{i+1}/{n_rows} {new_fname}')
        except:
          print("Skipped:",sys.exc_info()[0],"occured.")
          
if __name__ == '__main__':
    logger = get_logger()
    annual_filings = get_annual_filings_df()
    convert_to_plain_text(annual_filings)

16:10:08: retrieving filings index...
16:10:08: retrieving filings index...
16:10:08: /2020/QTR1/
16:10:08: /2020/QTR1/


ValueError: ignored

In [None]:
print(full_company_index)

##Try python-edgar for Proxy Statement

```
Test for the package. Go next section if you want the whole function
```

In [None]:
pip install python-edgar



In [None]:
import edgar

In [None]:
pip install ftfy



In [None]:
import tempfile
import edgar
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
#import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

In [None]:
START_YEAR = 2020
END_YEAR = 2021
QTRS = ['QTR1','QTR2','QTR3','QTR4']
tmpdirname = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Notebooks to share/Proxy_statement/new_proxy/'
#tmpdirname = '/content/drive/MyDrive/Rights Colab YH/'
import edgar
#edgar.download_index(tmpdirname, START_YEAR, user_agent, skip_all_present_except_last=False)

edgar.download_index(tmpdirname, START_YEAR, skip_all_present_except_last=False)
whole_df = pd.DataFrame()
try: 
  for Year in range(START_YEAR, END_YEAR+1):
    for QTR in QTRS:
      file_name = tmpdirname + "/"+str(Year)+"-"+QTR+".tsv"
      print("find: ", Year, QTR)
      with open(file_name, "r", encoding="utf-8") as f:
        #tmp_df = pd.read_csv (f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date","txt_name","link"])
        tmp_df = pd.read_csv (f, sep = '[\t|]', names=["cik","company","form_type","date","txt_name","link"])
        #tmp_df = pd.read_csv (f, delimiter = '|', names=["cik","company","form_type","date","txt_name","link"])
        if whole_df.shape[0] == 0:
          whole_df = tmp_df.copy()
        else:
          whole_df = pd.concat([whole_df, tmp_df])
except OSError:
  pass


TypeError: ignored

In [None]:
print(whole_df.shape)
whole_df.head()

(2213375, 6)


Unnamed: 0,cik,company,form_type,date,txt_name,link
0,1000045,NICHOLAS FINANCIAL INC,10-Q,2020-02-14,edgar/data/1000045/0001564590-20-004703.txt,edgar/data/1000045/0001564590-20-004703-index....
1,1000045,NICHOLAS FINANCIAL INC,4,2020-02-11,edgar/data/1000045/0001794162-20-000001.txt,edgar/data/1000045/0001794162-20-000001-index....
2,1000045,NICHOLAS FINANCIAL INC,4,2020-03-02,edgar/data/1000045/0001794162-20-000002.txt,edgar/data/1000045/0001794162-20-000002-index....
3,1000045,NICHOLAS FINANCIAL INC,4,2020-03-03,edgar/data/1000045/0001398344-20-005055.txt,edgar/data/1000045/0001398344-20-005055-index....
4,1000045,NICHOLAS FINANCIAL INC,4,2020-03-06,edgar/data/1000045/0001398344-20-005566.txt,edgar/data/1000045/0001398344-20-005566-index....


## try readability of top 10 UK company
company| ticker| cik
Unilever  (ul) 217410 
AstraZeneca (azn)  901832
HSBC (hsbc) 1089113
GlaxoSmithKline (gsk) 1131399
Rio Tinto  (rtntf) 887028
British American Tobacco (bti) 1303523
Royal Dutch Shell (rds-a)​ 1306965

In [None]:
df_20f = whole_df.loc[whole_df.form_type=='20-F']
df_20f_uk_top10 = df_20f.loc[df_20f.ticker.isin(['UL','AZN','HSBC','GSK','RTNTF','BTI','RDS-A'])]
df_20f_uk_top10.link.iloc[0]

'edgar/data/1089113/0001628280-20-001784-index.html'

In [None]:
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1089113/0001628280-20-001784-index.html'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 
con = urllib.request.urlopen(req).read()

HTTPError: ignored

In [None]:
whole_df.form_type.unique()

array(['10-Q', '4', '8-K', 'SC 13D/A', 'SC 13G/A', 'SC 13G', '13F-HR',
       'FOCUSN', 'X-17A-5', '6-K', '20-F', 'IRANNOTICE', '24F-2NT',
       '485APOS', '497', 'CORRESP', 'N-4', 'N-CEN', '10-K', '8-K/A',
       'NT 10-K', '5/A', '5', 'DEF 14A', 'DEFA14A', 'PRE 14A', 'N-CSR',
       'NPORT-P', '10-D', '424B2', '424B3', 'ABS-15G', 'F-N/A', 'FWP',
       '40-17G', '485BPOS', '497J', '8-A12B', 'APP WD', 'CERT', '3',
       '4/A', '424B5', 'EFFECT', 'S-3ASR', 'UPLOAD', '3/A', 'TA-2', 'S-8',
       '13F-HR/A', '40-F', 'F-10/A', 'F-10', 'F-X', 'SC 13D', 'SC 13E3/A',
       'SUPPL', 'D/A', 'CT ORDER', '10-K/A', 'NT 10-Q', '1-A/A', '1-A',
       'F-6EF', '497K', '40-17G/A', 'S-1', '13F-NT', '8-A12B/A',
       'DEF 14C', 'N-CEN/A', '15-12B', 'S-8 POS', '497AD', 'N-30B-2',
       '424H/A', '424H', 'ABS-EE', '25', 'DEFC14A', 'DFAN14A', 'PREN14A',
       'PX14A6G', 'S-3', 'PRE 14C', 'PRER14A', '485BXT', 'X-17A-5/A',
       '425', 'SC TO-I/A', 'SC TO-I', 'MA-A', 'DEL AM', 'N-14/A', 'N-14',
     

In [None]:
whole_df.cik = whole_df.cik.astype(np.int64)

In [None]:
whole_df['accession_number'] = whole_df.link.apply(lambda x: x.split('/')[-1].split('.')[0])
whole_df.head()

Unnamed: 0,cik,company,form_type,date,txt_name,link,accession_number
0,1000045,NICHOLAS FINANCIAL INC,10-Q,2020-02-14,edgar/data/1000045/0001564590-20-004703.txt,edgar/data/1000045/0001564590-20-004703-index....,0001564590-20-004703-index
1,1000045,NICHOLAS FINANCIAL INC,4,2020-02-11,edgar/data/1000045/0001794162-20-000001.txt,edgar/data/1000045/0001794162-20-000001-index....,0001794162-20-000001-index
2,1000045,NICHOLAS FINANCIAL INC,4,2020-03-02,edgar/data/1000045/0001794162-20-000002.txt,edgar/data/1000045/0001794162-20-000002-index....,0001794162-20-000002-index
3,1000045,NICHOLAS FINANCIAL INC,4,2020-03-03,edgar/data/1000045/0001398344-20-005055.txt,edgar/data/1000045/0001398344-20-005055-index....,0001398344-20-005055-index
4,1000045,NICHOLAS FINANCIAL INC,4,2020-03-06,edgar/data/1000045/0001398344-20-005566.txt,edgar/data/1000045/0001398344-20-005566-index....,0001398344-20-005566-index


In [None]:
CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
cik_ticker.head()

Unnamed: 0,ticker,cik
0,aapl,320193
1,msft,789019
2,googl,1652044
3,amzn,1018724
4,tsla,1318605


In [None]:
whole_df = pd.merge(whole_df, cik_ticker, left_on="cik", right_on="cik", how='left')
whole_df.ticker = whole_df.ticker.fillna('_UNK')
whole_df.ticker = whole_df.ticker.str.upper()
whole_df[whole_df.ticker == '_UNK']

Unnamed: 0,cik,company,form_type,date,txt_name,link,accession_number,ticker
15,1000097,"KINGDON CAPITAL MANAGEMENT, L.L.C.",13F-HR,2020-02-14,edgar/data/1000097/0001000097-20-000004.txt,edgar/data/1000097/0001000097-20-000004-index....,0001000097-20-000004-index,_UNK
16,1000097,"KINGDON CAPITAL MANAGEMENT, L.L.C.",SC 13G/A,2020-02-10,edgar/data/1000097/0000919574-20-000875.txt,edgar/data/1000097/0000919574-20-000875-index....,0000919574-20-000875-index,_UNK
17,1000097,"KINGDON CAPITAL MANAGEMENT, L.L.C.",SC 13G/A,2020-02-10,edgar/data/1000097/0000919574-20-000877.txt,edgar/data/1000097/0000919574-20-000877-index....,0000919574-20-000877-index,_UNK
18,1000097,"KINGDON CAPITAL MANAGEMENT, L.L.C.",SC 13G/A,2020-02-10,edgar/data/1000097/0000919574-20-000879.txt,edgar/data/1000097/0000919574-20-000879-index....,0000919574-20-000879-index,_UNK
19,1000152,"WESTERN INTERNATIONAL SECURITIES, INC.",FOCUSN,2020-03-20,edgar/data/1000152/9999999997-20-003463.txt,edgar/data/1000152/9999999997-20-003463-index....,9999999997-20-003463-index,_UNK
...,...,...,...,...,...,...,...,...
4199182,99203,FPA NEW INCOME INC,24F-2NT,2021-12-27,edgar/data/99203/0001410368-21-000530.txt,edgar/data/99203/0001410368-21-000530-index.html,0001410368-21-000530-index,_UNK
4199183,99203,FPA NEW INCOME INC,N-CEN,2021-12-08,edgar/data/99203/0001752724-21-265584.txt,edgar/data/99203/0001752724-21-265584-index.html,0001752724-21-265584-index,_UNK
4199184,99203,FPA NEW INCOME INC,N-CSR,2021-12-02,edgar/data/99203/0001104659-21-145977.txt,edgar/data/99203/0001104659-21-145977-index.html,0001104659-21-145977-index,_UNK
4199185,99203,FPA NEW INCOME INC,NPORT-P,2021-11-24,edgar/data/99203/0001752724-21-257219.txt,edgar/data/99203/0001752724-21-257219-index.html,0001752724-21-257219-index,_UNK


In [None]:
annual_filings = whole_df[whole_df.form_type.isin(['20-F'])]
annual_filings.shape

(2127, 8)

In [None]:
annual_filings.link.iloc[0]

'edgar/data/1000184/0001104659-20-025681-index.html'

## Convert to function

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
pip install python-edgar



In [None]:
pip install ftfy



**Create function**

In [None]:
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
import edgar
import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

##### UPDATE ME! #####
#FORM_TYPES = ['10-K']
FORM_TYPES = ['DEF 14A']


##### UPDATE ME! #####
GET_QUARTERS = ['QTR1','QTR2','QTR3','QTR4']
START_YEAR = 2020
END_YEAR = 2021
#GET_QUARTERS = ['/2021/QTR1/']

# CIK_TICKER_LOOKUP_PATH = '/content/drive/MyDrive/data_for_good/ticker.txt'

##### UPDATE ME! #####
#PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Datasets/10k_clean_text/_0.0 downloaded/2021Q1/'
PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Notebooks to share/Proxy_statement/new_proxy/''

INDEX_PATH = 'www.sec.gov/Archives/edgar/full-index/'


def get_logger():
    t = datetime.now().strftime('%m%d-%H%M')
    logger = logging.getLogger()
    ch = logging.StreamHandler()
    fh = logging.FileHandler(f'log_{t}.txt')
    formatter = logging.Formatter('%(asctime)s: %(message)s', datefmt='%H:%M:%S')
    ch.setFormatter(formatter)
    fh.setFormatter(formatter)
    logger.addHandler(ch)
    logger.addHandler(fh)
    logger.setLevel(logging.INFO)
    return logger


def parse_company_index(req):
    lines = req.text.split('\n')
    HEADER_LINE = 8
    COL_BREAKS = [62, 74, 86, 98, 120]
    ret = []
    for i in range(HEADER_LINE+2, len(lines)-1):
        line_list = []
        line_list.append(lines[i][:COL_BREAKS[0]].rstrip())
        line_list.append(lines[i][COL_BREAKS[0]:COL_BREAKS[1]].rstrip())
        line_list.append(lines[i][COL_BREAKS[1]:COL_BREAKS[2]].rstrip())
        line_list.append(lines[i][COL_BREAKS[2]:COL_BREAKS[3]].rstrip())
        line_list.append(lines[i][COL_BREAKS[3]:].rstrip())
        ret.append(line_list)
    columns = ['company', 'form_type', 'cik', 'date_filed', 'link']
    return pd.DataFrame(ret, columns=columns)

def get_annual_filings_df():
    logger.info('retrieving filings index...')
    #out = []
    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, skip_all_present_except_last=False)
    full_company_index = pd.DataFrame()
    try: 
      for Year in range(START_YEAR, END_YEAR+1):
        for QTR in GET_QUARTERS:
          logger.info(QTR)
          file_name = PLAIN_TEXT_PATH + "/" +str(Year)+"-"+QTR+".tsv"
          with open(file_name, "r", encoding="utf-8") as f:
            #tmp_df = pd.read_csv(f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date_filed","txt_name","link"])
            tmp_df = pd.read_csv(f, sep = '[\t|]', names=["cik","company","form_type","date_filed","txt_name","link"])
            #idx_df = parse_company_index(tmp_df)
            if full_company_index.shape[0] == 0:
              full_company_index = tmp_df.copy()
            else:
              full_company_index = pd.concat([full_company_index, tmp_df])
    except OSError:
      pass

    full_company_index.cik = full_company_index.cik.astype(np.int64)
    # full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])
    full_company_index['accession_number'] = full_company_index.txt_name.apply(lambda x: x.split('/')[-1].split('.')[0])
    #####
    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()

    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)
    
    

    annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]

    
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings


#
# def get_annual_filings_df():
#     logger.info('retrieving filings index...')
#     out = []
#     for quarter in GET_QUARTERS:
#         logger.info(quarter)
#         #'https://www.sec.gov/Archives/edgar/full-index/2021/QTR3/'
#         #r = requests.get('https://' + INDEX_PATH + quarter + 'company.idx')
#         r = requests.get('https://' + INDEX_PATH + quarter + 'xbrl.idx')
#         idx_df = parse_company_index(r)
#         #print(idx_df)
#         out.append(idx_df)
#         #print("get: ", out)
#     full_company_index = pd.concat(out, axis=0).reset_index(drop=True)
#     #print(full_company_index.iloc[0])
#     full_company_index.cik = full_company_index.cik.astype(np.int64)
#     full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])

#     # https://www.sec.gov/include/ticker.txt
#     CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
#     cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
#     # sics = pd.read_csv('./data/sics_new.csv',
#     #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
#     # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

#     full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
#     full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
#     full_company_index.ticker = full_company_index.ticker.str.upper()

#     # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
#     # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

#     annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
#     # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
#     # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
#     logger.info(f'annual filings: {len(annual_filings)}')
#     return annual_filings
#

def scrape_filing(link, ticker, form_type):
    url_prefix = "https://www.sec.gov/Archives/"
    # print(url_prefix + link)
    # sleep(30)
    file = None
    while file is None:
      try:
        req = urllib.request.Request(url_prefix + link, headers={'User-Agent' : "Magic Browser"}) 
        # file = urllib.request.urlopen(url_prefix + link)
        file = urllib.request.urlopen(req)
      except:
        print('Opening', link, 'failed. Retrying...')
        file = None
    out = ''
    por = ''
    in_doc = False
    i = 0
    while True:
        i += 1
        line = file.readline().decode('utf-8', 'ignore')
        if line.startswith("CONFORMED PERIOD OF REPORT"):
            por = line.split(':')[-1].strip()
        if line.startswith("<TEXT>"):
            in_doc = True
        if in_doc:
            # bs4/lxml handles <br>s by simply removing them, which squashes words
            # replace <br>s with spaces before passing to bs4
            cleanline = re.sub(r'<br>|<BR>', ' ', line)
            out += cleanline + ' '
        if line.startswith("</TEXT>"):
            break
    return BeautifulSoup(out, 'lxml'), por


def parse_soup(soup, base_element='p', is_xbrl=False):
    out = []
    base_els = soup.find_all(base_element)

     # for iXBRL, the first <div> is a large metadata block, so skip it
    if is_xbrl: base_els = base_els[1:]

    n_base_els = len(base_els)

    i = 0
    in_table = False
    while i < n_base_els:
        el = base_els[i]

        # skip divs that contain other divs or tables to avoid recursion
        # i.e. divs that contain divs would otherwise appear twice
        descendants = [d.name for d in el.descendants]
        if base_element in descendants or 'table' in descendants:
            i += 1
            continue

        if el.parent.name != 'td': # ordinary line
            if in_table:
                out.append('[END TABLE]')
                out.append('\n')
                in_table = False
            # remove line breaks inside elements (iXBRL filings)
            out.append(el.text.replace('\n', ''))
            i += 1
            continue

        # loop through tables row-wise
        elif el.parent.name == 'td':
            if not in_table:
                out.append('[BEGIN TABLE]')
                in_table = True
            row_el = el.parent.parent
            # handling for poorly-formed table markup
            if row_el.name != 'tr': break

            # sometimes text is contained directly in <td>s without <div>s or <p>s inside
            # so search on <td> instead
            row_tds = row_el.find_all('td')

            n_tds = len(row_tds)
            row_text = ''
            for el in row_tds:
                # Tables in most annual filings contain tds with a single text element.
                # Some tables have <td>s containing multiple <div>s, which would otherwise
                # become squashed into a single string without spaces between words...
                if len(el.find_all('div')) >1:
                    row_text += ' '.join([e.text for e in el.find_all('div')])
                else:
                    # iXBRL filings often contain extra line breaks in text elements:
                    row_text += el.text.replace('\n', ' ') + ' '
                # since the row-wise loop is searching for <tr> elements, we only increment
                # if the <tr> contains the base_element (i.e. <div> or <p>), in order to keep
                # the counter i in sync with the base_els iterable
                if base_element in [e_.name for e_ in el.children]:
                    i += 1
            out.append(row_text)

    return ('\n').join(out)


def get_filing_text(soup):
    n_p = len(soup.find_all('p'))
    n_div = len(soup.find_all('div'))
    n_span = len(soup.find_all('span'))

    # if there are <span>s, the file is probably iXBRL.
    if n_span > n_p: return parse_soup(soup, 'div', is_xbrl=True)

    # if not iXBRL, use <p> or <div>, whichever is more abundant in the markup
    elif n_p > n_div: return parse_soup(soup, 'p', is_xbrl=False)
    else: return parse_soup(soup, 'div', is_xbrl=False)


def get_clean_text(str_in):
    ret = str_in
    ret = re.sub(r'\x9f', '•', ret) # used as bullet in 0000004904-19-000009
    ret = ftfy.fix_text(ret)
    ret = re.sub(r'\xa0', ' ', ret) # remove \xa0

    # remove page breaks
    ret = re.sub(r'\s+(-\s*\d+\s*-\s*)+\s*(Table of Contents\s*)+\s*\n', '\n', ret) # e.g. 0001178879-19-000024
    ret = re.sub(r'\s*\d+(\s*Table of Contents\n)+\n', ' ', ret) # e.g. 0000764180-19-000023
    ret = re.sub(r'\s*(\d+\n)+\s*\n\n', ' ', ret) # e.g. 0000824142-19-000040
    ret = re.sub(r'\n\s*\d+\s*\n', ' ', ret) # base case: digits separated by line breaks

    # U+2022, U+00B7, U+25AA, U+25CF, U+25C6
    ret = re.sub(r'([•·▪●◆])\s*\n', '\1 ', ret) # combine orphaned bullet chars separated from text by newline
    ret = re.sub(r'([•·▪●◆])(\w)', r'\1 \2', ret) # separate bullets squished next to text
    ret = re.sub(r'[ ]*([•·▪●◆])[ ]*', r'\1 ', ret) # consolidate whitespace around bullets

    # not active in order to preserve visual structure of multi-level bulleted lists
    #ret = re.sub(r'[•·▪●◆]', '•', ret) # use single bullet char

    # join orphaned sentences (line starts with a lower-case word)
    ret = re.sub(r'([a-z\,])\s*\n\s*([a-z])', r'\1 \2', ret)

    # fix table delimiters not separated by newline
    ret = re.sub(r'\[END TABLE\](.)', r'[END TABLE]\n\1', ret)
    ret = re.sub(r'(.)\[BEGIN TABLE\]', r'\1\n[BEGIN TABLE]', ret)

    # remove empty tables
    ret = re.sub(r'\[BEGIN TABLE\]s*\n\s*\[END TABLE\]', r'\n', ret)

    # remove table delmiters if table has only one row
#    ret = re.sub(r'\[BEGIN TABLE\]\s*\n([^\n]+)\n\s*\[END TABLE\]', r'\1', ret)

    ret = re.sub(r'(\d)\s*\)', r'\1)', ret)
    ret = re.sub(r'\s*%', r'%', ret)

    ret = re.sub(r'\n\s*\n', r'\n', ret) # remove empty lines

    return ret


def convert_to_plain_text(df):
    # scrape and parse filings in the DataFrame; save as plain text
    n_rows = df.shape[0]
    for i, row in enumerate(df.iterrows()):
        ticker = row[1].ticker
        accession_num = row[1].accession_number
        form_type = row[1].form_type
        #soup, por = scrape_filing(row[1].link, ticker, form_type)
        soup, por = scrape_filing(row[1].txt_name, ticker, form_type)
        text = get_filing_text(soup)
        text = get_clean_text(text)
        #new_fname = f'{ticker}_{form_type}_{por}_{accession_num}.txt'
        new_fname = f'{ticker}_{form_type}_{por}_{accession_num}'
        try:
          with open(PLAIN_TEXT_PATH + new_fname, 'w', encoding='utf-8') as f:
              f.write(text)
              f.close()
          logger.info(f'{i+1}/{n_rows} {new_fname}')
        except:
          print("Skipped:",sys.exc_info()[0],"occured.")
          
if __name__ == '__main__':
    logger = get_logger()
    annual_filings = get_annual_filings_df()
    convert_to_plain_text(annual_filings)

  return func(*args, **kwargs)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...
Opening edgar/data/1089113/0001628280-20-001784.txt failed. Retrying...

In [None]:
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1000229/0001564590-20-011974-index.html'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 
con = urllib.request.urlopen(req).read()

NameError: ignored

In [None]:
url = "https://www.sec.gov/Archives/edgar/data/1000229/0001564590-20-011974.txt"
req = urllib.request.Request(url,  headers={'User-Agent': 'Mozilla/5.0'}) 
con = urllib.request.urlopen(req).read()
print(con)

b'<SEC-DOCUMENT>0001564590-20-011974.txt : 20200320\n<SEC-HEADER>0001564590-20-011974.hdr.sgml : 20200320\n<ACCEPTANCE-DATETIME>20200320135958\nACCESSION NUMBER:\t\t0001564590-20-011974\nCONFORMED SUBMISSION TYPE:\tDEF 14A\nPUBLIC DOCUMENT COUNT:\t\t12\nCONFORMED PERIOD OF REPORT:\t20200520\nFILED AS OF DATE:\t\t20200320\nDATE AS OF CHANGE:\t\t20200320\nEFFECTIVENESS DATE:\t\t20200320\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tCORE LABORATORIES N V\n\t\tCENTRAL INDEX KEY:\t\t\t0001000229\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tOIL, GAS FIELD SERVICES, NBC [1389]\n\t\tIRS NUMBER:\t\t\t\t000000000\n\t\tSTATE OF INCORPORATION:\t\t\tP7\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\tDEF 14A\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-14273\n\t\tFILM NUMBER:\t\t20731430\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\tSTRAWINSKYLAAN 913\n\t\tSTREET 2:\t\tTOWER A, LEVEL 9\n\t\tCITY:\t\t\t1077 XX AMSTERDAM\n\t\tSTATE:\t\t\tP7\n\t\tZIP:\t\t\t10

In [None]:
print(con)

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta http-equiv="Last-Modified" content="Fri, 20 Mar 2020 17:59:58 GMT" />\n<title>EDGAR Filing Documents for 0001564590-20-011974</title>\n<link rel="stylesheet" type="text/css" href="/include/interactive.css" />\n</head>\n<body style="margin: 0">\n<!-- SEC Web Analytics - For information please visit: https://www.sec.gov/privacy.htm#collectedinfo -->\n<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-TD3BKV"\nheight="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>\n<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\nnew Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n\'//www.googletagmanager.com/gtm.

In [None]:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.sec.gov/Archives/edgar/data/1000229/0001564590-20-011974-index.html'
page = get(url)
soup = BeautifulSoup(page.content, "html.parser")

In [None]:
print(page.text[:500])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SEC.gov | Request Rate Threshold Exceeded</title>
<style>
html {height: 100%}
body {height: 100%; margin:0; padding:0;}
#header {background-color:#003968; color:#fff; padding:15px 20px 10px 20px;font-family:Arial, Helvetica, sans-serif; font-size:20p


In [None]:
#base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=1000045&type=DEF_14A&dateb=20200214"
base_url = "https://www.sec.gov/cgi-bin/srch-edgar?text=FORM-TYPE=DEF14A&first=2021&last=2021"
edgar_resp = requests.get(base_url)

In [None]:
print(edgar_resp.content)

b'\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<html lang="ENG">\n\n<head>\n<title>Search Historical SEC EDGAR Archives</title>\n<!-- BEGIN HEADER -->\n<script language="JavaScript" type="text/javascript" src="/include/sec.js"></script>\n</head>\n\n<body topmargin="0" leftmargin="0" marginwidth="0" marginheight="0">\n<!-- SEC Web Analytics - For information please visit: http://www.sec.gov/privacy.htm#collectedinfo -->\n<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-TD3BKV"\nheight="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>\n<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\nnew Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n\'//www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n})(window,document,\'script\',\'dataLayer\',\'GTM-TD3BKV\');</script>\n<!-- End SEC Web Analyti

In [None]:
contents = edgar_resp.text()
print(content)

TypeError: ignored

In [None]:
print(soup.title)

<title>SEC.gov | Request Rate Threshold Exceeded</title>


In [None]:
results = soup.find(id="contentDiv")

In [None]:
print(results.prettify())

AttributeError: ignored

In [None]:
job_elements = results.find_all("div", class_="tableFile")

AttributeError: ignored

In [None]:
for job_element in job_elements:
    print(job_element, end="\n"*2)

 <td scope="row">DEF 14A</td>\n            <td scope="row"><a href="/Archives/edgar/data/1000229/000156459020011974/clb-def14a_20200520.htm">clb-def14a_20200520.htm</a></td>\n            <td scope="row">DEF 14A</td>\n

##### DO NOT USE

In [None]:
# check the number of files in the directory

import os
import glob
import pandas as pd

# FOLDER_PATH = '/content/drive/MyDrive/data_for_good/10k_clean_text_tmp/'
FOLDER_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Datasets/10k_clean_text/'
PERIOD = '*Q*'

# Get a list of all the file paths that ends with .txt from in specified directory
fileList = glob.glob(FOLDER_PATH+PERIOD+'/*.txt')


df_list = pd.DataFrame(fileList, columns=['filename'])

df_list.filename = df_list.filename.str.replace(FOLDER_PATH,'')
df_list[['period', 'filename']] = df_list.filename.str.split(pat='/',expand=True)
df_list.filename = df_list.filename.str.replace('_UNK','UNK')
df_list[['ticker', 'form_type', 'filing_period', 'filename']] = df_list.filename.str.split(pat='_',expand=True)
# df_list = df_list[df_list.ticker == 'UNK']
df_list = df_list[df_list.filing_period != 'log']
df_list = df_list[df_list.form_type == '10-K']
df_list = df_list.sort_values('ticker')
df_list['year'] = df_list['period'].str[:4]
# print("# of files: ", len(fileList))
# print("# of tickers: ", len(df_list.ticker))
# print("# of unique tickers: ", len(set(df_list.ticker)))

df_list.to_csv('/content/drive/MyDrive/data_for_good/list_of_raw_files.csv')

# Iterate over the list of filepaths & remove each file.
# for filePath in fileList:
#     try:
#         # os.remove(filePath)
#         # print("removed: ", filePath)
#     except:
#         print("Error while deleting file : ", filePath)