<a href="https://colab.research.google.com/github/Maggiey01/Rights-Colab-YH/blob/main/Download_20_F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

The purpose of this notebook is to download and scrape Form 20-Fs from SEC's EDGAR. More specifically, we:
* download the list of 20-F submissions by a specified submission year (e.g. /2021)
* download each 20-F submission in .txt format
* scrape and clean text data from each 20-F submission
* save the result into a .txt file with the naming convention of "Ticker_FormType_FiscalYearEnd_AccensionNumber.txt" where AccensionNumber is a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System.

All results are saved in [this Google Drive Folder](https://drive.google.com/drive/folders/1hZ8vrUYZL0bIk6uNpVmbIJJHKmOzzriH?usp=sharing) organized by submission quarters.

* overall, the output text of 7 top UK companies' 20-F are readable and clean

**Before working with this notebook, please make a copy of this template onto your own Google Drive.** 

Please note that this process will generally take 1-5 seconds per submission (depending on the size). As examples:

* For submission period of 2021, it took 1 minutes to download top 7 UK firms 20-F submissions.


As such, it is a good idea to restart the runtime in the Colab environment to avoid timeout.


## Step 1: Install package required

In [18]:
pip install ftfy



In [19]:
pip install python-edgar



In [20]:
import edgar

## Step 2: Download and scrape 20-Fs 
**Please update the following BEFORE executing the code.**
* `FORM_TYPES`: This should be the form that we are looking to download.
* `GET_QUARTERS` - This should be the submission year and quarter in the form of "YYYY/QTR#".
* `PLAIN_TEXT_PATH` - This should be the destination folder for the results.

##Try python-edgar for 20-F

```
Test for the package. Go next section if you want the whole function
```

In [21]:
pip install python-edgar



In [22]:
import edgar

In [23]:
pip install ftfy



In [46]:
import tempfile
import edgar
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
#import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

In [47]:
def get_annual_filings_df():
    logger.info('retrieving filings index...')
    #out = []
    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, "yh3395@columbia.edu",skip_all_present_except_last=False)
    full_company_index = pd.DataFrame()
    try: 
      for Year in range(START_YEAR, END_YEAR+1):
        for QTR in GET_QUARTERS:
          logger.info(QTR)
          file_name = PLAIN_TEXT_PATH + "/"+str(Year)+"-"+QTR+".tsv"
          with open(file_name, "r", encoding="utf-8") as f:
            #tmp_df = pd.read_csv(f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date_filed","txt_name","link"])
            tmp_df = pd.read_csv(f, sep = '[\t|]', names=["cik","company","form_type","date_filed","txt_name","link"])
            #idx_df = parse_company_index(tmp_df)
            if full_company_index.shape[0] == 0:
              full_company_index = tmp_df.copy()
            else:
              full_company_index = pd.concat([full_company_index, tmp_df])
    except OSError:
      pass

    full_company_index.cik = full_company_index.cik.astype(np.int64)
    # full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])
    full_company_index['accession_number'] = full_company_index.txt_name.apply(lambda x: x.split('/')[-1].split('.')[0])
    #####
    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()
    
    df_20f = full_company_index.loc[full_company_index.form_type=='20-F']
    # find uk top10 companies with ticker
    df_20f_uk_top10 = df_20f.loc[df_20f.ticker.isin(['UL','AZN','HSBC','GSK','RTNTF','BTI','RDS-A'])]
    df_20f_uk_top10_21 = df_20f_uk_top10.loc[df_20f_uk_top10.date_filed>'2021-01-01']
    demo_uk20f = df_20f_uk_top10_21

    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

    #annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    annual_filings = demo_uk20f
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings

In [48]:
ukannual_filings_top10 = get_annual_filings_df()
print(ukannual_filings_top10.shape)
ukannual_filings_top10

18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:09: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Colab YH/20F/2022-QTR1.tsv
18:45:09: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Col

(7, 8)


Unnamed: 0,cik,company,form_type,date_filed,txt_name,link,accession_number,ticker
595779,1089113,HSBC HOLDINGS PLC,20-F,2021-02-24,edgar/data/1089113/0001628280-21-003046.txt,edgar/data/1089113/0001628280-21-003046-index....,0001628280-21-003046,HSBC
706872,1131399,GLAXOSMITHKLINE PLC,20-F,2021-03-12,edgar/data/1131399/0001193125-21-079417.txt,edgar/data/1131399/0001193125-21-079417-index....,0001193125-21-079417,GSK
739528,1303523,British American Tobacco p.l.c.,20-F,2021-03-09,edgar/data/1303523/0001193125-21-073992.txt,edgar/data/1303523/0001193125-21-073992-index....,0001193125-21-073992,BTI
740160,1306965,Royal Dutch Shell plc,20-F,2021-03-11,edgar/data/1306965/0001306965-21-000025.txt,edgar/data/1306965/0001306965-21-000025-index....,0001306965-21-000025,RDS-A
1004707,217410,UNILEVER PLC,20-F,2021-03-10,edgar/data/217410/0001193125-21-075770.txt,edgar/data/217410/0001193125-21-075770-index.html,0001193125-21-075770,UL
1177272,887028,RIO TINTO LTD,20-F,2021-03-02,edgar/data/887028/0001628280-21-003713.txt,edgar/data/887028/0001628280-21-003713-index.html,0001628280-21-003713,RTNTF
1200493,901832,ASTRAZENECA PLC,20-F,2021-02-16,edgar/data/901832/0001104659-21-022456.txt,edgar/data/901832/0001104659-21-022456-index.html,0001104659-21-022456,AZN


## Convert to function

In [39]:
pwd

'/content'

In [40]:
from google.colab import drive
#drive.mount('/content/drive')
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [41]:
pip install python-edgar



In [42]:
pip install ftfy



### get logger
### parse_company_index
### get annual filling df - ukannual_filings
top 7 uk firms' 20-F

In [49]:
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
import edgar
import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

##### UPDATE ME! #####
#FORM_TYPES = ['10-K']
FORM_TYPES = ['20-F']

##### UPDATE ME! #####
#GET_QUARTERS = ['QTR1','QTR2','QTR3','QTR4']
GET_QUARTERS = ['QTR1']
START_YEAR = 2020
END_YEAR = 2021
#GET_QUARTERS = ['/2021/QTR1/']

# CIK_TICKER_LOOKUP_PATH = '/content/drive/MyDrive/data_for_good/ticker.txt'

##### UPDATE ME! #####
#PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Datasets/10k_clean_text/_0.0 downloaded/2021Q1/'
#PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Notebooks to share/Proxy_statement/new_proxy/''
PLAIN_TEXT_PATH = '/drive/MyDrive/Rights Colab YH/20F'



INDEX_PATH = 'www.sec.gov/Archives/edgar/full-index/'


def get_logger():
    t = datetime.now().strftime('%m%d-%H%M')
    logger = logging.getLogger()
    ch = logging.StreamHandler()
    fh = logging.FileHandler(f'log_{t}.txt')
    formatter = logging.Formatter('%(asctime)s: %(message)s', datefmt='%H:%M:%S')
    ch.setFormatter(formatter)
    fh.setFormatter(formatter)
    logger.addHandler(ch)
    logger.addHandler(fh)
    logger.setLevel(logging.INFO)
    return logger
logger = get_logger()

def parse_company_index(req):
    lines = req.text.split('\n')
    HEADER_LINE = 8
    COL_BREAKS = [62, 74, 86, 98, 120]
    ret = []
    for i in range(HEADER_LINE+2, len(lines)-1):
        line_list = []
        line_list.append(lines[i][:COL_BREAKS[0]].rstrip())
        line_list.append(lines[i][COL_BREAKS[0]:COL_BREAKS[1]].rstrip())
        line_list.append(lines[i][COL_BREAKS[1]:COL_BREAKS[2]].rstrip())
        line_list.append(lines[i][COL_BREAKS[2]:COL_BREAKS[3]].rstrip())
        line_list.append(lines[i][COL_BREAKS[3]:].rstrip())
        ret.append(line_list)
    columns = ['company', 'form_type', 'cik', 'date_filed', 'link']
    return pd.DataFrame(ret, columns=columns)

def get_annual_filings_df():
    logger.info('retrieving filings index...')
    #out = []
    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, "yh3395@columbia.edu",skip_all_present_except_last=False)
    full_company_index = pd.DataFrame()
    try: 
      for Year in range(START_YEAR, END_YEAR+1):
        for QTR in GET_QUARTERS:
          logger.info(QTR)
          file_name = PLAIN_TEXT_PATH + "/"+str(Year)+"-"+QTR+".tsv"
          with open(file_name, "r", encoding="utf-8") as f:
            #tmp_df = pd.read_csv(f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date_filed","txt_name","link"])
            tmp_df = pd.read_csv(f, sep = '[\t|]', names=["cik","company","form_type","date_filed","txt_name","link"])
            #idx_df = parse_company_index(tmp_df)
            if full_company_index.shape[0] == 0:
              full_company_index = tmp_df.copy()
            else:
              full_company_index = pd.concat([full_company_index, tmp_df])
    except OSError:
      pass

    full_company_index.cik = full_company_index.cik.astype(np.int64)
    # full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])
    full_company_index['accession_number'] = full_company_index.txt_name.apply(lambda x: x.split('/')[-1].split('.')[0])
    #####
    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()
    
    df_20f = full_company_index.loc[full_company_index.form_type=='20-F']
    df_20f_uk_top10 = df_20f.loc[df_20f.ticker.isin(['UL','AZN','HSBC','GSK','RTNTF','BTI','RDS-A'])]
    df_20f_uk_top10_21 = df_20f_uk_top10.loc[df_20f_uk_top10.date_filed>'2021-01-01']
    demo_uk20f = df_20f_uk_top10_21

    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

    #annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    annual_filings = demo_uk20f
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings



### scrape_filing
### parse_soup
### get_f_text
### get_clean_text
### convert_to_plain_text

In [50]:
def scrape_filing(link, ticker, form_type):
    url_prefix = "https://www.sec.gov/Archives/"
    # print(url_prefix + link)
    # sleep(30)
    file = None
    while file is None:
      try:
        #req = urllib.request.Request(url_prefix + link, headers={'User-Agent' : "Magic Browser"}) 
        req = urllib.request.Request(url_prefix + link, headers={'User-Agent' : "yh3395@columbia.edu"}) 
        # file = urllib.request.urlopen(url_prefix + link)
        file = urllib.request.urlopen(req)
      except:
        print('Opening', link, 'failed. Retrying...')
        file = None
    out = ''
    por = ''
    in_doc = False
    i = 0
    while True:
        i += 1
        line = file.readline().decode('utf-8', 'ignore')
        #if line.startswith("CONFORMED PERIOD OF REPORT"):
            #por = line.split(':')[-1].strip()
        if line.startswith("<TEXT>"):
            in_doc = True
        if in_doc:
            # bs4/lxml handles <br>s by simply removing them, which squashes words
            # replace <br>s with spaces before passing to bs4
            cleanline = re.sub(r'<br>|<BR>', ' ', line)
            out += cleanline + ' '
        if line.startswith("</TEXT>"):
            break
    return BeautifulSoup(out, 'lxml')

def parse_soup(soup, base_element='p', is_xbrl=False):
    out = []
    base_els = soup.find_all(base_element)

     # for iXBRL, the first <div> is a large metadata block, so skip it
    if is_xbrl: base_els = base_els[1:]

    n_base_els = len(base_els)

    i = 0
    in_table = False
    while i < n_base_els:
        el = base_els[i]

        # skip divs that contain other divs or tables to avoid recursion
        # i.e. divs that contain divs would otherwise appear twice
        descendants = [d.name for d in el.descendants]
        if base_element in descendants or 'table' in descendants:
            i += 1
            continue

        if el.parent.name != 'td': # ordinary line
            if in_table:
                out.append('[END TABLE]')
                out.append('\n')
                in_table = False
            # remove line breaks inside elements (iXBRL filings)
            out.append(el.text.replace('\n', ''))
            i += 1
            continue

        # loop through tables row-wise
        elif el.parent.name == 'td':
            if not in_table:
                out.append('[BEGIN TABLE]')
                in_table = True
            row_el = el.parent.parent
            # handling for poorly-formed table markup
            if row_el.name != 'tr': break

            # sometimes text is contained directly in <td>s without <div>s or <p>s inside
            # so search on <td> instead
            row_tds = row_el.find_all('td')

            n_tds = len(row_tds)
            row_text = ''
            for el in row_tds:
                # Tables in most annual filings contain tds with a single text element.
                # Some tables have <td>s containing multiple <div>s, which would otherwise
                # become squashed into a single string without spaces between words...
                if len(el.find_all('div')) >1:
                    row_text += ' '.join([e.text for e in el.find_all('div')])
                else:
                    # iXBRL filings often contain extra line breaks in text elements:
                    row_text += el.text.replace('\n', ' ') + ' '
                # since the row-wise loop is searching for <tr> elements, we only increment
                # if the <tr> contains the base_element (i.e. <div> or <p>), in order to keep
                # the counter i in sync with the base_els iterable
                if base_element in [e_.name for e_ in el.children]:
                    i += 1
            out.append(row_text)

    return ('\n').join(out)



def get_filing_text(soup):
    n_p = len(soup.find_all('p'))
    n_div = len(soup.find_all('div'))
    n_span = len(soup.find_all('span'))

    # if there are <span>s, the file is probably iXBRL.
    if n_span > n_p: return parse_soup(soup, 'div', is_xbrl=True)
    #if n_span > n_p: return parse_soup1
    # if not iXBRL, use <p> or <div>, whichever is more abundant in the markup
    elif n_p > n_div: return parse_soup(soup, 'p', is_xbrl=False)
    #elif n_p > n_div: return parse_soup2
    else: return parse_soup(soup, 'div', is_xbrl=False)
    #else: return parse_soup3


def get_clean_text(str_in):
    ret = str_in
    ret = re.sub(r'\x9f', '•', ret) # used as bullet in 0000004904-19-000009
    ret = ftfy.fix_text(ret)
    ret = re.sub(r'\xa0', ' ', ret) # remove \xa0

    # remove page breaks
    ret = re.sub(r'\s+(-\s*\d+\s*-\s*)+\s*(Table of Contents\s*)+\s*\n', '\n', ret) # e.g. 0001178879-19-000024
    ret = re.sub(r'\s*\d+(\s*Table of Contents\n)+\n', ' ', ret) # e.g. 0000764180-19-000023
    ret = re.sub(r'\s*(\d+\n)+\s*\n\n', ' ', ret) # e.g. 0000824142-19-000040
    ret = re.sub(r'\n\s*\d+\s*\n', ' ', ret) # base case: digits separated by line breaks

    # U+2022, U+00B7, U+25AA, U+25CF, U+25C6
    ret = re.sub(r'([•·▪●◆])\s*\n', '\1 ', ret) # combine orphaned bullet chars separated from text by newline
    ret = re.sub(r'([•·▪●◆])(\w)', r'\1 \2', ret) # separate bullets squished next to text
    ret = re.sub(r'[ ]*([•·▪●◆])[ ]*', r'\1 ', ret) # consolidate whitespace around bullets

    # not active in order to preserve visual structure of multi-level bulleted lists
    #ret = re.sub(r'[•·▪●◆]', '•', ret) # use single bullet char

    # join orphaned sentences (line starts with a lower-case word)
    ret = re.sub(r'([a-z\,])\s*\n\s*([a-z])', r'\1 \2', ret)

    # fix table delimiters not separated by newline
    ret = re.sub(r'\[END TABLE\](.)', r'[END TABLE]\n\1', ret)
    ret = re.sub(r'(.)\[BEGIN TABLE\]', r'\1\n[BEGIN TABLE]', ret)

    # remove empty tables
    ret = re.sub(r'\[BEGIN TABLE\]s*\n\s*\[END TABLE\]', r'\n', ret)

    # remove table delmiters if table has only one row
#    ret = re.sub(r'\[BEGIN TABLE\]\s*\n([^\n]+)\n\s*\[END TABLE\]', r'\1', ret)

    ret = re.sub(r'(\d)\s*\)', r'\1)', ret)
    ret = re.sub(r'\s*%', r'%', ret)

    ret = re.sub(r'\n\s*\n', r'\n', ret) # remove empty lines

    return ret


def convert_to_plain_text(df):
    # scrape and parse filings in the DataFrame; save as plain text
    n_rows = df.shape[0]
    for i, row in enumerate(df.iterrows()):
        ticker = row[1].ticker
        accession_num = row[1].accession_number
        form_type = row[1].form_type
        #soup, por = scrape_filing(row[1].link, ticker, form_type)
        soup = scrape_filing(row[1].txt_name, ticker, form_type)
        text = get_filing_text(soup)
        text = get_clean_text(text)
        #new_fname = f'{ticker}_{form_type}_{por}_{accession_num}.txt'
        new_fname = f'{ticker}_{form_type}_{accession_num}'
        try:
          with open(PLAIN_TEXT_PATH + new_fname, 'w', encoding='utf-8') as f:
              f.write(text)
              f.close()
          logger.info(f'{i+1}/{n_rows} {new_fname}')
        except:
          print("Skipped:",sys.exc_info()[0],"occured.")


if __name__ == '__main__':
    logger = get_logger()
    annual_filings = get_annual_filings_df()
    convert_to_plain_text(annual_filings)

18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: retrieving filings index...
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:28: 9 index files to retrieve
18:48:29: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Righ

## Do not use 
Test and debug function  
parse_soup  
get_filling_text   
get_clean_text.  
Test  
hsbc 20f as demo

In [None]:
parse_soup1 = parse_soup(soup, 'div', is_xbrl=True)
parse_soup2 = parse_soup(soup, 'p', is_xbrl=False)
parse_soup3 = parse_soup(soup, 'div', is_xbrl=False)

def get_filing_text(soup):
    n_p = len(soup.find_all('p'))
    n_div = len(soup.find_all('div'))
    n_span = len(soup.find_all('span'))

    # if there are <span>s, the file is probably iXBRL.
    #if n_span > n_p: return parse_soup(soup, 'div', is_xbrl=True)
    if n_span > n_p: return parse_soup1
    # if not iXBRL, use <p> or <div>, whichever is more abundant in the markup
    #elif n_p > n_div: return parse_soup(soup, 'p', is_xbrl=False)
    elif n_p > n_div: return parse_soup2
    #else: return parse_soup(soup, 'div', is_xbrl=False)
    else: return parse_soup3

In [None]:
text = get_filing_text(soup)

In [None]:
def get_clean_text(str_in):
    ret = str_in
    ret = re.sub(r'\x9f', '•', ret) # used as bullet in 0000004904-19-000009
    ret = ftfy.fix_text(ret)
    ret = re.sub(r'\xa0', ' ', ret) # remove \xa0

    # remove page breaks
    ret = re.sub(r'\s+(-\s*\d+\s*-\s*)+\s*(Table of Contents\s*)+\s*\n', '\n', ret) # e.g. 0001178879-19-000024
    ret = re.sub(r'\s*\d+(\s*Table of Contents\n)+\n', ' ', ret) # e.g. 0000764180-19-000023
    ret = re.sub(r'\s*(\d+\n)+\s*\n\n', ' ', ret) # e.g. 0000824142-19-000040
    ret = re.sub(r'\n\s*\d+\s*\n', ' ', ret) # base case: digits separated by line breaks

    # U+2022, U+00B7, U+25AA, U+25CF, U+25C6
    ret = re.sub(r'([•·▪●◆])\s*\n', '\1 ', ret) # combine orphaned bullet chars separated from text by newline
    ret = re.sub(r'([•·▪●◆])(\w)', r'\1 \2', ret) # separate bullets squished next to text
    ret = re.sub(r'[ ]*([•·▪●◆])[ ]*', r'\1 ', ret) # consolidate whitespace around bullets

    # not active in order to preserve visual structure of multi-level bulleted lists
    #ret = re.sub(r'[•·▪●◆]', '•', ret) # use single bullet char

    # join orphaned sentences (line starts with a lower-case word)
    ret = re.sub(r'([a-z\,])\s*\n\s*([a-z])', r'\1 \2', ret)

    # fix table delimiters not separated by newline
    ret = re.sub(r'\[END TABLE\](.)', r'[END TABLE]\n\1', ret)
    ret = re.sub(r'(.)\[BEGIN TABLE\]', r'\1\n[BEGIN TABLE]', ret)

    # remove empty tables
    ret = re.sub(r'\[BEGIN TABLE\]s*\n\s*\[END TABLE\]', r'\n', ret)

    # remove table delmiters if table has only one row
#    ret = re.sub(r'\[BEGIN TABLE\]\s*\n([^\n]+)\n\s*\[END TABLE\]', r'\1', ret)

    ret = re.sub(r'(\d)\s*\)', r'\1)', ret)
    ret = re.sub(r'\s*%', r'%', ret)

    ret = re.sub(r'\n\s*\n', r'\n', ret) # remove empty lines

    return ret

In [None]:
def convert_to_plain_text(df):
    # scrape and parse filings in the DataFrame; save as plain text
    n_rows = df.shape[0]
    for i, row in enumerate(df.iterrows()):
        ticker = row[1].ticker
        accession_num = row[1].accession_number
        form_type = row[1].form_type
        #soup, por = scrape_filing(row[1].link, ticker, form_type)
        #soup, por = scrape_filing(row[1].txt_name, ticker, form_type)
        text = get_filing_text(soup)
        text = get_clean_text(text)
        #new_fname = f'{ticker}_{form_type}_{por}_{accession_num}.txt'
        new_fname = f'{ticker}_{form_type}_{accession_num}'
        try:
          with open(PLAIN_TEXT_PATH + new_fname, 'w', encoding='utf-8') as f:
              f.write(text)
              f.close()
          logger.info(f'{i+1}/{n_rows} {new_fname}')
        except:
          print("Skipped:",sys.exc_info()[0],"occured.")

In [None]:
if __name__ == '__main__':
    logger = get_logger()
    #annual_filings = get_annual_filings_df()
    convert_to_plain_text(annual_filings)

INFO:root:1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
17:26:38: 1/7 HSBC_20-F_0001628280-21-003046
INFO:root:2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
17:26:41: 2/7 GSK_20-F_0001193125-21-079417
INFO:root:3/7 BTI_20-F_0001193125-21-073992
17:26:44: 3/7 BTI_20-F_0001193125-21-073992
17:26:44: 3/7 BTI_20-F_0001193125-21-073992
17:26:44: 3/7 BTI_20-F_0001193125-21-073992
17:26:44: 3/7 BTI_20-F_0001193125-21-073992
17:26:44: 3/7 BTI_20-F_0001193125-21-073992
17:26:44: 3/7 BTI_20-F_0

##### DO NOT USE

In [None]:
# check the number of files in the directory

import os
import glob
import pandas as pd

# FOLDER_PATH = '/content/drive/MyDrive/data_for_good/10k_clean_text_tmp/'
FOLDER_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Datasets/10k_clean_text/'
PERIOD = '*Q*'

# Get a list of all the file paths that ends with .txt from in specified directory
fileList = glob.glob(FOLDER_PATH+PERIOD+'/*.txt')


df_list = pd.DataFrame(fileList, columns=['filename'])

df_list.filename = df_list.filename.str.replace(FOLDER_PATH,'')
df_list[['period', 'filename']] = df_list.filename.str.split(pat='/',expand=True)
df_list.filename = df_list.filename.str.replace('_UNK','UNK')
df_list[['ticker', 'form_type', 'filing_period', 'filename']] = df_list.filename.str.split(pat='_',expand=True)
# df_list = df_list[df_list.ticker == 'UNK']
df_list = df_list[df_list.filing_period != 'log']
df_list = df_list[df_list.form_type == '10-K']
df_list = df_list.sort_values('ticker')
df_list['year'] = df_list['period'].str[:4]
# print("# of files: ", len(fileList))
# print("# of tickers: ", len(df_list.ticker))
# print("# of unique tickers: ", len(set(df_list.ticker)))

df_list.to_csv('/content/drive/MyDrive/data_for_good/list_of_raw_files.csv')

# Iterate over the list of filepaths & remove each file.
# for filePath in fileList:
#     try:
#         # os.remove(filePath)
#         # print("removed: ", filePath)
#     except:
#         print("Error while deleting file : ", filePath)

### do not use 
try HSBC 20f as demo

In [None]:
url_prefix = "https://www.sec.gov/Archives/"

In [None]:
demolink = annual_filings.iloc[0,:].txt_name
#demolink = annual_filings.iloc[0,:].link
demoticker = annual_filings.iloc[0,:].ticker
demoft = annual_filings.iloc[0,:].form_type
print(url_prefix+demolink)
print(demoticker)
print(demoft)

https://www.sec.gov/Archives/edgar/data/1089113/0001628280-21-003046.txt
HSBC
20-F


In [None]:
demolink = annual_filings.iloc[1,:].txt_name

demoticker = annual_filings.iloc[1,:].ticker
demoft = annual_filings.iloc[1,:].form_type
print(url_prefix+demolink)
print(demoticker)
print(demoft)

https://www.sec.gov/Archives/edgar/data/1131399/0001193125-21-079417.txt
GSK
20-F


In [None]:
soup = scrape_filing(link=demolink, ticker=demoticker, form_type=demoft)

In [None]:
url_prefix = "https://www.sec.gov/Archives/"
print(url_prefix + demolink)
req = urllib.request.Request(url_prefix + demolink, headers={'User-Agent' : "yh3395@columbia.edu"}) 
        # file = urllib.request.urlopen(url_prefix + link)
file = urllib.request.urlopen(req)
file.readline().decode('utf-8', 'ignore')

https://www.sec.gov/Archives/edgar/data/1131399/0001193125-21-079417.txt


'<SEC-DOCUMENT>0001193125-21-079417.txt : 20210312\n'

In [None]:
req = requests.get(url_prefix + demolink, headers={'User-Agent' : "yh3395@columbia.edu"})
soup = BeautifulSoup(req.content, 'html.parser')

In [None]:
url = 'https://www.sec.gov/'
url = url+docurl
url
req = requests.get(url, headers={'User-Agent' : "yh3395@columbia.edu"})
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Required meta tags -->
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <!-- Check Browser, as well as Hostname and set JS, CSS, Favicon path based on that-->
  <!-- Favicon -->
  <!-- cpcommit -->
  <!-- <link rel="shortcut icon" type="image/x-icon" href="https://www.sec.gov/favicon.ico"> -->
  <script>
   (function() {
		var hostName = window.location.hostname;

		var browserAccepted = function() {

			var ua = window.navigator.userAgent;
			var trident = ua.indexOf('Trident/'); //IE 11;
			var msie = ua.indexOf('MSIE '); // IE 10 or older
			if (msie > 0) {
				return false;
			}
			return true;
		}();

		var pathToLibs = './js';
		var pathToError = '/';
		if (hostName === 'www-test.sec.gov' || hostName === 'www.sec.gov') {
			pathToLibs = './ixviewer/js';
			pathToError = '/ixviewer/';
		}
		if (browserAccepted) {

			// css
			document
					.