<a href="https://colab.research.google.com/github/Maggiey01/Rights-Colab-YH/blob/main/Download_20_F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

The purpose of this notebook is to download and scrape Form 20-Fs from SEC's EDGAR. More specifically, we:
* download the list of 20-F submissions by a specified submission year (e.g. /2021)
* download each 20-F submission in .txt format
* scrape and clean text data from each 20-F submission
* save the result into a .txt file with the naming convention of "Ticker_FormType_FiscalYearEnd_AccensionNumber.txt" where AccensionNumber is a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System.

All results are saved in [this Google Drive Folder](https://drive.google.com/drive/folders/1hZ8vrUYZL0bIk6uNpVmbIJJHKmOzzriH?usp=sharing) organized by submission quarters.
Demo uk 7 firms
* Get clean formatted annual 20-F for 7 uk companies with tickers
* overall, the output text of 7 top UK companies' 20-F are readable and clean (all in .txt format)
* standardized formatting 20-F’s across 7 uk companies 
(similar function be used to parse all the 20-F’s)


**Before working with this notebook, please make a copy of this template onto your own Google Drive.** 

Please note that this process will generally take 1-5 seconds per submission (depending on the size). As examples:

* For submission period of 2021, it took 1h47 minutes to download 1108 20-F submissions.


As such, it is a good idea to restart the runtime in the Colab environment to avoid timeout.



For country and industry
* 1108 20-F annual filings in 2021
* in 1h47m 
* sec-edgar api doc - does not have industry or country parameter
find dataset or api that can group tickers/companies by industry
https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list

* use nlp to retrieve country info
try to find common pattern in corpors
1.
(Jurisdiction of incorporation or organization)
or
2. if no then go with
(Address of principal executive offices) 

public trade in 


## Step 1: Install package required

In [None]:
pip install ftfy

Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[?25l[K     |██████▏                         | 10 kB 20.1 MB/s eta 0:00:01[K     |████████████▍                   | 20 kB 21.2 MB/s eta 0:00:01[K     |██████████████████▌             | 30 kB 11.2 MB/s eta 0:00:01[K     |████████████████████████▊       | 40 kB 9.1 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51 kB 9.1 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 1.4 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1


In [None]:
pip install python-edgar

Collecting python-edgar
  Downloading python_edgar-3.1.3-py3-none-any.whl (8.6 kB)
Installing collected packages: python-edgar
Successfully installed python-edgar-3.1.3


In [None]:
import edgar

## Step 2: Download and scrape 20-Fs 
**Please update the following BEFORE executing the code.**
* `START_YEAR` 
* `END_YEAR` 
* `GET_QUARTERS` - This should be the submission year and quarter in the form of "YYYY/QTR#".
* `PLAIN_TEXT_PATH` - This should be the destination folder for the results.

##Try python-edgar for 20-F

```
Test for the package. Go next section if you want the whole function
```

In [None]:
pip install python-edgar

Collecting python-edgar
  Downloading python_edgar-3.1.3-py3-none-any.whl (8.6 kB)
Installing collected packages: python-edgar
Successfully installed python-edgar-3.1.3


In [None]:
import edgar

In [None]:
pip install ftfy



In [None]:
import tempfile
import edgar
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
#import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

In [None]:
PLAIN_TEXT_PATH = '/drive/MyDrive/Rights Colab YH/20F21'
START_YEAR = 2021
END_YEAR = 2021
GET_QUARTERS = ['QTR1','QTR2','QTR3','QTR4']

In [None]:
def get_logger():
    t = datetime.now().strftime('%m%d-%H%M')
    logger = logging.getLogger()
    ch = logging.StreamHandler()
    fh = logging.FileHandler(f'log_{t}.txt')
    formatter = logging.Formatter('%(asctime)s: %(message)s', datefmt='%H:%M:%S')
    ch.setFormatter(formatter)
    fh.setFormatter(formatter)
    logger.addHandler(ch)
    logger.addHandler(fh)
    logger.setLevel(logging.INFO)
    return logger
logger = get_logger()

def get_annual_filings_df():
    logger.info('retrieving filings index...')
    #out = []
    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, "yh3395@columbia.edu",skip_all_present_except_last=False)
    full_company_index = pd.DataFrame()
    try: 
      for Year in range(START_YEAR, END_YEAR+1):
        for QTR in GET_QUARTERS:
          logger.info(QTR)
          file_name = PLAIN_TEXT_PATH + "/"+str(Year)+"-"+QTR+".tsv"
          with open(file_name, "r", encoding="utf-8") as f:
            #tmp_df = pd.read_csv(f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date_filed","txt_name","link"])
            tmp_df = pd.read_csv(f, sep = '[\t|]', names=["cik","company","form_type","date_filed","txt_name","link"])
            #idx_df = parse_company_index(tmp_df)
            if full_company_index.shape[0] == 0:
              full_company_index = tmp_df.copy()
            else:
              full_company_index = pd.concat([full_company_index, tmp_df])
    except OSError:
      pass

    full_company_index.cik = full_company_index.cik.astype(np.int64)
    # full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])
    full_company_index['accession_number'] = full_company_index.txt_name.apply(lambda x: x.split('/')[-1].split('.')[0])
    #####
    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()
    
    df_20f = full_company_index.loc[full_company_index.form_type=='20-F']
    # find uk top10 companies with ticker
    #df_20f_uk_top10 = df_20f.loc[df_20f.ticker.isin(['UL','AZN','HSBC','GSK','RTNTF','BTI','RDS-A'])]
    #df_20f_uk_top10_21 = df_20f_uk_top10.loc[df_20f_uk_top10.date_filed>'2021-01-01']
    #demo_uk20f = df_20f_uk_top10_21
    df_20f_21 = df_20f.loc[df_20f.date_filed>'2021-01-01']
    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

    #annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    #annual_filings = demo_uk20f
    annual_filings = df_20f_21
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings

In [None]:
df_20f_21 = get_annual_filings_df()
print(df_20f_21.shape)


23:09:39: retrieving filings index...
23:09:39: 5 index files to retrieve
23:09:41: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Colab YH/20F21/2022-QTR1.tsv
23:09:43: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR4/master.zip to /drive/MyDrive/Rights Colab YH/20F21/2021-QTR4.tsv
23:09:46: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR3/master.zip to /drive/MyDrive/Rights Colab YH/20F21/2021-QTR3.tsv
23:09:49: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR2/master.zip to /drive/MyDrive/Rights Colab YH/20F21/2021-QTR2.tsv
23:09:52: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR1/master.zip to /drive/MyDrive/Rights Colab YH/20F21/2021-QTR1.tsv
23:09:52: complete
23:09:52: QTR1
  return func(*args, **kwargs)
23:09:59: QTR2
23:10:02: QTR3
23:10:05: QTR4
23:10:12: annual filings: 1094


(1094, 8)


In [None]:
df_20f_21.iloc[0]

cik                                                           1000184
company                                                        SAP SE
form_type                                                        20-F
date_filed                                                 2021-03-04
txt_name                  edgar/data/1000184/0001104659-21-031847.txt
link                edgar/data/1000184/0001104659-21-031847-index....
accession_number                                 0001104659-21-031847
ticker                                                            SAP
Name: 13, dtype: object

In [None]:
ukannual_filings_top10 = get_annual_filings_df()
print(ukannual_filings_top10.shape)


18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: retrieving filings index...
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:08: 9 index files to retrieve
18:45:09: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Colab YH/20F/2022-QTR1.tsv
18:45:09: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Col

(7, 8)


Unnamed: 0,cik,company,form_type,date_filed,txt_name,link,accession_number,ticker
595779,1089113,HSBC HOLDINGS PLC,20-F,2021-02-24,edgar/data/1089113/0001628280-21-003046.txt,edgar/data/1089113/0001628280-21-003046-index....,0001628280-21-003046,HSBC
706872,1131399,GLAXOSMITHKLINE PLC,20-F,2021-03-12,edgar/data/1131399/0001193125-21-079417.txt,edgar/data/1131399/0001193125-21-079417-index....,0001193125-21-079417,GSK
739528,1303523,British American Tobacco p.l.c.,20-F,2021-03-09,edgar/data/1303523/0001193125-21-073992.txt,edgar/data/1303523/0001193125-21-073992-index....,0001193125-21-073992,BTI
740160,1306965,Royal Dutch Shell plc,20-F,2021-03-11,edgar/data/1306965/0001306965-21-000025.txt,edgar/data/1306965/0001306965-21-000025-index....,0001306965-21-000025,RDS-A
1004707,217410,UNILEVER PLC,20-F,2021-03-10,edgar/data/217410/0001193125-21-075770.txt,edgar/data/217410/0001193125-21-075770-index.html,0001193125-21-075770,UL
1177272,887028,RIO TINTO LTD,20-F,2021-03-02,edgar/data/887028/0001628280-21-003713.txt,edgar/data/887028/0001628280-21-003713-index.html,0001628280-21-003713,RTNTF
1200493,901832,ASTRAZENECA PLC,20-F,2021-02-16,edgar/data/901832/0001104659-21-022456.txt,edgar/data/901832/0001104659-21-022456-index.html,0001104659-21-022456,AZN


## Convert to function

In [None]:
pwd

'/content'

In [None]:
from google.colab import drive
#drive.mount('/content/drive')
drive.mount('/drive')

Mounted at /drive


In [None]:
pip install python-edgar



In [None]:
pip install ftfy



### get logger
### parse_company_index
### get annual filling df - ukannual_filings
top 7 uk firms' 20-F

In [None]:
import os
import re
import numpy as np
import pandas as pd
import requests
import lxml
import edgar
import ftfy
from bs4 import BeautifulSoup
from datetime import datetime
from time import sleep
import urllib.request
import logging
import sys

##### UPDATE ME! #####
#FORM_TYPES = ['10-K']
FORM_TYPES = ['20-F']

##### UPDATE ME! #####
GET_QUARTERS = ['QTR1','QTR2','QTR3','QTR4']
#GET_QUARTERS = ['QTR1']
START_YEAR = 2021
END_YEAR = 2021
#GET_QUARTERS = ['/2021/QTR1/']

# CIK_TICKER_LOOKUP_PATH = '/content/drive/MyDrive/data_for_good/ticker.txt'

##### UPDATE ME! #####
#PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Datasets/10k_clean_text/_0.0 downloaded/2021Q1/'
#PLAIN_TEXT_PATH = '/content/drive/MyDrive/DFG Cost of Human Rights Violations/Notebooks to share/Proxy_statement/new_proxy/''
PLAIN_TEXT_PATH = '/drive/MyDrive/Rights Colab YH/20F'



INDEX_PATH = 'www.sec.gov/Archives/edgar/full-index/'


def get_logger():
    t = datetime.now().strftime('%m%d-%H%M')
    logger = logging.getLogger()
    ch = logging.StreamHandler()
    fh = logging.FileHandler(f'log_{t}.txt')
    formatter = logging.Formatter('%(asctime)s: %(message)s', datefmt='%H:%M:%S')
    ch.setFormatter(formatter)
    fh.setFormatter(formatter)
    logger.addHandler(ch)
    logger.addHandler(fh)
    logger.setLevel(logging.INFO)
    return logger
logger = get_logger()

def parse_company_index(req):
    lines = req.text.split('\n')
    HEADER_LINE = 8
    COL_BREAKS = [62, 74, 86, 98, 120]
    ret = []
    for i in range(HEADER_LINE+2, len(lines)-1):
        line_list = []
        line_list.append(lines[i][:COL_BREAKS[0]].rstrip())
        line_list.append(lines[i][COL_BREAKS[0]:COL_BREAKS[1]].rstrip())
        line_list.append(lines[i][COL_BREAKS[1]:COL_BREAKS[2]].rstrip())
        line_list.append(lines[i][COL_BREAKS[2]:COL_BREAKS[3]].rstrip())
        line_list.append(lines[i][COL_BREAKS[3]:].rstrip())
        ret.append(line_list)
    columns = ['company', 'form_type', 'cik', 'date_filed', 'link']
    return pd.DataFrame(ret, columns=columns)


def get_annual_filings_df():
    logger.info('retrieving filings index...')
    #out = []
    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, "yh3395@columbia.edu",skip_all_present_except_last=False)
    full_company_index = pd.DataFrame()
    try: 
      for Year in range(START_YEAR, END_YEAR+1):
        for QTR in GET_QUARTERS:
          logger.info(QTR)
          file_name = PLAIN_TEXT_PATH + "/"+str(Year)+"-"+QTR+".tsv"
          with open(file_name, "r", encoding="utf-8") as f:
            #tmp_df = pd.read_csv(f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date_filed","txt_name","link"])
            tmp_df = pd.read_csv(f, sep = '[\t|]', names=["cik","company","form_type","date_filed","txt_name","link"])
            #idx_df = parse_company_index(tmp_df)
            if full_company_index.shape[0] == 0:
              full_company_index = tmp_df.copy()
            else:
              full_company_index = pd.concat([full_company_index, tmp_df])
    except OSError:
      pass

    full_company_index.cik = full_company_index.cik.astype(np.int64)
    # full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])
    full_company_index['accession_number'] = full_company_index.txt_name.apply(lambda x: x.split('/')[-1].split('.')[0])
    #####
    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()
    
    df_20f = full_company_index.loc[full_company_index.form_type=='20-F']
    # find uk top10 companies with ticker
    #df_20f_uk_top10 = df_20f.loc[df_20f.ticker.isin(['UL','AZN','HSBC','GSK','RTNTF','BTI','RDS-A'])]
    #df_20f_uk_top10_21 = df_20f_uk_top10.loc[df_20f_uk_top10.date_filed>'2021-01-01']
    #demo_uk20f = df_20f_uk_top10_21
    df_20f_21 = df_20f.loc[df_20f.date_filed>'2021-01-01']
    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

    #annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    #annual_filings = demo_uk20f
    annual_filings = df_20f_21
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings
"""
def get_annual_filings_df():
    logger.info('retrieving filings index...')
    #out = []
    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, "yh3395@columbia.edu",skip_all_present_except_last=False)
    full_company_index = pd.DataFrame()
    try: 
      for Year in range(START_YEAR, END_YEAR+1):
        for QTR in GET_QUARTERS:
          logger.info(QTR)
          file_name = PLAIN_TEXT_PATH + "/"+str(Year)+"-"+QTR+".tsv"
          with open(file_name, "r", encoding="utf-8") as f:
            #tmp_df = pd.read_csv(f, sep = '\t', delimiter = '|', names=["cik","company","form_type","date_filed","txt_name","link"])
            tmp_df = pd.read_csv(f, sep = '[\t|]', names=["cik","company","form_type","date_filed","txt_name","link"])
            #idx_df = parse_company_index(tmp_df)
            if full_company_index.shape[0] == 0:
              full_company_index = tmp_df.copy()
            else:
              full_company_index = pd.concat([full_company_index, tmp_df])
    except OSError:
      pass

    full_company_index.cik = full_company_index.cik.astype(np.int64)
    # full_company_index['accession_number'] = full_company_index.link.apply(lambda x: x.split('/')[-1].split('.')[0])
    full_company_index['accession_number'] = full_company_index.txt_name.apply(lambda x: x.split('/')[-1].split('.')[0])
    #####
    # https://www.sec.gov/include/ticker.txt
    CIK_TICKER_LOOKUP_PATH = 'https://www.sec.gov/include/ticker.txt'
    cik_ticker = pd.read_csv(CIK_TICKER_LOOKUP_PATH, sep='\t', header=None, names=['ticker', 'cik'])
    # sics = pd.read_csv('./data/sics_new.csv',
    #                    usecols=['company_ticker', 'company_name', 'primary_industry_id', 'scope', 'is_active'])
    # sics = sics[(sics.scope == 'US') & (sics.is_active == 'Y')]

    full_company_index = pd.merge(full_company_index, cik_ticker, left_on="cik", right_on="cik", how='left')
    full_company_index.ticker = full_company_index.ticker.fillna('_UNK')
    full_company_index.ticker = full_company_index.ticker.str.upper()
    
    df_20f = full_company_index.loc[full_company_index.form_type=='20-F']
    df_20f_uk_top10 = df_20f.loc[df_20f.ticker.isin(['UL','AZN','HSBC','GSK','RTNTF','BTI','RDS-A'])]
    df_20f_uk_top10_21 = df_20f_uk_top10.loc[df_20f_uk_top10.date_filed>'2021-01-01']
    demo_uk20f = df_20f_uk_top10_21

    # full_company_index = pd.merge(full_company_index, sics, left_on="ticker", right_on="company_ticker", how='left')
    # full_company_index.drop(['company_ticker', 'company_name'], axis=1, inplace=True)

    #annual_filings = full_company_index[full_company_index.form_type.isin(FORM_TYPES)]
    annual_filings = demo_uk20f
    # , 'DEFA14A', 'PRE 14A', 'DEFM14A'
    # annual_filings = annual_filings[annual_filings.primary_industry_id.notna()]
    logger.info(f'annual filings: {len(annual_filings)}')
    return annual_filings

"""

'\ndef get_annual_filings_df():\n    logger.info(\'retrieving filings index...\')\n    #out = []\n    edgar.download_index(PLAIN_TEXT_PATH, START_YEAR, "yh3395@columbia.edu",skip_all_present_except_last=False)\n    full_company_index = pd.DataFrame()\n    try: \n      for Year in range(START_YEAR, END_YEAR+1):\n        for QTR in GET_QUARTERS:\n          logger.info(QTR)\n          file_name = PLAIN_TEXT_PATH + "/"+str(Year)+"-"+QTR+".tsv"\n          with open(file_name, "r", encoding="utf-8") as f:\n            #tmp_df = pd.read_csv(f, sep = \'\t\', delimiter = \'|\', names=["cik","company","form_type","date_filed","txt_name","link"])\n            tmp_df = pd.read_csv(f, sep = \'[\t|]\', names=["cik","company","form_type","date_filed","txt_name","link"])\n            #idx_df = parse_company_index(tmp_df)\n            if full_company_index.shape[0] == 0:\n              full_company_index = tmp_df.copy()\n            else:\n              full_company_index = pd.concat([full_company_inde

### scrape_filing
### parse_soup
### get_f_text
### get_clean_text
### convert_to_plain_text

In [None]:
def scrape_filing(link, ticker, form_type):
    url_prefix = "https://www.sec.gov/Archives/"
    # print(url_prefix + link)
    # sleep(30)
    file = None
    while file is None:
      try:
        #req = urllib.request.Request(url_prefix + link, headers={'User-Agent' : "Magic Browser"}) 
        req = urllib.request.Request(url_prefix + link, headers={'User-Agent' : "yh3395@columbia.edu"}) 
        # file = urllib.request.urlopen(url_prefix + link)
        file = urllib.request.urlopen(req)
      except:
        print('Opening', link, 'failed. Retrying...')
        file = None
    out = ''
    por = ''
    in_doc = False
    i = 0
    while True:
        i += 1
        line = file.readline().decode('utf-8', 'ignore')
        #if line.startswith("CONFORMED PERIOD OF REPORT"):
            #por = line.split(':')[-1].strip()
        if line.startswith("<TEXT>"):
            in_doc = True
        if in_doc:
            # bs4/lxml handles <br>s by simply removing them, which squashes words
            # replace <br>s with spaces before passing to bs4
            cleanline = re.sub(r'<br>|<BR>', ' ', line)
            out += cleanline + ' '
        if line.startswith("</TEXT>"):
            break
    return BeautifulSoup(out, 'lxml')

def parse_soup(soup, base_element='p', is_xbrl=False):
    out = []
    base_els = soup.find_all(base_element)

     # for iXBRL, the first <div> is a large metadata block, so skip it
    if is_xbrl: base_els = base_els[1:]

    n_base_els = len(base_els)

    i = 0
    in_table = False
    while i < n_base_els:
        el = base_els[i]

        # skip divs that contain other divs or tables to avoid recursion
        # i.e. divs that contain divs would otherwise appear twice
        descendants = [d.name for d in el.descendants]
        if base_element in descendants or 'table' in descendants:
            i += 1
            continue

        if el.parent.name != 'td': # ordinary line
            if in_table:
                out.append('[END TABLE]')
                out.append('\n')
                in_table = False
            # remove line breaks inside elements (iXBRL filings)
            out.append(el.text.replace('\n', ''))
            i += 1
            continue

        # loop through tables row-wise
        elif el.parent.name == 'td':
            if not in_table:
                out.append('[BEGIN TABLE]')
                in_table = True
            row_el = el.parent.parent
            # handling for poorly-formed table markup
            if row_el.name != 'tr': break

            # sometimes text is contained directly in <td>s without <div>s or <p>s inside
            # so search on <td> instead
            row_tds = row_el.find_all('td')

            n_tds = len(row_tds)
            row_text = ''
            for el in row_tds:
                # Tables in most annual filings contain tds with a single text element.
                # Some tables have <td>s containing multiple <div>s, which would otherwise
                # become squashed into a single string without spaces between words...
                if len(el.find_all('div')) >1:
                    row_text += ' '.join([e.text for e in el.find_all('div')])
                else:
                    # iXBRL filings often contain extra line breaks in text elements:
                    row_text += el.text.replace('\n', ' ') + ' '
                # since the row-wise loop is searching for <tr> elements, we only increment
                # if the <tr> contains the base_element (i.e. <div> or <p>), in order to keep
                # the counter i in sync with the base_els iterable
                if base_element in [e_.name for e_ in el.children]:
                    i += 1
            out.append(row_text)

    return ('\n').join(out)



def get_filing_text(soup):
    n_p = len(soup.find_all('p'))
    n_div = len(soup.find_all('div'))
    n_span = len(soup.find_all('span'))

    # if there are <span>s, the file is probably iXBRL.
    if n_span > n_p: return parse_soup(soup, 'div', is_xbrl=True)
    #if n_span > n_p: return parse_soup1
    # if not iXBRL, use <p> or <div>, whichever is more abundant in the markup
    elif n_p > n_div: return parse_soup(soup, 'p', is_xbrl=False)
    #elif n_p > n_div: return parse_soup2
    else: return parse_soup(soup, 'div', is_xbrl=False)
    #else: return parse_soup3


def get_clean_text(str_in):
    ret = str_in
    ret = re.sub(r'\x9f', '•', ret) # used as bullet in 0000004904-19-000009
    ret = ftfy.fix_text(ret)
    ret = re.sub(r'\xa0', ' ', ret) # remove \xa0

    # remove page breaks
    ret = re.sub(r'\s+(-\s*\d+\s*-\s*)+\s*(Table of Contents\s*)+\s*\n', '\n', ret) # e.g. 0001178879-19-000024
    ret = re.sub(r'\s*\d+(\s*Table of Contents\n)+\n', ' ', ret) # e.g. 0000764180-19-000023
    ret = re.sub(r'\s*(\d+\n)+\s*\n\n', ' ', ret) # e.g. 0000824142-19-000040
    ret = re.sub(r'\n\s*\d+\s*\n', ' ', ret) # base case: digits separated by line breaks

    # U+2022, U+00B7, U+25AA, U+25CF, U+25C6
    ret = re.sub(r'([•·▪●◆])\s*\n', '\1 ', ret) # combine orphaned bullet chars separated from text by newline
    ret = re.sub(r'([•·▪●◆])(\w)', r'\1 \2', ret) # separate bullets squished next to text
    ret = re.sub(r'[ ]*([•·▪●◆])[ ]*', r'\1 ', ret) # consolidate whitespace around bullets

    # not active in order to preserve visual structure of multi-level bulleted lists
    #ret = re.sub(r'[•·▪●◆]', '•', ret) # use single bullet char

    # join orphaned sentences (line starts with a lower-case word)
    ret = re.sub(r'([a-z\,])\s*\n\s*([a-z])', r'\1 \2', ret)

    # fix table delimiters not separated by newline
    ret = re.sub(r'\[END TABLE\](.)', r'[END TABLE]\n\1', ret)
    ret = re.sub(r'(.)\[BEGIN TABLE\]', r'\1\n[BEGIN TABLE]', ret)

    # remove empty tables
    ret = re.sub(r'\[BEGIN TABLE\]s*\n\s*\[END TABLE\]', r'\n', ret)

    # remove table delmiters if table has only one row
#    ret = re.sub(r'\[BEGIN TABLE\]\s*\n([^\n]+)\n\s*\[END TABLE\]', r'\1', ret)

    ret = re.sub(r'(\d)\s*\)', r'\1)', ret)
    ret = re.sub(r'\s*%', r'%', ret)

    ret = re.sub(r'\n\s*\n', r'\n', ret) # remove empty lines

    return ret


def convert_to_plain_text(df):
    # scrape and parse filings in the DataFrame; save as plain text
    n_rows = df.shape[0]
    for i, row in enumerate(df.iterrows()):
        ticker = row[1].ticker
        accession_num = row[1].accession_number
        form_type = row[1].form_type
        #soup, por = scrape_filing(row[1].link, ticker, form_type)
        soup = scrape_filing(row[1].txt_name, ticker, form_type)
        text = get_filing_text(soup)
        text = get_clean_text(text)
        #new_fname = f'{ticker}_{form_type}_{por}_{accession_num}.txt'
        new_fname = f'{ticker}_{form_type}_{accession_num}'
        try:
          with open(PLAIN_TEXT_PATH + new_fname, 'w', encoding='utf-8') as f:
              f.write(text)
              f.close()
          logger.info(f'{i+1}/{n_rows} {new_fname}')
        except:
          print("Skipped:",sys.exc_info()[0],"occured.")


if __name__ == '__main__':
    logger = get_logger()
    annual_filings = get_annual_filings_df()
    convert_to_plain_text(annual_filings)

04:12:13: retrieving filings index...
04:12:13: retrieving filings index...
04:12:13: 5 index files to retrieve
04:12:13: 5 index files to retrieve
04:12:15: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Colab YH/20F/2022-QTR1.tsv
04:12:15: > downloaded https://www.sec.gov/Archives/edgar/full-index/2022/QTR1/master.zip to /drive/MyDrive/Rights Colab YH/20F/2022-QTR1.tsv
04:12:17: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR4/master.zip to /drive/MyDrive/Rights Colab YH/20F/2021-QTR4.tsv
04:12:17: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR4/master.zip to /drive/MyDrive/Rights Colab YH/20F/2021-QTR4.tsv
04:12:20: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR3/master.zip to /drive/MyDrive/Rights Colab YH/20F/2021-QTR3.tsv
04:12:20: > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR3/master.zip to /drive/MyDrive/Rights Colab YH/20F/2021-QTR3.tsv


## Get country

In [None]:
# get country - state
def get_state(tmp_f):
    import re
    reg_country = r', (\w*)(Jurisdiction of incorporation)*'
    country = re.find(reg_country, tmp_f)[0]
    return country

In [None]:
def file_seeker(path_in, sw):
    import os
    import pandas as pd
    import nltk
    dictionary = nltk.corpus.words.words("en")
    tmp_pd = pd.DataFrame()
    for dirName, subdirList, fileList in os.walk(path_in):
        tmp_dir_name = dirName.split("/")[-1::][0]
        try:
            for word in fileList:
                #print(word)
                tmp = open(dirName+"/"+word, "r", encoding="ISO-8859-1")
                tmp_f = tmp.read()
                tmp.close()
                if sw == "check_words":
                    tmp_f = [word for word in tmp.split() if word in dictionary]
                    tmp_f = ' '.join(tmp_f)
                    
                tmp_pd = tmp_pd.append({'label': tmp_dir_name, 
                                        #'path': dirName+"/"+word}, 
                                        'body': tmp_f,
                                        'state': get_state(tmp_f)},
                                        ignore_index = True)
        except Exception as e:
            print (e)
            pass
    return tmp_pd



# get industry - use map

In [None]:
from google.colab import drive

drive.mount('drive')

Mounted at drive


In [None]:
path = '/content/drive/MyDrive/DFG/' 
df_map = pd.read_csv(path+'usable_10ks_with_sics.csv')
df_map.head()

Unnamed: 0,filename,filepath,ticker,type,fy,doc_id,id,company_ticker,company_name,isin,naics_code,country,exchange,market_cap,market_cap_category,revenue,primary_industry_id,secondary_industry_id,tertiary_industry_id,is_active,last_modified_date,scope,ticker_display
0,STAG_10-K_20121231_0001047469-13-002284.txt,/content/drive/MyDrive/DFG Cost of Human Right...,STAG,10-K,20121231,0001047469-13-002284.txt,135671,STAG,"Stag Industrial, Inc.",US85254J1025,,US,New York,3575124000.0,Medium Cap,,IF-RE,,,Y,2019-12-13 00:08:25,US,STAG
1,STAG_10-K_20131231_0001047469-14-001398.txt,/content/drive/MyDrive/DFG Cost of Human Right...,STAG,10-K,20131231,0001047469-14-001398.txt,135671,STAG,"Stag Industrial, Inc.",US85254J1025,,US,New York,3575124000.0,Medium Cap,,IF-RE,,,Y,2019-12-13 00:08:25,US,STAG
2,STAG_10-K_20141231_0001558370-15-000135.txt,/content/drive/MyDrive/DFG Cost of Human Right...,STAG,10-K,20141231,0001558370-15-000135.txt,135671,STAG,"Stag Industrial, Inc.",US85254J1025,,US,New York,3575124000.0,Medium Cap,,IF-RE,,,Y,2019-12-13 00:08:25,US,STAG
3,STAG_10-K_20151231_0001479094-16-000006.txt,/content/drive/MyDrive/DFG Cost of Human Right...,STAG,10-K,20151231,0001479094-16-000006.txt,135671,STAG,"Stag Industrial, Inc.",US85254J1025,,US,New York,3575124000.0,Medium Cap,,IF-RE,,,Y,2019-12-13 00:08:25,US,STAG
4,STAG_10-K_20161231_0001479094-17-000005.txt,/content/drive/MyDrive/DFG Cost of Human Right...,STAG,10-K,20161231,0001479094-17-000005.txt,135671,STAG,"Stag Industrial, Inc.",US85254J1025,,US,New York,3575124000.0,Medium Cap,,IF-RE,,,Y,2019-12-13 00:08:25,US,STAG


In [None]:
list1 = list(df_20f_21['ticker'])
list2 = list(df_map['ticker'])

sametic = []

for tick in list1:
    if tick in list2:
        sametic.append(tick)
print(len(sametic))
print(sametic)

11
['NESR', 'UMEWF', 'SPI', 'AKTX', 'ANY', 'BRQS', 'KXIN', 'UMEWF', 'UMEWF', 'RCON', 'RGC']


In [None]:
for tick in list(df_20f_21_c['ticker']):
  if tick in list(df_map['ticker']):
    df_20f_21_c.loc[df_20f_21_c['ticker']==tick,['industry_id']] = df_map.loc[df_map['ticker']==tick,['primary_industry_id']].values[0]

In [None]:
sametic
df_20f_21_c.loc[df_20f_21_c['ticker'].isin(sametic) ,:]

Unnamed: 0,cik,company,form_type,date_filed,txt_name,link,accession_number,ticker,industry_id
316461,1698514,National Energy Services Reunited Corp.,20-F,2021-03-24,edgar/data/1698514/0001493152-21-006706.txt,edgar/data/1698514/0001493152-21-006706-index....,0001493152-21-006706,NESR,FN-AC
801142,1114936,UMeWorld Ltd,20-F,2021-06-04,edgar/data/1114936/0001477932-21-003808.txt,edgar/data/1114936/0001477932-21-003808-index....,0001477932-21-003808,UMEWF,TC-IM
816529,1210618,"SPI Energy Co., Ltd.",20-F,2021-04-29,edgar/data/1210618/0001683168-21-001594.txt,edgar/data/1210618/0001683168-21-001594-index....,0001683168-21-001594,SPI,RR-ST
885886,1541157,Akari Therapeutics Plc,20-F,2021-04-21,edgar/data/1541157/0001104659-21-052672.txt,edgar/data/1541157/0001104659-21-052672-index....,0001104659-21-052672,AKTX,HC-BP
900096,1591956,Sphere 3D Corp,20-F,2021-04-09,edgar/data/1591956/0001591956-21-000010.txt,edgar/data/1591956/0001591956-21-000010-index....,0001591956-21-000010,ANY,TC-SI
916465,1650575,"Borqs Technologies, Inc.",20-F,2021-04-26,edgar/data/1650575/0001213900-21-022875.txt,edgar/data/1650575/0001213900-21-022875-index....,0001213900-21-022875,BRQS,TC-SI
940776,1713539,Kaixin Auto Holdings,20-F,2021-05-14,edgar/data/1713539/0001104659-21-067017.txt,edgar/data/1713539/0001104659-21-067017-index....,0001104659-21-067017,KXIN,CG-MR
1382669,1114936,UMeWorld Ltd,20-F,2021-07-16,edgar/data/1114936/0001477932-21-004699.txt,edgar/data/1114936/0001477932-21-004699-index....,0001477932-21-004699,UMEWF,TC-IM
1382670,1114936,UMeWorld Ltd,20-F,2021-09-27,edgar/data/1114936/0001477932-21-006655.txt,edgar/data/1114936/0001477932-21-006655-index....,0001477932-21-006655,UMEWF,TC-IM
1956144,1442620,"Recon Technology, Ltd",20-F,2021-11-15,edgar/data/1442620/0001104659-21-139057.txt,edgar/data/1442620/0001104659-21-139057-index....,0001104659-21-139057,RCON,EM-SV


In [None]:
df_20f_21_c.loc[df_20f_21_c['ticker'].isin(sametic) ,:].shape

(11, 9)