**Table of contents**<a id='toc0_'></a>    
- [Stock Screener Module](#toc1_)    
  - [Prelimenary Imports and ENV variable definitions](#toc1_1_)    
  - [Downloading/Scraping necessary information for the screening process](#toc1_2_)    
    - [Obtaining the ETF holdings](#toc1_2_1_)    
    - [Obtaining the failstodeliver information.](#toc1_2_2_)    
    - [Obtaining 13f Forms](#toc1_2_3_)    
  - [Creation of the main dataset](#toc1_3_)    
    - [Collecting data from the submission table](#toc1_3_1_)    
    - [Converting the Accession Numbers to Name of Issuers](#toc1_3_2_)    
    - [Converting CUSIP to Ticker Symbols](#toc1_3_3_)    
    - [Adding the remainder of the ticker symbols to the dataset](#toc1_3_4_)    
  - [Optimizing the dataset and creating subset b](#toc1_4_)    
  - [Creation of the subset c](#toc1_5_)    
  - [Storing the ticker symbols into the cfs module](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Stock Screener Module](#toc0_)
A software module that generates a subset (b) from a superset (a) of stock names based on a screening criteria. Additional sub-subsets (c), etc. of the subset are created by applying additional criterias (fundamental analysis, etc) to eventually generate one final subset (the target stock list).


## <a id='toc1_1_'></a>[Preliminary Imports and ENV variable definitions](#toc0_)
Performing the necessary imports and constants.
CIK_IDENTIFIER is a curated list of CIK ids.

In [None]:

import csv
import os
import yfinance as yf

from dotenv import load_dotenv
from pprint import pprint

load_dotenv()
FILE_PATH = r"./dataset/" 


CIK_IDENTIFIERS = [
    '0001720792',
    '0001099281',
    '0001079114',
    '0001112520',
    '0001641864',
    '0000846222',
    '0001709323',
    '0000732905',
    '0000883965',
    '0001067983',
    '0001061768',
]

## <a id='toc1_2_'></a>[Downloading/Scraping necessary information for the screening process](#toc0_)
In this section we download the needed information and store them into the /dataset folder for use by the ssm. First we define the current date in order to select the proper quarter and year for the ETFs and 13fs. Also we define the headers for the requests.

In [None]:
import requests
import zipfile
from io import BytesIO
from datetime import datetime

# Empty out directory
files = [filename for filename in os.listdir(FILE_PATH) if not filename.startswith("README")]
for file in files:
    os.remove(FILE_PATH+file)

month = datetime.now().month
quarter = 4 if int(month/4) == 0 else int(month/4)
print(quarter)
year = datetime.now().year

headers = {
    'Host': 'www.sec.gov', 'Connection': 'close',
    'Accept': 'application/json, text/javascript, */*; q=0.01', 'X-Requested-With': 'XMLHttpRequest',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
}       

### <a id='toc1_2_1_'></a>[Obtaining the ETF holdings](#toc0_)
In this section we obtain the ETF holdings from the specified urls, then write them into csv files located in /dataset.

In [None]:
url_high_div_etf = 'https://www.blackrock.com/us/individual/products/239563/ishares-high-dividend-etf/1464253357814.ajax?fileType=csv&fileName=HDV_holdings&dataType=fund'
url_core_div_etf = 'https://www.ishares.com/us/products/291387/fund/1467271812596.ajax?fileType=csv&fileName=DIVB_holdings&dataType=fund'

r = requests.get(url_high_div_etf, allow_redirects=True)
with open(FILE_PATH + 'HDV_holdings.csv', 'wb+') as f:
    f.write(r.content)

r = requests.get(url_core_div_etf, allow_redirects=True)
with open(FILE_PATH + 'DIVB_holdings.csv', 'wb+') as f:
    f.write(r.content)

with open(FILE_PATH + 'HDV_holdings.csv', 'r', encoding='utf-8-sig') as fp:
    lines = fp.readlines()

with open(FILE_PATH + 'HDV_holdings.csv', 'w', encoding='utf-8-sig') as fp:
    for i, line in enumerate(lines):
        if i<9: continue
        fp.write(line)

with open(FILE_PATH + 'DIVB_holdings.csv', 'r', encoding='utf-8-sig') as fp:
    lines = fp.readlines()

with open(FILE_PATH + 'DIVB_holdings.csv', 'w', encoding='utf-8-sig') as fp:
    for i, line in enumerate(lines):
        if i<9: continue
        fp.write(line)

### <a id='toc1_2_2_'></a>[Obtaining the failstodeliver information.](#toc0_)
Next we get the fails to deliver data. We do this in order to get the ticker symbol from the CUSIP which is extracted later from the 13fs. The urls for the different files are the same and only vary in month and year. For example if today was January 1st 2022 then the url would be:
`https://www.sec.gov/files/data/fails-deliver-data/cnsfails202201a.zip`

In [None]:
fmonth = month
fyear = year
for i in range(3):
    fmonth = int(fmonth)-1
    fyear = fyear if fmonth != 0 else fyear-1
    if fmonth == 0:
        str_month = '12'
        fmonth = 12
    elif fmonth < 10:
        str_month = '0' + str(fmonth)
    else:
        str_month = str(fmonth)

    print(f"month {str_month} year {fyear}")
    fails_deliver_url = f'https://www.sec.gov/files/data/fails-deliver-data/cnsfails{fyear}{str_month}a.zip'
    r = requests.get(fails_deliver_url, headers=headers, allow_redirects=True)
    z = zipfile.ZipFile(BytesIO(r.content))
    z.extract(f'cnsfails{fyear}{str_month}a', FILE_PATH)

### <a id='toc1_2_3_'></a>[Obtaining 13f Forms](#toc0_)
Finally we get the 13f forms, we only get the last 3 most recent quarters.

Similar to the fails to deliver data the urls follow a similar format, they take on the form of: 

https://www.sec.gov/files/structureddata/data/form-13f-data-sets/{year}q{quarter}_form13f.zip

So we need to first determine the year and convert the month to the current quarter.

Then we get the 3 most recent 13f data sets then finally we need to also extract the 2 files desired from the zip.

In [None]:
if quarter == 4: # q4 comes out in the new year so most recent data will be in last year for the 13f data.
    year -= 1

files = 3 # Last {files} quarters of 13f data
i = 0 # iterator not pythonic but might be mutated
while i < files:
    
    if i > 0:
        quarter -= 1
        if quarter <= 0:
            quarter = 4
            year -= 1
            
    url = f'https://www.sec.gov/files/structureddata/data/form-13f-data-sets/{year}q{quarter}_form13f.zip'
    try:
        r = requests.get(url, headers=headers, allow_redirects=True)
        z = zipfile.ZipFile(BytesIO(r.content))
        zipinfos = z.infolist()
        for zipinfo in zipinfos:
            if "INFOTABLE" in zipinfo.filename:
                zipinfo.filename = f'INFOTABLE_{year}_q{quarter}.tsv'
                z.extract(zipinfo, FILE_PATH)
            elif "SUBMISSION" in zipinfo.filename:
                zipinfo.filename = f'SUBMISSION_{year}_q{quarter}.tsv'
                z.extract(zipinfo, FILE_PATH)
    except zipfile.BadZipFile:
        print("13f of specific quarter not present") 
        files += 1 # Get next most recent
    i+=1

## <a id='toc1_3_'></a>[Creation of the main dataset](#toc0_)
We will now compile all the data we just obtained from the 13f and holdings into one major set of data. To do this we must first gather the data from the 13f and convert them into ticker symbols.

### <a id='toc1_3_1_'></a>[Collecting data from the submission table](#toc0_)
From the SUBMISSION table we fetch a list of ACCESSION_NUMBER(s) using the CIK identifiers defined at the start of the program.


In [None]:
picked_submissions = []

prefixed = [filename for filename in os.listdir(FILE_PATH) if filename.startswith("SUBMISSION")]
print(prefixed)

for file in prefixed:
    with open(FILE_PATH + file, 'r', encoding='utf-8') as q:
        for submission in csv.DictReader(q, delimiter="\t"):
            if submission["CIK"] in CIK_IDENTIFIERS:
                picked_submissions.append(submission["ACCESSION_NUMBER"])

pprint(len(picked_submissions))


### <a id='toc1_3_2_'></a>[Converting the Accession Numbers to Name of Issuers](#toc0_)
From the INFOTABLE fetch a list of NAMEOFISSUER(s) using the ACCESSION_NUMBER(s) created in the previous cell. We then will use CUSIP(s) to map between brokers since it is unique where names differ slightly.

In [None]:
names_of_issuers = set()

prefixed = [filename for filename in os.listdir(FILE_PATH) if filename.startswith("INFOTABLE")]
print(prefixed)

for file in prefixed:
    with open(FILE_PATH + file, 'r', encoding='utf-8') as q:
        for entry in csv.DictReader(q, delimiter="\t"):
            if entry["ACCESSION_NUMBER"] in picked_submissions:
                names_of_issuers.add(entry["CUSIP"].upper())

### <a id='toc1_3_3_'></a>[Converting CUSIP to Ticker Symbols](#toc0_)

Now we need to convert the CUSIP to tickers, we will do this using the cnsfails to deliver files to fetch info about a holding by it's CUSIP ID. 

In this step we lose a bit of the data in this step as we fail to be able to convert all tickers.

In [None]:
tickers = set()

prefixed = [filename for filename in os.listdir(FILE_PATH) if filename.startswith("cnsfail")]
print(prefixed)

for file in prefixed:
    with open(FILE_PATH + file,'r') as f:
        for entry in csv.DictReader(f, delimiter="|"):
            if entry['CUSIP'] in names_of_issuers: 
                tickers.add(entry['SYMBOL'])
                names_of_issuers.remove(entry['CUSIP'])
    
pprint(tickers)

### <a id='toc1_3_4_'></a>[Adding the remainder of the ticker symbols to the dataset](#toc0_)
Adding the rest of the ticker symbols to the set from the other datasets. To do this we simple append all the ticker symbols into the list.

However before anything delete the first 10 rows of the csv files {DIVB_holdings, HDV_holdings} as it messed up the parsing for DictReader.

In [None]:
prefixed = [filename for filename in os.listdir(FILE_PATH) if "holdings" in filename]
print(prefixed)

for file in prefixed:
    with open(FILE_PATH + file,'r', encoding='utf-8-sig') as f:
        for entry in csv.DictReader(f, delimiter=","):
            entry.keys()
            tickers.add(entry["Ticker"])
        
pprint(tickers)

## <a id='toc1_4_'></a>[Optimizing the dataset and creating subset b](#toc0_)
At this point set A is complete and we can create subsets of it for the next few steps but first we will parse it a bit to speed up the process a bit. So we simply remove all tickers that do not offer dividends, we do this by simply checking if dividend rate is defined. We pull the data by scraping with yfinance.

We will call this subset of a, subset b.

In [None]:
from requests import HTTPError

arr_A = list(tickers)

ticker_objs = list(yf.Tickers(arr_A).tickers.values())
arr_B = []
for ticker in ticker_objs:
    try:
        if 'dividendRate' in ticker.info.keys():
            arr_B.append(ticker.info["symbol"])
    except HTTPError:
        print(f"Ticker not found, removed from subset.")
        continue

pprint(arr_B)

## <a id='toc1_5_'></a>[Creation of the subset c](#toc0_)

From subset (b) remove all names that have a high business risk, a debt to equity ratio greater than 1.5, sub-subset (c)

In [None]:
ticker_objs = list(yf.Tickers(arr_B).tickers.values())


arr_C = []

for ticker in ticker_objs:
    try:
        balance_sheet = list(ticker.balancesheet.to_dict().values())[0] # get most recent data
        liabilities = balance_sheet['Total Liabilities Net Minority Interest']
        assets = balance_sheet['Total Assets']
        debtToEquity = abs( liabilities / (assets - liabilities) )
    except ZeroDivisionError:
        print(ticker.info["symbol"]) # if this is close to 0 then equity to debt ratio is near inf 
        continue                     # So we skip it.
    except KeyError:
        print('Missing Balance Sheet Info')
        print(ticker.info["symbol"])
    if debtToEquity <= 1.5:
        arr_C.append(ticker.info["symbol"])

print(len(arr_C))
pprint(arr_C)
    



## <a id='toc1_6_'></a>[Storing the ticker symbols into the cfs module](#toc0_)
Store subset into a file for the cfs module to reference.

In [None]:
with open(r'../cfs_module/subset_c.txt', 'w') as f:
    f.write('\n'.join(arr_C))