In [2]:
# Please list the names of your group members in the list named 'members'

members = ["Dhairya Dhamani", "Marisa Guerra", "Kristen Cheung", "Holt Boink", "Hammad Hassan"]

### Please create a group of no more than 5 group members to solve Projects 2 and 3

# Project 2

<b>Your job in this project is to calculate a number of statistics for real-life firms using the SEC EDGAR system. To get started, you will want to look at the examples in the Topic 2.1 notebook on Canvas. There is nothing about the questions that I am asking that limits us to filings from 1998 through 2000. However, doing so minimizes the likelihood that you'll need to remove html tags from the 10-K filings that you process in question 3.</b>

- <b>Question 1a.</b> Use master.idx available on EDGAR to create a list of CIKs for firms that file a 10-K in the fourth quarters of 1998, 1999, and 2000. This is obviously a subset of the firms that file a 10-K in each of these calendar years.
- <b>Question 1b</b>. Use the SEC's current mapping from tickers to CIKs (provided below) to determine how many of the CIKs in question 1a currently have a ticker. (I am not asking you to verify that the CIK belongs to the same firm today that it did in 1998-2000; I don't know the SEC's policy on re-using CIKs.)
- <b>Question 2a</b>. For the CIKs from question 1a, how many 8-K were filed during each month of 1999? How many 8-Ks were filed on each day of the week during 1999?
- <b>Question 2b.</b> Calculate summary statistics based on the number of 8-Ks filed during 1999 by the CIKs from question 1a.
- <b>Question 3.</b> Pick one of the CIK from question 1a at random and measure the sentiment of item 1 of its 10-Ks from 1998, 1999, and 2000.

<i>Hint #1: Although my solutions heavily on dictionaries, some questions are easier to answer using numpy and/or pandas. You are welcome to use any combination of approaches/libraries.</i>

<i>Hint #2: Start as soon as possible! Before you start coding, write down the steps you need to undertake to answer each question.</i>.

<i>Hint #3: When downloading files from SEC EDGAR, please include statements like 'if not os.path.exists(destfile)', so that you do not download the same file over and over again.

In [3]:
import os
import re
import random
import requests      # my preferred library for downloading files from SEC EDGAR; see Topic 2.1
import datetime      # useful when I ask you to convert dates into days of the week (e.g. Monday)
import statistics
import numpy as np   # for some questions, I will present alternative solutions using numpy 
import pandas as pd  # just in case you want to use pandas
import urllib.request # used for accessing websites

### <b>Question 1a (5 points).</b> 

Determine the CIK of every firm that filed a 10-K in the fourth quarters of 1998, 1999, and 2000. You should focus on filing dates, which are available in master.idx, as opposed to effective dates, which are not. (You should exclude the CIK of any firm that files multiple 10-Ks in a given quarter or that fails to file a 10-K in each of the three calendar years.) Create a list named `unique_ciks` that contains this list of CIKs. Print the length of this list. 

<i>Hint: You need to start by downloading the master.idx files for 1998/QTR4, 1999/QTR4, and 2000/QTR4.</i>

In [4]:
# Function for accessing files from the link, string for path of file, and filename to save
def fromfile(url, filename):
    # Destination to download file and the name for file
    destfile = os.path.join(os.getcwd(), filename)
    # If-else statements to check if file not already downloaded, then download file
    if not os.path.exists(destfile):
        urllib.request.urlretrieve(url, destfile)
    # Return destfile to use later
    return destfile

In [5]:
# Function to open file and return converted file to list
def read_lines(file):
    f = open(file,'r')
    lines = f.readlines()
    f.close()
    return lines

In [6]:
# Creating the agent to open websites under the username and email
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Dhairya Dhamani dhamani@bc.edu)')]
urllib.request.install_opener(opener)

# Accessing Q4 1998 data
destfile1 = fromfile('https://www.sec.gov/Archives/edgar/full-index/1998/QTR4/master.idx','master_1998_Q4.txt')
# Accessing Q4 1999 data
destfile2 = fromfile('https://www.sec.gov/Archives/edgar/full-index/1999/QTR4/master.idx','master_1999_Q4.txt')
# Accessing Q4 2000 data
destfile3 = fromfile('https://www.sec.gov/Archives/edgar/full-index/2000/QTR4/master.idx','master_2000_Q4.txt')

# Open Q4 1998 data as list
lines1 = read_lines(destfile1)
# Open Q4 1999 data as list
lines2 = read_lines(destfile2)
# Open Q4 2000 data as list
lines3 = read_lines(destfile3)

# Strip element to remove \n elements
lines1 = [element.strip() for element in lines1]
lines2 = [element.strip() for element in lines2]
lines3 = [element.strip() for element in lines3]
# Remove first 10 rows since data is irrelevant there
lines1 = lines1[11:]
lines2 = lines2[11:]
lines3 = lines3[11:]
# Splitting on | to retrieve CIK and filing type
list_1 = [(element.split('|')[0], element.split('|')[2]) for element in lines1]
list_2 = [(element.split('|')[0], element.split('|')[2]) for element in lines2]
list_3 = [(element.split('|')[0], element.split('|')[2]) for element in lines3]
# Filtering to only include '10-K' filing types and storing corresponding CIKs
list_1 = [element[0] for element in list_1 if element[1] == '10-K']
list_2 = [element[0] for element in list_2 if element[1] == '10-K']
list_3 = [element[0] for element in list_3 if element[1] == '10-K']
# Removing duplicates by making sure count = 1 for each cik
list_1 = [cik for cik in list_1 if list_1.count(cik) == 1]
list_2 = [cik for cik in list_2 if list_2.count(cik) == 1]
list_3 = [cik for cik in list_2 if list_3.count(cik) == 1]

unique_ciks = []
# Iterate through list_1 (1998 data)
for i in list_1:
    # Check if element in list_2 and list_3 (meaning that CIK in all 3 years)
    if i in list_2 and i in list_3:
        # Appedning CIK to unique_ciks list
        unique_ciks.append(i)

In [7]:
print(f"The number of unique CIK is {len(unique_ciks)}.")

The number of unique CIK is 261.


### <b>Question 1b (2 points).</b> 

Use the mapping between tickers and CIK at https://www.sec.gov/include/ticker.txt to create a dictionary `cik_ticker` that maps each of the CIK in `unique_ciks` to a ticker (replacing lowercase letters with uppercase letters). Please use a regular expression to convert <i>ticker.txt</i> into a list of tuples containing the ticker and CIK. If the CIK is missing from <i>ticker.txt</i>, then the corresponding dictionary value should be set equal to "". If a CIK in <i>ticker.txt</i> is mising from `unique_ciks`, then it should also be missing from the dictionary `cik_ticker`. Print the 'total' number of CIKs and the number for which a ticker is 'missing'.

In [8]:
# Accessing ticker data from SEC website
destfile4 = fromfile('https://www.sec.gov/include/ticker.txt','ticker.txt')

# Open ticker data as string
f4 = open(destfile4,'r')
lines4 = f4.read()
f4.close()

# Regex to get list of tuples (ff) for ticker and matching ciks from data
ff = re.findall('([a-z-?]+)\s+(\d*)', lines4)

cik_ticker = {}
# Extract ticker from ff
ticker = [element[0] for element in ff]
# Extract cik from ff
cik = [element[1] for element in ff]
# Iterate through unique_ciks
for CIK in unique_ciks:
    # Check if CIK from unique_ciks is not in ticker data: if not, then ticker value = "" for CIK
    if CIK not in cik:
            cik_ticker[CIK] = ""
    # else, CIK is in ticker data
    else:
        # Enumerate to get index, value of cik from ticker data
        for index, c in enumerate(cik):
            # If CIK from unique_ciks = cik from ticker data, then assign CIK with corresponding ticker
            if CIK == c:
                cik_ticker[CIK] = ticker[index]
                # Break from for loop since found a match
                break
        
# Calculate total length of dictionary
total = len(cik_ticker)
# Calculate the total missing values from dictionary
missing = sum([1 if c == "" else 0 for c in cik_ticker.values()])

In [9]:
print(f'Of the {total} CIK in the cik_ticker dictionary, {missing} are missing a ticker.')

Of the 261 CIK in the cik_ticker dictionary, 210 are missing a ticker.


<b>Discussion question for class: I was unable to locate a ticker for many of the CIK in `unique_ciks`. Why do you think that might be the case?</b>

### <b>Question 2a (4 points).</b> 

Use the filing dates in the master.idx files for 1999/QTR1, 1999/QTR2, 1999/QTR3, and 1999/QTR4 and the set of CIK in `unique_ciks` to determine the total number of 8-K filed during

- Each month of 1999 (JAN, FEB, ..., DEC)
- Each day of the week during 1999 (MON, TUE, ..., SUN)

(You can learn more about form 8-K here: https://www.sec.gov/answers/form8k.htm)

In [10]:
from datetime import date

In [11]:
destfile5 = fromfile('https://www.sec.gov/Archives/edgar/full-index/1999/QTR1/master.idx', 'master_1999_Q1.txt')
destfile6 = fromfile('https://www.sec.gov/Archives/edgar/full-index/1999/QTR2/master.idx', 'master_1999_Q2.txt')
destfile7 = fromfile('https://www.sec.gov/Archives/edgar/full-index/1999/QTR3/master.idx', 'master_1999_Q3.txt')

# Open Q1 1999 data as list
lines5 = read_lines(destfile5)
# Open Q2 1999 data as list
lines6 = read_lines(destfile6)
# Open Q3 1999 data as list
lines7 = read_lines(destfile7)
# Since Q4 1999 is already downloaded, we will just read_lines
lines8 = read_lines(destfile2)

# Strip element to remove \n elements
lines5 = [element.strip() for element in lines5]
lines6 = [element.strip() for element in lines6]
lines7 = [element.strip() for element in lines7]
lines8 = [element.strip() for element in lines8]
# Remove first 10 rows since data is irrelevant there
lines5 = lines5[11:]
lines6 = lines6[11:]
lines7 = lines7[11:]
lines8 = lines8[11:]
# Splitting on | to retrieve CIK, date, and filing type
lines2a = lines5 + lines6 + lines7 + lines8
list_2a = [(element.split('|')[0], element.split('|')[2], element.split('|')[3]) for element in lines2a]
list_2a = [(element[0], element[2]) for element in list_2a if element[1] == '8-K']

# Create empty list to track filings for month and day of 1999
month = [0]*12
day = [0]*7

In [12]:
# Loop through list_2a containing CIK and date data
for element in list_2a:
    # Store CIK and date in distinct variables
    cik_string = element[0]
    date_string = element[1]
    # Separate date_string to year, month, date and convert to int
    y,m,d = date_string.split('-')
    y = int(y)
    m = int(m)
    d = int(d)
    # Use date function from datetime library to create date object to find weekday
    date_obj = date(y,m,d)
    d = date_obj.weekday()
    # Check if cik is in unique_ciks list from previous problem
    if cik_string in unique_ciks:
        # Increase month and day counter using previous m and d
        month[m-1] = month[m-1] + 1
        day[d] = day[d] + 1

In [13]:
# List containing headers for print statements for month and day
months_in_year = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
days_in_week = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

print('Month in 1999')
print('-------------')
# Loop through list for month in year and print corresponding value for listing
for index1, mon in enumerate(months_in_year):
    print(f'{mon.upper()}: {month[index1]}')
print()    
print('Day in 1999')
print('-----------')
# Loop through list for week in year and print corresponding value for listing
for index2, days in enumerate(days_in_week):
    print(f'{days.upper()}: {day[index2]}')

Month in 1999
-------------
JAN: 29
FEB: 22
MAR: 24
APR: 47
MAY: 47
JUN: 30
JUL: 55
AUG: 22
SEP: 41
OCT: 35
NOV: 33
DEC: 25

Day in 1999
-----------
MON: 59
TUE: 72
WED: 85
THU: 85
FRI: 109
SAT: 0
SUN: 0


### <b>Question 2b (3 points).</b> 

Calculate the total number of 8-K filed by each CIK in `unique_ciks` during 1999. (You should have one total per CIK. For CIK that filed no 8-Ks, the total should be equal to zero.) Then, calculate and print the minimum, maximum, mean, median, and standard deviation of the number of 8-Ks filed by CIKs during 1999.

In [14]:
# Create a dictionary to track CIKs and number of 8-Ks filed during 1999
tot_8K = {}

# Loop through all cik in unique_ciks list
for cik in unique_ciks:
    # Add cik into dictionary with initial count = 0
    tot_8K[cik] = 0
    # Loop through list_2a, which contains the CIK and date of 8-K filing
    for element in list_2a:
        # If CIK matches with 8-K filing dictionary, then add count
        if element[0] == cik:
            tot_8K[cik] = tot_8K[cik] + 1 
            
tot_8K_values = list(tot_8K.values())

In [15]:
# Summary stats of minimum, maximum, mean, median, and std dev for total 8-K values
print('Summary statistics of 8-Ks filed by CIKs during 1999')
print(f'Minimum: {min(tot_8K_values):5.2f}')
print(f'Maximum: {max(tot_8K_values):5.2f}')
print(f'Mean:    {statistics.mean(tot_8K_values):5.2f}')
print(f'Median:  {statistics.median(tot_8K_values):5.2f}')
print(f'St Dev:  {statistics.stdev(tot_8K_values):5.2f}')

Summary statistics of 8-Ks filed by CIKs during 1999
Minimum:  0.00
Maximum: 13.00
Mean:     1.57
Median:   1.00
St Dev:   2.31


### <b>Question 3 (6 points).</b> 

Pick one CIK at random from `unique_ciks` and download the firms' 10-Ks for 1998, 1999, and 2000. (You will find the file names in the master.idx for 1998/QTR4, 1999/QTR4, and 2000/QTR4, which you downloaded for question 1. The general structure of 10-Ks is described here: https://www.sec.gov/answers/reada10k.htm.)

Use a <b>regular expression</b> to extract Section 1 (describing the business) from each 10-K as a long string. Because 10-K are not perfectly standardized, even within a firm across years, you may need to tweak the regular expression to make it work with all three filings. Worst case, you can use different regular expressions on the different years. <i>Hint: Before attempting to extract multiple lines of text, it is helpful to replace the '\n' line returns with '|' or some other delimiter.</i>

Then, use the four word lists from Topic 1.9 to create a table with three rows and four columns that reports the fraction of words in Section 1 that are classified as negative, positive, uncertain, and litigious (these are the columns) in the 1998, 1999, and 2000 filings (these are the rows). Please format the numbers so that one percent appears as 1.00%. The table should also include row and column labels. 

<i>Hint #1: If you pick a firm for whom the 10-Ks are reported in html format, you can either adopt the regex code from class to remove html, or pick a different firm.</i>

<i>Hint #2: As we discussed in class, you need to convert the 10-Ks into lists of uppercase words before comparing them to the four lists of words used in sentiment analysis.</i>



In [16]:
# Imports the four words lists

def Input(filename):
    f = open(filename, 'r')
    lines = [l.strip() for l in f.readlines()]
    f.close()
    return lines

list_neg = Input('1.9_LM_negative.txt')
list_pos = Input('1.9_LM_positive.txt')
list_unc = Input('1.9_LM_uncertainty.txt')
list_lit = Input('1.9_LM_litigious.txt')

In [17]:
# Picks your CIK at random. 
# my_cik = random.sample(unique_ciks,1)[0]
# print(my_cik)

# Please replace '000000' with your CIK, so that it does
# not change when you run this cell again
my_cik = '894490'

In [18]:
# Accessing 10-K file for my_cik
# 1998
destfile9 = fromfile('https://www.sec.gov/Archives/edgar/data/894490/0000950170-98-002420.txt', 'cik_1998.txt')
# 1999
destfile10 = fromfile('https://www.sec.gov/Archives/edgar/data/894490/0000950170-99-001955.txt', 'cik_1999.txt')   
# 2000
destfile11 = fromfile('https://www.sec.gov/Archives/edgar/data/894490/0000950170-00-002076.txt', 'cik_2000.txt')

# Open data as a string
f9 = open(destfile9,'r')
lines9 = f9.read()
f9.close()

f10 = open(destfile10,'r')
lines10 = f10.read()
f10.close()

f11 = open(destfile11,'r')
lines11 = f11.read()
f11.close()

In [22]:
# Replacing all \n with | to work with regex
lines9 = lines9.replace('\n','|')
# Finding all characters between Item 1 and Item 2 (this covers the whole business section of 10-K)
sect1_1998 = re.findall('ITEM 1(.*)ITEM 2.', lines9)
# Capturing only words from the previous list
sect1_1998 = re.findall('[A-Za-z]+', sect1_1998[0])
# Making all words upper-case and removing any empty characters ('')
sect1_1998 = [element.upper() for element in sect1_1998 if element != '']
# Removing 'S' to account for all possessive nouns in the list that had apostrophe removed (')
while 'S' in sect1_1998:
    sect1_1998.remove('S')
# Removing 'C' for html language syntax to create columns (removing 'S' and 'C' prevents inflation of count)
while 'C' in sect1_1998:
    sect1_1998.remove('C')

# Repeat for 1999 and 2000 data
lines10 = lines10.replace('\n','|')
sect1_1999 = re.findall('ITEM 1(.*)ITEM 2.', lines10)
sect1_1999 = re.findall('[A-Za-z]+', sect1_1999[0])
sect1_1999 = [element.upper() for element in sect1_1999 if element != '']
while 'S' in sect1_1999:
    sect1_1999.remove('S')
while 'C' in sect1_1999:
    sect1_1999.remove('C')
    
lines11 = lines11.replace('\n','|')
sect1_2000 = re.findall('Item 1.(.*)Item 2.', lines11)
sect1_2000 = re.findall('([A-Za-z]+)', sect1_2000[0])
sect1_2000 = [element.upper() for element in sect1_2000 if element != '']
while 'S' in sect1_2000:
    sect1_2000.remove('S')
while 'C' in sect1_2000:
    sect1_2000.remove('C')
    
# Create an array counter to store values in a table format
counter = np.zeros((3,4))
# Create array length to store length of Section 1 lists
length = np.zeros((3,))
# Create a list of list for words (neg, pos, unc, lit) and section 1 (1998, 1999, 2000)
# This allows us to make a more effective for loop instead of writing a for loop for each year and each list
list_words = [list_neg, list_pos, list_unc, list_lit]
list_sect1 = [sect1_1998, sect1_1999, sect1_2000]

# Use the Section 1 lists as rows
for row, sect1 in enumerate(list_sect1):
    # Count = 0 to count the number of words from the section list that hit the word list
    count = 0
    # Loop through each kind of word list (neg, pos, unc, lit) to see if words match
    for col, words in enumerate(list_words):
        # Loop through each word in Section 1 list to check
        for w in sect1:
            # If word from Section 1 list is in Word list (neg, pos, unc, lit), then add the corresponding counter in table
            if w in words:
                counter[row, col] = counter[row,col] + 1

# For loop for calculating the length of each Section 1 list
for c, l in enumerate(list_sect1):
    length[c] = len(l)

In [23]:
print_list = []
# Print list created to help calculate the fraction of words in a Section 1 that were hit by a particular word list
for r, l in enumerate(counter):
    print_list.append(np.round(l / length[r] * 100,2))

In [24]:
# List to create row header
year_list = ['1998', '1999', '2000']
# Printing column header
print(f'Year  Neg.  Pos.  Unc.  Lit.')
# For loop to print each row/column from table counter
for i in range(3):
    print(f'{year_list[i]}: {print_list[i][0]:0.2f}% {print_list[i][1]:0.2f}% {print_list[i][2]:0.2f}% {print_list[i][3]:0.2f}%')

Year  Neg.  Pos.  Unc.  Lit.
1998: 1.42% 0.55% 1.15% 0.94%
1999: 1.53% 0.59% 1.01% 0.90%
2000: 1.55% 0.55% 1.06% 1.05%
