## Preliminaries
Initial setup

In [39]:
import requests   # best library to manage HTTP transactions
import csv # library to read/write/parse CSV files
from bs4 import BeautifulSoup # web-scraping library

acceptMime = 'text/html'
cikList = []
cikPath = 'cik.txt'


Open the file containing the list of CIK codes, read them in, and turn them into a list with whitespace stripped

In [40]:
cikFileObject = open(cikPath, newline='')
cikRows = cikFileObject.readlines()

for cik in cikRows:
    cikList.append(cik.strip())
print(cikList)

['0001085917', '0000105598', '0000034088']


## Searching for 10-K forms
Create a list of dictionaries for appropriate results

In [41]:
resultsList = []

Create the search URL using one hacked from playing around online

In [42]:
cik = cikList[2] # in the final script, this will loop through all of the CIK codes. (elements 0 and 1 don't produce any results)
# this query string selects for 10-K forms, but also retrieves forms whose code start with 10-K
baseUri = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK='+cik+'&type=10-K&dateb=&owner=exclude&start=0&count=40&output=atom'
print(baseUri)


https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000034088&type=10-K&dateb=&owner=exclude&start=0&count=40&output=atom


Retrieve the XML document and turn it into a Beautiful Soup object (well-structured with magical properties)

In [43]:
r = requests.get(baseUri, headers={'Accept' : 'application/xml'})
soup = BeautifulSoup(r.text,features="html5lib")
print(soup)

<!--?xml version="1.0" encoding="ISO-8859-1" ?--><html><head></head><body><feed xmlns="http://www.w3.org/2005/Atom">
    <author>
      <email>webmaster@sec.gov</email>
      <name>Webmaster</name>
    </author>
    <company-info>
      <addresses>
        <address type="mailing">
          <city>IRVING</city>
          <state>TX</state>
          <street1>5959 LAS COLINAS BLVD</street1>
          <zip>75039-2298</zip>
        </address>
        <address type="business">
          <city>IRVING</city>
          <phone>9729406000</phone>
          <state>TX</state>
          <street1>5959 LAS COLINAS BLVD</street1>
          <zip>75039-2298</zip>
        </address>
      </addresses>
      <assigned-sic>2911</assigned-sic>
      <assigned-sic-desc>PETROLEUM REFINING</assigned-sic-desc>
      <assigned-sic-href>http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&amp;SIC=2911&amp;owner=exclude&amp;count=40</assigned-sic-href>
      <assitant-director>4</assitant-director>
      <cik>

The search string (term="10-k") limits results to only category elements with the attribute that's exactly equal to"10-K"

The select function returns a list of soup objects that can each be searched

In [44]:
for cat in soup.select('category[term="10-K"]'):
    # can't use cat.filing-href because hyphen in tag is interpreted by Python as a minus
    # also, couldn't get .strings to work, so used first child element (the string content of the tag)
    date = cat.find('filing-date').contents[0]
    year = date[:4] # the year is the first four characters of the date string
    print(year)
    # create a dictionary of an individual result
    searchResults = {'cik':cik,'year':year,'uri':cat.find('filing-href').contents[0]}
    if year == "2016" or year == "2014":
        # append the dictionary to the list of results
        resultsList.append(searchResults)

2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1996
1995
1994


The loop is done, now show the results

In [45]:
print(resultsList)

[{'cik': '0000034088', 'year': '2016', 'uri': 'http://www.sec.gov/Archives/edgar/data/34088/000003408816000065/0000034088-16-000065-index.htm'}, {'cik': '0000034088', 'year': '2014', 'uri': 'http://www.sec.gov/Archives/edgar/data/34088/000003408814000012/0000034088-14-000012-index.htm'}]


## Searching for the components of an individual 10-K filing

Start by showing the URL to be retrieved

In [46]:
form10kList = [] # create an empty list to put the results in
hitNumber = 0  # in the final script, loop through the resultsList.  Here, just do the first result.
# for hitNumber in range(0,len(resultsList)):
print(resultsList[hitNumber]['uri'])

http://www.sec.gov/Archives/edgar/data/34088/000003408816000065/0000034088-16-000065-index.htm


Retrieve the HTML and turn it into a cleaned-up soupt object

In [47]:
r = requests.get(resultsList[hitNumber]['uri'], headers={'Accept' : 'text/html'})
soup = BeautifulSoup(r.text,features="html5lib")
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>EDGAR Filing Documents for 0000034088-16-000065</title>
<link href="/include/interactive.css" rel="stylesheet" type="text/css"/>
</head>
<body style="margin: 0">
<!-- SEC Web Analytics - For information please visit: https://www.sec.gov/privacy.htm#collectedinfo -->
<noscript><iframe height="0" src="//www.googletagmanager.com/ns.html?id=GTM-TD3BKV" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-TD3BKV');</script

Select the tr elements and generate an array of soup objects for each of the tr elements

In [48]:
trArray = soup.select('tr')
print(trArray)

[<tr>
            <th scope="col" style="width: 5%;"><acronym title="Sequence Number">Seq</acronym></th>
            <th scope="col" style="width: 40%;">Description</th>
            <th scope="col" style="width: 20%;">Document</th>
            <th scope="col" style="width: 10%;">Type</th>
            <th scope="col">Size</th>
         </tr>, <tr>
            <td scope="row">1</td>
            <td scope="row">FORM 10-K</td>
            <td scope="row"><a href="/Archives/edgar/data/34088/000003408816000065/xom10k2015.htm">xom10k2015.htm</a></td>
            <td scope="row">10-K</td>
            <td scope="row">7694659</td>
         </tr>, <tr class="blueRow">
            <td scope="row">2</td>
            <td scope="row">RESTATED CERTIFICATE OF INCORPORATION</td>
            <td scope="row"><a href="/Archives/edgar/data/34088/000003408816000065/xomexhibit3i.htm">xomexhibit3i.htm</a></td>
            <td scope="row">EX-3.(I)</td>
            <td scope="row">260323</td>
         </tr>, <tr

Loop through each of the tr elements and check whether it has a td element that contains "10-K".  If so, then add the value of the href attribute to the results array.  Note: the values are relative, so must prepend 'http://www.sec.gov' to make it an absolute URL.

In [49]:
for row in trArray:
    is10k = False
    for cell in row.select('td'):
        try:
            testString = cell.contents[0]
            if cell.contents[0] == "10-K":
                is10k = True
        except:  # handle error caes where the cell doesn't have contents
            pass
    if is10k:
        form10kList.append('http://www.sec.gov' + row.a.get('href'))


Print the resulting list

In [50]:
print(form10kList)

['http://www.sec.gov/Archives/edgar/data/34088/000003408816000065/xom10k2015.htm']


## Retrieve the actual 10-K page and pull out the signatory names
This will eventually be a loop, but for now, just do the first result

In [51]:
form10kNumber = 0
# for form10kNumber in range(0,len(form10kList)):
print(form10kList[form10kNumber])

http://www.sec.gov/Archives/edgar/data/34088/000003408816000065/xom10k2015.htm


Retrieve the HTML for the web page and turn it into a Beautiful Soup object

In [52]:
r = requests.get(form10kList[form10kNumber], headers={'Accept' : 'text/html'})
soup = BeautifulSoup(r.text,features="html5lib")

Select the tr elements and generate an array of soup objects for each of the tr elements

In [53]:
trArray = soup.select('tr')
print(trArray)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Search through the tr element soup objects and select out the font elements.  These are the elements that contain the final names and titles that we want.

Only some of the font elements have signatures.  The signatures are indicated by "/s/", so we can ignore font elements that don't have this.  Also, there are other signature locations in the document, but we should pay attention only to the ones in the final table (which have more than 5 font elements per tr) 

In [54]:
for row in soup.select('tr'):
    hasSlashS = False
    for cell in row.select('font'):
        try:
            testString = cell.contents[0]
            if "/s/" in cell.contents[0]:
                hasSlashS = True
        except:  # handle error caes where the cell doesn't have contents
            pass
    if hasSlashS:
        tableItems = row.select('font')
        if len(tableItems)>=5:
            noLeft = tableItems[2].contents[0].replace('(','')
            cleanName = noLeft.replace(')','')
            print(cleanName)
            print(tableItems[4].contents[0])

Rex W. Tillerson
Chairman of the Board
Michael J. Boskin
Director
Peter Brabeck-Letmathe
Director
Ursula M. Burns
Director
Larry R. Faulkner
Director
Jay S. Fishman
Director
 
 
Kenneth C. Frazier
Director
Douglas R. Oberhelman
Director
Samuel J. Palmisano
Director
Steven S Reinemund
Director
William C. Weldon
Director
Darren W. Woods
Director
Andrew P. Swiger
Senior Vice President
David S. Rosenthal
Vice President and Controller
