### Final Project Requirements/notes: https://docs.google.com/document/d/1mwYbYJHkB7kpx4tNflKh54jN9_oOscw3p4k5fsmn3bc/edit

### Link with all Data: https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html
- using NUM file only for now (data set of all numeric XBRL facts presented on the primary financial statements)

In [1]:
import pandas as pd
q414numbers = pd.read_table('2014q4_notes/num.tsv', encoding ='latin1')

  interactivity=interactivity, compiler=compiler, result=result)


### From "Financial Statement and Notes Data Sets" Readme:
These fields comprise a unique compound key:

1) **adsh - EDGAR accession number**: a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System; The first set of numbers (0001193125) is the CIK of the entity submitting the filing. The next 2 numbers (18) represent the year. The last series of numbers represent a sequential count of submitted filings from that CIK. The count is usually, but not always, reset to 0 at the start of each calendar year.
- **TODO**: separate these numbers to identify a company or a financial filing, there were 6,492 individual filings

2) **tag** - tag used by the filer 
- **TODO**: may have to separate out first word from tag to identify broader groups such as revenue

3) **version** – if a standard tag, the taxonomy of origin, otherwise equal to adsh.

4) **ddate** - period end date

5) **qtrs** - duration in number of quarters

6) **uom** - unit of measure

7) **dimh** - 16-byte dimensional qualifier

8) **iprx** - a sequential integer used to distinguish otherwise identical facts

9) **coreg** - If specified, indicates a specific co-registrant, the parent company, or other entity (e.g., guarantor).  NULL indicates the consolidated entity.  Note that this value is a function of the dimension segments.

10) **durp** - The difference between the reported fact duration and the quarter duration (qtrs), expressed as a fraction of 1.  For example, a fact with duration of 120 days rounded to a 91-day quarter has a durp value of 29/91 = +0.3187.

11) **datp** - The difference between the reported fact date and the month-end rounded date (ddate), expressed as a fraction of 1.  For example, a fact reported for 29/Dec, with ddate rounded to 31/Dec, has a datp value of minus 2/31 = -0.0645.
 
12) **dcml** - The value of the fact "decimals" attribute, with INF represented by 32767.

#### A lot of null values for footnotes and coregistrants (majority of rows); will remove these columns for now

In [2]:
q414numbers = q414numbers.drop(columns=['footnote','coreg'])

In [3]:
q414numbers = q414numbers.dropna()

In [4]:
q414numbers.isnull().sum()

adsh       0
tag        0
version    0
ddate      0
qtrs       0
uom        0
dimh       0
iprx       0
value      0
footlen    0
dimn       0
durp       0
datp       0
dcml       0
dtype: int64

In [5]:
q414numbers.describe(include='all') #still have 5million+ data points

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,dimh,iprx,value,footlen,dimn,durp,datp,dcml
count,5538862,5538862,5538862,5538862.0,5538862.0,5538862,5538862,5538862.0,5538862.0,5538862.0,5538862.0,5538862.0,5538862.0,5538862.0
unique,7655,239779,7368,,,2740,299276,,,,,,,
top,0001193125-14-405655,StockholdersEquity,us-gaap/2014,,,USD,0x00000000,,,,,,,
freq,11560,55620,3298152,,,4822236,2458846,,,,,,,
mean,,,,20135430.0,1.407349,,,0.0009821512,5974044000.0,0.8742092,0.7856128,0.004730011,0.1031539,3041.128
std,,,,11837.28,2.497928,,,0.03728668,2137439000000.0,22.77343,0.8838732,0.03858588,1.725226,9512.136
min,,,,19681230.0,0.0,,,0.0,-30155000000000.0,0.0,0.0,-0.4986305,-15.0,-12.0
25%,,,,20130930.0,0.0,,,0.0,0.22,0.0,0.0,0.0,0.0,-3.0
50%,,,,20140630.0,1.0,,,0.0,900000.0,0.0,1.0,0.002740025,0.0,-3.0
75%,,,,20140930.0,3.0,,,0.0,24000000.0,0.0,1.0,0.01917911,0.0,0.0


In [6]:
#break out adsh to cik and filing number
s = q414numbers['adsh'].str.split('-', n = 1, expand = True)
q414numbers['entity_CIK'] = s[0]
q414numbers['filing_number'] = s[1]
q414numbers.head()

# System; The first set of numbers (0001193125) is the CIK of the entity submitting the filing. 
# The next 2 numbers (18) represent the year. 
# The last series of numbers represent a sequential count of submitted filings from that CIK. 
# The count is usually, but not always, reset to 0 at the start of each calendar year.

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,dimh,iprx,value,footlen,dimn,durp,datp,dcml,entity_CIK,filing_number
0,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20140930,1,USD,0x00000000,0.0,463000.0,0,0,0.00274,0.0,-3.0,1171843,14-005353
1,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20130930,1,USD,0x00000000,0.0,399000.0,0,0,0.00274,0.0,-3.0,1171843,14-005353
2,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20140930,3,USD,0x00000000,0.0,1444000.0,0,0,0.019179,0.0,-3.0,1171843,14-005353
3,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20130930,3,USD,0x00000000,0.0,1214000.0,0,0,0.019179,0.0,-3.0,1171843,14-005353
4,0001171843-14-005353,ShareBasedCompensationArrangementByShareBasedP...,0001171843-14-005353,20140930,0,shares,0x00000000,0.0,217227.0,0,0,0.0,0.0,32767.0,1171843,14-005353


In [7]:
len(set(q414numbers.entity_CIK))

2385

## come back to this: need to see if this company had multiple filings

In [27]:
len(q414numbers.entity_CIK == '0001171843')

5538862

In [8]:
adsh_grouped = q414numbers[['entity_CIK','value']].groupby('entity_CIK')
adsh_grouped.groups


{'0000002178': Int64Index([  91142,   91143,   91144,   91145,   91146,   91147,   91148,
               91149,   91150,   91151,
             ...
             2657391, 2658762, 2658763, 2660719, 2660720, 2661795, 2661796,
             5004092, 5004093, 5004094],
            dtype='int64', length=479),
 '0000003146': Int64Index([  91152,   91153,   91154,   91155,   91156,   91157,   91158,
               91159,   91160,   91161,
             ...
             5004237, 5004238, 5004239, 5004240, 5004241, 5004242, 5004243,
             5004244, 5004245, 5004246],
            dtype='int64', length=1874),
 '0000003499': Int64Index([  91196,   91197,   91198,   91199,   91200,   91201,   91216,
               91217,   91218,   91219,
             ...
             5004260, 5004261, 5004262, 5004263, 5004264, 5004265, 5004266,
             5004267, 5004268, 5004269],
            dtype='int64', length=397),
 '0000003570': Int64Index([  91202,   91203,   91204,   91205,   91206,   91207,   9120

In [9]:
len(adsh_grouped.groups)

2385

In [10]:
def no_negs(number):
    return list(set(number.abs()))

In [11]:
grouped_abs_val = adsh_grouped.agg(no_negs)

In [12]:
grouped_abs_val#value is now a list of positive unique numbers, split up by filer and that specific filing

Unnamed: 0_level_0,value
entity_CIK,Unnamed: 1_level_1
0000002178,"[0.0, 960000.0, 1.7, 3.13, 1.72, 5.1, 5.2, 1.0..."
0000003146,"[0.0, 51200000.0, 1.0, 2.0, 4.0, 5.0, 6400000...."
0000003499,"[0.0, 1.0, 2.71, 3.46, 2.75, 3.25, 13824000.0,..."
0000003570,"[0.0, 151808000.0, 480000000.0, 1.71, 4.0, 0.0..."
0000004127,"[6400000.0, 0.0, 2.38, 1.33, 0.25, 1.0, 256000..."
0000004187,"[0.0, 44032000.0, 0.5, 2.0, 2304000.0, 1.0, 4...."
0000004457,"[0.0, 0.25, 3.0, 7.98, 10453000.0, 5141717000...."
0000004904,"[2432000000.0, 0.0, 0.5, 1.5, 1.0, 3.5, 6.5, 4..."
0000004977,"[0.0, 1.5, 1.0, 2.0, 4.93, 5.31, 5.0, 25600000..."
0000005272,"[0.0, 1.0, 2.0, 2.5, 4.75, 4.82, 4.77, 0.125, ..."


In [13]:
# for adsh in adsh_grouped.adsh:
#     numbers = list(set(adsh_grouped.adsh.value.abs()))

In [14]:
# for adsh in adsh_grouped.adsh:
#     unique = q414numbers.loc[q414numbers.adsh == adsh]
#     numbers = list(set(q414numbers.unique.value.abs()))
    

In [15]:
grouped_abs_val.value[0:10]

entity_CIK
0000002178    [0.0, 960000.0, 1.7, 3.13, 1.72, 5.1, 5.2, 1.0...
0000003146    [0.0, 51200000.0, 1.0, 2.0, 4.0, 5.0, 6400000....
0000003499    [0.0, 1.0, 2.71, 3.46, 2.75, 3.25, 13824000.0,...
0000003570    [0.0, 151808000.0, 480000000.0, 1.71, 4.0, 0.0...
0000004127    [6400000.0, 0.0, 2.38, 1.33, 0.25, 1.0, 256000...
0000004187    [0.0, 44032000.0, 0.5, 2.0, 2304000.0, 1.0, 4....
0000004457    [0.0, 0.25, 3.0, 7.98, 10453000.0, 5141717000....
0000004904    [2432000000.0, 0.0, 0.5, 1.5, 1.0, 3.5, 6.5, 4...
0000004977    [0.0, 1.5, 1.0, 2.0, 4.93, 5.31, 5.0, 25600000...
0000005272    [0.0, 1.0, 2.0, 2.5, 4.75, 4.82, 4.77, 0.125, ...
Name: value, dtype: object

In [16]:
type(grouped_abs_val.value)

pandas.core.series.Series

In [17]:
test = grouped_abs_val.value[0]
print(len(test),test)

310 [0.0, 960000.0, 1.7, 3.13, 1.72, 5.1, 5.2, 1.0, 405000.0, 149000.0, 293397000.0, 0.5, 3.0, 4.0, 4906000.0, 142570000.0, 11178000.0, 426000.0, 1049194000.0, 42000.0, 22655000.0, 1855000.0, 93631000.0, 127000.0, 4479000.0, 1026004000.0, 4500000.0, 6228000.0, 4393000.0, 1129000.0, 1039337000.0, 297000.0, 272169000.0, 105000.0, 41000.0, 26921000.0, 510000.0, 7230000.0, 152979000.0, 16147000.0, 467000.0, 63955000.0, 51795000.0, 33875000.0, 1939000.0, 0.22, 18664000.0, 168000.0, 1896000.0, 9512000.0, 360000.0, 154685000.0, 5373000.0, 60733000.0, 21501000.0, 448082000.0, 82000.0, 1173970000.0, 3283090000.0, 2450000.0, 13394000.0, 103000.0, 21927000.0, 9767000.0, 999000.0, 275239000.0, 2272000.0, 2300000.0, 20220000.0, 252000.0, 226300000.0, 572000.0, 124000.0, 94673000.0, 212433000.0, 2977873000.0, 358000.0, 165094000.0, 38000.0, 1958000.0, 16294000.0, 422000.0, 2598000.0, 34406000.0, 123000.0, 1787000.0, 2000000.0, 0.44, 464000.0, 356453000.0, 229000.0, 549000.0, 37093000.0, 1338000.0, 2

In [18]:
type(test)

list

In [19]:
# for number in test:
#     benford = int(str(number)[0])
#     print(benford)

In [20]:
for val in grouped_abs_val.value:
    print(val)

[0.0, 960000.0, 1.7, 3.13, 1.72, 5.1, 5.2, 1.0, 405000.0, 149000.0, 293397000.0, 0.5, 3.0, 4.0, 4906000.0, 142570000.0, 11178000.0, 426000.0, 1049194000.0, 42000.0, 22655000.0, 1855000.0, 93631000.0, 127000.0, 4479000.0, 1026004000.0, 4500000.0, 6228000.0, 4393000.0, 1129000.0, 1039337000.0, 297000.0, 272169000.0, 105000.0, 41000.0, 26921000.0, 510000.0, 7230000.0, 152979000.0, 16147000.0, 467000.0, 63955000.0, 51795000.0, 33875000.0, 1939000.0, 0.22, 18664000.0, 168000.0, 1896000.0, 9512000.0, 360000.0, 154685000.0, 5373000.0, 60733000.0, 21501000.0, 448082000.0, 82000.0, 1173970000.0, 3283090000.0, 2450000.0, 13394000.0, 103000.0, 21927000.0, 9767000.0, 999000.0, 275239000.0, 2272000.0, 2300000.0, 20220000.0, 252000.0, 226300000.0, 572000.0, 124000.0, 94673000.0, 212433000.0, 2977873000.0, 358000.0, 165094000.0, 38000.0, 1958000.0, 16294000.0, 422000.0, 2598000.0, 34406000.0, 123000.0, 1787000.0, 2000000.0, 0.44, 464000.0, 356453000.0, 229000.0, 549000.0, 37093000.0, 1338000.0, 24900

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [21]:
len(grouped_abs_val.value)


2385

In [81]:
def first_digit(number): 
    return int(str(number)[0])

In [91]:
# first_digit(grouped_abs_val.value[0][11])

0

In [88]:
all_nums = [v for v in grouped_abs_val.value[0]]
all_nums

[0.0,
 960000003040.403,
 18778.521716462383,
 32561.88123974383,
 18178.791131381367,
 51799.730585081,
 52676.16257526919,
 10585.718664474636,
 4050000516.640476,
 1490000462.1506548,
 293397000.0,
 0.5,
 3.0,
 4.0,
 4906000.0,
 142570000.0,
 11178000.0,
 426000.0,
 1049194000.0,
 42000.0,
 22655000.0,
 1855000.0,
 93631000.0,
 127000.0,
 4479000.0,
 1026004000.0,
 4500000.0,
 6228000.0,
 4393000.0,
 1129000.0,
 1039337000.0,
 297000.0,
 272169000.0,
 105000.0,
 41000.0,
 26921000.0,
 510000.0,
 7230000.0,
 152979000.0,
 16147000.0,
 467000.0,
 63955000.0,
 51795000.0,
 33875000.0,
 1939000.0,
 0.22,
 18664000.0,
 168000.0,
 1896000.0,
 9512000.0,
 360000.0,
 154685000.0,
 5373000.0,
 60733000.0,
 21501000.0,
 448082000.0,
 82000.0,
 1173970000.0,
 3283090000.0,
 2450000.0,
 13394000.0,
 103000.0,
 21927000.0,
 9767000.0,
 999000.0,
 275239000.0,
 2272000.0,
 2300000.0,
 20220000.0,
 252000.0,
 226300000.0,
 572000.0,
 124000.0,
 94673000.0,
 212433000.0,
 2977873000.0,
 358000.0,
 

In [101]:
for v in grouped_abs_val.value:
    fds = [first_digit(num) for num in v]

In [122]:
def frequencies(first_digits):
    counts = [0]*10
    for x in first_digits:
        if x > 0:
            counts[x] += 1 
    total = sum(counts)
    freq = [count/total for count in counts]
    print(sum(freq)) #frequencies should sum up to 1
    return freq[1:] #same as going from element 1 through the end (i.e. 9)

In [123]:
frequencies(fds)

1.0


[0.26791808873720135,
 0.22525597269624573,
 0.12457337883959044,
 0.1075085324232082,
 0.08873720136518772,
 0.052901023890784986,
 0.042662116040955635,
 0.042662116040955635,
 0.04778156996587031]

In [None]:
# def frequencies(first_digits):
#     counts = [0]*10
#     for x in first_digits:
#         counts[x] += 1 
#     total = sum(counts)
#     freq = [count/total for count in counts]
#     print(sum(freq)) #frequencies should sum up to 1
#     return freq[1:] #same as going from element 1 through the end (i.e. 9)

In [165]:
# freq = [0] * 10
import math
def kl_divergence(freq):
    kl_div = 0.0
    for d in range(1, 10):
        Q = (math.log(d+1) - math.log(d)) / math.log(10) #calculates theoretical benfords (perfect benny freqs)
        P = freq[d-1]
        kl_div += math.log(P / Q) * P
#         print((freq))

    return int(kl_div)

In [166]:
kl_divergence(benfords_law)

0

In [127]:

r = range(10)
benfords_law = [math.log10(1 + 1/digit) for digit in r if digit != 0]
benfords_law

[0.3010299956639812,
 0.17609125905568124,
 0.12493873660829993,
 0.09691001300805642,
 0.07918124604762482,
 0.06694678963061322,
 0.05799194697768673,
 0.05115252244738129,
 0.04575749056067514]

In [128]:
sum(benfords_law) 

1.0

In [171]:
fds[0][1]

TypeError: 'int' object is not subscriptable

In [168]:
kl_divergence(fds[0])

TypeError: 'int' object is not subscriptable