### Final Project Requirements/notes: https://docs.google.com/document/d/1mwYbYJHkB7kpx4tNflKh54jN9_oOscw3p4k5fsmn3bc/edit

### Link with all Data: https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html
- using NUM file only for now (data set of all numeric XBRL facts presented on the primary financial statements)

In [1]:
import pandas as pd
q414numbers = pd.read_table('2014q4_notes/num.tsv', encoding ='latin1')

  interactivity=interactivity, compiler=compiler, result=result)


### From "Financial Statement and Notes Data Sets" Readme:
These fields comprise a unique compound key:

1) **adsh - EDGAR accession number**: a unique identifier assigned automatically to an accepted submission by the EDGAR Filer System; The first set of numbers (0001193125) is the CIK of the entity submitting the filing. The next 2 numbers (18) represent the year. The last series of numbers represent a sequential count of submitted filings from that CIK. The count is usually, but not always, reset to 0 at the start of each calendar year.
- **TODO**: separate these numbers to identify a company or a financial filing, there were 6,492 individual filings

2) **tag** - tag used by the filer 
- **TODO**: may have to separate out first word from tag to identify broader groups such as revenue

3) **version** – if a standard tag, the taxonomy of origin, otherwise equal to adsh.

4) **ddate** - period end date

5) **qtrs** - duration in number of quarters

6) **uom** - unit of measure

7) **dimh** - 16-byte dimensional qualifier

8) **iprx** - a sequential integer used to distinguish otherwise identical facts

9) **coreg** - If specified, indicates a specific co-registrant, the parent company, or other entity (e.g., guarantor).  NULL indicates the consolidated entity.  Note that this value is a function of the dimension segments.

10) **durp** - The difference between the reported fact duration and the quarter duration (qtrs), expressed as a fraction of 1.  For example, a fact with duration of 120 days rounded to a 91-day quarter has a durp value of 29/91 = +0.3187.

11) **datp** - The difference between the reported fact date and the month-end rounded date (ddate), expressed as a fraction of 1.  For example, a fact reported for 29/Dec, with ddate rounded to 31/Dec, has a datp value of minus 2/31 = -0.0645.
 
12) **dcml** - The value of the fact "decimals" attribute, with INF represented by 32767.

#### A lot of null values for footnotes and coregistrants (majority of rows); will remove these columns for now

In [2]:
q414numbers = q414numbers.drop(columns=['footnote','coreg'])

In [3]:
q414numbers = q414numbers.dropna()

In [4]:
q414numbers.isnull().sum()

adsh       0
tag        0
version    0
ddate      0
qtrs       0
uom        0
dimh       0
iprx       0
value      0
footlen    0
dimn       0
durp       0
datp       0
dcml       0
dtype: int64

In [5]:
q414numbers.describe(include='all') #still have 5million+ data points

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,dimh,iprx,value,footlen,dimn,durp,datp,dcml
count,5538862,5538862,5538862,5538862.0,5538862.0,5538862,5538862,5538862.0,5538862.0,5538862.0,5538862.0,5538862.0,5538862.0,5538862.0
unique,7655,239779,7368,,,2740,299276,,,,,,,
top,0001193125-14-405655,StockholdersEquity,us-gaap/2014,,,USD,0x00000000,,,,,,,
freq,11560,55620,3298152,,,4822236,2458846,,,,,,,
mean,,,,20135430.0,1.407349,,,0.0009821512,5974044000.0,0.8742092,0.7856128,0.004730011,0.1031539,3041.128
std,,,,11837.28,2.497928,,,0.03728668,2137439000000.0,22.77343,0.8838732,0.03858588,1.725226,9512.136
min,,,,19681230.0,0.0,,,0.0,-30155000000000.0,0.0,0.0,-0.4986305,-15.0,-12.0
25%,,,,20130930.0,0.0,,,0.0,0.22,0.0,0.0,0.0,0.0,-3.0
50%,,,,20140630.0,1.0,,,0.0,900000.0,0.0,1.0,0.002740025,0.0,-3.0
75%,,,,20140930.0,3.0,,,0.0,24000000.0,0.0,1.0,0.01917911,0.0,0.0


In [6]:
#break out adsh to cik and filing number
s = q414numbers['adsh'].str.split('-', n = 1, expand = True)
q414numbers['entity_CIK'] = s[0]
q414numbers['filing_number'] = s[1]
q414numbers.head()

# System; The first set of numbers (0001193125) is the CIK of the entity submitting the filing. 
# The next 2 numbers (18) represent the year. 
# The last series of numbers represent a sequential count of submitted filings from that CIK. 
# The count is usually, but not always, reset to 0 at the start of each calendar year.

Unnamed: 0,adsh,tag,version,ddate,qtrs,uom,dimh,iprx,value,footlen,dimn,durp,datp,dcml,entity_CIK,filing_number
0,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20140930,1,USD,0x00000000,0.0,463000.0,0,0,0.00274,0.0,-3.0,1171843,14-005353
1,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20130930,1,USD,0x00000000,0.0,399000.0,0,0,0.00274,0.0,-3.0,1171843,14-005353
2,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20140930,3,USD,0x00000000,0.0,1444000.0,0,0,0.019179,0.0,-3.0,1171843,14-005353
3,0001171843-14-005353,FederalHomeLoanBankStockDividends,0001171843-14-005353,20130930,3,USD,0x00000000,0.0,1214000.0,0,0,0.019179,0.0,-3.0,1171843,14-005353
4,0001171843-14-005353,ShareBasedCompensationArrangementByShareBasedP...,0001171843-14-005353,20140930,0,shares,0x00000000,0.0,217227.0,0,0,0.0,0.0,32767.0,1171843,14-005353


In [102]:
len(set(q414numbers.entity_CIK))

2385

In [30]:
adsh_grouped = q414numbers[['adsh','value']].groupby('adsh')
adsh_grouped.groups


{'0000002178-14-000064': Int64Index([  91142,   91143,   91144,   91145,   91146,   91147,   91148,
               91149,   91150,   91151,
             ...
             2657391, 2658762, 2658763, 2660719, 2660720, 2661795, 2661796,
             5004092, 5004093, 5004094],
            dtype='int64', length=479),
 '0000003146-14-000006': Int64Index([  91152,   91153,   91154,   91155,   91156,   91157,   91158,
               91159,   91160,   91161,
             ...
             5004228, 5004229, 5004230, 5004231, 5004232, 5004237, 5004239,
             5004240, 5004243, 5004244],
            dtype='int64', length=937),
 '0000003146-14-000009': Int64Index([  91182,   91183,   91184,   91185,   91186,   91187,   91188,
               91189,   91190,   91191,
             ...
             5004226, 5004233, 5004234, 5004235, 5004236, 5004238, 5004241,
             5004242, 5004245, 5004246],
            dtype='int64', length=937),
 '0000003499-14-000018': Int64Index([  91196,   91197,   9

In [35]:
len(adsh_grouped.groups)

7655

In [32]:
def no_negs(number):
    return list(set(number.abs()))

In [45]:
grouped_abs_val = adsh_grouped.agg(no_negs)

In [46]:
grouped_abs_val.head() #value is now a list of positive unique numbers, split up by filer and that specific filing

Unnamed: 0_level_0,value
adsh,Unnamed: 1_level_1
0000002178-14-000064,"[0.0, 960000.0, 1.7, 3.13, 1.72, 5.1, 5.2, 1.0..."
0000003146-14-000006,"[0.0, 1.0, 51200000.0, 2.0, 4.0, 5.0, 6400000...."
0000003146-14-000009,"[0.0, 51200000.0, 1.0, 2.0, 4.0, 5.0, 6400000...."
0000003499-14-000018,"[0.0, 1.0, 2.71, 3.46, 2.75, 3.25, 13824000.0,..."
0000003570-14-000268,"[0.0, 151808000.0, 480000000.0, 1.71, 4.0, 0.0..."


In [84]:
# for adsh in adsh_grouped.adsh:
#     numbers = list(set(adsh_grouped.adsh.value.abs()))

In [None]:
# for adsh in adsh_grouped.adsh:
#     unique = q414numbers.loc[q414numbers.adsh == adsh]
#     numbers = list(set(q414numbers.unique.value.abs()))
    

In [51]:
grouped_abs_val.value[0:10]

adsh
0000002178-14-000064    [0.0, 960000.0, 1.7, 3.13, 1.72, 5.1, 5.2, 1.0...
0000003146-14-000006    [0.0, 1.0, 51200000.0, 2.0, 4.0, 5.0, 6400000....
0000003146-14-000009    [0.0, 51200000.0, 1.0, 2.0, 4.0, 5.0, 6400000....
0000003499-14-000018    [0.0, 1.0, 2.71, 3.46, 2.75, 3.25, 13824000.0,...
0000003570-14-000268    [0.0, 151808000.0, 480000000.0, 1.71, 4.0, 0.0...
0000004127-14-000046    [6400000.0, 0.0, 2.38, 1.33, 0.25, 1.0, 256000...
0000004187-14-000043    [0.0, 44032000.0, 0.5, 2.0, 2304000.0, 1.0, 4....
0000004457-14-000051    [0.0, 0.25, 3.0, 7.98, 10453000.0, 5141717000....
0000004904-14-000097    [2432000000.0, 0.0, 0.5, 1.5, 1.0, 3.5, 6.5, 4...
0000004977-14-000138    [0.0, 1.5, 1.0, 2.0, 4.93, 5.31, 5.0, 25600000...
Name: value, dtype: object

In [52]:
type(grouped_abs_val.value)

pandas.core.series.Series

In [98]:
test = grouped_abs_val.value[1]
print(len(test),test)

408 [0.0, 1.0, 51200000.0, 2.0, 4.0, 5.0, 6400000.0, 1600000000.0, 9.0, 30.0, 2900000.0, 201300000.0, 79700000.0, 60500000.0, 28500000.0, 98900000.0, 34900000.0, 50.0, 18600000.0, 0.21, 143100000.0, 21500000.0, 8700000.0, 2300000.0, 123900000.0, 885500000.0, 27900000.0, 50000000.0, 331600000.0, 0.0126, 5200000.0, 18000000.0, 0.1315, 8100000.0, 14500000.0, 84900000.0, 1700000.0, 52900000.0, 11000000.0, 62200000.0, 49400000.0, 43000000.0, 75000000.0, 17400000.0, 4600000.0, 39500000.0, 1100000.0, 58700000.0, 0.013, 0.38, 266400000.0, 4000000.0, 23200000.0, 29600000.0, 16800000.0, 119200000.0, 0.13, 141300000.0, 32500000.0, 500000.0, 19700000.0, 13300000.0, 253000000.0, 73800000.0, 163400000.0, 265800000.0, 35400000.0, 169800000.0, 3400000.0, 358300000.0, 31900000.0, 6300000.0, 150000000.0, 60400000.0, 0.34, 22000000.0, 258800000.0, 2800000.0, 15600000.0, 34800000.0, 86000000.0, 0.1365, 111000.0, 18500000.0, 0.0359, 12100000.0, 0.05, 24900000.0, 159300000.0, 27800000.0, 469400000.0, 860000

In [86]:
type(test)

list

In [88]:
# for number in test:
#     benford = int(str(number)[0])
#     print(benford)

In [91]:
for val in grouped_abs_val.value:
    print(val)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[0.51, 0.0, 0.77, 0.69, 1.53, 3036677.0, 37892.0, 0.56, 22536.0, 312824.0, 23050.0, 25611.0, 1227787.0, 260613.0, 1541132.0, 91663.0, 185360.0, 183314.0, 43027.0, 399892.0, 333333.0, 223257.0, 1144858.0, 30750.0, 0.5, 20000.0, 434209.0, 671782.0, 103976.0, 3113002.0, 30763.0, 517.0, 96304.0, 129072.0, 1756720.0, 50739.0, 50.0, 93239.0, 516667.0, 560700.0, 9240640.0, 52288.0, 1000000.0, 0.23, 40000.0, 132165.0, 125000.0, 16974.0, 168015.0, 80.0, 50702418.0, 166996.0, 892500.0, 10385498.0, 1029212.0, 70238.0, 83554.0, 2836578.0, 1901666.0, 42085.0, 89192.0, 495215.0, 50739312.0, 68722.0, 1084028.0, 19071.0, 0.43, 129.0, 2000000.0, 992899.0, 570500.0, 0.2, 4142214.0, 1341575.0, 80000.0, 84105.0, 63619.0, 387211.0, 14597262.0, 122000.0, 3450000.0, 0.18, 137884.0, 52287645.0, 100000.0, 102050.0, 2884771.0, 1532069.0, 6822.0, 3184809.0, 533677.0, 193710.0, 11950.0, 84144.0, 280750.0, 3250.0, 699572.0, 3116214.0, 35512.0, 188601.0, 774840.0, 9338045.0, 16353982.0, 75000000.0, 2250435.0, 5731

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[0.0, 1792000.0, 1.0, 3.0, 19200000.0, 256000.0, 0.25, 0.5, 16085000.0, 213000.0, 8.0, 11.0, 40917000.0, 20949000.0, 0.75, 1.75, 1450000.0, 43434000.0, 18.49, 4.25, 23.0, 75647000.0, 2175000.0, 26.0, 24.2, 28.4, 29.33, 37.96, 208937000.0, 136233000.0, 979000.0, 61.42, 2216000.0, 64.6, 125000.0, 11090000.0, 2898000.0, 18770000.0, 86.85, 551000.0, 44583000.0, 39000.0, 92.02, 0.077, 252000.0, 2300000.0, 99.32, 100.0, 99.33, 7121000.0, 0.327, 218448000.0, 122960000.0, 143492.0, 2511000.0, 975000.0, 3023000.0, 19620000.0, 0.1075, 633000.0, 207225000.0, 331897000.0, 9550000.0, 192590000.0, 43043000.0, 2339000.0, 1528000.0, 0.17, 4088000.0, 8397000.0, 3490000.0, 631000.0, 719479000.0, 31564000.0, 225.0, 4684000.0, 1100000.0, 33000.0, 8737000.0, 234.0, 2550000.0, 184310000.0, 451318000.0, 0.005, 715000.0, 5024000.0, 1696000.0, 36000000.0, 55200000.0, 0.13, 12960000.0, 2208000.0, 108405000.0, 12874000.0, 147530000.0, 42570000.0, 8991000.0, 105759000.0, 12831000.0, 184052000.0, 0.106, 12276000.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




[0.0, 400000000.0, 7360000.0, 200000000.0, 4000000000.0, 1.0, 1.9063, 9280000.0, 195541000.0, 31637000.0, 5973000.0, 2908885000.0, 8597000.0, 0.84, 7.0, 2.0, 36330000.0, 8554000.0, 69674000.0, 50090000.0, 15146000.0, 113983000.0, 36287000.0, 7999000.0, 23807000.0, 25.0, 6143000.0, 26.0, 2946196000.0, 2107988000.0, 7700000.0, 399400000.0, 90601000.0, 2557865000.0, 3625000.0, 489000.0, 297000.0, 0.0405, 0.057, 138003000.0, 19091000.0, 659000.0, 170191000.0, 168000.0, 2536000.0, 25000000.0, 5352000.0, 1897512000.0, 46248000.0, 1896000.0, 185000000.0, 187005000.0, 2141821000.0, 2173000.0, 25234000.0, 9746000.0, 73426000.0, 1426000.0, 79773000.0, 274000.0, 0.0225, 1234000.0, 196071000.0, 5287000.0, 1959000.0, 19388000.0, 8956000.0, 516284000.0, 5500000.0, 179900000.0, 25233000.0, 95014000.0, 25062000.0, 74278000.0, 8102000.0, 3.15, 0.0186, 61371000.0, 1275000.0, 9467000.0, 290000000.0, 38544000.0, 2253968000.0, 2154064000.0, 458000000.0, 490000000.0, 130000000.0, 1250000000.0, 357000.0, 84

In [76]:
len(grouped_abs_val.value)


7655

In [92]:
# x = [i for i in range(0,len(grouped_abs_val.value))]
# 

In [97]:
def first_digit(number): 
    return int(str(number)[0])

def frequencies(first_digits):
    counts = [0]*10
    for x in first_digits:
        counts[x] += 1 
    total = sum(counts)
    freq = [count/total for count in counts]
    return freq[1:] #same as going from element 1 through the end (i.e. 9)

for v in grouped_abs_val.value[0:10]:
    first_digits = [first_digit(num) for num in v]
    print(frequencies(first_digits))

[0.3, 0.17096774193548386, 0.10967741935483871, 0.13225806451612904, 0.0967741935483871, 0.04516129032258064, 0.035483870967741936, 0.02258064516129032, 0.05161290322580645]
[0.23284313725490197, 0.17401960784313725, 0.13480392156862744, 0.09313725490196079, 0.06372549019607843, 0.061274509803921566, 0.03431372549019608, 0.0392156862745098, 0.031862745098039214]
[0.23284313725490197, 0.17401960784313725, 0.13480392156862744, 0.09313725490196079, 0.06372549019607843, 0.061274509803921566, 0.03431372549019608, 0.0392156862745098, 0.031862745098039214]
[0.2446043165467626, 0.15467625899280577, 0.18345323741007194, 0.10071942446043165, 0.10431654676258993, 0.02877697841726619, 0.05755395683453238, 0.03237410071942446, 0.02877697841726619]
[0.2875457875457875, 0.22344322344322345, 0.10622710622710622, 0.07326007326007326, 0.06227106227106227, 0.0641025641025641, 0.0347985347985348, 0.054945054945054944, 0.03296703296703297]
[0.30174081237911027, 0.17794970986460348, 0.08704061895551257, 0.0

In [96]:
freqs

[0.2709739633558341,
 0.15043394406943106,
 0.12825458052073288,
 0.10318225650916105,
 0.08389585342333655,
 0.05785920925747348,
 0.052073288331726135,
 0.04339440694310511,
 0.02892960462873674]

In [None]:
for adsh in adsh_grouped.adsh:
    unique = q414numbers.loc[q414numbers.adsh == adsh]
    numbers = list(set(q414numbers.unique.value.abs()))
    

In [64]:
digits

<function __main__.<lambda>(index)>

In [44]:
grouped_abs_val = grouped_abs_val.groupby('adsh')
grouped_abs_val.groups

AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy' objects, try using the 'apply' method

In [40]:
def first_digit(number):
    return int(str(number)[0])    

In [41]:
grouped_benfords = grouped_abs_val.agg(first_digit)

In [42]:
grouped_benfords

Unnamed: 0_level_0,value
adsh,Unnamed: 1_level_1
0000002178-14-000064,0
0000003146-14-000006,0
0000003146-14-000009,0
0000003499-14-000018,0
0000003570-14-000268,0
0000004127-14-000046,0
0000004187-14-000043,0
0000004457-14-000051,0
0000004904-14-000097,0
0000004977-14-000138,0


In [40]:
benford_valeant = [first_digit(num) for num in valeant_nums] 
len(benford_valeant)

890

In [41]:
def frequencies(first_digits):
    counts = [0]*10
    for x in first_digits:
        counts[x] += 1 
    total = sum(counts)
    freq = [count/total for count in counts]
    return freq[1:] #same as going from element 1 through the end (i.e. 9)

In [42]:
frequencies(benford_valeant)

[0.2910112359550562,
 0.16741573033707866,
 0.13595505617977527,
 0.09550561797752809,
 0.08426966292134831,
 0.06629213483146068,
 0.048314606741573035,
 0.04269662921348315,
 0.043820224719101124]

In [45]:
import math
r = range(10)
benfords_law = [math.log10(1 + 1/digit) for digit in r if digit != 0]
benfords_law

[0.3010299956639812,
 0.17609125905568124,
 0.12493873660829993,
 0.09691001300805642,
 0.07918124604762482,
 0.06694678963061322,
 0.05799194697768673,
 0.05115252244738129,
 0.04575749056067514]

In [47]:
import plotly

plotly.offline.init_notebook_mode(connected=True)
Benfords_Law = {'type': 'scatter', 'x': list(range(1, 10)), 'y': benfords_law}
Valeant = {'type': 'scatter', 'x': list(range(1, 10)), 'y': frequencies(benford_valeant)}


plotly.offline.iplot([Benfords_Law, Valeant])

In [14]:
# q414numbers.join(s.apply(lambda x: x.split('-')))

In [15]:
# q414numbers= q414numbers.drop('adsh', axis=1).join(s.reset_index(drop=True, level=1).rename(['0'],['1'],['2']))


In [16]:
# q414numbers.head()

In [59]:
import numpy as np
import random
def p_value(freq):
    n = sum(freq)
    ps = [(math.log(d+1) - math.log(d)) / math.log(10) for d in range(1, 10)]

    ks_obs = freq[1:]

    def ll(ks): # log-likelihood
        z = random.sample(zip(ks, ps), 8)
        return sum([k * math.log(p) for (k, p) in z])

    N = 10000
    P = 0
    for i in range(N):
        ks = np.random.multinomial(n, ps)
        if ll(ks) > ll(ks_obs): P += 1

In [60]:
p_value(benford_valeant)

TypeError: Population must be a sequence or set.  For dicts, use list(d).

In [50]:
ks_obs = freq[1:]

NameError: name 'freq' is not defined