# Data Preparation

I need to prepare data for my models to work so I plan on doing that here within this notebook.

In [2]:
# Pandas is a huge tool for this stuff
import pandas as pd

## Visualization

My first step here is to get an idea of the shape of the data I am working with.

In [7]:
# this ran for like 4 minutes without doing anything
# frame = pd.read_csv('us-stock-fundamentals.csv', header=None, error_bad_lines=False)

# there's a possibility the files are too large to do anything worthwhile with. I'll try taking a chunk.
frame = pd.read_csv('us-stock-fundamentals.csv', on_bad_lines='skip', sep=';')

frame

Unnamed: 0,Indicator,Quarter,Latest NAICS Industry Sector Name,Latest Name,SEC ID,Value
0,Comprehensive Income Net Of Tax,2012/Q4,Administrative and Support and Waste Managemen...,"Nxt-Id, Inc.",1566826,
1,Comprehensive Income Net Of Tax,2012/Q3,Administrative and Support and Waste Managemen...,"Nxt-Id, Inc.",1566826,
2,Comprehensive Income Net Of Tax,2012/Q1,Administrative and Support and Waste Managemen...,"Nxt-Id, Inc.",1566826,
3,Comprehensive Income Net Of Tax,2013/Q3,Administrative and Support and Waste Managemen...,"Nxt-Id, Inc.",1566826,
4,Comprehensive Income Net Of Tax,2011/Q2,Administrative and Support and Waste Managemen...,"Nxt-Id, Inc.",1566826,
...,...,...,...,...,...,...
3549953,Weighted Average Number Of Diluted Shares Outs...,2011/Q1,Manufacturing,Terra Tech Corp.,1451512,
3549954,Weighted Average Number Of Diluted Shares Outs...,2014/Q1,Manufacturing,Technical Communications Corp,96699,1838907.0
3549955,Weighted Average Number Of Diluted Shares Outs...,2015/Q1,Manufacturing,Technical Communications Corp,96699,1846399.0
3549956,Weighted Average Number Of Diluted Shares Outs...,2015/Q2,Manufacturing,Technical Communications Corp,96699,1839520.0


## US-Stock-Fundamentals

Actually seems to have very little useful data for us to work with, but it is interesting that between this and the othe CSV we were provided, ids seem to be common. The **SEC ID** for a stock seems to be a regular representative of a company which will be nice when we need to corroborate or consolidate data between companies. It also seems to be the case that the time frame over which we are going to have access to these fundamental indicators is once a quarter, which is about 4 times a year, pretty self explanatorily. 

In [8]:
frame = pd.read_csv('indicators_by_company.csv')

frame

Unnamed: 0,company_id,indicator_id,2010,2011,2012,2013,2014,2015,2016
0,1000045,AccountsPayableAndAccruedLiabilitiesCurrentAnd...,,6612429.0,7405579.0,8924919.0,7841070.0,5839000.0,
1,1000045,AccumulatedDepreciationDepletionAndAmortizatio...,,,2111343.0,2242703.0,2236449.0,2462000.0,
2,1000045,AdjustmentForAmortization,,,-11482251.0,-13490892.0,-13852305.0,-13811000.0,
3,1000045,Assets,,257236034.0,263835468.0,283429579.0,302528591.0,325309000.0,
4,1000045,AssetsHeldForSaleAtCarryingValue,,1373001.0,1203664.0,1696330.0,,,
...,...,...,...,...,...,...,...,...,...
1907878,9984,UnrecognizedTaxBenefitsInterestOnIncomeTaxesAc...,,0.0,0.0,1031000.0,1031000.0,1923000.0,
1907879,9984,UnrecognizedTaxBenefitsReductionsResultingFrom...,,177000.0,0.0,0.0,,215000.0,
1907880,9984,ValuationAllowanceDeferredTaxAssetChangeInAmount,,-13355000.0,8255000.0,-6063000.0,,,
1907881,9984,WeightedAverageNumberOfDilutedSharesOutstanding,,55931882.0,55224457.0,54973344.0,55723267.0,55513219.0,


## Indicators-By-Company

This is a nicer data set that seems to come with a more useful amount of information. The shape is interesting, however. To make it work nicer with something like pandas, it's nice that it has 9 columns, but that leaves us with multiple rows for single companies. These different rows represent different indicators. It's also worth mentioning that this data is yearly now and seems to span 5 years at best fo r alot of these companies. 

It might be worthwhile to look in to how many of these indicators exist for different companies.

In [16]:
# luckily, single companies seem to be grouped by their ids which makes this a little easier to do
indicator_counts = dict()

# I want to see what the indicator numbers look like for each one of these companies throughout the entire thing.
unique_inds = frame['indicator_id'].unique()

for ind in range(len(unique_inds)):
    print('{:.2f}%'.format(ind / len(unique_inds) * 100), end='\r')
    indicator_counts[unique_inds[ind]] = len(frame[frame['indicator_id'] == unique_inds[ind]])


99.99%

In [28]:
# it may be useful to have a sorted array of the indicators by the amount of data points we have for them

count = [(-count, val) for val, count in indicator_counts.items()]
count.sort()

indicators_by_name = list(indicator_counts.keys())
indicators_by_name.sort()

indicators = open("Indicator-Names.txt", "w")

for ind in indicators_by_name:
    indicators.write(ind + '\n')

indicators.close()

## Indicators

I need to find the data indicators that are of particular interest to me. These are values that can be used to calculate heuristics for fundamental analysis to plug into the system _and_ have high enough counts to present a large enough training set for the data I am using. 

Soon I may have to decide what to do about incomplete records and how I will treat them within the data set I am building, but for now I need to make sure that will be enough of an issue later on.

---
### Indicator-Names.txt
This is an alphabetical list of all of the indicators present within the data set I am currently looking at.

## Important Indicators

Here I will keep a list of indicators of interest to me and their associated calculations they may be pertinent to.

|Indicator|Indicator Count|
|--|--|
|EarningsPerShareBasic|5501|
|EarningsPerShareDiluted|5084|
|Price|??|

It has come to my attention that I need access to prices to do many of these calculations. I will look through the data set for the current values I might be able to use to calculate price, or see if it has any prices. If there aren't, I will need to bring in other files.

In [54]:
print(indicator_counts['EarningsPerShareBasic'])
print(indicator_counts['EarningsPerShareDiluted'])
print(indicator_counts['SharesIssuedPricePerShare'])
print(indicator_counts['FairValueInputsLongTermRevenueGrowthRate'])
print(indicator_counts['CommonStockSharesOutstanding'])


print(count[:40])

5501
5084
182
44
7571
[(-9469, 'LiabilitiesAndStockholdersEquity'), (-9460, 'Assets'), (-8926, 'NetIncomeLoss'), (-8843, 'StockholdersEquity'), (-8735, 'CashAndCashEquivalentsAtCarryingValue'), (-8412, 'CashAndCashEquivalentsPeriodIncreaseDecrease'), (-8374, 'RetainedEarningsAccumulatedDeficit'), (-8303, 'CommonStockSharesAuthorized'), (-8192, 'NetCashProvidedByUsedInOperatingActivities'), (-8093, 'CommonStockValue'), (-8092, 'CommonStockSharesIssued'), (-8078, 'NetCashProvidedByUsedInFinancingActivities'), (-7848, 'CommonStockParOrStatedValuePerShare'), (-7571, 'CommonStockSharesOutstanding'), (-7512, 'Liabilities'), (-7435, 'LiabilitiesCurrent'), (-7432, 'AssetsCurrent'), (-7430, 'PropertyPlantAndEquipmentNet'), (-7412, 'NetCashProvidedByUsedInInvestingActivities'), (-7205, 'OperatingIncomeLoss'), (-7031, 'IncomeTaxExpenseBenefit'), (-6738, 'InterestExpense'), (-6551, 'AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment'), (-6484, 'ShareBasedCompensation'), (-603

In [71]:
frame[(frame['indicator_id'] == 'StockholdersEquity')]

Unnamed: 0,company_id,indicator_id,2010,2011,2012,2013,2014,2015,2016
116,1000045,StockholdersEquity,,,1.269651e+08,1.419376e+08,8.988794e+07,1.028490e+08,
499,1000180,StockholdersEquity,5.782624e+09,7.064358e+09,7.263901e+09,6.967872e+09,6.528059e+09,5.738924e+09,
796,1000228,StockholdersEquity,,2.432222e+09,2.613585e+09,2.785197e+09,2.813594e+09,2.884256e+09,
1133,1000229,StockholdersEquity,,1.779030e+08,1.822300e+08,1.633230e+08,8.757300e+07,-2.906400e+07,
1357,1000230,StockholdersEquity,,2.820917e+07,3.064436e+07,3.019867e+07,3.100657e+07,2.663061e+07,
...,...,...,...,...,...,...,...,...,...
1906624,99302,StockholdersEquity,,2.737800e+07,3.165000e+07,3.008300e+07,3.431800e+07,3.891100e+07,
1906835,99359,StockholdersEquity,,3.815200e+07,4.307200e+07,5.048400e+07,6.711500e+07,,
1907168,99771,StockholdersEquity,,1.179690e+08,,8.871000e+07,8.100300e+07,,
1907596,99780,StockholdersEquity,,1.863800e+09,2.053000e+09,2.402100e+09,2.995900e+09,3.653900e+09,


In [72]:
frame[(frame['indicator_id'] == 'WeightedAverageNumberOfSharesOutstandingBasic')]

Unnamed: 0,company_id,indicator_id,2010,2011,2012,2013,2014,2015,2016
123,1000045,WeightedAverageNumberOfSharesOutstandingBasic,,,11977174.0,12096000.0,12012765.0,7622000.0,
521,1000180,WeightedAverageNumberOfSharesOutstandingBasic,232531000.0,239484000.0,242076000.0,234886000.0,222714000.0,205443000.0,
829,1000228,WeightedAverageNumberOfSharesOutstandingBasic,,90120000.0,87499000.0,85926000.0,84265000.0,82844000.0,
1152,1000229,WeightedAverageNumberOfSharesOutstandingBasic,,46286000.0,47211000.0,45692000.0,44362000.0,42747000.0,
1370,1000230,WeightedAverageNumberOfSharesOutstandingBasic,,,6455817.0,,,,
...,...,...,...,...,...,...,...,...,...
1906635,99302,WeightedAverageNumberOfSharesOutstandingBasic,,7309000.0,7404000.0,7080000.0,6798000.0,6887000.0,
1906845,99359,WeightedAverageNumberOfSharesOutstandingBasic,,9473000.0,9511000.0,9643000.0,9778000.0,,
1907200,99771,WeightedAverageNumberOfSharesOutstandingBasic,,,,6449726.0,6452557.0,,
1907617,99780,WeightedAverageNumberOfSharesOutstandingBasic,,77500000.0,77300000.0,76400000.0,151000000.0,150200000.0,


## New Data Set Needed - Prices

My worst fears have been realized. I don't seem to have any consistent, reliable access to historical price data for these companies. I will have to use another data set. The problem here is, many free data sets that include prices for many stocks are full of daily stock information, and I need about five days worth of information, which means I may have to use a data set to build my own.

## TOD - Fundamental Analysis API?

There's a chance I have been going about this all wrong. I have been looking for free online data sets to use, but the other good option for data collection is API usage. Many of the data sets I have found that weren't behind a paywall offered difficulties when it came to the processing of data, and so the left over ones were sparse and not too much to my liking. I'm going to take a look into apis.

In [199]:
# I actually didn't have the original modules needed to do this installed and I 
# am not good at managing the different python environments on my pc so this will have
# to do for now
import sys
!{sys.executable} -m pip install fundamentalanalysis
!{sys.executable} -m pip install requests

import fundamentalanalysis as fa

# let's play around a little bit with the fundamental analysis api
# on my repo this file won't exist
apikey = open('apikey.txt', 'r')
key = apikey.readline()
apikey.close()



You should consider upgrading via the 'c:\Python38\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\Python38\python.exe -m pip install --upgrade pip' command.


## API Information

I have a free plan, currently. I may buy one month's worth of data if I really find it worthwhile, but as of now I am rate limited to something like 300 requests per day. (This is like... obscenely low unless I can get a huge batch of info from a single request. We'll see how this turns out.)

[Link To API documentation](https://pypi.org/project/fundamentalanalysis/)

I ended up dropping about $20 on the lowest tier for a month of data collection. It's frustrating, but the data set is thorough and It will give me good information to work with. I was limited to using the same end point a maximum of 5 times without it.

In [84]:
# Now I have to build my data set, so I need an example of what metrics they have in their key metrics endpoint
intc_vals = fa.key_metrics('INTC', key, 'quarterly')
intc_vals

Unnamed: 0,2021,2020,2019,2018,2017,2016,2015,2014,2013,2012,...,2003,2002,2001,2000,1999,1998,1997,1996,1995,1994
period,FY,FY,FY,FY,FY,FY,FY,FY,FY,FY,...,FY,FY,FY,FY,FY,FY,FY,FY,FY,FY
revenuePerShare,19.468835,18.544177,16.292733,15.364997,13.350564,12.555391,11.673345,11.399714,10.605231,10.676741,...,4.617895,4.024057,3.951608,5.026979,4.420728,3.9378,3.833333,2.934544,2.291007,1.649094
netIncomePerShare,4.894802,4.977137,4.765225,4.565821,2.042331,2.180973,2.408267,2.388084,1.935614,2.202762,...,0.864256,0.468651,0.192228,1.570279,1.100181,0.909472,1.061927,0.725929,0.504242,0.3275
operatingCashFlowPerShare,7.388766,8.426768,7.503962,6.382997,4.703255,4.610571,4.010333,4.166089,4.180282,3.779824,...,1.76421,1.372576,1.288565,1.911909,1.705024,1.377548,1.530275,1.230715,0.569287,0.426695
freeCashFlowPerShare,2.380389,4.984758,3.833371,3.090653,2.19783,2.575687,2.44011,2.085493,2.017907,1.409528,...,1.204075,0.665464,0.200268,0.917126,1.193141,0.708633,0.842049,0.805039,0.067308,0.077295
cashPerShare,7.0,5.690641,2.971021,2.526567,2.978515,3.615011,5.338043,2.867578,4.04165,3.635308,...,2.476482,1.892497,1.719774,2.060367,1.773165,1.142986,1.51789,1.125282,0.347568,0.344963
bookValuePerShare,23.501109,19.299357,17.546751,16.170679,14.68177,14.001268,12.881695,11.398694,11.721529,10.248799,...,5.798376,5.332732,5.335021,5.562975,4.893953,3.503747,2.950306,2.375,1.716629,1.326461
tangibleBookValuePerShare,15.067258,10.726602,9.181798,8.378443,6.966816,9.199789,9.85175,8.275454,8.57002,7.057246,...,5.230734,4.681702,4.57162,4.677448,4.151775,3.48711,2.950306,2.375,1.716629,1.326461
shareholdersEquityPerShare,23.501109,19.299357,17.546751,16.170679,14.68177,14.001268,12.881695,11.398694,11.721529,10.248799,...,5.798376,5.332732,5.335021,5.562975,4.893953,3.503747,2.950306,2.375,1.716629,1.326461
interestDebtPerShare,9.533875,8.818766,6.676477,5.818044,5.841098,5.500211,4.85175,2.826974,2.754527,2.709768,...,0.177723,0.205232,0.217242,0.161723,0.143652,0.105216,0.068502,0.102477,0.056561,0.130113


In [115]:
# Getting prices
curVal = fa.stock_data('INTC', start='1999-9-01', end='2022-08-01', interval='1mo')
print(curVal['close'])

1999-09-01    37.156250
1999-10-01    38.718750
1999-11-01    38.343750
1999-12-01    41.156250
2000-01-01    49.468750
                ...    
2022-04-01    43.590000
2022-05-01    44.419998
2022-06-01    37.410000
2022-07-01    36.310001
2022-08-01    31.920000
Name: close, Length: 276, dtype: float64


In [116]:
# Getting key ratios
keyRat = fa.key_metrics('INTC', key, period='quarter', limit=(23 * 4))
keyRat.transpose()


Unnamed: 0,period,revenuePerShare,netIncomePerShare,operatingCashFlowPerShare,freeCashFlowPerShare,cashPerShare,bookValuePerShare,tangibleBookValuePerShare,shareholdersEquityPerShare,interestDebtPerShare,...,averagePayables,averageInventory,daysSalesOutstanding,daysPayablesOutstanding,daysOfInventoryOnHand,receivablesTurnover,payablesTurnover,inventoryTurnover,roe,capexPerShare
2022-07,Q2,3.736829,-0.110732,0.197317,-1.572195,6.596098,24.687317,16.39122,24.687317,8.668049,...,7577500000,12054500000,35.615821,73.45901,112.560099,2.526967,1.225173,0.799573,-0.004485,-1.769512
2022-04,Q1,4.510445,1.993856,1.447776,0.268862,9.509953,25.346768,17.034161,25.346768,9.39887,...,7210000000,11935000000,34.689697,71.237238,117.921836,2.59443,1.263384,0.763217,0.078663,-1.178914
2021-12,Q4,5.044974,1.136151,1.424674,-0.450971,6.982797,23.443352,15.030229,23.443352,9.396658,...,6478500000,11355500000,41.461906,54.33659,101.884652,2.170667,1.656342,0.883352,0.048464,-1.875645
2021-09,Q3,4.72593,1.680128,2.437823,1.34425,8.528688,22.183452,13.695395,22.183452,9.960108,...,7001000000,10866500000,39.391413,72.375089,104.40682,2.284762,1.243522,0.862013,0.075738,-1.093573
2021-06,Q2,4.848358,1.249938,2.16004,1.204742,6.139047,21.043961,12.452704,21.043961,8.776982,...,6563500000,10376000000,34.201009,63.208309,94.187537,2.631501,1.423863,0.95554,0.059397,-0.955298
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2000-09,Q3,1.299449,0.373419,0.448281,0.16178,2.084685,5.611698,4.694449,5.611698,0.140795,...,4785000000,6935500000,47.149238,67.47141,55.349428,1.908833,1.333898,1.626033,0.066543,-0.286501
2000-07,Q2,1.23696,0.467511,0.470343,0.283756,2.033383,5.457526,4.527571,5.457526,0.129657,...,4589000000,6771000000,46.984337,53.950655,44.054219,1.915532,1.668191,2.042937,0.085664,-0.186587
2000-04,Q1,1.196199,0.403472,0.491021,0.33029,1.678539,5.403023,4.508381,5.403023,0.129901,...,4485000000,6736000000,41.413737,52.994312,46.279692,2.173192,1.698295,1.944697,0.074675,-0.16073
1999-12,Q4,1.17853,0.302526,0.491533,0.5,1.691734,4.669202,3.961108,4.669202,0.137055,...,4290000000,6706500000,40.550414,38.822418,41.882872,2.219459,2.318248,2.14885,0.064792,0


In [9]:
import fundamentalanalysis as fa
from datetime import date
import numpy as np
import pandas as pd

keyfile = open('apikey.txt', 'r')
key = keyfile.readline()
keyfile.close()

data = fa.stock_data('SPY',  start='1999-01-01', end='2022-01-01', interval='1mo')
data.index.array


data['dates'] = pd.to_datetime(data.index)
data['months'] = data['dates'].dt.month
data = data[data.months.isin([1, 4, 7, 10])]
data[['close', 'months']]

def calc_quarter(date):
    return '{:d}Q{:d}'.format(date.year, date.month // 3 + 1)


def get_quarter_closes(ticker, start, end):
    allData = fa.stock_data(ticker, start=start, end=end, interval='1mo')
    allData['dates'] = pd.to_datetime(allData.index)
    allData['months'] = allData['dates'].dt.month
    allData['ticker'] = ticker
    allData['quarter']  = allData['dates'].map(calc_quarter)
    allData = allData[allData.months.isin([1, 4, 7, 10])]
    allData['price'] = (allData['high'] + allData['low'] + allData['open'] + allData['close']) / 4
    allData.drop(['high', 'volume', 'low', 'open', 'close', 'adjclose', 'dates', 'months'], axis=1, inplace=True)
    allData.set_index(['quarter'])
    allData.sort_values(['quarter'])
    return allData

spy_closes = get_quarter_closes('SPY', '1999-01-01', '2022-01-01')
closes = get_quarter_closes('INTC', '1999-01-01', '2022-01-01')
closes


Unnamed: 0,ticker,quarter,price
1999-01-01,INTC,1999Q1,32.765625
1999-04-01,INTC,1999Q2,30.363281
1999-07-01,INTC,1999Q3,32.468750
1999-10-01,INTC,1999Q4,37.062500
2000-01-01,INTC,2000Q1,45.775391
...,...,...,...
2021-01-01,INTC,2021Q1,54.670000
2021-04-01,INTC,2021Q2,61.952499
2021-07-01,INTC,2021Q3,54.905001
2021-10-01,INTC,2021Q4,51.647500


In [10]:
fa.stock_data('INTC', start='1999-01-01', end='2022-01-01', interval='1mo')[-20:]

Unnamed: 0,volume,high,low,open,close,adjclose
2020-06-01,570572500,65.110001,56.759998,62.490002,59.830002,56.057896
2020-07-01,808654500,61.93,46.970001,59.91,47.73,44.720764
2020-08-01,667321500,51.5,47.700001,48.27,50.950001,47.737762
2020-09-01,680429800,52.68,48.419998,50.91,51.779999,48.844917
2020-10-01,738437500,56.23,43.610001,52.400002,44.279999,41.77005
2020-11-01,676942800,48.5,44.240002,44.959999,48.349998,45.609344
2020-12-01,883810300,52.650002,45.240002,48.75,49.82,47.337845
2021-01-01,962555600,63.950001,49.330002,49.889999,55.509998,52.744358
2021-02-01,511339300,63.540001,55.709999,55.950001,60.779999,57.751789
2021-03-01,770803800,67.440002,57.91,61.720001,64.0,61.180485


Excellent, the dates in these values come as datetime objects. This works well!

In [11]:
# now I need to turn these quarterly closes into quarterly increases and decreases
def proc_losses_and_gains(quarter_closes):
    closes = np.array(quarter_closes['price'])
    offset = np.roll(closes, -1)
    
    close_vals = ((offset - closes) / closes)[:-1]
    quarter_closes.drop(quarter_closes.tail(1).index, inplace=True)

    quarter_closes['change'] = close_vals

    return close_vals

print(proc_losses_and_gains(spy_closes))
spy_closes

[ 0.05732325  0.03027078 -0.03397223  0.08554117  0.01885192  0.00018799
 -0.0323005  -0.04946921 -0.10415527  0.01186577 -0.12551007  0.07154976
 -0.02388493 -0.17124058 -0.07751076  0.04011082  0.00552537  0.1129991
  0.04238296  0.09895316 -0.0046635  -0.00586225  0.00399822  0.06309375
 -0.02230826  0.04045462 -0.00571912  0.04802304  0.02941643 -0.02993751
  0.07348606  0.05046231  0.019899    0.02652432  0.02724566 -0.09217027
 -0.02172782 -0.07757117 -0.17941217 -0.15665543 -0.04160798  0.13405341
  0.10815386  0.05553171  0.07624627 -0.10295695  0.09272796  0.09652244
  0.04786486 -0.02020824 -0.09569788  0.09281023  0.07625598 -0.02211358
  0.0499698   0.02975129  0.06614957  0.04846397  0.03926973  0.0554106
  0.03171357  0.0460278   0.00099833  0.03728424  0.02536793  0.00442392
 -0.04422395 -0.02684797  0.06119954  0.03214539  0.00596808  0.05825196
  0.04387745  0.03390351  0.04072639  0.08444517 -0.04596973  0.04883052
  0.01027661 -0.07685069  0.12379984  0.02859436 -0.0

Unnamed: 0,ticker,quarter,price,change
1999-01-01,SPY,1999Q1,124.976562,0.057323
1999-04-01,SPY,1999Q2,132.140625,0.030271
1999-07-01,SPY,1999Q3,136.140625,-0.033972
1999-10-01,SPY,1999Q4,131.515625,0.085541
2000-01-01,SPY,2000Q1,142.765625,0.018852
...,...,...,...,...
2020-10-01,SPY,2020Q4,335.212502,0.115747
2021-01-01,SPY,2021Q1,374.012505,0.092611
2021-04-01,SPY,2021Q2,408.649994,0.059066
2021-07-01,SPY,2021Q3,432.787498,0.025994


In [12]:
# we need to be able to figure out where our range starts for our labels of interest
def find_offset(myQuar, start='1998Q4'):
    sYear = int(start[:4])
    sQ = int(start[-1])
    
    year = int(myQuar[:4])
    q = int(myQuar[-1])

    dQ = q - sQ # this is the difference in quarters
    dYear = year - sYear

    return dYear * 4 + dQ

find_offset('2021Q3')


91

In [13]:
# Now I am able to get information that I can use to make labels for these stocks. 
# I have to make the labels by creating this indicator function

def indicator(diff):
    if diff > 0: #S&P better
        return 0
    else: # this better
        return 1

# I want this to return a dataframe shaped like |ticker|quarter|indicator|
spy_closes = get_quarter_closes('SPY', '1999-01-01', '2022-01-01')
spy_changes = proc_losses_and_gains(spy_closes)
def create_label(ticker, start='1999-01-01', end='2022-01-01', spy=spy_changes):
    # first we need to get the closes
    tick_closes = get_quarter_closes(ticker, start, end)
    tick_changes = proc_losses_and_gains(tick_closes)

    # then we need to find where our closes start
    start_q = tick_closes['quarter'][0]

    # we need the differences in the changes from
    tick_closes['diff'] = spy_changes[find_offset(start_q)] - tick_changes
    tick_closes['label'] = tick_closes['diff'].map(indicator)

    return tick_closes

create_label('AAPL')    

Unnamed: 0,ticker,quarter,price,change,diff,label
1999-01-01,AAPL,1999Q1,0.374302,-0.029817,0.060088,0
1999-04-01,AAPL,1999Q2,0.363142,0.252785,-0.222514,1
1999-07-01,AAPL,1999Q3,0.454939,0.387612,-0.357341,1
1999-10-01,AAPL,1999Q4,0.631278,0.473149,-0.442878,1
2000-01-01,AAPL,2000Q1,0.929966,0.209571,-0.179301,1
...,...,...,...,...,...,...
2020-10-01,AAPL,2020Q4,114.902500,0.168273,-0.138002,1
2021-01-01,AAPL,2021Q1,134.237501,-0.041475,0.071746,0
2021-04-01,AAPL,2021Q2,128.670004,0.104026,-0.073755,1
2021-07-01,AAPL,2021Q3,142.055000,0.026257,0.004013,0


# Creating actual data points

We have been able thus far to create labels, now we need a way to create data points for specific tickers. A large portion of the difficult part should theoretically be handled. We need to get the information about the stock that is going in to our feature list and add that to the information we need for our feature set.

In [14]:
features = ['peRatio', 'netIncomePerShare', 'freeCashFlowPerShare', 'pbRatio', 'roe', 'payoutRatio', 'dividendYield', 'debtToEquity', 'priceToSalesRatio', 'quarter', 'period']

# this is the quarter for which the stats were collected
# meaning it should be associated with the labels from the
# next quarter
def calc_metrics_quarter(date):
    quarter = ((date.month % 12) // 3 - 1) % 4 + 1
    yearMod = 0
    if quarter == 4 and date.month < 6:
        yearMod == -1
    return str(date.year + yearMod) + 'Q' + str(quarter)


# this will get our metrics which will be used as part of our
# feature set for our stocks
def get_metrics(ticker, limit=95, feature_set_values=features):
    metrics = fa.key_metrics(ticker, key, 'quarter', limit=95).transpose()
    metrics['date'] = pd.to_datetime(metrics.index)
    metrics['quarter'] = metrics['date'].map(calc_metrics_quarter)
    metrics.drop(metrics.columns.difference(feature_set_values), axis=1, inplace=True)

    return metrics[::-1]


metrics = get_metrics('AAPL')

metrics

Unnamed: 0,period,netIncomePerShare,freeCashFlowPerShare,peRatio,priceToSalesRatio,pbRatio,debtToEquity,dividendYield,payoutRatio,roe,quarter
1998-12,Q1,0.010033,0.014389,9.094006,3.233424,2.875276,0.4961,,0.0,0.079043,1998Q4
1999-03,Q2,0.008839,0.016368,8.396918,2.963618,2.084752,0.43908,,0.0,0.062069,1999Q1
1999-06,Q3,0.012582,0.00502,7.484331,3.900691,2.046221,0.10101,,0.0,0.06835,1999Q2
1999-09,Q4,0.006154,0.0112,23.552856,7.827446,3.36903,0.096649,,0.0,0.03576,1999Q3
2000-01,Q1,0.010146,0.018574,22.618616,7.066507,3.75098,0.067966,,0.0,0.041459,2000Q4
...,...,...,...,...,...,...,...,...,...,...,...
2021-06,Q3,1.307566,1.142617,25.449958,27.181958,34.435837,1.645177,0.001702,0.173243,0.33827,2021Q2
2021-09,Q4,1.246488,1.029713,29.466787,29.058155,38.394164,1.729371,0.001503,0.17712,0.325741,2021Q3
2021-12,Q1,2.112651,2.694225,21.339298,23.848639,41.093249,1.482358,0.001263,0.107768,0.481427,2021Q4
2022-03,Q2,1.52577,1.564936,28.628169,29.441005,42.492649,1.533005,0.001255,0.143743,0.371074,2022Q1


In [15]:
labels = create_label('AAPL')
labels

Unnamed: 0,ticker,quarter,price,change,diff,label
1999-01-01,AAPL,1999Q1,0.374302,-0.029817,0.060088,0
1999-04-01,AAPL,1999Q2,0.363142,0.252785,-0.222514,1
1999-07-01,AAPL,1999Q3,0.454939,0.387612,-0.357341,1
1999-10-01,AAPL,1999Q4,0.631278,0.473149,-0.442878,1
2000-01-01,AAPL,2000Q1,0.929966,0.209571,-0.179301,1
...,...,...,...,...,...,...
2020-10-01,AAPL,2020Q4,114.902500,0.168273,-0.138002,1
2021-01-01,AAPL,2021Q1,134.237501,-0.041475,0.071746,0
2021-04-01,AAPL,2021Q2,128.670004,0.104026,-0.073755,1
2021-07-01,AAPL,2021Q3,142.055000,0.026257,0.004013,0


In [16]:
metrics = get_metrics('')
metrics

ValueError: This endpoint is only for premium members. Please visit the subscription page to upgrade the plan (Starter or higher) at https://financialmodelingprep.com/developer/docs/pricing

In [17]:
# Now we need to come up with a clean way to combine our labels with our features
# It's important to properly pair quarterly values with labels from the following quarter
# that way we ensure we are using info to make a buy decision that we would have in real life
# (i.e., we have the fundamental stats before buying)

def build_data_points(labels, metrics):

    # we need to make sure we have the right number of metrics
    # hopefully we can assume the price data starts at the same time the metric data does
    num_labels = len(labels)
    num_metrics = len(metrics)

    if num_labels != num_metrics:
        metrics = metrics.drop(metrics.index[num_labels - num_metrics:])

    assert find_offset(labels['quarter'][0], metrics['quarter'][0]) == 1, 'Labels and features have incorrect start offset'
    assert find_offset(labels['quarter'][-1], metrics['quarter'][-1]) == 1, 'Labels and features have incorrect end offset'

    # now that we have proper offsets, we need to merge these into a table of data points

    points = pd.merge(metrics, labels, how='left', left_index=False, right_index=False, copy=True)
    points = points.drop(['diff', 'price', 'change', 'quarter', 'period'], axis=1)
    points = points.dropna()

    return points

labels = create_label('AAPL')
metrics = get_metrics('AAPL')
points = build_data_points(labels, metrics)
points



Unnamed: 0,netIncomePerShare,freeCashFlowPerShare,peRatio,priceToSalesRatio,pbRatio,debtToEquity,dividendYield,payoutRatio,roe,ticker,label
56,0.497458,0.79769,9.146318,8.77719,3.757183,0.0,0.00521,0.190625,0.102697,AAPL,0.0
57,0.362871,0.387881,10.55348,9.242857,2.974509,0.0,0.006181,0.26092,0.070463,AAPL,0.0
58,0.26826,0.225961,13.197812,10.31225,2.952961,0.137474,0.007717,0.407391,0.055937,AAPL,0.0
59,0.296723,0.300278,14.526248,11.648289,3.532887,0.137273,0.006344,0.36861,0.060802,AAPL,1.0
60,0.521004,0.82208,9.598397,8.714119,3.87003,0.130787,0.005517,0.211827,0.100799,AAPL,1.0
61,0.417381,0.492055,11.484447,10.28835,3.907688,0.141139,0.005666,0.260295,0.085065,AAPL,1.0
62,0.322155,0.325315,17.844676,14.774583,4.572864,0.240036,0.005184,0.370031,0.064065,AAPL,1.0
63,0.356725,0.395949,17.65191,14.1926,5.359489,0.259864,0.004732,0.334121,0.075905,AAPL,1.0
64,0.771168,1.303122,9.238417,8.928443,5.400663,0.263557,0.004205,0.155404,0.146147,AAPL,1.0
65,0.585497,0.716499,13.156565,12.309701,5.535291,0.310621,0.003841,0.202152,0.105181,AAPL,1.0


In [18]:
# now we need to make a method to do all of this for a single stock

def generate_data(ticker):
    return build_data_points(create_label(ticker), get_metrics(ticker))

generate_data('CLB')

Unnamed: 0,netIncomePerShare,freeCashFlowPerShare,peRatio,priceToSalesRatio,pbRatio,debtToEquity,dividendYield,payoutRatio,roe,ticker,label
40,0.756989,0.766108,9.884553,6.917222,8.799991,1.509032,0.001655,0.065426,0.222569,CLB,0.0
41,0.635438,1.031062,14.391659,9.394694,8.129047,0.958302,0.00137,0.078857,0.141211,CLB,1.0
42,0.649249,0.728142,16.779014,11.960447,8.519775,0.859031,0.00127,0.085248,0.126941,CLB,1.0
43,0.674561,1.174039,19.103154,14.11112,9.554969,0.828748,0.008244,0.629921,0.125044,CLB,1.0
44,0.514106,0.644673,28.719761,14.940051,9.711557,0.748518,0.000852,0.097896,0.084537,CLB,1.0
45,0.68774,1.198783,23.77351,16.261434,13.655016,0.0,0.00088,0.083727,0.143595,CLB,1.0
46,0.76594,0.949475,24.089672,16.568462,14.207233,0.094289,0.000868,0.083626,0.147441,CLB,1.0
47,0.86284,1.134791,25.508792,19.771875,16.063353,0.06566,0.008054,0.821762,0.15743,CLB,1.0
48,-1.808943,0.584187,-12.306911,21.043517,15.134357,0.030619,0.000615,-0.030292,-0.307436,CLB,1.0
49,1.02377,1.185473,24.94945,22.354321,16.486382,0.016138,0.002498,0.24933,0.165198,CLB,1.0


In [19]:
# Now for a cheap trick. We'll build sets and append them to a csv and ignore all ones that give us errors.
def build_dataset(filename, error_file, start_at=0):
    companies = fa.available_companies(key)[start_at + 1:]
    companies.reset_index(inplace=True)
    tickers = companies['symbol'].array
    counter = 0

    for ticker in tickers:
        data = None
        try:
            data = generate_data(ticker)
        except:
            errFile = open(error_file, 'a')
            errFile.write(ticker + '\n')
            errFile.close()

        if data is not None:
            data.to_csv(filename, mode='a', index=False, header=(counter == 0))
        
        counter += 1
        print('Latest Ticker: {:s}. {:4f}% completed'.format(ticker, 100 * counter / len(tickers)), end='\r')


build_dataset('dataset2.csv', 'error2.log')

Latest Ticker: PRME. 100.000000% completededededtedd