# Working with stock data, I'm not that surprised my original data isn't "normal".
In this segment, my goals are to work with my previously identified small group of stocks in order to impute missing data from SEC filings, and compute percentage changes between one filing and the next. Perhaps I will find that percentage change numbers achieve a more normal distribution. Otherwise, I will need to use entirely nonparametric methods of machine learning in future steps.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('hf-3-day-prices.csv', parse_dates=['date', 'date_minus1', 'date_plus1'])

### Preliminary work from EDA:
1. Collect "tech stock" tickers with the most complete "revenue" column
2. Drop any completely null columns
3. Drop the "split coefficient" columns, they are always 1 in this group of data

In [3]:
# most complete revenues data
tech_revenues = ['CSCO', 'FB', 'GOOGL', 'HPQ', 'IBM', 'ORCL']
df_tech_rev = df[df['ticker'].isin(tech_revenues)]
# drop completely null columns
df_tech_rev = df_tech_rev.dropna(axis='columns', how='all')
# drop split coefficients
dropcols = ['split_coefficient', 'split_coef_minus1', 'split_coef_plus1']
df_tech_rev = df_tech_rev.drop(dropcols, axis=1)
df_tech_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  19 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

#### Imputing considerations:
1. There are a bunch of columns with really sparse data. I wouldn't be surprised if only one ticker reports some of these figures.
2. I need to do any imputing within tickers, and be sure to not cross data between one ticker and the next.
3. It seems most logical, although complicated, to me to impute by attempting to fill in data that reflect the slope of the changes between one known point and the next.

In [5]:
sparsecols = ['commonstockvalue', 'cash', 'preferredstockvalue', 'operatingexpenses', 'land', 'sharesissued',
              'commercialpaper', 'salariesandwages']
for col in sparsecols:
    print(col + ': ' + df_tech_rev[df_tech_rev[col].notnull()].ticker.unique())

['commonstockvalue: GOOGL' 'commonstockvalue: ORCL']
['cash: FB']
['preferredstockvalue: FB']
['operatingexpenses: CSCO' 'operatingexpenses: HPQ']
['land: ORCL']
['sharesissued: CSCO' 'sharesissued: GOOGL']
['commercialpaper: CSCO' 'commercialpaper: GOOGL' 'commercialpaper: IBM'
 'commercialpaper: ORCL']
['salariesandwages: CSCO']


#### So, I can probably impute data for tickers that have made some reports, but how will I adjust for companies that don't report?

In [7]:
# credit: https://stackoverflow.com/questions/19124601/pretty-print-an-entire-pandas-series-dataframe
with  pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev[df_tech_rev['ticker'] == 'GOOGL'])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
1160,GOOGL,2015-10-29,50990000.0,88103000000.0,1549000000.0,51000.0,26178000000.0,144281000000.0,116241000000.0,15.95,831000000.0,-208000000.0,,39680000000.0,,,,,23998000000.0,,,,687348000.0,2000000000.0,,3237000000.0,,736.63,746.79,735.16,744.85,744.85,1825957.0,2015-10-28,733.0,737.37,727.27,736.92,736.92,1980154.0,2015-10-30,745.56,746.31,736.53,737.39,737.39,1999161.0
1161,GOOGL,2016-02-11,50295000.0,90114000000.0,1931000000.0,50000.0,26178000000.0,147461000000.0,89223000000.0,23.11,16348000000.0,-208000000.0,,55629000000.0,,,,,33112000000.0,,,,687348000.0,2000000000.0,,3225000000.0,,696.34,712.315,691.19,706.36,706.36,3250206.0,2016-02-10,711.79,723.22,705.39,706.85,706.85,3015733.0,2016-02-12,712.21,716.0,701.58,706.89,706.89,2326889.0
1162,GOOGL,2016-05-03,49536000.0,90955000000.0,1667000000.0,,26178000000.0,149747000000.0,-1891000000.0,6.12,4207000000.0,-169000000.0,,14915000000.0,,,,,8955000000.0,,,,691293000.0,,,3221000000.0,,712.5,713.37,707.33,708.44,708.44,1931040.0,2016-05-02,711.92,715.41,706.36,714.41,714.41,1673820.0,2016-05-04,706.77,715.05,704.05,711.37,711.37,1708609.0
1163,GOOGL,2016-08-04,48921000.0,94238000000.0,1716000000.0,,26413000000.0,154292000000.0,-2010000000.0,13.23,656000000.0,-183000000.0,,30447000000.0,,,,,18506000000.0,,,,691293000.0,,,2219000000.0,,798.24,800.2,793.92,797.25,797.25,1076031.0,2016-08-03,796.47,799.54,793.02,798.92,798.92,1461025.0,2016-08-05,800.11,807.22,797.81,806.93,806.93,1807271.0
1164,GOOGL,2016-11-03,48105000.0,98546000000.0,2175000000.0,,25845000000.0,159948000000.0,-1881000000.0,20.59,1013000000.0,-137000000.0,,47131000000.0,,,,,28418000000.0,,,,691293000.0,,,,,784.5,790.0,778.63,782.19,782.19,2175216.0,2016-11-02,806.76,806.76,785.0,788.42,788.42,2350736.0,2016-11-04,771.3,788.48,771.0043,781.1,781.1,1970603.0
1165,GOOGL,2017-02-03,47437000.0,105408000000.0,2041000000.0,,28461000000.0,167497000000.0,105131000000.0,28.32,19478000000.0,,,66556000000.0,,,,,39704000000.0,,,,691293000.0,,,,,823.13,826.13,819.35,820.13,820.13,1528095.0,2017-02-02,815.0,824.56,812.05,818.26,818.26,1689179.0,NaT,,,,,,
1166,GOOGL,2017-05-02,47164000.0,108794000000.0,2306000000.0,,27807000000.0,172756000000.0,-2195000000.0,7.85,-25000000.0,,,18182000000.0,,,,,8091000000.0,,,,,,,,,933.27,942.99,931.0,937.09,937.09,1745453.0,2017-05-01,924.15,935.82,920.8,932.82,932.82,2294856.0,2017-05-03,936.05,950.2,935.21,948.45,948.45,1792847.0
1167,GOOGL,2017-07-25,47101000.0,112386000000.0,2488000000.0,,30335000000.0,178621000000.0,-1630000000.0,12.94,-51000000.0,,,40060000000.0,,,,,16636000000.0,,,,,,,,,970.7,976.73,963.8,969.03,969.03,5793414.0,2017-07-24,994.1,1006.19,990.27,998.31,998.31,3053176.0,2017-07-26,972.78,973.95,960.23,965.31,965.31,2166225.0
1168,GOOGL,2017-10-27,47054000.0,119345000000.0,2674000000.0,,32436000000.0,189536000000.0,-1189000000.0,22.65,-98000000.0,,,60050000000.0,,,,,25733000000.0,,,,,,,,,1030.99,1063.62,1026.85,1033.67,1033.67,5139945.0,2017-10-26,998.47,1006.51,990.47,991.42,991.42,1827682.0,NaT,,,,,,
1169,GOOGL,2018-02-06,46972000.0,124308000000.0,3137000000.0,51000.0,44793000000.0,197295000000.0,113247000000.0,18.27,12662000000.0,-208000000.0,,84709000000.0,,,,,36046000000.0,,,,694783000.0,1300000000.0,,1329000000.0,,1033.98,1087.38,1030.01,1084.43,1084.43,3732527.0,2018-02-05,1100.61,1114.99,1056.74,1062.39,1062.39,3742469.0,2018-02-07,1084.97,1086.53,1054.62,1055.41,1055.41,2544683.0


I growl greatly at incomplete reporting. There's a couple columns, in Google's case, where I am tempted to just take the "most common" value (commonstockvalue, and profitloss). Commercialpaper? Debt? I know imputing data is basically guessing, and I'm guessing without much benefit of outside information. Also, how do I automate my guessing so I don't have to look through all this data when it's more than a handful of tickers? I suppose, since this is a handful of tickers, that it's worth doing more looking...

In [8]:
with  pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev[df_tech_rev['ticker'] == 'FB'])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
980,FB,2012-07-31,117000000.0,4604000000.0,63000000.0,,1432000000.0,6331000000.0,4899000000.0,-0.08,-31000000.0,,,1927000000.0,1513000000.0,615000000.0,566000000.0,,596000000.0,,,,,,,,,23.37,23.37,21.61,21.71,21.71,56179400.0,2012-07-30,23.995,24.04,23.03,23.15,23.15,29285900.0,2012-08-01,21.5,21.58,20.84,20.88,20.88,44604400.0
981,FB,2012-10-24,180000000.0,12285000000.0,63000000.0,,1432000000.0,6331000000.0,14174000000.0,-0.02,-3000000.0,,,1371000000.0,1513000000.0,,566000000.0,,543000000.0,,,,,,,,,24.13,24.25,22.85,23.2299,23.2299,228949900.0,2012-10-23,19.25,19.8,19.1,19.5,19.5,78381200.0,2012-10-25,23.29,23.31,22.47,22.56,22.56,76142000.0
982,FB,2013-02-01,117000000.0,11267000000.0,63000000.0,,1432000000.0,6331000000.0,2162000000.0,0.02,35000000.0,,,942000000.0,1513000000.0,,566000000.0,,1644000000.0,,,,,,,,,31.01,31.02,29.63,29.73,29.73,85856700.0,2013-01-31,29.15,31.47,28.74,30.981,30.981,190744900.0,NaT,,,,,,
983,FB,2013-05-02,1671000000.0,11042000000.0,65000000.0,,3348000000.0,15103000000.0,11824000000.0,0.1,156000000.0,,,1085000000.0,1166000000.0,,566000000.0,,522000000.0,,,,,,,,,28.0099,29.02,27.98,28.97,28.97,104257000.0,2013-05-01,27.85,27.915,27.31,27.43,27.43,64567600.0,2013-05-03,29.04,29.07,28.15,28.311,28.311,58506400.0
984,FB,2013-07-25,701000000.0,11421000000.0,65000000.0,,3348000000.0,15103000000.0,12349000000.0,-0.08,-126000000.0,,,1251000000.0,1020000000.0,,566000000.0,,1124000000.0,,,,,,,,,33.545,34.88,32.75,34.359,34.359,365457900.0,2013-07-24,26.32,26.53,26.05,26.51,26.51,82635600.0,2013-07-26,33.77,34.73,33.56,34.01,34.01,136028900.0
985,FB,2013-11-01,701000000.0,10549000000.0,65000000.0,,3348000000.0,15103000000.0,13048000000.0,0.17,260000000.0,,,1280000000.0,898000000.0,,566000000.0,,1789000000.0,,,,,,,,,50.85,52.09,49.72,49.75,49.75,95033000.0,2013-10-31,47.155,52.0,46.5,50.205,50.205,248809000.0,NaT,,,,,,
986,FB,2014-01-31,701000000.0,13070000000.0,87000000.0,,3348000000.0,15103000000.0,-6000000.0,0.52,35000000.0,,,5068000000.0,1044000000.0,,566000000.0,,1644000000.0,,,,,,,,,60.47,63.37,60.17,62.57,62.57,87794600.0,2014-01-30,62.12,62.5,60.46,61.08,61.08,150178900.0,NaT,,,,,,
987,FB,2014-04-25,1970000000.0,14060000000.0,85000000.0,,2425000000.0,19028000000.0,16737000000.0,0.09,144000000.0,,,1427000000.0,1044000000.0,,923000000.0,,777000000.0,,,,,,,,,59.97,60.01,57.57,57.71,57.71,92502000.0,2014-04-24,63.6,63.65,59.77,60.87,60.87,138769000.0,NaT,,,,,,
988,FB,2014-07-24,2013000000.0,15557000000.0,146000000.0,,2425000000.0,20769000000.0,18346000000.0,0.31,317000000.0,,,2948000000.0,1044000000.0,,923000000.0,,3023000000.0,,,,,,,,,75.96,76.74,74.51,74.98,74.98,124006900.0,2014-07-23,69.74,71.33,69.61,71.29,71.29,77435900.0,2014-07-25,74.99,75.67,74.662,75.19,75.19,45823100.0
989,FB,2014-10-30,564000000.0,16115000000.0,87000000.0,,2425000000.0,24188000000.0,15470000000.0,0.4,632000000.0,,,4754000000.0,1344000000.0,,923000000.0,,4758000000.0,,,,,,,,,75.05,75.35,72.9,74.11,74.11,83269554.0,2014-10-29,75.45,76.88,74.78,75.86,75.86,106119520.0,2014-10-31,74.93,75.7,74.45,74.99,74.99,44544325.0


In [9]:
with  pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev[df_tech_rev['ticker'] == 'ORCL'])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
1945,ORCL,2009-09-21,,18581000000.0,271000000.0,14648000000.0,,52998000000.0,26143000000.0,0.21,9452000000.0,11235000000.0,,3810000000.0,,,64000000.0,,5331000000.0,868000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2500000000.0,,21.56,21.82,21.5,21.57,19.5155,25487400.0,NaT,,,,,,,2009-09-22,21.59,21.7075,21.35,21.41,19.3708,34159300.0
1946,ORCL,2009-12-22,,25235000000.0,255000000.0,14648000000.0,,53833000000.0,27531000000.0,0.46,9452000000.0,11235000000.0,,7442000000.0,,,123000000.0,,10938000000.0,868000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2500000000.0,,24.51,24.63,24.24,24.46,22.1836,23847200.0,2009-12-21,24.39,24.57,24.22,24.43,22.1564,26935500.0,2009-12-23,24.46,24.75,24.36,24.73,22.4285,19257800.0
1947,ORCL,2010-03-29,,23979000000.0,616000000.0,14648000000.0,,59386000000.0,28476000000.0,0.73,9452000000.0,11235000000.0,,10955000000.0,,,196000000.0,,16391000000.0,868000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2500000000.0,,25.645,25.85,25.41,25.57,23.2361,28975300.0,NaT,,,,,,,2010-03-30,25.48,25.58,25.22,25.54,23.2088,29819400.0
1948,ORCL,2010-07-01,,18581000000.0,271000000.0,14648000000.0,,61578000000.0,25090000000.0,1.08,6230000000.0,5593000000.0,,14586000000.0,,,298000000.0,,22430000000.0,757000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2950000000.0,,21.46,21.68,21.24,21.55,19.6204,38318200.0,2010-06-30,21.64,21.96,21.39,21.46,19.5384,35301600.0,2010-07-02,21.71,22.03,21.49,21.83,19.8753,31784000.0
1949,ORCL,2010-12-21,,27004000000.0,775000000.0,14648000000.0,,61578000000.0,30798000000.0,0.52,9981000000.0,5593000000.0,,6993000000.0,,,123000000.0,,16084000000.0,757000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2950000000.0,,31.65,32.0,31.59,31.76,29.0318,20002800.0,2010-12-20,31.54,31.94,31.11,31.675,28.9541,33568100.0,2010-12-22,31.68,31.8825,31.56,31.66,28.9404,14019200.0
1950,ORCL,2011-03-29,,27004000000.0,775000000.0,14648000000.0,,61578000000.0,30798000000.0,0.75,9981000000.0,5593000000.0,,11553000000.0,,,196000000.0,,24847000000.0,757000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2950000000.0,,32.4,33.16,32.36,33.16,30.3601,29950300.0,2011-03-28,32.83,32.89,32.4,32.555,29.8062,31399500.0,2011-03-30,33.27,33.43,33.0,33.05,30.2594,25718200.0
1951,ORCL,2011-06-28,,39174000000.0,701000000.0,,,73535000000.0,39776000000.0,1.69,9981000000.0,5593000000.0,,23589000000.0,,,298000000.0,,23252000000.0,757000000.0,7118000000.0,10431000000.0,,881000000.0,1259000000.0,2950000000.0,,31.69,32.37,31.37,32.34,29.6619,32746800.0,2011-06-27,30.98,31.83,30.86,31.58,28.9648,29585600.0,2011-06-29,32.4,32.68,32.21,32.43,29.7445,27305700.0
1952,ORCL,2011-09-23,,39174000000.0,701000000.0,,,73535000000.0,39776000000.0,0.27,9981000000.0,6230000000.0,,5585000000.0,,,99000000.0,,8374000000.0,692000000.0,7118000000.0,10431000000.0,,,896000000.0,2950000000.0,,28.1,29.08,27.81,28.9,26.5548,43991700.0,2011-09-22,28.74,29.03,27.8309,28.34,26.0402,61371400.0,NaT,,,,,,
1953,ORCL,2011-12-23,,38482000000.0,445000000.0,,,72910000000.0,41920000000.0,0.43,9981000000.0,6230000000.0,,5681000000.0,,,226000000.0,,8792000000.0,692000000.0,6360000000.0,10431000000.0,,,929000000.0,2950000000.0,,25.8,26.08,25.75,26.06,23.9933,32292800.0,2011-12-22,25.8601,25.87,25.38,25.69,23.6526,44203700.0,NaT,,,,,,
1954,ORCL,2012-03-23,,37538000000.0,442000000.0,,,74361000000.0,42873000000.0,0.5,9981000000.0,6230000000.0,,5722000000.0,,,283000000.0,,9039000000.0,692000000.0,6591000000.0,10431000000.0,,,922000000.0,2950000000.0,,28.69,28.89,28.52,28.55,26.3441,36696300.0,2012-03-22,29.33,29.33,28.56,28.63,26.418,59763200.0,NaT,,,,,,


Maybe I should just start with "simple" and impute the mean to the SEC filing columns. I will need a different decision for my day-plus/day-minus price information, though. I should probably just set those equal to the day-of prices when they're missing.

In [12]:
df_tech_rev['date_minus1'].fillna(value=df_tech_rev['date'], axis=0, inplace=True)
df_tech_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  19 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

In [13]:
df_tech_rev['date_plus1'].fillna(value=df_tech_rev['date'], axis=0, inplace=True)
df_tech_rev['open_minus1'].fillna(value=df_tech_rev['open'], axis=0, inplace=True)
df_tech_rev['open_plus1'].fillna(value=df_tech_rev['open'], axis=0, inplace=True)
df_tech_rev['high_minus1'].fillna(value=df_tech_rev['high'], axis=0, inplace=True)
df_tech_rev['high_plus1'].fillna(value=df_tech_rev['high'], axis=0, inplace=True)
df_tech_rev['low_minus1'].fillna(value=df_tech_rev['low'], axis=0, inplace=True)
df_tech_rev['low_plus1'].fillna(value=df_tech_rev['low'], axis=0, inplace=True)
df_tech_rev['close_minus1'].fillna(value=df_tech_rev['close'], axis=0, inplace=True)
df_tech_rev['close_plus1'].fillna(value=df_tech_rev['close'], axis=0, inplace=True)
df_tech_rev['close_adj_minus1'].fillna(value=df_tech_rev['close_adjusted'], axis=0, inplace=True)
df_tech_rev['close_adj_plus1'].fillna(value=df_tech_rev['close_adjusted'], axis=0, inplace=True)
df_tech_rev['volume_minus1'].fillna(value=df_tech_rev['volume'], axis=0, inplace=True)
df_tech_rev['volume_plus1'].fillna(value=df_tech_rev['volume'], axis=0, inplace=True)
df_tech_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  19 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

In [15]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev[df_tech_rev['open'].isnull()])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
1375,IBM,2012-10-30,2182470000.0,48141000000.0,7085000000.0,,94112000000.0,115778000000.0,21541000000.0,9.38,15000000.0,10771000000.0,10003000000.0,,,,2572000000.0,,563000000.0,,670000000.0,35131000000.0,,2458000000.0,29285000000.0,9334000000.0,,,,,,,,2012-10-30,,,,,,,2012-10-31,194.8,196.41,193.63,194.53,166.5844,6052300.0


In [19]:
#sigh, fixing the ONE DARN ROW...
df_tech_rev.loc[1375,'date_minus1'] = df_tech_rev.loc[1375,'date_plus1'] #because date_minus1 had already been fillna'd
df_tech_rev['open_minus1'].fillna(value=df_tech_rev['open_plus1'], axis=0, inplace=True)
df_tech_rev['open'].fillna(value=df_tech_rev['open_plus1'], axis=0, inplace=True)
df_tech_rev['high_minus1'].fillna(value=df_tech_rev['high_plus1'], axis=0, inplace=True)
df_tech_rev['high'].fillna(value=df_tech_rev['high_plus1'], axis=0, inplace=True)
df_tech_rev['low_minus1'].fillna(value=df_tech_rev['low_plus1'], axis=0, inplace=True)
df_tech_rev['low'].fillna(value=df_tech_rev['low_plus1'], axis=0, inplace=True)
df_tech_rev['close_minus1'].fillna(value=df_tech_rev['close_plus1'], axis=0, inplace=True)
df_tech_rev['close'].fillna(value=df_tech_rev['close_plus1'], axis=0, inplace=True)
df_tech_rev['close_adj_minus1'].fillna(value=df_tech_rev['close_adj_plus1'], axis=0, inplace=True)
df_tech_rev['close_adjusted'].fillna(value=df_tech_rev['close_adj_plus1'], axis=0, inplace=True)
df_tech_rev['volume_minus1'].fillna(value=df_tech_rev['volume_plus1'], axis=0, inplace=True)
df_tech_rev['volume'].fillna(value=df_tech_rev['volume_plus1'], axis=0, inplace=True)
df_tech_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  19 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

Whee! Now to try the magic of Pandas "interpolate"!

In [25]:
# credit: https://stackoverflow.com/questions/37057187/pandas-interpolate-within-a-groupby
df_tech_rev_interp = df_tech_rev.groupby('ticker').transform(pd.DataFrame.interpolate)
df_tech_rev_interp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 46 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   date                              199 non-null    datetime64[ns]
 1   commonstocksharesissued           160 non-null    float64       
 2   assetscurrent                     199 non-null    float64       
 3   accountspayablecurrent            199 non-null    float64       
 4   commonstockvalue                  53 non-null     float64       
 5   liabilities                       122 non-null    float64       
 6   liabilitiesandstockholdersequity  199 non-null    float64       
 7   stockholdersequity                199 non-null    float64       
 8   earningspersharebasic             199 non-null    float64       
 9   netincomeloss                     199 non-null    float64       
 10  profitloss                        171 non-null 

Well... this looks like it probably did what I hoped for, except it's made the ticker info disappear.

In [26]:
df_tech_rev_interp = df_tech_rev.groupby('ticker').apply(pd.DataFrame.interpolate)
df_tech_rev_interp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  53 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

That's odd. Wonder what's different between "apply" interpolate and "transform" interpolate. Let's see what it did.

In [27]:
with  pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev_interp[df_tech_rev_interp['ticker'] == 'GOOGL'])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
1160,GOOGL,2015-10-29,50990000.0,88103000000.0,1549000000.0,51000.0,26178000000.0,144281000000.0,116241000000.0,15.95,831000000.0,-208000000.0,,39680000000.0,,,,,23998000000.0,,,,687348000.0,2000000000.0,,3237000000.0,,736.63,746.79,735.16,744.85,744.85,1825957.0,2015-10-28,733.0,737.37,727.27,736.92,736.92,1980154.0,2015-10-30,745.56,746.31,736.53,737.39,737.39,1999161.0
1161,GOOGL,2016-02-11,50295000.0,90114000000.0,1931000000.0,50000.0,26178000000.0,147461000000.0,89223000000.0,23.11,16348000000.0,-208000000.0,,55629000000.0,,,,,33112000000.0,,,,687348000.0,2000000000.0,,3225000000.0,,696.34,712.315,691.19,706.36,706.36,3250206.0,2016-02-10,711.79,723.22,705.39,706.85,706.85,3015733.0,2016-02-12,712.21,716.0,701.58,706.89,706.89,2326889.0
1162,GOOGL,2016-05-03,49536000.0,90955000000.0,1667000000.0,50125.0,26178000000.0,149747000000.0,-1891000000.0,6.12,4207000000.0,-169000000.0,,14915000000.0,,,,,8955000000.0,,,,691293000.0,1912500000.0,,3221000000.0,,712.5,713.37,707.33,708.44,708.44,1931040.0,2016-05-02,711.92,715.41,706.36,714.41,714.41,1673820.0,2016-05-04,706.77,715.05,704.05,711.37,711.37,1708609.0
1163,GOOGL,2016-08-04,48921000.0,94238000000.0,1716000000.0,50250.0,26413000000.0,154292000000.0,-2010000000.0,13.23,656000000.0,-183000000.0,,30447000000.0,,,,,18506000000.0,,,,691293000.0,1825000000.0,,2219000000.0,,798.24,800.2,793.92,797.25,797.25,1076031.0,2016-08-03,796.47,799.54,793.02,798.92,798.92,1461025.0,2016-08-05,800.11,807.22,797.81,806.93,806.93,1807271.0
1164,GOOGL,2016-11-03,48105000.0,98546000000.0,2175000000.0,50375.0,25845000000.0,159948000000.0,-1881000000.0,20.59,1013000000.0,-137000000.0,,47131000000.0,,,,,28418000000.0,,,,691293000.0,1737500000.0,,2070667000.0,,784.5,790.0,778.63,782.19,782.19,2175216.0,2016-11-02,806.76,806.76,785.0,788.42,788.42,2350736.0,2016-11-04,771.3,788.48,771.0043,781.1,781.1,1970603.0
1165,GOOGL,2017-02-03,47437000.0,105408000000.0,2041000000.0,50500.0,28461000000.0,167497000000.0,105131000000.0,28.32,19478000000.0,-151200000.0,,66556000000.0,,,,,39704000000.0,,,,691293000.0,1650000000.0,,1922333000.0,,823.13,826.13,819.35,820.13,820.13,1528095.0,2017-02-02,815.0,824.56,812.05,818.26,818.26,1689179.0,2017-02-03,823.13,826.13,819.35,820.13,820.13,1528095.0
1166,GOOGL,2017-05-02,47164000.0,108794000000.0,2306000000.0,50625.0,27807000000.0,172756000000.0,-2195000000.0,7.85,-25000000.0,-165400000.0,,18182000000.0,,,,,8091000000.0,,,,692165500.0,1562500000.0,,1774000000.0,,933.27,942.99,931.0,937.09,937.09,1745453.0,2017-05-01,924.15,935.82,920.8,932.82,932.82,2294856.0,2017-05-03,936.05,950.2,935.21,948.45,948.45,1792847.0
1167,GOOGL,2017-07-25,47101000.0,112386000000.0,2488000000.0,50750.0,30335000000.0,178621000000.0,-1630000000.0,12.94,-51000000.0,-179600000.0,,40060000000.0,,,,,16636000000.0,,,,693038000.0,1475000000.0,,1625667000.0,,970.7,976.73,963.8,969.03,969.03,5793414.0,2017-07-24,994.1,1006.19,990.27,998.31,998.31,3053176.0,2017-07-26,972.78,973.95,960.23,965.31,965.31,2166225.0
1168,GOOGL,2017-10-27,47054000.0,119345000000.0,2674000000.0,50875.0,32436000000.0,189536000000.0,-1189000000.0,22.65,-98000000.0,-193800000.0,,60050000000.0,,,,,25733000000.0,,,,693910500.0,1387500000.0,,1477333000.0,,1030.99,1063.62,1026.85,1033.67,1033.67,5139945.0,2017-10-26,998.47,1006.51,990.47,991.42,991.42,1827682.0,2017-10-27,1030.99,1063.62,1026.85,1033.67,1033.67,5139945.0
1169,GOOGL,2018-02-06,46972000.0,124308000000.0,3137000000.0,51000.0,44793000000.0,197295000000.0,113247000000.0,18.27,12662000000.0,-208000000.0,,84709000000.0,,,,,36046000000.0,,,,694783000.0,1300000000.0,,1329000000.0,,1033.98,1087.38,1030.01,1084.43,1084.43,3732527.0,2018-02-05,1100.61,1114.99,1056.74,1062.39,1062.39,3742469.0,2018-02-07,1084.97,1086.53,1054.62,1055.41,1055.41,2544683.0


In [28]:
with  pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev_interp[df_tech_rev_interp['ticker'] == 'FB'])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
980,FB,2012-07-31,117000000.0,4604000000.0,63000000.0,,1432000000.0,6331000000.0,4899000000.0,-0.08,-31000000.0,,,1927000000.0,1513000000.0,615000000.0,566000000.0,,596000000.0,,,,,,,,,23.37,23.37,21.61,21.71,21.71,56179400.0,2012-07-30,23.995,24.04,23.03,23.15,23.15,29285900.0,2012-08-01,21.5,21.58,20.84,20.88,20.88,44604400.0
981,FB,2012-10-24,180000000.0,12285000000.0,63000000.0,,1432000000.0,6331000000.0,14174000000.0,-0.02,-3000000.0,,,1371000000.0,1513000000.0,615000000.0,566000000.0,,543000000.0,,,,,,,,,24.13,24.25,22.85,23.2299,23.2299,228949900.0,2012-10-23,19.25,19.8,19.1,19.5,19.5,78381200.0,2012-10-25,23.29,23.31,22.47,22.56,22.56,76142000.0
982,FB,2013-02-01,117000000.0,11267000000.0,63000000.0,,1432000000.0,6331000000.0,2162000000.0,0.02,35000000.0,,,942000000.0,1513000000.0,615000000.0,566000000.0,,1644000000.0,,,,,,,,,31.01,31.02,29.63,29.73,29.73,85856700.0,2013-01-31,29.15,31.47,28.74,30.981,30.981,190744900.0,2013-02-01,31.01,31.02,29.63,29.73,29.73,85856700.0
983,FB,2013-05-02,1671000000.0,11042000000.0,65000000.0,,3348000000.0,15103000000.0,11824000000.0,0.1,156000000.0,,,1085000000.0,1166000000.0,615000000.0,566000000.0,,522000000.0,,,,,,,,,28.0099,29.02,27.98,28.97,28.97,104257000.0,2013-05-01,27.85,27.915,27.31,27.43,27.43,64567600.0,2013-05-03,29.04,29.07,28.15,28.311,28.311,58506400.0
984,FB,2013-07-25,701000000.0,11421000000.0,65000000.0,,3348000000.0,15103000000.0,12349000000.0,-0.08,-126000000.0,,,1251000000.0,1020000000.0,615000000.0,566000000.0,,1124000000.0,,,,,,,,,33.545,34.88,32.75,34.359,34.359,365457900.0,2013-07-24,26.32,26.53,26.05,26.51,26.51,82635600.0,2013-07-26,33.77,34.73,33.56,34.01,34.01,136028900.0
985,FB,2013-11-01,701000000.0,10549000000.0,65000000.0,,3348000000.0,15103000000.0,13048000000.0,0.17,260000000.0,,,1280000000.0,898000000.0,615000000.0,566000000.0,,1789000000.0,,,,,,,,,50.85,52.09,49.72,49.75,49.75,95033000.0,2013-10-31,47.155,52.0,46.5,50.205,50.205,248809000.0,2013-11-01,50.85,52.09,49.72,49.75,49.75,95033000.0
986,FB,2014-01-31,701000000.0,13070000000.0,87000000.0,,3348000000.0,15103000000.0,-6000000.0,0.52,35000000.0,,,5068000000.0,1044000000.0,615000000.0,566000000.0,,1644000000.0,,,,,,,,,60.47,63.37,60.17,62.57,62.57,87794600.0,2014-01-30,62.12,62.5,60.46,61.08,61.08,150178900.0,2014-01-31,60.47,63.37,60.17,62.57,62.57,87794600.0
987,FB,2014-04-25,1970000000.0,14060000000.0,85000000.0,,2425000000.0,19028000000.0,16737000000.0,0.09,144000000.0,,,1427000000.0,1044000000.0,615000000.0,923000000.0,,777000000.0,,,,,,,,,59.97,60.01,57.57,57.71,57.71,92502000.0,2014-04-24,63.6,63.65,59.77,60.87,60.87,138769000.0,2014-04-25,59.97,60.01,57.57,57.71,57.71,92502000.0
988,FB,2014-07-24,2013000000.0,15557000000.0,146000000.0,,2425000000.0,20769000000.0,18346000000.0,0.31,317000000.0,,,2948000000.0,1044000000.0,615000000.0,923000000.0,,3023000000.0,,,,,,,,,75.96,76.74,74.51,74.98,74.98,124006900.0,2014-07-23,69.74,71.33,69.61,71.29,71.29,77435900.0,2014-07-25,74.99,75.67,74.662,75.19,75.19,45823100.0
989,FB,2014-10-30,564000000.0,16115000000.0,87000000.0,,2425000000.0,24188000000.0,15470000000.0,0.4,632000000.0,,,4754000000.0,1344000000.0,615000000.0,923000000.0,,4758000000.0,,,,,,,,,75.05,75.35,72.9,74.11,74.11,83269554.0,2014-10-29,75.45,76.88,74.78,75.86,75.86,106119520.0,2014-10-31,74.93,75.7,74.45,74.99,74.99,44544325.0


In [29]:
with  pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tech_rev_interp[df_tech_rev_interp['ticker'] == 'ORCL'])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,open,high,low,close,close_adjusted,volume,date_minus1,open_minus1,high_minus1,low_minus1,close_minus1,close_adj_minus1,volume_minus1,date_plus1,open_plus1,high_plus1,low_plus1,close_plus1,close_adj_plus1,volume_plus1
1945,ORCL,2009-09-21,,18581000000.0,271000000.0,14648000000.0,,52998000000.0,26143000000.0,0.21,9452000000.0,11235000000.0,,3810000000.0,,,64000000.0,,5331000000.0,868000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2500000000.0,,21.56,21.82,21.5,21.57,19.5155,25487400.0,2009-09-21,21.56,21.82,21.5,21.57,19.5155,25487400.0,2009-09-22,21.59,21.7075,21.35,21.41,19.3708,34159300.0
1946,ORCL,2009-12-22,,25235000000.0,255000000.0,14648000000.0,,53833000000.0,27531000000.0,0.46,9452000000.0,11235000000.0,,7442000000.0,,,123000000.0,,10938000000.0,868000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2500000000.0,,24.51,24.63,24.24,24.46,22.1836,23847200.0,2009-12-21,24.39,24.57,24.22,24.43,22.1564,26935500.0,2009-12-23,24.46,24.75,24.36,24.73,22.4285,19257800.0
1947,ORCL,2010-03-29,,23979000000.0,616000000.0,14648000000.0,,59386000000.0,28476000000.0,0.73,9452000000.0,11235000000.0,,10955000000.0,,,196000000.0,,16391000000.0,868000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2500000000.0,,25.645,25.85,25.41,25.57,23.2361,28975300.0,2010-03-29,25.645,25.85,25.41,25.57,23.2361,28975300.0,2010-03-30,25.48,25.58,25.22,25.54,23.2088,29819400.0
1948,ORCL,2010-07-01,,18581000000.0,271000000.0,14648000000.0,,61578000000.0,25090000000.0,1.08,6230000000.0,5593000000.0,,14586000000.0,,,298000000.0,,22430000000.0,757000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2950000000.0,,21.46,21.68,21.24,21.55,19.6204,38318200.0,2010-06-30,21.64,21.96,21.39,21.46,19.5384,35301600.0,2010-07-02,21.71,22.03,21.49,21.83,19.8753,31784000.0
1949,ORCL,2010-12-21,,27004000000.0,775000000.0,14648000000.0,,61578000000.0,30798000000.0,0.52,9981000000.0,5593000000.0,,6993000000.0,,,123000000.0,,16084000000.0,757000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2950000000.0,,31.65,32.0,31.59,31.76,29.0318,20002800.0,2010-12-20,31.54,31.94,31.11,31.675,28.9541,33568100.0,2010-12-22,31.68,31.8825,31.56,31.66,28.9404,14019200.0
1950,ORCL,2011-03-29,,27004000000.0,775000000.0,14648000000.0,,61578000000.0,30798000000.0,0.75,9981000000.0,5593000000.0,,11553000000.0,,,196000000.0,,24847000000.0,757000000.0,6288000000.0,10431000000.0,,881000000.0,116000000.0,2950000000.0,,32.4,33.16,32.36,33.16,30.3601,29950300.0,2011-03-28,32.83,32.89,32.4,32.555,29.8062,31399500.0,2011-03-30,33.27,33.43,33.0,33.05,30.2594,25718200.0
1951,ORCL,2011-06-28,,39174000000.0,701000000.0,14648000000.0,,73535000000.0,39776000000.0,1.69,9981000000.0,5593000000.0,,23589000000.0,,,298000000.0,,23252000000.0,757000000.0,7118000000.0,10431000000.0,,881000000.0,1259000000.0,2950000000.0,,31.69,32.37,31.37,32.34,29.6619,32746800.0,2011-06-27,30.98,31.83,30.86,31.58,28.9648,29585600.0,2011-06-29,32.4,32.68,32.21,32.43,29.7445,27305700.0
1952,ORCL,2011-09-23,,39174000000.0,701000000.0,14648000000.0,,73535000000.0,39776000000.0,0.27,9981000000.0,6230000000.0,,5585000000.0,,,99000000.0,,8374000000.0,692000000.0,7118000000.0,10431000000.0,,881000000.0,896000000.0,2950000000.0,,28.1,29.08,27.81,28.9,26.5548,43991700.0,2011-09-22,28.74,29.03,27.8309,28.34,26.0402,61371400.0,2011-09-23,28.1,29.08,27.81,28.9,26.5548,43991700.0
1953,ORCL,2011-12-23,,38482000000.0,445000000.0,14648000000.0,,72910000000.0,41920000000.0,0.43,9981000000.0,6230000000.0,,5681000000.0,,,226000000.0,,8792000000.0,692000000.0,6360000000.0,10431000000.0,,881000000.0,929000000.0,2950000000.0,,25.8,26.08,25.75,26.06,23.9933,32292800.0,2011-12-22,25.8601,25.87,25.38,25.69,23.6526,44203700.0,2011-12-23,25.8,26.08,25.75,26.06,23.9933,32292800.0
1954,ORCL,2012-03-23,,37538000000.0,442000000.0,14648000000.0,,74361000000.0,42873000000.0,0.5,9981000000.0,6230000000.0,,5722000000.0,,,283000000.0,,9039000000.0,692000000.0,6591000000.0,10431000000.0,,881000000.0,922000000.0,2950000000.0,,28.69,28.89,28.52,28.55,26.3441,36696300.0,2012-03-22,29.33,29.33,28.56,28.63,26.418,59763200.0,2012-03-23,28.69,28.89,28.52,28.55,26.3441,36696300.0


Last choice, I think: do I fill in the remaining NaN's with zero, or some other dummy constant, to make the ML happy later? Is there a better way to deal with some companies reporting data and others not?