Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bikeshed name for kwarg that preserves dates consumed by computation #47

Closed
milktrader opened this issue Mar 12, 2015 · 9 comments
Closed

Comments

@milktrader
Copy link
Member

A 10-period moving average cannot be computed on a distance less than 10, so when performing this computation on a data structure, the first 9 values don't have any reasonable value. This package consumes these dates, throwing them into the black hole of time-space where they belong. The resulting data structure has 9 less rows than the original data structure. This is what happens when you do these sorts of things on data.

Alas, many researchers don't like this behavior. Similar packages in R and pandas don't do this, for example. They populate the value slot consumed by computation with NA in the case of R, or the sentinel NaN in the case of pandas.

The most compelling reason to allow this behavior is that there are times when you'd like to merge or combine different data structures, and if you cut short a transformed column, you're going to face losing meaningful data in the other data structure.

For example, suppose you have a need for a TimeArray that includes closing prices and their 10-period moving average.

julia> using MarketData, MarketTechnicals

julia> cl
500x1 TimeSeries.TimeArray{Float64,1,DataType} 2000-01-03 to 2001-12-31

             Close     
2000-01-03 | 111.94    
2000-01-04 | 102.5     
2000-01-05 | 104.0     
2000-01-06 | 95.0      

2001-12-26 | 21.49     
2001-12-27 | 22.07     
2001-12-28 | 22.43     
2001-12-31 | 21.9      

julia> sma(cl, 10)
491x1 TimeSeries.TimeArray{Float64,1,DataType} 2000-01-14 to 2001-12-31

             sma10     
2000-01-14 | 98.782    
2000-01-18 | 97.982    
2000-01-19 | 98.388    
2000-01-20 | 99.338    

2001-12-26 | 21.065    
2001-12-27 | 21.123    
2001-12-28 | 21.266    
2001-12-31 | 21.417    

julia> merge(ans, cl)
491x2 TimeSeries.TimeArray{Float64,2,DataType} 2000-01-14 to 2001-12-31

             sma10     Close     
2000-01-14 | 98.782    100.44    
2000-01-18 | 97.982    103.94    
2000-01-19 | 98.388    106.56    
2000-01-20 | 99.338    113.5     

2001-12-26 | 21.065    21.49     
2001-12-27 | 21.123    22.07     
2001-12-28 | 21.266    22.43     
2001-12-31 | 21.417    21.9      

The argument that the sma10 column shouldn't represent values where none make sense is reasonable, but this has now consumed meaningful values found in the Close column.

Solution with Nullable{Float64}

Now that Nullable{Float64} is available, we can allow consumed timestamps to put the Julian sentinel in this slot. This package (or probably TimeSeries should do this) can define a show method to represent this as NA when displaying. This is preferred because it's terse and everyone knows what NA means (not really, but we all think we do).

Bike shed

What should the name be for a kwarg that allows consumed dates to be represented with Julian sentinels?

consume = true
preserve_dates = false
sentinalize = false (is that even a word?)
NA = false

@multidis
Copy link

Is Nullable in version 0.4 only or is there a package to include it in Julia 0.3? In any case, kwarg for choosing this behavior is really necessary that may be set to a proper default if Nullable-type is not available. Perhaps allow_nullable?

@milktrader
Copy link
Member Author

allow_nullable is descriptive but hard to remember and easy to mis-spell.

Yes, Nullable is a new feature in v0.4. It won't be much longer before that becomes the stable version so I'd rather just reserve this functionality for that.

I suppose you could just use NaN in the meantime as the sentinel and build out the method signatures. Then it would be easy enough to s/NaN/Nullable{Float64}/g in the future.

@multidis
Copy link

I did not know v0.4 is coming soon and could not find much info on that. Any pointers on the expected time frame? Could you share the links?

@milktrader
Copy link
Member Author

keep
keep_dates
removeNA

@milktrader
Copy link
Member Author

well, it's been 9 months and I feel everyone's getting ready to finalize what must be done first and what can wait until 0.4.1

https://groups.google.com/forum/#!topic/julia-dev/Q1K3MvNYvJo

@milktrader
Copy link
Member Author

DataFrames uses removeNA.

I like that, quite frankly. It does conflate NA (which doesn't exist in Base) with Nullable (which is implemented)

removeNullable = true ?

Edit: that's too many letters and easy to mis-spell I guess ;)

@milktrader
Copy link
Member Author

How about using a new placeholder? I was thinking of using CBC for consumed by calculation. It would be clear why that value is missing in this case. Underneath it can be implemented as Nullable

@milktrader
Copy link
Member Author

In this case, the kwarg would be keep_all_dates = false maybe?

@milktrader
Copy link
Member Author

padding=true is what TimeArrays use so this will be consistent over here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants