-
Notifications
You must be signed in to change notification settings - Fork 3
TimeSeries and TimeModels #6
Comments
My current thinking:
|
Hey, I'm about to come into a bunch of free time and would be interested on working on this (probably starting in 1-2 weeks). I do want to review the source code for zoo/xts in R and Series in Pandas for a bit though. Here are my thoughts on the roadmap:
Like I said, I'd like to dig deeper into this starting in the next week or so, but these are my initial impressions. |
First, let me say that I think we need more than just one single implementation of time series data. In my opinion, at the very core we should have a package that suits a quite general set of needs, and that fits neatly into the framework provided by already existing packages. Maybe, this place just should be taken by your current design of
|
@carljv cool that you have some time to think about this. I'm most familiar with
My Series.jl package is a sort of awkward attempt to approach this implementation a bit differently. Here is the type: immutable SeriesPair{T, V} <: AbstractSeriesPair
index::T
value::V
end So essentially it represents a row of data. Methods are provided to work with an array of these instances. Sorting, working with time indexes, performing transformations upon (log returns, etc). To play with it, you can clone the Pkg.clone("https://github.com/milktrader/Series.jl.git")
Pkg.clone("https://github.com/JuliaQuant/MarketData.jl.git") Once you get those installed, you can play around with some 3-year (cl) or 65-year (Cl) SPX daily closing prices. julia> using Series, MarketData
julia> byyear(Cl, 1968) |> x -> bymonth(x,12) |> x -> byday(x,24)
1-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
1968-12-24 105.0400
julia> Cl[date(1968,12,24)]
1968-12-24 105.0400
julia> ans.value = 105.00
ERROR: type SeriesPair is immutable
#cannot change the closing price on Dec 24, 1968
julia> mean(value(Cl))
436.97211140781224
#value() simply operates on the value element of the SeriesPairs in the array
julia> maximum(index(Cl))
2013-12-31
julia> Hi - Lo;
julia> ans[12345:12348]
4-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
1999-01-25 14.5200
1999-01-26 19.2700
1999-01-27 19.7900
1999-01-28 23.2300 And there is more. I haven't updated the README yet (it's admittedly a mess), but I have left a trail in the pull request history. @carljv I'll attempt to give you access to TimeSeries and TimeModels so you can push up some branches and play around if you like. |
As to the name of the package and whether to keep TimeSeries or change it, here is a closed issue for some additional background JuliaLang/METADATA.jl#472 Essentially, Series is a no-go. It connotes something completely different to mathematicians, and we have quite a few in the community. In fact, a new package named PowerSeries has just recently been registered. The two most reasonable options are TimeSeries or DataSeries. Prepending with Time suggests that the index is a Date type, at the exclusion of more general indexable types such as Integers. I like that the pandas And of course, a time-type data structure doesn't need to be hard-coded to time either. In fact, you may have your own time type that you'd prefer to index with. This opens the door to indexing with integers. I don't think it would be a stretch for someone interested in indexing with Integers to say to themselves, "hey, I think I'll use TimeSeries and simply substitute the time index with integers". Using DataSeries as a name appears to solve some of these semantic issues. It does suggest a close affinity to the DataFrames/DataArrays data structure, which is where I start to get a little wary. I think the package should stand on its own and have as small a list of packages as possible in the REQUIRE file (as in zero, ideally) |
Thoughts about removing dependency to DataFrames/DataArrays. The data structure for serialized data is different enough from the table-centered DataFrames/DataArrays that it should stand alone. This does lose the advantage of leveraging existing code, but I feel this distinction is important enough to forego that benefit. There is also the important issue of a bloated REQUIRE file. What if I simply want to use TimeSeries but not DataFrames? With the dependency in REQUIRE in place, I'd be required to keep up to date not on just the TimeSeries package, but also DataFrames and DataArrays. And I would unnecessarily be bringing all that code into my project. An example of how this affects packages downstream is the MarketTechnicals package, a library of technical analysis methods. Ideally, this package would only use a TimeSeries type for its methods. Sure, if you prefer to use a DataFrame there should certainly be at the very least a To convert between a Time/DataSeries data structure and the DataFrames/DataArrays structure, I think a separate package would be the best option. This gives users who want this criss-cross the option, but doesn't force it upon those that don't. |
Just thinking aloud here. How about structure TimeSeries with two separate branches? One branch named Whichever branch floats to the top as being used the most gets the honor of being named |
@cgroll I share your concern that a DataFrame with a first column Date type is not robust enough for time series. It was an early stage solution. I'm not really convinced that most Julia users who work with time series prefer it either. It's just that there isn't an alternative yet. Once an alternative data structure is available, I suspect that using time series with DataFrames will be a fringe case. |
FWIW, when we originally designed the Julia DataFrames, the goal was very Other thoughts: Are you thinking the same or different data structures for On Sat, Jan 25, 2014 at 9:31 AM, milktrader notifications@github.comwrote:
|
Hey @milktrader, I may be wrong, but I believe your REQUIRE file is meant to keep only those dependencies for the current/master version of your package. For dependencies in older versions of the package, those can be handled through the METATA versions folder where you specify the SHA1 and specific dependencies for that release version. |
@HarlanH currently, I've only braved a single column of data associated with the I think Datetime will sort out issues related to irregular rows. For example, suppose you have daily data. You'd like to collapse it to weekly data and have the last day of the week as the date element and the highest value during the week as the value element. Further suppose that you have one week that ends on a Thursday, unlike others that end on a Friday. The |
@karbarcca yes indeed, I've conflated REQUIRE with calls to |
ie, there are 6 rows in |
Here, this might be better julia> clw[dayofweek(index(clw)).==4]
6-element Array{SeriesPair{Date{ISOCalendar},Float64},1}:
1980-04-03 102.1500
1980-07-03 117.4600
1981-04-16 134.7000
1981-07-02 128.6400
1981-12-24 122.5400
1981-12-31 122.5500 |
I should note that the most ideal TimeSeries type would be a Julian array whose index is not an Integer, but rather a time type. The chances that base would do this is likely near zero (though I haven't posited the idea yet). For this to happen in base, there would either need to be a backdoor to modify the row index type or a new TimeArray type. Theoretically one could take the C code that Array is written in and modify it to accept time type as an alternate index to rows. I'm fairly certain this is what I'm not sure how this would work in a package. Now that Datetime is getting integrated into Base it might be worth an effort to explore this in a branch, calling it |
From a data structures and performance point of view, I'm not sure that I've thought about using B+ trees for persistent DataFrames, with the key On Sat, Jan 25, 2014 at 11:39 AM, milktrader notifications@github.comwrote:
|
Yes, I need to do more than a wiki review of this topic. It sounds very interesting. Any texts that you would recommend? |
I'm not up-to-date enough on this to recommend texts, sorry! Thinking through something like an AbstractTimeSeries and its possible On Sat, Jan 25, 2014 at 12:07 PM, milktrader notifications@github.comwrote:
|
Also. After writing up some sort of AbstractTimeSeries spec, and analyzing On Sat, Jan 25, 2014 at 12:17 PM, Harlan Harris harlan@harris.name wrote:
|
I've pushed the current Series.jl repo to the |
If you want to index arrays using objects of an arbitrary type, then I think what you want is a |
I can't really keep up with this conversation (so feel free to ignore what I'm saying), but I would not use a I think @HarlanH's original point was dead on: only worry about desired behaviors in the first pass and then get advice from the broader community about implementation. It's very easy to get the implementation wrong if you think you're going to support behaviors you don't need. |
I think I understand why julia> using Datetime, NamedArrays
julia> n = NamedArray(rand(2,4));
julia> dates = [today(), today()+days(1)]
2-element Array{Date{ISOCalendar},1}:
2014-01-27
2014-01-28
julia> setnames!(n, dates, 1)
[2014-01-27=>1,2014-01-28=>2] Is it because there is a mapping between |
My point was much simpler (and potentially way less important): hashing costs a lot more than indexing using something with a trivial transformation into numbers. For constant interval time series, you can probably compute indices using something like The only point I'm sure of is that implementation shouldn't be the main concern: start with behaviors. |
Good point. #7 |
I'd like to check-off whether TimeSeries and TimeModels should be one or two packages as settled that they should be separate. Early R packages did this but as the language matured it was abandoned in favor of bifurcation. @carljv has a good argument in favor of this at the beginning of the issue. I'll check it off as settled in a few days, unless we have some objections. |
Also on the short list of check offs is the name TimeSeries. I think I'm about the only one that was reticent about this name and I'm no longer. I'll let it percolate for a little while. |
Just some notes about naming things and how the pieces fit together (subject to change of course)
|
I'm going to close this as the checklist at the top is complete. It might be better to continue any ideas, thoughts, suggestions at the TimeSeries issues location. Thanks for all the input! |
Great idea to centralize issues around package goals.
NA
in new type => Yes, it should be supported{ISOCalendar}
for nowThe text was updated successfully, but these errors were encountered: