# Time Series I/O Library
## Guided Tour

The Time Series I/O package provides a basic Time Series model, represented by the **TimeSeries** class. A TimeSeries object simply contains a name, a *dict* of its "attributes" (which represents information about the Time Series) and a pandas.Series with (date, value) pairs. Here is an example of creating a TimeSeries:

In [None]:
import pandas as pd
from tsio import TimeSeries

attributes = {
             "TYPE": "TEST",
             "CODE": "000001",
             "ISSUE_DATE": pd.to_datetime("2018-01-01")
             }
dates = list(map(pd.to_datetime, ["2018-01-01", "2018-02-01", "2018-03-01"]))
values = [1, 2, 3]
dated_values = pd.Series(index=dates, data=values)

timeseries = TimeSeries("Example Time Series")
timeseries.update_attributes(attributes)
timeseries.update_values(dated_values)

In [None]:
# Check content of the Time Series.
print(timeseries.ts_name)
print("\n" + "*"*20 + "\n")
print(timeseries.ts_attributes)
print("\n" + "*"*20 + "\n")
print(timeseries.ts_values)
print("\n" + "*"*20 + "\n")

# When the whole Time Series is printed, it shows all of its content.
print(timeseries)

There is really not much else to talk about the TimeSeries object, it is just a data-holder. The objective of this library is to implement easy reading/writing of TimeSeries from/into a MongoDB collection. To set up a local MongoDB instance, make sure you have mongodb installed (```pacman -S mongodb``` in Arch Linux).

To start a mongodb instance:
```mongod --dbpath <path/to/db_directory>```

Then we make use of the **DBIO** class:

In [None]:
from tsio import DBIO

db = DBIO(host_address="localhost", db_name="test", collection_name="test")

# Writing to the database.
db.write(timeseries)

In [None]:
# Selecting from database based on attributes.
ts_selection = db.select(ISSUE_DATE=pd.to_datetime("2018-01-01"))
print(ts_selection)
print("Type of the returned object: " + str(type(ts_selection)))

The 'select' operation above actually returns a TimeSeriesCollection object. This object is basically an
OrderedDict of TimeSeries objects, indexed by names (the ts_name attribute of each TimeSeries).

**Note**: This means that TimeSeries with the same ts_name should not be added to the same TimeSeriesCollection,
unless you want to replace the existent TimeSeries.

In [None]:
# A TimeSeriesCollection behaves like an iterable of TimeSeries:
for ts in ts_selection:
    print(ts)

### Note:

In the result above, you can see that the "select" method returns a time series collection, but the time series
contained in it are empty. It retrieves the names of the time series, but not their content. To do this, we must
**read** the time series collection.

In [None]:
# Reading the selection.
db.read(ts_selection)
for ts in ts_selection:
    print(ts)

### Note:

If you run the above cell twice, you may notice that a "LAST_USE" attribute was added to the time series. This is by default, because some of the plugins we have implemented rely on this. The "LAST_USE" attribute represents the last date (GMT) when the time series was read or written. Maybe in the future we may add a kill-switch for this behaviour.

In [None]:
# Now let us add some other time series to our database.
timeseries2 = TimeSeries("Second Example Time Series")
timeseries2.update_attributes(
    {"TYPE": "TEST",
     "CODE": "000002"
    }
)
timeseries2.update_values(
    pd.Series(index=dates,  # The dates we used when defining the first timeseries.
             data=[10, 11, 12])
)

timeseries3 = TimeSeries("Third Example Time Series")
timeseries3.update_attributes(
    {"TYPE": "DIFFERENT_TYPE",  # Here we use a different type, to test the 'select' method below.
     "CODE": "000003"
    }
)
timeseries3.update_values(
    pd.Series(index=dates,  # The dates we used when defining the first timeseries.
             data=[20, 21, 22])
)

db.write([timeseries2, timeseries3])

In [None]:
# Selecting all time series in the database.
entire_db = db.select()
print(entire_db)

In [None]:
# Again, all retrieved time series are empty. We need to read them if we want their content.
for ts in entire_db:
    print(ts)

In [None]:
# Reading the time series.
db.read(entire_db)
for ts in entire_db:
    print(ts)

In [None]:
# We can also select time series by their attributes.
selection = db.select(TYPE='TEST')
# Notice that we selected only the time series with the "TYPE" attribute equal to "TEST".
db.read(selection)
for ts in selection:
    print(ts)

In [None]:
# We can also select based on multiple attribute values!
selection = db.select(TYPE=["TEST", "DIFFERENT_TYPE"])
db.read(selection)
for ts in selection:
    print(ts)

**Note:** Now you may be asking yourself if it is possible to query only a part of the TimeSeries data (e.g.: only (date, value) pairs from 2018-02-01 on, or specifying the granularity of the time series (date, value) pairs, like asking only for yearly data). Unfortunately, for now it is not possible. This library was designed for use with daily data, with local databases, and reading the whole time series each time has imposed no significant overhead for our use cases. However, we do have some internal workarounds for huge datasets (intraday, milisecond data) which we intend to use for extending this library in the near future. You can expect being able to make more sophisticated queries with the **read** method above, keeping backward compatibility.

## Generalized Reading Interface

Another feature of this library is the *Generalized Reading Interface* (**GenIO**). The **GenIO** class extends the **DBIO** class with external reading capabilities. Imagine a reading interface which can read from our database, but can also read from arbitrary data sources.

As an example, allow us to step into the financial word. We will build a reading interface that reads stock prices from our database and, if desired, increments our price set with data from Yahoo Finance.

To do this, we need to implement a YahooReader class that will be passed as an argument to our **GenIO** class.

We will need the the **pandas-datareader** package.

In [None]:
import pandas_datareader as pdr
from datetime import datetime

class YahooReader:
    # The time series that will be read with this class will have a "SOURCE" attribute with the value "YAHOO"
    # and a "TICKER" attribute with the value of its corresponding Yahoo Finance ticker.
    def __init__(self, dbio):
        # We give a dbio instance for the YahooReader so it can read first from the database, and use Yahoo only
        # to complete missing (recent) data.
        self.dbio = dbio
    
    def is_member(self, ts):
        if ts.get_attribute("SOURCE") == "YAHOO":
            return True
        return False
    
    def read_attributes(self, ts_collection, **kwargs):
        # Let's skip reading attributes, for now.
        pass
    
    def read_values(self, ts_collection, **kwargs):
        # You can really configure your behaviour here. Before reading from the external source, we read from the
        # database. Then you can configure how you want to complete the database data with the external source. 
        # In this example, we take the lower bound of last quote dates for all the required
        # time series and use it as the start_date for the Yahoo query.
        self.dbio.read(ts_collection)
        for ts in ts_collection:
            print(ts)
        tickers = ts_collection.get_attributes("TICKER")
        try:
            last_quote_date_lower_bound = min([ts.ts_values.last_valid_index() for ts in ts_collection])
        except TypeError:
            last_quote_date_lower_bound = datetime(2018, 1, 1)
        data = pdr.get_data_yahoo(symbols=tickers, start=last_quote_date_lower_bound, end=datetime.today())
        for ts in ts_collection:
            ts.update_values(data["Adj Close"][ts.get_attribute("TICKER")])
            
        
    def read(self, ts_collection, **kwargs):
        self.read_values(ts_collection)
        

# Now we will instantiate a GenIO object using the YahooReader class above.
from tsio import GenIO
yahoo_reader = YahooReader(dbio=db)  # Give our dbio instance to the YahooReader class.
genio = GenIO(host_address='localhost', db_name='test', collection_name='test', external_interfaces=[yahoo_reader])


# Almost there. Let's add some stocks to our database.
shopify = TimeSeries("SHOPIFY")
shopify.update_attributes({"TYPE": "STOCK", "SOURCE": "YAHOO", "TICKER": "SHOP"})

baozun = TimeSeries("BAOZUN")
baozun.update_attributes({"TYPE": "STOCK", "SOURCE": "YAHOO", "TICKER": "BZUN"})

amazon = TimeSeries("AMAZON")
amazon.update_attributes({"TYPE": "STOCK", "SOURCE": "YAHOO", "TICKER": "AMZN"})

genio.write([shopify, baozun, amazon])

In [None]:
# Now let's read from the database + Yahoo.
# First we select our stocks.

stocks = genio.select(TYPE="STOCK")
genio.read(stocks)
for stock in stocks:
    print(stock)
    
# Let's write the new data to the database.
genio.write(stocks)

In [None]:
# The next time you read from the database, the data will be there.
# Let's use the pure database reader, dbio, again.

stocks = db.select(TYPE="STOCK")
db.read(stocks)
for stock in stocks:
    print(stock)

In [None]:
# Now you can do whatever you want with your time series.
# As a very simple example, you can plot their performances.
# Note: We need the matplotlib package here.

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

for stock in stocks:
    plt.plot(stock.ts_values.index, stock.ts_values.values/stock.ts_values.values[0] - 1, label=stock.ts_name)

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y))) 
plt.gcf().set_size_inches(15, 7.5)
plt.legend()
plt.show()

### Check the project's [Documentation](https://lanxdev.github.io/tsio/index) for more details.