## Time Series Extractor and Slicer Introduction

This notebook intends to show you different queries that can be done with the TimeSeriesExtractor class that soam provides, such as:
- Simple extract all query.
- Temporal data filters or conditions.
- Categorical data filters or conditions.
- Aggregated fields.

At the same time, we provide a brief introduction to the Slicer class that soam provides to generate slices of your DataFrame.


- Poner mas prolijo + storytelling.

In [1]:
from soam.workflow.time_series_extractor import TimeSeriesExtractor
from muttlib.dbconn import get_client

In [2]:
sqlite_cfg = {
    "db_type": "sqlite",
    "database": "soam_quickstart.db"
}

sqlite_client = get_client(sqlite_cfg)[1]

In [3]:
extractor = TimeSeriesExtractor(db=sqlite_client, table_name='stock')

### Query 1

Simple query, just retrieving all the data from the database.

Query shape: 
- build_query_kwargs: dict of {str: obj}
    - Configuration of the extraction query to be used for the extraction.

In [5]:
query={
    'columns': '*'
}

In [6]:
df = extractor.run(build_query_kwargs = query)

df.head()

Unnamed: 0,id,date,symbol,avg_num_trades,avg_price
0,1,2021-03-01,AAPL,80000.0,125.0
1,2,2021-03-02,AAPL,70000.0,126.0
2,3,2021-03-03,AAPL,80000.0,123.0
3,4,2021-03-04,AAPL,70000.0,121.0
4,5,2021-03-05,AAPL,80000.0,119.0


### Query 2
Adding some extra conditionals:
- Filtering data by just retrieving Apple's stock valuations.
- Querying only a subset of the columns.
- Renaming some columns with aliases.

In [7]:
query={
    'columns': ['date', 'symbol', 'avg_price AS Valuation'],
    'extra_where_conditions': ["symbol = 'AAPL'"]
}

In [8]:
df = extractor.run(build_query_kwargs = query)
df.head()

Unnamed: 0,date,symbol,Valuation
0,2021-03-01,AAPL,125.0
1,2021-03-02,AAPL,126.0
2,2021-03-03,AAPL,123.0
3,2021-03-04,AAPL,121.0
4,2021-03-05,AAPL,119.0


### Query 3
Adding some extra conditionals:
- Filtering data by certain days.
- Ordering results based on their dates.

In [9]:
query={
    'columns': ['date', 'symbol', 'avg_price AS Valuation'],
    'timestamp_col': 'date',
    'start_date': "2021-03-01",
    'end_date': "2021-03-20",
    'extra_where_conditions': ["symbol = 'AAPL'"],
    'order_by': ["date ASC"]
}

In [10]:
df = extractor.run(build_query_kwargs = query)
df.head()

Unnamed: 0,date,symbol,Valuation
0,2021-03-01,AAPL,125.0
1,2021-03-02,AAPL,126.0
2,2021-03-03,AAPL,123.0
3,2021-03-04,AAPL,121.0
4,2021-03-05,AAPL,119.0


### Query 4

Adding some aggregated data.
- Multiply the average valuation with the amount of trades to obtain the transactional volume of the day.
- Group by date and symbol, this logic is implicit in the class, you don't need to specify it.
- Filter by a certain level of volume by using the having method.

In [11]:
query={
    'columns': ['date', 'symbol', 'avg_num_trades * avg_price AS Volume'],
    'dimensions': ['date','symbol'],
    'timestamp_col': 'date',
    'start_date': "2021-03-01",
    'end_date': "2021-03-20",
    'order_by': ["date ASC"],
    'extra_having_conditions': ['Volume > 1000000']
}

In [12]:
df = extractor.run(build_query_kwargs = query)
df.head()

Unnamed: 0,date,symbol,Volume
0,2021-03-01,AAPL,10000000.0
1,2021-03-01,TSLA,6300000.0
2,2021-03-02,AAPL,8820000.0
3,2021-03-02,TSLA,6448000.0
4,2021-03-03,AAPL,9840000.0


### Query 5

Adding some aggregated data.
- Retrieve the day with the biggest transactional volume for each company.

In [13]:
query={
    'columns': ['date', 'symbol', 'max(avg_num_trades * avg_price) AS Max_Volume'],
    'dimensions': ['symbol']
}

In [14]:
df = extractor.run(build_query_kwargs = query)
df.head()

Unnamed: 0,date,symbol,Max_Volume
0,2021-03-22,AAPL,21300000.0
1,2021-03-08,TSLA,10324000.0


## Slicer

#### First instantiate the class:

Slice the incoming data upon the given dimensions

        Parameters
        ----------
        dimensions:
            str or list of str labels of categorical columns to slices
        metrics:
            str or list of str labels of metrics columns to slices
        ds_col:
            str of datetime column
        keeps:
            str or list of str labels of columns to keep.
            
            
            
#### Then run the method .run:

Slice the given dataframe with the dimensions setted.

        Parameters
        ----------
        raw_df
            A pandas DataFrame containing the raw data to slice

        Returns
        -------
        list[pd.DataFrame]
            DataFrame containing the sliced dataframes.

In [15]:
from soam.workflow.slicer import Slicer

In [16]:
query={
    'columns': '*'
}
df = extractor.run(build_query_kwargs = query)

df.head()

Unnamed: 0,id,date,symbol,avg_num_trades,avg_price
0,1,2021-03-01,AAPL,80000.0,125.0
1,2,2021-03-02,AAPL,70000.0,126.0
2,3,2021-03-03,AAPL,80000.0,123.0
3,4,2021-03-04,AAPL,70000.0,121.0
4,5,2021-03-05,AAPL,80000.0,119.0


In [17]:
slicer = Slicer(metrics=["avg_num_trades", "avg_price"], ds_col="date", dimensions=["symbol"])

In [27]:
apple_trades, apple_price, tesla_trades, tesla_price = slicer.run(df)

In [28]:
apple_trades.head()

Unnamed: 0,date,symbol,avg_num_trades
0,2021-03-01,AAPL,80000.0
1,2021-03-02,AAPL,70000.0
2,2021-03-03,AAPL,80000.0
3,2021-03-04,AAPL,70000.0
4,2021-03-05,AAPL,80000.0


In [29]:
apple_price.head()

Unnamed: 0,date,symbol,avg_price
0,2021-03-01,AAPL,125.0
1,2021-03-02,AAPL,126.0
2,2021-03-03,AAPL,123.0
3,2021-03-04,AAPL,121.0
4,2021-03-05,AAPL,119.0


In [30]:
tesla_trades.head()

Unnamed: 0,date,symbol,avg_num_trades
22,2021-03-01,TSLA,60000.0
23,2021-03-02,TSLA,62000.0
24,2021-03-03,TSLA,64000.0
25,2021-03-04,TSLA,69000.0
26,2021-03-05,TSLA,80000.0


In [31]:
tesla_price.head()

Unnamed: 0,date,symbol,avg_price
22,2021-03-01,TSLA,105.0
23,2021-03-02,TSLA,104.0
24,2021-03-03,TSLA,101.0
25,2021-03-04,TSLA,108.0
26,2021-03-05,TSLA,115.0
