# SOAM Quickstart
How to make an end to end project using SOAM modules and tools.

![soam_pipeline](documentation/images/SoaM_diagram.png)


This library pipeline supports any data source.
The process is structured in different stages:
* Extraction: manages the connection with the database, the time granularity and the aggregation level of the input data.
* Preprocessing: lets select among out of the box tools to perform standard tasks as normalization or fill nan values.
* Forecasting: fits a model and predict results.
* Postprocessing: modifies the results based on business/real information or create analysis with the predicted values,
 such as an anomaly detection.


## Extraction

This stage extracts data from the needed sources to build the condensed dataset for the next steps. This tends to be project dependent.

### Establish the connection with the database

In [32]:
from soam.workflow.time_series_extractor import TimeSeriesExtractor
from muttlib.dbconn import get_client

Postgres config set up

In [67]:
pg_cfg = {
    "host": "localhost",
    "port": 5432,
    "db_type": "postgres",
    "username": "mutt",
    "password": "mutt",
    "database": "sqlalchemy"
}

In [68]:
pg_client = get_client(pg_cfg)[1]

In [70]:
pg_client

<muttlib.dbconn.postgres.PgClient at 0x7f1969940040>

In [171]:
extractor = TimeSeriesExtractor(db=pg_client, table_name='stocks_valuation')

#### Then it converts the full dataset to the desired time granularity and aggregation level by some categorical attribute/s and return it as a pandas data frame.
In this case we define the following: <br>
- Time granularity: <br>
     - Start date: 2021-03-01 <br>
     - End date: 2021-03-20
- Aggregation Level:
    - Just stay with Apple's (AAPL) stock information.

In [314]:
build_query_kwargs={
    'columns': '*',
    'timestamp_col': 'date',
    'start_date': "2021-03-01",
    'end_date': "2021-03-20",
    'extra_where_conditions': ["symbol = 'AAPL'"]
}

In [318]:
extractor.run(build_query_kwargs=build_query_kwargs).head()

Unnamed: 0,index,date,symbol,avg_num_trades,avg_price
0,0,2021-03-18,AAPL,84353.996528,121.75
1,1,2021-03-17,AAPL,77730.997222,124.09795
2,2,2021-03-16,AAPL,80019.4,125.9675
3,3,2021-03-15,AAPL,64298.996528,122.21
4,4,2021-03-12,AAPL,61184.0625,120.165


Store the query into a <b>pandas dataframe</b> to facilitate data manipulation.

In [330]:
import pandas as pd
df = extractor.run(build_query_kwargs = build_query_kwargs)

df.head()

Unnamed: 0,index,date,symbol,avg_num_trades,avg_price
0,0,2021-03-18,AAPL,84353.996528,121.75
1,1,2021-03-17,AAPL,77730.997222,124.09795
2,2,2021-03-16,AAPL,80019.4,125.9675
3,3,2021-03-15,AAPL,64298.996528,122.21
4,4,2021-03-12,AAPL,61184.0625,120.165


## Preprocessing

In [331]:
from soam.workflow import Transformer

Import the MinMaxScaler from Scikit-Learn

In [332]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

Create the Transformer object and pass the scaler as the transformer parameter.

In [333]:
ts = Transformer(transformer = scaler)

We want to normalize the average price values.

We convert the column to an array and swap the axes to pass it to the scaler.

In [334]:
data = np.array([df.avg_price])
data = np.swapaxes(data, 0, 1)

Run the soam transform object.

In [335]:
ts.run(data)

(array([[0.38075061],
        [0.66500605],
        [0.89134383],
        [0.43644068],
        [0.18886199],
        [0.43946731],
        [0.26694915],
        [0.22033898],
        [0.        ],
        [0.13892252],
        [0.30326877],
        [0.62590799],
        [1.        ],
        [0.81779661]]),
 MinMaxScaler())

Change the values of avg_price for the scaled ones.

In [336]:
df.avg_price = ts.run(data)[0]
df.head()

Unnamed: 0,index,date,symbol,avg_num_trades,avg_price
0,0,2021-03-18,AAPL,84353.996528,0.380751
1,1,2021-03-17,AAPL,77730.997222,0.665006
2,2,2021-03-16,AAPL,80019.4,0.891344
3,3,2021-03-15,AAPL,64298.996528,0.436441
4,4,2021-03-12,AAPL,61184.0625,0.188862


We drop the unnecesary columns and adapt the column names for the Fb Prophet for the Forecasting.

In [337]:
df = df[['date', 'avg_price']]
df.rename(columns = {
    'date': 'ds',
    'avg_price': 'y'}, inplace = True)
df.head()

Unnamed: 0,ds,y
0,2021-03-18,0.380751
1,2021-03-17,0.665006
2,2021-03-16,0.891344
3,2021-03-15,0.436441
4,2021-03-12,0.188862


# SoaMFlow

Putting all together with SoaMFlow.

### WORK IN PROGRESS...

In [279]:
from soam.core import SoamFlow
from prefect import task

In [283]:
@task
def load_df(df):
    df.to_csv("hola.csv")

In [293]:
with SoamFlow(name="test") as test:
    df = extractor(build_query_kwargs)
    df = ts(data)
    load_df(df)

In [294]:
test.run()

[2021-03-25 18:50:23-0300] INFO - prefect.FlowRunner | Beginning Flow run for 'test'
[2021-03-25 18:50:23-0300] INFO - prefect.TaskRunner | Task 'TimeSeriesExtractor': Starting task run...
[2021-03-25 18:50:23-0300] INFO - prefect.TaskRunner | Task 'TimeSeriesExtractor': Finished task run for task with final state: 'Success'
[2021-03-25 18:50:23-0300] INFO - prefect.TaskRunner | Task 'Transformer': Starting task run...
[2021-03-25 18:50:23-0300] INFO - prefect.TaskRunner | Task 'Transformer': Finished task run for task with final state: 'Success'
[2021-03-25 18:50:23-0300] INFO - prefect.TaskRunner | Task 'load_df': Starting task run...
[2021-03-25 18:50:23-0300] ERROR - prefect.TaskRunner | Unexpected error: AttributeError("'tuple' object has no attribute 'to_csv'")
Traceback (most recent call last):
  File "/home/scafati98/MUTT/soam/quickstart_env/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/hom

<Failed: "Some reference tasks failed.">