# SOAM Quickstart
How to make an end to end project using SOAM modules and tools.

![soam_pipeline](documentation/images/SoaM_diagram.png)


This library pipeline supports any data source.
The process is structured in different stages:
* Extraction: manages the connection with the database, the time granularity and the aggregation level of the input data.
* Preprocessing: lets select among out of the box tools to perform standard tasks as normalization or fill nan values.
* Forecasting: fits a model and predict results.
* Postprocessing: modifies the results based on business/real information or create analysis with the predicted values,
 such as an anomaly detection.


## Extraction

This stage extracts data from the needed sources to build the condensed dataset for the next steps. This tends to be project dependent.

### Establish the connection with the database

In [32]:
from soam.workflow.time_series_extractor import TimeSeriesExtractor
from muttlib.dbconn import get_client

In [67]:
pg_cfg = {
    "host": "localhost",
    "port": 5432,
    "db_type": "postgres",
    "username": "mutt",
    "password": "mutt",
    "database": "sqlalchemy"
}

In [68]:
pg_client = get_client(pg_cfg)[1]

In [69]:
pg_client

<muttlib.dbconn.postgres.PgClient at 0x7f1969940040>

In [70]:
pg_client

<muttlib.dbconn.postgres.PgClient at 0x7f1969940040>

In [71]:
Extractor = TimeSeriesExtractor(db=pg_client, table_name='stocks_valuation')

#### Then it converts the full dataset to the desired time granularity and aggregation level by some categorical attribute/s and return it as a pandas data frame.
In this case we define the following: <br>
- Time granularity: <br>
     - Start date: 2021-03-01 <br>
     - End date: 2021-03-20
- Aggregation Level:
    - Just stay with Apple's (AAPL) stock information.

In [94]:
Extractor.extract(build_query_kwargs={
    'columns': '*',
    'timestamp_col': 'date',
    'start_date': "2021-03-01",
    'end_date': "2021-03-20",
    'extra_where_conditions': ["symbol = 'AAPL'"]
})

Unnamed: 0,index,date,symbol,avg_num_trades,avg_price
0,0,2021-03-18,AAPL,84353.996528,121.75
1,1,2021-03-17,AAPL,77730.997222,124.09795
2,2,2021-03-16,AAPL,80019.4,125.9675
3,3,2021-03-15,AAPL,64298.996528,122.21
4,4,2021-03-12,AAPL,61184.0625,120.165
5,5,2021-03-11,AAPL,71546.190278,122.235
6,6,2021-03-10,AAPL,77738.420833,120.81
7,7,2021-03-09,AAPL,89948.458333,120.425
8,8,2021-03-08,AAPL,107205.979167,118.605
9,9,2021-03-05,AAPL,106782.361806,119.7525


Store the query into a <b>pandas dataframe</b> to facilitate data manipulation.

In [98]:
import pandas as pd
df = Extractor.extract(build_query_kwargs={
    'columns': '*',
    'timestamp_col': 'date',
    'start_date': "2021-03-01",
    'end_date': "2021-03-20",
    'extra_where_conditions': ["symbol = 'AAPL'"]
})

df.head()

Unnamed: 0,index,date,symbol,avg_num_trades,avg_price
0,0,2021-03-18,AAPL,84353.996528,121.75
1,1,2021-03-17,AAPL,77730.997222,124.09795
2,2,2021-03-16,AAPL,80019.4,125.9675
3,3,2021-03-15,AAPL,64298.996528,122.21
4,4,2021-03-12,AAPL,61184.0625,120.165


## Preprocessing