# Analytical Dataset (ADS)

An `ADS` is a table created for specific analytic purposes. The concept is to merge different data sources so that all possible information about the objects of interest (most often the clients) are in one place. This data is then distilled with sliding windows.

1. The core part of `ADS` is a sliding window for each time period (eg. 1 week).
2. `ADS` contains one line for each observation every week.

![graph](https://i.imgur.com/ojRMfB0.png)

The advantages of `ADS` is as follows:
- combines all data sources into one table
- in the future, ML models can be based on one table
- by using time slices (weekly, monthly), we take care of fluctuations in the data
- it suits as aggregation layer for the reporting
- batch scoring (weekly, monthly) is easy to implement
- new data sources can be simple added in the future using joins

![graph](https://i.imgur.com/zXsvJ25.png)

## Connecting to the `northwind` database

In [87]:
import sqlite3
from sqlite3 import Error

In [88]:
def create_connection(path):
  con = None
  try:
    con = sqlite3.connect(database=path)
    print('Connection to SQLite DB successful.')
  except Error as e:
    print(f'The error \'{e}\' occurred.')
  
  return con

In [89]:
con = create_connection('./_data/northwind.db')

Connection to SQLite DB successful.


In [90]:
def execute_query(connection, query):
  cur = connection.cursor()
  result = None
  try:
    cur.execute(query)
    result = cur.fetchall()
    return result
    print('Query executed successfully.')
  except Error as e:
    print(f'The error \'{e}\' occurred.')

In [91]:
def execute_commit(connection, commit):
  cur = connection.cursor()
  try:
    cur.execute(commit)
    connection.commit()
    print('Query executed successfully.')
  except Error as e:
    print(f'The error \'{e}\' occurred.')

In [92]:
query_count = """ 
SELECT COUNT(*) FROM orders
"""

query_min = """ 
SELECT MIN(orderdate) FROM orders
"""

query_max = """ 
SELECT MAX(orderdate) FROM orders
"""

In [93]:
order_count = execute_query(con, query_count)
min_orderdate = execute_query(con, query_min)
max_orderdate = execute_query(con, query_max)
print(f'order count: {order_count[0][0]}\nmin order date: {min_orderdate[0][0]}\nmax order date: {max_orderdate[0][0]}')

order count: 830
min order date: 1996-07-04
max order date: 1998-05-06


There are 830 orders ranging from `1996-07-04` to `1998-05-06`. From this, an `ADS` can be built aggregated by month. \
It is also possible to aggregate by day or week but for this example, monthly windows are sufficient.

For traditional banking, 1 month may be enough. For telecommunications, 1 week can be appropriate, but there are also industries like e-commerce where they need to aggregate per day.

In this tutorial, orders will be aggregated each month and labeled with the column called `end_obs_date` (end observation date).

Example:
- order date: 1996-12-12 --> `endobsdate`: 1997-01-01
- order date: 1997-01-31 --> `endobsdate`: 1997-02-01


In [94]:
drop_endobsdate = """
DROP TABLE if exists end_obs_dates
"""

In [95]:
# create 'endobsdate' table
create_endobsdate = """ 
CREATE TABLE end_obs_dates
AS

WITH RECURSIVE
  cnt(x) AS (
    -- count begins at 0
    SELECT 0
    -- combine with value below, including duplicates
    UNION ALL
    -- count iterates by +1 for every recursion
    SELECT x+1 FROM cnt
    -- recursion ends when it meets the limit condition below,
    -- start and end date difference in days, then divided by 30 to return numbers of months and then +1
    LIMIT (SELECT ROUND(((julianday('1998-06-01') - julianday('1996-08-01'))/30) + 1))
    -- x is returned below for every recursion, adding a row with x value under the column 'end_obs_date' in table 'end_obs_dates'
    ) SELECT date('1996-08-01', '+' || x || ' month') AS end_obs_date FROM cnt
"""

In [96]:
execute_commit(con, drop_endobsdate) # drop table ensures that the table is not recreated below
execute_commit(con, create_endobsdate)

Query executed successfully.
Query executed successfully.


In [101]:
test_query_endobsdate = """ 
SELECT * FROM end_obs_dates
"""

dates = execute_query(con, test_query_endobsdate)
for date in dates:
  print(date)

('1996-08-01',)
('1996-09-01',)
('1996-10-01',)
('1996-11-01',)
('1996-12-01',)
('1997-01-01',)
('1997-02-01',)
('1997-03-01',)
('1997-04-01',)
('1997-05-01',)
('1997-06-01',)
('1997-07-01',)
('1997-08-01',)
('1997-09-01',)
('1997-10-01',)
('1997-11-01',)
('1997-12-01',)
('1998-01-01',)
('1998-02-01',)
('1998-03-01',)
('1998-04-01',)
('1998-05-01',)
('1998-06-01',)


In [102]:
drop_adspopulationhist = """ 
DROP TABLE if exists ads_population_hist
"""

In [103]:
create_adspopulationhist = """ 
CREATE TABLE ads_population_hist
AS
SELECT
  A.*,
  B.*
FROM end_obs_dates AS A
CROSS JOIN (
  SELECT DISTINCT customerid FROM customers) AS B
"""

In [104]:
execute_commit(con, drop_adspopulationhist)
execute_commit(con, create_adspopulationhist)

Query executed successfully.
Query executed successfully.


In [105]:
test_query_adspopulationhist = """ 
SELECT * FROM ads_population_hist
"""

hists = execute_query(con, test_query_adspopulationhist)
for hist in hists:
  print(hist)

('1996-08-01', 'ALFKI')
('1996-08-01', 'ANATR')
('1996-08-01', 'ANTON')
('1996-08-01', 'AROUT')
('1996-08-01', 'BERGS')
('1996-08-01', 'BLAUS')
('1996-08-01', 'BLONP')
('1996-08-01', 'BOLID')
('1996-08-01', 'BONAP')
('1996-08-01', 'BOTTM')
('1996-08-01', 'BSBEV')
('1996-08-01', 'CACTU')
('1996-08-01', 'CENTC')
('1996-08-01', 'CHOPS')
('1996-08-01', 'COMMI')
('1996-08-01', 'CONSH')
('1996-08-01', 'DRACD')
('1996-08-01', 'DUMON')
('1996-08-01', 'EASTC')
('1996-08-01', 'ERNSH')
('1996-08-01', 'FAMIA')
('1996-08-01', 'FISSA')
('1996-08-01', 'FOLIG')
('1996-08-01', 'FOLKO')
('1996-08-01', 'FRANK')
('1996-08-01', 'FRANR')
('1996-08-01', 'FRANS')
('1996-08-01', 'FURIB')
('1996-08-01', 'GALED')
('1996-08-01', 'GODOS')
('1996-08-01', 'GOURL')
('1996-08-01', 'GREAL')
('1996-08-01', 'GROSR')
('1996-08-01', 'HANAR')
('1996-08-01', 'HILAA')
('1996-08-01', 'HUNGC')
('1996-08-01', 'HUNGO')
('1996-08-01', 'ISLAT')
('1996-08-01', 'KOENE')
('1996-08-01', 'LACOR')
('1996-08-01', 'LAMAI')
('1996-08-01', '

The primary goal is to create a table where all important information about clients is kept.
To do this, the following variables will be created:
- noofitems
- noofdistinct_orders
- total_price

All of which will be aggregated monthly.

First, compute the additional attribute `totalprice_for_product` as `unitprice * quantity`.

In [107]:
query_order_details = """
SELECT *, unitprice*quantity AS totalprice_for_product
FROM 'Order Details'
LIMIT 20
"""

execute_query(con, query_order_details)

[(10248, 11, 14.0, 12, 0.0, 168.0),
 (10248, 42, 9.8, 10, 0.0, 98.0),
 (10248, 72, 34.8, 5, 0.0, 174.0),
 (10249, 14, 18.6, 9, 0.0, 167.4),
 (10249, 51, 42.4, 40, 0.0, 1696.0),
 (10250, 41, 7.7, 10, 0.0, 77.0),
 (10250, 51, 42.4, 35, 0.15, 1484.0),
 (10250, 65, 16.8, 15, 0.15, 252.0),
 (10251, 22, 16.8, 6, 0.05, 100.80000000000001),
 (10251, 57, 15.6, 15, 0.05, 234.0),
 (10251, 65, 16.8, 20, 0.0, 336.0),
 (10252, 20, 64.8, 40, 0.05, 2592.0),
 (10252, 33, 2.0, 25, 0.05, 50.0),
 (10252, 60, 27.2, 40, 0.0, 1088.0),
 (10253, 31, 10.0, 20, 0.0, 200.0),
 (10253, 39, 14.4, 42, 0.0, 604.8000000000001),
 (10253, 49, 16.0, 40, 0.0, 640.0),
 (10254, 24, 3.6, 15, 0.15, 54.0),
 (10254, 55, 19.2, 21, 0.15, 403.2),
 (10254, 74, 8.0, 21, 0.0, 168.0)]