# Time Series & Historical Query Analysis

Welcome to this example where we'll demonstrate how to work with large datasets in kdb+ to analyze time-series data. 

One of the key features of kdb+ is its ability to handle huge volumes of data with exceptional speed and efficiency. Whether it's reading massive datasets, performing time-based aggregations, or joining data from different sources, kdb+ excels at time-series analysis. By the end of this example, you'll have a clear understanding of how to create, manipulate, store, and analyze data using kdb+/q. Along the way, we'll introduce several key concepts that are fundamental to working with kdb+/q.


Here, we'll cover:
- Creating a large time-series dataset from scratch
- Saving this data to a database on disk
- Streamline ingestion and save down using functions 
- Performing time-based aggregations to analyze trends over time
- Using asof joins (aj) to combine time-series data (e.g., matching trades to quotes)

## 1. Prerequisites

1. For setup instructions and prerequisites, please refer to the [README](README.md).
2. Ensure PyKX is properly initialized by running the cell below.<br/>
   <b>Note</b>: This is a Python cell that will enable the kernel to execute q code as the default language for all later cells.

In [1]:
import pykx as kx
kx.util.jupyter_qfirst_enable()

PyKX now running in 'jupyter_qfirst' mode. All cells by default will be run as q code. 
Include '%%py' at the beginning of each cell to run as python code. 


In [2]:
.Q.w[]

used| 2076560
heap| 67108864
peak| 67108864
wmax| 0
mmap| 0
mphy| 67436519424
syms| 3712
symw| 188782


In [3]:
system"df -mh ."

"Filesystem      Size  Used Avail Use% Mounted on"
"/dev/sdh        9.8G  792K  9.8G   1% /home/jovyan"


## 2. Create the Time Series Dataset

Let’s start by creating a sample dataset to work with. This dataset will simulate trade data over a period of time, with random values for price, size, and symbols. We’ll generate 5 million rows of trade data.

In [4]:
n:10000000
day:2025.01.01
trade:([] 
    time:asc (`timestamp$day) + n?24:00:00;              / Start from midnight, spread across 24h
    sym:n?`AAPL`MSFT`GOOG`AMZN;                          / Random stock tickers
    price:n?100f;                                        / Random trade prices
    size:n?1000                                          / Random trade sizes
 )

Here's a breakdown of what's happening:
- `n: 5000000` sets the number of rows we want to generate
- We define a new table with table notation `([] col1:<values>; col2:<values>: ...)`
- We use `?` to generate random values for 4 columns:
    - `time` is populated with timestamps starting from midnight and increasing across a 24-hour period, with a random offset to simulate a spread of trades.
    - `sym` is populated with random symbols like AAPL, MSFT, etc., selected from a list.
    - `price` and trade `size` are randomnly generated

This table is now available in memory to investigate and query. Let's take a quick look at the row [`count`](#https://code.kx.com/q/ref/count/), schema details with [`meta`](#https://code.kx.com/q/ref/meta/) and first 10 rows using [`sublist`](#https://code.kx.com/q/ref/sublist/).

These simple commands are essential when exploring your data quickly in kdb+/q.

In [5]:
count trade              / get row count

10000000


In [6]:
meta trade               / get table schema details - datatypes, column names etc

c    | t f a
-----| -----
time | p   s
sym  | s    
price| f    
size | j    


The following columns are produced when we run `meta`:
- c: column name
- t: <a href="https://code.kx.com/q/ref/#datatypes" target="_blank">column type</a>
- f: <a href="https://code.kx.com/q4m3/8_Tables/#85-foreign-keys-and-virtual-columns" target="_blank">foreign keys</a>
- a: <a href="https://code.kx.com/q/ref/set-attribute/" target="_blank">attributes</a> (modifiers applied for performance optimisation)

In [7]:
10 sublist trade         / get first 10 rows 

time                          sym  price    size
------------------------------------------------
2025.01.01D00:00:00.000000000 AMZN 4.037237 655 
2025.01.01D00:00:00.000000000 AMZN 13.76677 649 
2025.01.01D00:00:00.000000000 MSFT 35.25838 934 
2025.01.01D00:00:00.000000000 AMZN 22.24729 348 
2025.01.01D00:00:00.000000000 MSFT 24.44719 593 
2025.01.01D00:00:00.000000000 GOOG 60.64019 20  
2025.01.01D00:00:00.000000000 AAPL 12.23558 825 
2025.01.01D00:00:00.000000000 MSFT 31.32998 628 
2025.01.01D00:00:00.000000000 AMZN 21.48323 500 
2025.01.01D00:00:00.000000000 AMZN 17.12322 402 


## 3.  Save Data to Disk

Once the data is generated, you’ll likely want to save it to disk for persistent storage.

Because we want the ability to scale, partitioning by date will be a good approach for this dataset. Without partitioning, queries that span large time periods would require scanning entire datasets, which can be very slow and resource-intensive. By partitioning data, kdb+ can limit the query scope to the relevant partitions, significantly speeding up the process.

To partition by date we can use the inbuilt function [`.Q.dpft`](#https://code.kx.com/q/ref/dotq/#dpft-save-table).


In [8]:
homeDir:getenv[`HOME]                    / Get the home directory for edu.kx.com
dbDir:homeDir,"/data"                    / Define database location as string
dbPath:hsym `$dbDir                      / Database location as hsym for file I/O

In [9]:
.z.zd:(17;2;6)

In [10]:
.Q.dpft[dbPath;day;`sym;`trade]          / Save data as a partitioned database

trade


In the above:
- <a href="https://code.kx.com/q/ref/hsym/" target="_blank">hsym</a>: This function prefixes the directory location with a colon to make it a file handle
- <a href="https://code.kx.com/q/ref/dotq/#dpft-save-table" target="_blank">.Q.dpft[d;p;f;t]</a>: This command saves data to a <b>(d)</b>atabase location, targeting a particular <b>(p)</b>artition and indexes the data on a chosen <b>(f)</b>ield for the specified <b>(t)</b>able.

One persisted, the table name is returned. We can test its worked as expected by deleting the `trade` table we have in memory and reloading the database from disk.

In [11]:
delete trade from `.                     / Delete in memory table
system"l ",dbDir                         / Load the partitioned database
meta trade                               / Check it exists

.
c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j    


kdb+ actually offers a number of different methods to store tables which will allow for efficient storage and querying for different sized datasets: flat, splayed, partitioned and segmented.

A general rule of thumb around which format to choose depends on three things:

- Will the table continue to grow at a fast rate?
- Am I working in a RAM/memory constrained environment?
- What level of performance do I want?

To learn more about these types and when to choose which <a href="https://code.kx.com/q/database/" target="_blank">see here</a>.

## 4. Scaling Data Ingestion with Functions

If you want to scale the ingestion of data to many days, it’s helpful to create a reusable function. Let’s create a function `createTrade` that generates trade data for specific dates and saves it to the database.

In [12]:
.Q.w[]

used| 2079888
heap| 1543503872
peak| 1543503872
wmax| 0
mmap| 0
mphy| 67436519424
syms| 3753
symw| 190598


In [16]:
\c 100 1000

In [18]:
createTrade:{[date]                                            / Start of function definition and input parameters
    trade::([]                                                 / Start of table definition
              time:asc (`timestamp$date) + n?24:00:00;         / Start from midnight, spread across 24h
              sym:n?`AAPL`MSFT`GOOG`AMZN;                      / Random stock symbols
              price:n?100f;                                    / Random trade prices
              size:n?1000                                      / Random trade sizes
        );                                                     / End of table definition
    .Q.dpft[dbPath;date;`sym;`trade]                           / Save data as a partitioned database
 }                                                             / End of function definition

days:day + 1 + til 30                                        / Generate a list of 5 dates
\t createTrade peach days                                          / Execute the function for each date in the list

QError: noupdate: `. `trade

In [17]:
.Q.dpft

k){[d;p;f;t;s]if[` in f,c:!+r:`. . `\:t;'`domain];if[~f in c;'f];i:<t f;r:+enxs[$;d;r;s];{[d;t;i;u;x]@[d;x;:;u t[x]i]}[d:par[d;p;t];r;i;]'[(::;`p#)f=c;c];@[d;`.d;:;f,c@&~f=c];t}[;;;;`sym]


In [16]:
.Q.w[]

used| 538954784
heap| 1811939328
peak| 1811939328
wmax| 0
mmap| 0
mphy| 67436519424
syms| 4175
symw| 216253


In [17]:
system"df -mh ."

"Filesystem      Size  Used Avail Use% Mounted on"
"/dev/sdh        9.8G  2.4G  7.5G  24% /home/jovyan"


In the above:
- The function `createTrade` generates trade data for a given date, and then saves it to disk.
- We generate data for multiple days (2025.02.01 to 2025.02.05), using the [`til`](#https://code.kx.com/q/ref/til/) operator as a quick handy way to generate a list of dates.
- The we loop over the dates using [`each`](#https://code.kx.com/q/wp/iterators/#map-iterators)

> **📌 Iterators** like each are the primary means of iteration in q, and in almost all cases the most efficient way to iterate. Loops are rare in q programs and are almost always candidates for optimization.

After running this function, the data will be partitioned and stored for each specific day. Again, lets delete our in memory `trade` table and reload our database to pick up these new additions.

In [18]:
delete trade from `.                               / Delete in memory table
system"l ",dbDir                                   / Load the partitioned database
select count i by date from trade                  / Select number of records per date within the trade table

.
date      | x       
----------| --------
2025.01.01| 10000000
2025.01.02| 10000000
2025.01.03| 10000000
2025.01.04| 10000000
2025.01.05| 10000000
2025.01.06| 10000000
2025.01.07| 10000000
2025.01.08| 10000000
2025.01.09| 10000000
2025.01.10| 10000000
2025.01.11| 10000000
2025.01.12| 10000000
2025.01.13| 10000000
2025.01.14| 10000000
2025.01.15| 10000000
2025.01.16| 10000000
2025.01.17| 10000000
2025.01.18| 10000000
2025.01.19| 10000000
2025.01.20| 10000000
..


## 5. Time Series Analytics

Now that we have some data, let's dive into some basic time-series analytics.

### Total Trade Volume Every Hour for AAPL

In [19]:
select sum size 
    by date,
       60 xbar time.minute 
    from trade 
    where sym=`AAPL

date       minute| size    
-----------------| --------
2025.01.01 00:00 | 52228952
2025.01.01 01:00 | 51872751
2025.01.01 02:00 | 52064693
2025.01.01 03:00 | 52071918
2025.01.01 04:00 | 51904116
2025.01.01 05:00 | 51830218
2025.01.01 06:00 | 52034075
2025.01.01 07:00 | 52037456
2025.01.01 08:00 | 52182458
2025.01.01 09:00 | 52070061
2025.01.01 10:00 | 52082665
2025.01.01 11:00 | 51789446
2025.01.01 12:00 | 52009891
2025.01.01 13:00 | 52153188
2025.01.01 14:00 | 52155671
2025.01.01 15:00 | 51896148
2025.01.01 16:00 | 52020846
2025.01.01 17:00 | 51932273
2025.01.01 18:00 | 51828075
2025.01.01 19:00 | 52103248
..


#### qSQL & Temporal Arithmetic
Here we are using <a href="https://code.kx.com/q/basics/qsql/" target="_blank">qSQL</a>, the inbuilt table query language in kdb+. If you have used SQL, you will find the syntax of qSQL queries very similar.
- Just as in SQL, table results called using `select` and `from` and can be filtered by expressions following a `where`
- Multiple filter criteria, separated by ,, are evaluated starting from the left
- To group similar values together we can use the `by` clause. This is particularly useful in combination with used with an aggregation like `sum`,`max`,`min` etc.

kdb+/q supports several temporal types and arithmetic between them. See here for a summary of <a href="https://code.kx.com/q/ref/#datatypes" target="_blank">datatypes</a>.

In this example:
- The `time` column in the data has a type of timestamp, which includes both date and time values.
- We convert the `time` values to their minute values (including hours and minutes)
- We then aggregate further on time by using <a href="https://code.kx.com/q/ref/xbar/" target="_blank">xbar</a> to bucket the minutes into hours (60-unit buckets)

### Weighted Average Price and Last Trade Price Every 15 Minutes for MSFT

In [20]:
select lastPx:last price, 
       vwapPx:size wavg price
 by date, 15 xbar time.minute 
 from trade 
 where sym=`MSFT

date       minute| lastPx   vwapPx  
-----------------| -----------------
2025.01.01 00:00 | 62.99824 49.80043
2025.01.01 00:15 | 7.181633 49.96954
2025.01.01 00:30 | 53.54667 50.10013
2025.01.01 00:45 | 4.95421  49.88338
2025.01.01 01:00 | 84.04677 50.34475
2025.01.01 01:15 | 34.85096 49.42832
2025.01.01 01:30 | 6.610401 50.0637 
2025.01.01 01:45 | 90.43545 50.0949 
2025.01.01 02:00 | 5.894815 49.91792
2025.01.01 02:15 | 65.51916 50.01561
2025.01.01 02:30 | 17.56833 49.91298
2025.01.01 02:45 | 43.00724 49.83422
2025.01.01 03:00 | 25.91682 50.45375
2025.01.01 03:15 | 69.40958 49.89365
2025.01.01 03:30 | 32.29328 50.21242
2025.01.01 03:45 | 34.90826 50.23979
2025.01.01 04:00 | 63.70649 49.54437
2025.01.01 04:15 | 51.32217 50.06754
2025.01.01 04:30 | 41.98754 50.14077
2025.01.01 04:45 | 54.61311 49.98737
..


This is similar to the previous analytic, but this time we make use of the built in `wavg` function to find out the weighted average over time intervals. 

In finance, volume-weighted averages give a more accurate reflection of a stock’s price movement by incorporating trading volume at different price levels. This can be especially useful in understanding whether a price move is supported by strong market participation or is just a result of a few trades.

Let's time this anayltic with `\t` to see how long it takes in milliseconds to crunch through 30 million records.

In [21]:
\t select lastPx:last price, 
       vwapPx:size wavg price
   by date, 15 xbar time.minute 
   from trade 
   where sym=`MSFT

3652


The query processed 30+ million records in 147 ms, efficiently aggregating last price (`lastPx`) and volume-weighted-average price (`vwapPx`) for MSFT trades. The use of `by date, 15 xbar time.minute` optimized the grouping, making the computation fast. This demonstrates the power of kdb+/q for high-speed time-series analytics.

 ### SQL Comparison

A SQL version of this query above would look something like:

```

SELECT 
    (array_agg(price ORDER BY time DESC))[1] AS lastPx,
    SUM(price * size) / NULLIF(SUM(size), 0) AS vwapPx,
    DATE_TRUNC('day', time),                                            
    TRUNC(time, 'MI') + (FLOOR(TO_NUMBER(TO_CHAR(time, 'MI')) / 15) * INTERVAL '15' MINUTE) 
FROM 
    trade
WHERE 
    sym = 'MSFT'
GROUP BY 
    DATE_TRUNC('day', time), 
    TRUNC(time, 'MI') + (FLOOR(TO_NUMBER(TO_CHAR(time, 'MI')) / 15) * INTERVAL '15' MINUTE)
ORDER BY 
    DATE_TRUNC('day', time), 
    TRUNC(time, 'MI') + (FLOOR(TO_NUMBER(TO_CHAR(time, 'MI')) / 15) * INTERVAL '15' MINUTE);

```

SQL is more complex due to several factors:
- **Time-series Calculations**: The SQL version involves the creation of custom logic for common time-series calculations such as volume-weighted-averages. In the q-sql version, these functionalities are implicit, and the syntax is more concise when working with vectors. The SQL equivalent requires custom definitions and is often more verbose leaving room for error.
- **Grouping and Aggregation**: In the q-sql version, grouping by date and a 15 minute window is done with a single, simple syntax, which is an efficient and intuitive way to express time bucketing. In SQL, similar behavior requires explicitly defining how time intervals are handled and aggregating the results using GROUP BY with custom time expressions which are often repeated throughout the query.
- **Temporal Formatting**: SQL queries often require repetitive conversion for handling timestamp formats, which is more cumbersome compared to q-sql, where time-based operations like xbar (interval-based bucketing) can be done directly in a streamlined manner. Temporal primitives also make it extremely easy to convert a nanosecond timestamp to it's equivalent minute using dot notation e.g. time.minute
- **Data Transformation**: The q language is optimized for high-performance, in-memory, columnar data transformations, which allows for more compact expressions on vectors of data. SQL, on the other hand, is typically too general purpose for even simple transformations on time-series data. This is down to how kdb+/q is designed, where operations execute on ordered lists, whereas SQL (based on set theory) treats data as records instead of columns e.g. selecting the (last) value in a series, or understanding prior states (deltas) for series movements would require re-ordering the column data
- **Performance Considerations**: q-sql is designed for high-performance analytics on large datasets, and many operations that would require complex SQL expressions can be done efficiently with q-sql syntax. In SQL, complex operations requires workarounds such as additional processing with temporary tables, sub-expressions, re-indexing, changing data models, or heavily leveraging partitions and window functions.

Thus, while the core logic of the query is similar in both languages, the SQL version requires much more overhead in terms of complexity and verbosity. This inefficiency will also become more pronounced with large datasets, leading to challenges with query performance.

While these are just basic analytics, they highlight kdb+/q’s ability to storage and analyse large-scale time-series datasets quickly.

## 6. Asof Join – Matching Trades with Quotes

One of the most powerful features in kdb+/q is the asof join (`aj`), which is designed to match records from two tables based on the most recent timestamp. Unlike a standard SQL join, where records must match exactly on a key, an asof join finds the most recent match.

Why Use Asof Joins?
In time-series data, we often deal with information arriving at different intervals. For example:
- Trade and Quote Data: A trade occurs at a given time, and we want to match it with the latest available quote.
- Sensor Data: A sensor records temperature every second, while another logs environmental data every 10 seconds—matching the closest reading is crucial.

> **📌** kdb+/q optimizes asof joins to handle large datasets efficiently, making it a key tool in real-time analytics and historical data analysis.

#### Generate synthetic quote data for one day

In [23]:
n:2000000
today:last days
quote:([] 
    time:asc (`timestamp$today) + n?86400000000000;  / Random timestamps
    sym:n?`AAPL`MSFT`GOOG`AMZN;                      / Random stock tickers
    bid:n?100f;                                      / Random bid prices
    ask:n?100f                                       / Random ask prices
 )

As we're keeping this table in memory we need to perform one extra step before joining, we apply the parted (p#) attribute to the sym column of the quote table. Our trade table on disk already has the parted attribute on the sym column, we see this in the column `a` when we run `meta trade`.

In [24]:
meta trade

c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j    


This is crucial for optimizing asof joins, as it ensures faster lookups when performing symbol-based joins. Before applying parted to quote, we first sort the table by sym using [`xasc`](#https://code.kx.com/q/ref/asc/), as the parted attribute requires the column to be sorted for it to work efficiently.

In [25]:
quote:`sym xasc quote                  / Sorting sym in ascending order
quote:update `p#sym from quote         / Apply the parted attruibute on sym

In the above:
- `xasc` Sorts the quote table by sym in ascending order
- `#`  Applies the parted attribute to sym, optimizing symbol-based lookups.

#### Peform Asof Join

We now match each trade with the most recent available quote for todays date using [`aj`](#https://code.kx.com/q/ref/aj/).


In [26]:
tradequote:aj[`sym`time; select from trade where date=today; quote]
tradequote

date       sym  time                          price    size bid ask
-------------------------------------------------------------------
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 61.19055 427         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 10.30378 701         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 51.7924  908         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 83.30336 700         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 22.94161 326         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 49.01219 12          
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 9.120412 393         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 91.51694 500         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 26.40567 197         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 26.96222 564         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 96.5675  345         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 38.72911 817         
2025.01.31 AAPL 2025.01.31D00:00:00.000000000 97

In the above:
- `aj` performs an asof join on the `sym` and `time` columns
- Each trade record gets matched with the latest available quote at or before the trade’s timestamp.
- We can see this means the first few `bid` and `ask` values are empty because there was no quote data prior to those trades.

This approach ensures that for every trade, we have the best available quote information, allowing traders to analyze trade execution relative to the prevailing bid/ask spread at the time.

## Next Steps

Try [Example2](Example2.html) on Real-Time Ingestion & Streaming Analytics.


In [27]:
.Q.w[]

used| 404740272
heap| 1946157056
peak| 1946157056
wmax| 0
mmap| 320004328
mphy| 67436519424
syms| 4242
symw| 218937


In [28]:
system"df -mh ."

"Filesystem      Size  Used Avail Use% Mounted on"
"/dev/sdh        9.8G  9.3G  502M  95% /home/jovyan"


In [29]:
// peak diff in GB 
((.Q.w[]`peak)-67108864)%1073741824

1.75


In [30]:
count trade

310000000
