# Time Series & Historical Query Analysis

Welcome to this example where we'll demonstrate how to work with large datasets in kdb+ to analyze time-series data. 

One of the key features of kdb+ is its ability to handle huge volumes of data with exceptional speed and efficiency. Whether it's reading massive datasets, performing time-based aggregations, or joining data from different sources, kdb+ excels at time-series analysis. By the end of this example, you'll have a clear understanding of how to create, manipulate, store, and analyze data using q/kdb+. Along the way, we'll introduce several key concepts that are fundamental to working with q/kdb+.


Here, we'll cover:
- Creating a large time-series dataset from scratch
- Saving this data to a database on disk
- Streamline ingestion and save down using functions 
- Performing time-based aggregations to analyze trends over time
- Using asof joins (aj) to combine time-series data (e.g., matching trades to quotes)

## 1. Prerequisites

1. For setup instructions and prerequisites, please refer to the [README](README.md).
2. Ensure PyKX is properly initialized and qfirst mode is enabled by running the below.

In [1]:
import pykx as kx
kx.util.jupyter_qfirst_enable()

PyKX now running in 'jupyter_qfirst' mode. All cells by default will be run as q code. 
Include '%%python' at the beginning of each cell to run as python code. 


## 2. Create the Time Series Dataset

Let’s start by creating a sample dataset to work with. This dataset will simulate trade data over a period of time, with random values for price, size, and symbols. We’ll generate 5 million rows of trade data.

In [2]:
n:5000000
day:2025.01.02
trade:([] 
    time:asc (`timestamp$day) + n?24:00:00.000000000;    / Start from midnight, spread across 24h
    sym:n?`AAPL`MSFT`GOOG`AMZN;                          / Random symbols
    price:n?100f;                                        / Random prices
    size:n?1000                                          / Random trade sizes
 )

Here's a breakdown of what's happening:
- `n: 5000000` sets the number of rows we want to generate
- We define a new table with table notation `([] col1:<values>; col2:<values>: ....)`
- We use `?` to generate random values for 4 columns:
    - `time` is populated with timestamps starting from midnight and increasing across a 24-hour period, with a random offset to simulate a spread of trades.
    - `sym` is populated with random symbols like AAPL, MSFT, etc., selected from a list.
    - `price` and trade `size` are randomnly generated

This table is now available in memory to investigate and query. Let's take a quick look at the row [`count`](#https://code.kx.com/q/ref/count/), schema details with [`meta`](#https://code.kx.com/q/ref/meta/) and first 10 rows using [`sublist`](#https://code.kx.com/q/ref/sublist/).

These simple commands are essential when exploring your data quickly in q/kdb+.

In [3]:
count trade       / get row count

5000000


In [4]:

meta trade        / get table schema details - datatypes, column names etc

c    | t f a
-----| -----
time | p   s
sym  | s    
price| f    
size | j    


The following columns are produced when we run `meta`:
- c: column name
- t: [column type](#https://code.kx.com/q/ref/#datatypes)
- f: [foreign keys](#https://code.kx.com/q4m3/8_Tables/#85-foreign-keys-and-virtual-columns)
- a: [attributes](#https://code.kx.com/q/ref/#attributes): modifiers applied for performance characteristics

In [5]:
10 sublist trade  / get first 10 rows 

time                          sym  price     size
-------------------------------------------------
2025.01.02D00:00:00.011406093 AAPL 5.286875  908 
2025.01.02D00:00:00.014765560 AMZN 47.93312  360 
2025.01.02D00:00:00.038664042 GOOG 17.13715  522 
2025.01.02D00:00:00.046268105 AMZN 70.1903   257 
2025.01.02D00:00:00.050713866 AMZN 23.25251  858 
2025.01.02D00:00:00.073526054 AMZN 18.48452  585 
2025.01.02D00:00:00.099858641 AAPL 66.48997  90  
2025.01.02D00:00:00.120478123 AAPL 29.23461  683 
2025.01.02D00:00:00.156366080 GOOG 0.8593363 90  
2025.01.02D00:00:00.165257602 GOOG 75.44551  869 


## 3.  Save Data to Disk

Once the data is generated, you’ll likely want to save it to disk for persistent storage.

Because we want the ability to scale, partitioning by date will be a good approach for this dataset. Without partitioning, queries that span large time periods would require scanning entire datasets, which can be very slow and resource-intensive. By partitioning data, kdb+ can limit the query scope to the relevant partitions, significantly speeding up the process.

To partition by date we can use the inbuilt function [`.Q.dpft`](#https://code.kx.com/q/ref/dotq/#dpft-save-table).


In [6]:
dbDir:"/home/your-dir/data"          / Define database location
dbPath:hsym `$dbDir
.Q.dpft[dbPath;day;`sym;`trade]            / Save data as a partitioned database

trade


In the above:
- [`hsym`](#https://code.kx.com/q/ref/hsym/): This function prefixes the directory location with a colon to make it a file handle
- `.Q.dpft[d;p;f;t]`: This command persists to a (d)atabase location with a specific (p)artition with data from a (t)able with an associated (f)ield.

One persisted, the table name is returned. We can test its worked as expected by deleting the `trade` table we have in memory and reloading the database from disk.

In [7]:
delete trade from `.                     / Delete in memory table
system"l ",dbDir                         / Load the partitioned database
meta trade                               / Check it exists

.
c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j    


kdb+ actually offers a number of different methods to store tables which will allow for efficient storage and querying for different sized datasets: flat, splayed, partitioned and segmented.

A general rule of thumb around which format to choose depends on three things:

- Will the table continue to grow at a fast rate?
- Am I working in a RAM constrained environment?
- What level of performance do I want?

To learn more about these types and when to choose which [see here](#https://code.kx.com/q/database/).

## 4. Scaling Data Ingestion with Functions

If you want to scale the ingestion of data to many days, it’s helpful to create a reusable function. Let’s create a function `createTrade` that generates trade data for specific dates and saves it to the database.

In [8]:
createTrade:{[date]
    trade::([] time:asc (`timestamp$date) + n?24:00:00.000000000; / Start from midnight, spread across 24h
              sym:n?`AAPL`MSFT`GOOG`AMZN;                         / Random symbols
              price:n?100f;                                       / Random prices
              size:n?1000);                                       / Random trade sizes
    .Q.dpft[dbPath;date;`sym;`trade]                              / Save data as a partitioned database
 }

days:2025.02.01 + til 5
createTrade each days

`trade`trade`trade`trade`trade


In the above:
- The function `createTrade` generates trade data for a given date, and then saves it to disk.
- We generate data for multiple days (2025.02.01 to 2025.02.05), using the [`til`](#https://code.kx.com/q/ref/til/) operator as a quick handy way to generate a list of dates.
- The we loop over the dates using [`each`](#https://code.kx.com/q/wp/iterators/#map-iterators)

> **📌 Iterators** like each are the primary means of iteration in q, and in almost all cases the most efficient way to iterate. Loops are rare in q programs and are almost always candidates for optimization.

After running this function, the data will be partitioned and stored for each specific day. Again, lets delete our in memory `trade` table and reload our database to pick up these new additions.

In [9]:
delete trade from `.                     / Delete in memory table
system"l ",dbDir                         / Load the partitioned database
select count i by date from trade        / Count num rows by date after partitioning 5 days of data

.
date      | x      
----------| -------
2025.01.02| 5000000
2025.02.01| 5000000
2025.02.02| 5000000
2025.02.03| 5000000
2025.02.04| 5000000
2025.02.05| 5000000


## 5. Time Series Analytics

Now that we have some data, let's dive into some basic time-series analytics.

### Total Trade Volume Every Hour for AAPL

In [10]:
select sum size 
    by date,
       60 xbar time.minute 
    from trade 
    where sym=`AAPL

date       minute| size    
-----------------| --------
2025.01.02 00:00 | 25891446
2025.01.02 01:00 | 26050097
2025.01.02 02:00 | 26147341
2025.01.02 03:00 | 26037462
2025.01.02 04:00 | 26043744
2025.01.02 05:00 | 26024765
2025.01.02 06:00 | 25993388
2025.01.02 07:00 | 25923493
2025.01.02 08:00 | 25926785
2025.01.02 09:00 | 25889219
2025.01.02 10:00 | 26145203
2025.01.02 11:00 | 25802003
2025.01.02 12:00 | 26088725
2025.01.02 13:00 | 26144928
2025.01.02 14:00 | 26120175
2025.01.02 15:00 | 26083944
2025.01.02 16:00 | 26226585
2025.01.02 17:00 | 25882417
2025.01.02 18:00 | 26146353
2025.01.02 19:00 | 25890825
..


#### qSQL & Temporal Arithmetic
Here we are using [qSQL](#https://code.kx.com/q/basics/qsql/), the inbuilt table query language in kdb+. If you have used SQL, you will find the syntax of qSQL queries very similar.
- Just as in SQL, table results called using `select` and `from` and can be filtered by expressions following a `where`
- Multiple filter criteria, separated by ,, are evaluated starting from the left
- To group similar values together we can use the `by` clause. This is particularly useful in combination with used with an aggregation like `sum`,`max`,`min` etc.

q/kdb+ supports several temporal types and arithmetic between them. See here for a summary of [datatypes](#https://code.kx.com/q/ref/#datatypes).
In this example:
- The `time` column in the data has a type of timestamp, which includes both date and time values.
- We convert the `time` values to their minute values (including hours and minutes)
- We then aggregate further on time by using [`xbar`](#https://code.kx.com/q/ref/xbar/) to bucket the minutes into hours (60-unit buckets)

### Weighted Average Price and Last Trade Price Every 15 Minutes for MSFT

In [11]:
select LastPrice:last price, 
       WeightedPrice:size wavg price
 by date,15 xbar time.minute 
 from trade 
 where sym=`MSFT

date       minute| LastPrice WeightedPrice
-----------------| -----------------------
2025.01.02 00:00 | 11.43895  49.69477     
2025.01.02 00:15 | 42.51047  49.91692     
2025.01.02 00:30 | 69.80891  50.01703     
2025.01.02 00:45 | 63.15883  49.95182     
2025.01.02 01:00 | 5.080531  50.30521     
2025.01.02 01:15 | 73.5871   49.87864     
2025.01.02 01:30 | 89.19987  50.40955     
2025.01.02 01:45 | 47.07693  49.92108     
2025.01.02 02:00 | 13.39698  49.89728     
2025.01.02 02:15 | 16.23821  50.17888     
2025.01.02 02:30 | 87.14231  49.3321      
2025.01.02 02:45 | 75.71376  50.28474     
2025.01.02 03:00 | 12.01796  50.28201     
2025.01.02 03:15 | 87.38662  50.45316     
2025.01.02 03:30 | 28.95476  50.30487     
2025.01.02 03:45 | 33.23928  50.07296     
2025.01.02 04:00 | 63.04001  50.16875     
2025.01.02 04:15 | 37.40805  50.05399     
2025.01.02 04:30 | 78.51536  50.04083     
2025.01.02 04:45 | 24.88957  50.66176     
..


This is similar to the previous analytic, but this time we make use of the built in `wavg` function to find out the weighted average over time intervals. 

In finance, volume-weighted averages give a more accurate reflection of a stock’s price movement by incorporating trading volume at different price levels. This can be especially useful in understanding whether a price move is supported by strong market participation or is just a result of a few trades.

Let's time this anayltic with `\t` to see how long it takes in milliseconds to crunch through 30 million records.

In [13]:
\t select LastPrice:last price, 
       WeightedPrice:size wavg price
 by date,15 xbar time.minute 
 from trade 
 where sym=`MSFT

147


The query processed 30+ million records in 147 ms, efficiently aggregating LastPrice and WeightedPrice for MSFT trades. The use of `by date, 15 xbar time.minute` optimized the grouping, making the computation fast. This demonstrates the power of kdb+/q for high-speed time-series analytics.

 ### SQL Comparison

A SQL version of this query above would look something like:

```
CREATE OR REPLACE function wavg(v,p) AS sum(v * p)/sum(v);
CREATE OR REPLACE MACRO xbarTime(n,x) AS n*(((date('hour', x::Time)*60) + date('minute', x::Time))//n);
select sym, wavg("size", "price") AS wavg FROM trade GROUP BY ALL ORDER BY sym;
select min("time") AS time, last("price") as LastPrice,wavg("size", "price") AS WeightedPrice 
    FROM trade 
    WHERE sym='MSFT' GROUP BY xbarTime(15,time) ORDER BY time;
```

SQL is more complex to write due to several factors:
- **Custom Functions**: The SQL version involves the creation of custom functions and macros such as wavg(v, p) for weighted averages and xbarTime(n, x) for time bucketing. In the Q version, these functionalities are implicit, and the syntax is more concise. The SQL equivalent requires explicit definitions and can be more verbose.
- **Grouping and Aggregatio**n**: In the q version, grouping by date and 15 xbar time.minute is done with a single, simple syntax, which is efficient and easy to express. In SQL, similar behavior requires explicitly defining how time intervals are handled and aggregating the results using GROUP BY and custom time expressions.
- **Time Formatting**: SQL queries often require conversion or handling of time formats, which is more cumbersome compared to q, where time-based operations like xbar (interval-based bucketing) can be done directly in a more streamlined manner.
- **Data Transformation**: The q language is optimized for high-performance, in-memory, columnar data transformations, which allows for more compact expressions. SQL, on the other hand, typically requires a more rigid structure for achieving the same results, often relying on the use of subqueries or joining intermediate results.
- **Performance Considerations**: q is designed for high-performance analytics on large datasets, and many operations that would require more complex SQL expressions can be done more efficiently with Q syntax. In SQL, complex operations may require additional processing, such as temporary tables, indexing, or window functions.

Thus, while the core logic of the query is similar in both languages, the SQL version requires more manual setup (e.g., custom function creation, complex time transformations, and explicit grouping), leading to a more verbose and complex query.

While these are just basic analytics, but they showcase q/kdb+’s ability to handle large-scale time-series data and perform aggregations quickly.

## 6. Asof Join – Matching Trades with Quotes

One of the most powerful features in q/kdb+ is the asof join (`aj`), which is designed to match records from two tables based on the most recent timestamp. Unlike a standard SQL join, where records must match exactly on a key, an asof join finds the most recent match.

Why Use Asof Joins?
In time-series data, we often deal with information arriving at different intervals. For example:
- Trade and Quote Data: A trade occurs at a given time, and we want to match it with the latest available quote.
- Sensor Data: A sensor records temperature every second, while another logs environmental data every 10 seconds—matching the closest reading is crucial.

> **📌** q/kdb+ optimizes asof joins to handle large datasets efficiently, making it a key tool in real-time analytics and historical data analysis.

#### Generate synthetic quote data for one day

In [14]:
n:2000000
today:last days
quote:([] 
    time:asc (`timestamp$today) + n?86400000000000;  / Random timestamps
    sym:n?`AAPL`MSFT`GOOG`AMZN;                     / Symbols
    bid:n?100f;                                     / Random bid prices
    ask:n?100f                                      / Random ask prices
 )

As we're keeping this table in memory we need to perform one extra step before joining, we apply the parted (p#) attribute to the sym column of the quote table. Our trade table on disk already has the parted attribute on the sym column, we see this in the column `a` when we run `meta trade`.

In [15]:
meta trade

c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j    


This is crucial for optimizing asof joins, as it ensures faster lookups when performing symbol-based joins. Before applying parted to quote, we first sort the table by sym using [`xasc`](#https://code.kx.com/q/ref/asc/), as the parted attribute requires the column to be sorted for it to work efficiently.

In [16]:
quote:`sym xasc quote           / sorting sym in ascending order
quote:update `p#sym from quote  / apply parted attruibute on sym

In the above:
- `xasc` Sorts the quote table by sym in ascending order
- `#`  Applies the parted attribute to sym, optimizing symbol-based lookups.

#### Peform Asof Join

We now match each trade with the most recent available quote for todays date using [`aj`](#https://code.kx.com/q/ref/aj/).


In [17]:
tradequote:aj[`sym`time; 
              select from trade where date=today;
              quote]
tradequote

date       sym  time                          price    size bid      ask     
-----------------------------------------------------------------------------
2025.02.05 AAPL 2025.02.05D00:00:00.041379779 68.3932  935                   
2025.02.05 AAPL 2025.02.05D00:00:00.062924623 60.90381 405                   
2025.02.05 AAPL 2025.02.05D00:00:00.173867493 16.86426 495  40.66565 73.38496
2025.02.05 AAPL 2025.02.05D00:00:00.233070552 50.05196 816  40.66565 73.38496
2025.02.05 AAPL 2025.02.05D00:00:00.338360667 30.29596 67   40.66565 73.38496
2025.02.05 AAPL 2025.02.05D00:00:00.349666178 76.59111 689  40.66565 73.38496
2025.02.05 AAPL 2025.02.05D00:00:00.431198626 20.13091 740  94.77276 3.518068
2025.02.05 AAPL 2025.02.05D00:00:00.515386462 33.70869 306  94.77276 3.518068
2025.02.05 AAPL 2025.02.05D00:00:00.765958428 22.29313 60   51.94371 83.79661
2025.02.05 AAPL 2025.02.05D00:00:00.777746737 29.0045  976  51.94371 83.79661
2025.02.05 AAPL 2025.02.05D00:00:00.904098898 57.4872  201  59.5

15:20:47:102 [kurl] INFO - Replacing oauth2 token
15:20:47:102 [kurl] INFO - Replacing oauth2 token


In the above:
- `aj` performs an asof join on the `sym` and `time` columns
- Each trade record gets matched with the latest available quote at or before the trade’s timestamp.
- We can see this means the first few `bid` and `ask` values are empty because there was no quote data prior to those trades.

This approach ensures that for every trade, we have the best available quote information, allowing traders to analyze trade execution relative to the prevailing bid/ask spread at the time.

## Next Steps

Try [Example2](Example2.html) on Real-Time Ingestion & Streaming Analytics.
