In [1]:
#;.pykx.disableJupyter()

In [2]:
# https://code.kx.com/pykx/3.0/examples/jupyter-integration.html#q-first-mode
import pykx as kx
kx.util.jupyter_qfirst_enable()

PyKX now running in 'jupyter_qfirst' mode. All cells by default will be run as q code. 
Include '%%py' at the beginning of each cell to run as python code. 


##### Initialization Code 

In [3]:
system"l init.q"

Creating local segmented database in :/home/jovyan/course-advanced/.hidden/db/segmentedDBRoot for use within the current section.
Finished segmented database creation.


**Learning Outcomes**

To understand: 
* Why use a segmented database?
* The structure of a segmented database
* How to use par.txt
* Saving data to segments
* Creating a segmented database

# Introduction
For large time-series databases, kdb+ introduced segmentation. This involves storing data partitions in multiple locations outside of the root directory. These locations are known as segments. This is an extension of the partitioned database we discussed previously - in fact each segment of the database will have some number of the data partitions.

With this structure, kdb+ can access and retrieve large amounts of data across segments in parallel. Let's first look at the structure of the database below:

<table><tr>
<td> <img src="../images/SegmentedDisk0.png" alt="Drawing" style="width: 230px;"/> </td>
<td> <img src="../images/SegmentedDisk1.png" alt="Drawing" style="width: 300px;"/> </td>
<td> <img src="../images/SegmentedDisk2.png" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

## Why segment a database? 

There are three main reasons why databases are segmented: 

1. For **Storage Capacity** Reasons: When dealing with databases on the order of Petabytes it is by necessity that databases need to be split across many different storage devices as it cannot all be fit on one. 
2. For **Performance** Reasons: When data is being retrieved from a given disk the read and write performance is limited by the I/O capacity of the disk itself (see [IOPS](https://en.wikipedia.org/wiki/IOPS)). By distributing intensive queries across multiple disks, the overhead due to I/O can be significantly reduced. 
3. For **Cost** Purposes: In many cases data has a recency bias in term of how often it is accessed - newer data is accessed frequently, while older data is accessed far less so. This means in many cases high performances (costly) disks are used for the recent data (in Finance for example the last 3-6 months of data), while older less accessed data is offloaded onto cheaper storage (.e.g hard spinning disks).

## Differences between Partitioned and Segmented databases

|  | Partitioned Table | Segmented Table |
| --- | --- | --- |
| Record location |	All partitions (and hence all records) reside under the root directory. | None of the segments (and hence no records) reside under the root. |
| I/O channels | All partitions (and hence all records) reside on a single I/O channel. | The segments (and hence the records) should reside on multiple I/O channels. |
| Processing | Partitions loaded and processed sequentially in aggregation queries.	 | Given appropriate slaves and cores, aggregate queries load segments in parallel and process them concurrently. |
| Symbols | Cannot partition on a symbol column. | Can segment along a symbol column |
| Virtual Column | Partition column not stored. Virtual column values inferred from directory names | No special column associated with segmentation (virtual column from underlying partition still present) |

# Segmented Database Structure

## Segmented Database Home 
The only items that will remain in the root directory are the sym file the par.txt file and any flat or splayed reference tables. 

The **par.txt** file exists so that q can read it in order to determine the location of each segment. Each line in the **par.txt** file will point to a specific segment. The local Segmented database has been created in the `segmentedDBRoot` directory: 

In [4]:
key `:segmentedDBRoot

`s#`daily`depth`mas`par.txt`sym


We can see a number of files in this directory - our familiar `sym` file, and three flat tables:

In [5]:
3 sublist get `:segmentedDBRoot/daily  //note we don't have our `sym file in memory yet as we haven't loaded the db
3 sublist get `:segmentedDBRoot/depth       //hence why the symbol values are currently showing as integers
3 sublist get `:segmentedDBRoot/mas

date       sym  open  high  low   close price   size 
-----------------------------------------------------
2020.01.02 AAPL 83.9  86.52 82.87 86.22 4480705 52948
2020.01.02 AIG  26.99 29.07 26.69 29    1508183 53858
2020.01.02 AMD  33.01 34    32.89 33.93 1794876 53829
date       time         sym  price size side ex
-----------------------------------------------
2020.01.31 09:30:01.068 TXN  19.74 100  S    O 
2020.01.31 09:30:20.507 INTC 53.09 400  B    N 
2020.01.31 09:30:41.953 INTC 53.3  300  B    N 
sym  name                     
------------------------------
AMD  "ADVANCED MICRO DEVICES" 
AIG  "AMERICAN INTL GROUP INC"
AAPL "APPLE INC COM STK"      


The new file that is specific to the segmented database structure is the `par.txt` file. It is this file that the kdb+/q process looks for in order to determine if the database is in fact segmented. 

## Database Segments - The par.txt file

Looking at our local segmented database, the `par.txt` exists in our Segmented database root. 

In [6]:
read0 `:segmentedDBRoot/par.txt

"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3"


These four file paths indicate the four different locations where the data in our segmented database is stored. In this instance the four locations are local and on the same disk, though in practical usage these are usually different disk mounts. 

Let's look at the corresponding data in these locations: 

In [7]:
show segmentDict:segments!key each segments:hsym `$ read0 `:segmentedDBRoot/par.txt

:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0| `s#`2020.01.08`2020.01.16`2020.01.20`2020.01.24`2020.01.28
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1| `s#`2020.01.09`2020.01.13`2020.01.17`2020.01.21`2020.01.29
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2| `s#`2020.01.02`2020.01.06`2020.01.10`2020.01.14`2020.01.22`2020.01.30
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3| `s#`2020.01.03`2020.01.07`2020.01.15`2020.01.23`2020.01.27`2020.01.31


In [8]:
sum count each  segmentDict  //how many dates do we have in our database

22


We can see that each of the segments contains a number of dates within our database and can build the full file path for each of our partitions: 

In [None]:
//the below pairs each mount dir with the each date partion, on a pairwise basis then creates the full path
show segPaths:` sv/: raze key[segmentDict](,/:)'value[segmentDict] 
show path:first 1?segPaths    //choose a random path 
key path                      //what's in this directory

We can now see that our date partitions contain the TAQ data we have worked with before. 

## Loading a Segmented Database 

Loading a segmented database is relatively straightforward and just involves loading the root segment directory with the par.txt file

For example, lets load our pre-created local segmented database we have been inspecting: 

In [9]:
\l segmentedDBRoot

In [10]:
tables[]

`daily`depth`mas`nbbo`quote`td`trade


After loading this directory, we have in memory the flat tables within our root directory, and our usual `sym` and `date` variables we know from working with partitioned databases: 

In [14]:
key `:.

`s#`daily`depth`mas`par.txt`sym


In [15]:
sym 
date

`AAPL`AIG`AMD`DELL`DOW`GOOG`HPQ`IBM`INTC`MSFT`ORCL`PEP`PRU`SBUX`TXN
2020.01.02 2020.01.03 2020.01.06 2020.01.07 2020.01.08 2020.01.09 2020.01.10 2020.01.13 2020.01.14 2020.01.15 2020.01.16 2020.01.17 2020.01.20 2020.01.21 2020.01.22 2020.01.23 2020.01.24 2020.01.27..


The `trade`, `quote` and `nbbo` tables are reconstructed from the partitions in each of our segments, however this is abstracted from us as users and we can continue to query our tables as we have previously: 

In [16]:
select count i by date from trade 

date      | x    
----------| -----
2020.01.02| 14754
2020.01.03| 15087
2020.01.06| 14687
2020.01.07| 14049
2020.01.08| 14436
2020.01.09| 13939
2020.01.10| 13579
2020.01.13| 14684
2020.01.14| 14560
2020.01.15| 14739
2020.01.16| 15082
2020.01.17| 14984
2020.01.20| 14556
2020.01.21| 14445
2020.01.22| 14583
..


In [17]:
select open: first price, high:max price, low: min price, close: last price 
    by date, sym
    from trade 

date       sym | open  high  low   close
---------------| -----------------------
2020.01.02 AAPL| 83.9  86.52 82.87 86.22
2020.01.02 AIG | 26.99 29.07 26.69 29   
2020.01.02 AMD | 33.01 34    32.89 33.93
2020.01.02 DELL| 12    12.22 11.82 12.07
2020.01.02 DOW | 20    20.63 19.82 20.44
2020.01.02 GOOG| 72.02 72.54 70.2  71.04
2020.01.02 HPQ | 35.98 36.04 33.93 34.57
2020.01.02 IBM | 42    42.2  40.75 41.43
2020.01.02 INTC| 51.04 51.07 48.22 49.44
2020.01.02 MSFT| 29    29.24 28.04 28.74
2020.01.02 ORCL| 35    35.53 34.62 34.82
2020.01.02 PEP | 21.99 22.46 21.67 22.44
2020.01.02 PRU | 58.95 61.2  58.81 59.66
2020.01.02 SBUX| 63    63.19 61.93 63.04
2020.01.02 TXN | 18    18.37 17.74 18.07
..


# Segmented Database Utilies 

There are some [`.Q` namespace](https://code.kx.com/q/ref/dotq/) utilities and variables which are useful when working with segmented databases. 

## `.Q.P` and `.Q.D`

In segmented databases:
* [`.Q.P`](https://code.kx.com/q/ref/dotq/#qp-segments) returns a list of the segments (i.e. the contents of par.txt).
* [`.Q.D`](https://code.kx.com/q/ref/dotq/#qd-partitions) contains a list of the partitions – conformant to `.Q.P` – that are present in each segment.

In [18]:
.Q.P       //the path to each of our mounts
.Q.D       //all the partitions (in our case dates) in the database
.Q.P!.Q.D  //we can recreate our segment dictionary from earlier!

`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2`:/home/jov..
2020.01.08 2020.01.16 2020.01.20 2020.01.24 2020.01.28
2020.01.09 2020.01.13 2020.01.17 2020.01.21 2020.01.29
2020.01.02 2020.01.06 2020.01.10 2020.01.14 2020.01.22 2020.01.30
2020.01.03 2020.01.07 2020.01.15 2020.01.23 2020.01.27 2020.01.31
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0| 2020.01.08 2020.01.16 2020.01.20 2020.01.24 2020.01.28
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1| 2020.01.09 2020.01.13 2020.01.17 2020.01.21 2020.01.29
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2| 2020.01.02 2020.01.06 2020.01.10 2020.01.14 2020.01.22 2020.01.30
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3| 2020.01.03 2020.01.07 2020.01.15 2020.01.23 2020.01.27 2020.01.31


And we can build all our segment paths making use of `.Q.P` and `.Q.D`:

In [19]:
` sv' raze .Q.P(,/:)'`$string .Q.D  
count ` sv' raze .Q.P(,/:)'`$string .Q.D  

`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0/2020.01.08`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0/2020.01.16`:/home/jovyan/course-advanced/.hidden/db/segmented..
22


## `.Q.par`

The most commonly utilized function when working with segmented databases is [`.Q.par`](https://code.kx.com/q/ref/dotq/#qpar-locate-partition). This is a special in-built function that will return the full path location of a table and can be helpful when working with large HDBs.

It uses the syntax `.Q.par[directory as a file path;part as a date;table]`

In [20]:
.Q.par[`:.;2020.01.02;`trade]     //full path to this table on that date, given the database in this current directory 

:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2/2020.01.02/trade


In [21]:
get .Q.par[`:.;2020.01.02;`trade] // running get just displays the table at that location

sym  time         price size stop cond ex
-----------------------------------------
AAPL 09:30:00.434 83.9  13   0    C    N 
AAPL 09:30:00.785 83.8  65   0    Z    N 
AAPL 09:30:01.273 83.76 75   0    Z    N 
AAPL 09:30:05.311 83.69 35   0    W    N 
AAPL 09:30:05.760 83.66 61   0    K    N 
AAPL 09:30:07.334 83.73 97   0    9    N 
AAPL 09:30:11.544 83.7  71   0    E    N 
AAPL 09:30:11.976 83.62 27   0    W    N 
AAPL 09:30:12.435 83.6  78   0    G    N 
AAPL 09:30:13.205 83.53 49   0    W    N 
AAPL 09:30:13.875 83.47 92   0    Z    N 
AAPL 09:30:14.789 83.52 14   0    B    N 
AAPL 09:30:17.346 83.54 62   0    J    N 
AAPL 09:30:18.556 83.65 35   0    L    N 
AAPL 09:30:24.985 83.7  86   0    A    N 
..


<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:12px;padding-left:5px;" align="left"/>
<p style='color:#273a6e'><i> <code>.Q.par</code> builds the path acting on the assumption that the partitions are distributed in a modulo fashion (explained in Section 4) depending on the date .e.g <code>2020.01.02 mod 4</code> is 2, therefore this date is determined to be in the 2nd mount as per par.txt. </i></p>

## `.Q.PV` and `.Q.PD`

If the database is not structured in modulo fashion, we can't rely on using .Q.par to can build our own file paths. In this case we rely on [`.Q.PV`](https://code.kx.com/q/ref/dotq/#qpv-modified-partition-values) and [`.Q.PD`](https://code.kx.com/q/ref/dotq/#qpd-partition-locations)

`.Q.PD` contains a list of partition locations – conformant to .Q.PV – which represents the partition location for each partition.

`.Q.PV` returns a list of partition values – conformant to .Q.PD – which represents the partition value for each partition.

`.Q.PV!.Q.PD` can be used to create a dictionary of partition-to-location information.

In [22]:
.Q.PD                   // list of partition locations
.Q.PV                   // list of partition values
3 sublist .Q.PV!.Q.PD   // partition-to-location information

`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3`:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2`:/home/jov..
2020.01.02 2020.01.03 2020.01.06 2020.01.07 2020.01.08 2020.01.09 2020.01.10 2020.01.13 2020.01.14 2020.01.15 2020.01.16 2020.01.17 2020.01.20 2020.01.21 2020.01.22 2020.01.23 2020.01.24 2020.01.27..
2020.01.02| :/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2
2020.01.03| :/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3
2020.01.06| :/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2


##### Exercise

Create a function called `getPath` using `.Q.PD` and `.Q.PV`. This function will take two inputs, the date and the tablename and return the full filepath to that table - don't use `.Q.par`.

In [None]:
3 sublist .Q.PV!.Q.PD    //date to mount dictionary 
getPath:{[dt;tabname] mnt:(.Q.PV!.Q.PD)[dt]; 
                ` sv mnt,$[`;string dt],tabname}
getPath[2020.01.02;`trade]
get getPath[2020.01.02;`trade]

In [23]:
//your answer here 
getPath:{[dt;tabname] mnt:(.Q.PV!.Q.PD)[dt]; 
                ` sv mnt,$[`;string dt],tabname}

# Saving data to segments 

In order to get the most benefit from the database segmentation, data should be distributed evenly across all mounts, and consideration should be taken to common query patterns. For our segmented database we have four segments named `d0`,`d1`,`d2` and `d3` respectively, as a reminder here is our par.txt: 

In [24]:
read0 `par.txt

"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3"


If we have four date partitions and we want to store them in these segments, utilising the common round-robin method would mean they are stored as follows

2016.01.01 ---- segmentedDBmounts/d0

2016.01.02 ---- segmentedDBmounts/d1

2016.01.03 ---- segmentedDBmounts/d2

2016.01.04 ---- segmentedDBmounts/d3 

The most common way to do this is to use the mod function in conjunction with the partition itself - e.g. `2016.01.01 mod 4` returns `0`, so we allocate to our first mount and so on.

Consider the first partition above:

In [25]:
2016.01.01 mod 4    //since we have four mounts - therefore allocated to d0

0i


We can abstract this more generally: 

In [26]:
d:2016.01.01;           //the partition for this date needs to go in the first segment
show r:read0`:par.txt   //r is a list of our mounts
count r                 //how many mounts do we have? 
r d mod count r         //getting the modulo versus the mount count - then indexing to get the corresponding mount

4
/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d0"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d2"
"/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d3"


<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:2px;padding-left:5px;" align="left"/>
<p style='color:#273a6e'><i> A date is just an integer in q representing the number of days since 2000.01.01 so we can use modulo with them.</i></p>

It should be noted that the names of the segments are unimportant. The `mod` function uses the date & the `count` of the number of lines in the par.txt file as its two arguments. The result from this is then used to index back into the result from `read0` and extract the correct segment.  

##### Exercise 

Create a utility function - `savePath` - that takes a date and tablename and returns the save path for the new date. This should allocate using modulo on the date and round robin between the mounts. 

Assume that the par.txt file is in the current directory. 

Verify that this returns the same result as `.Q.par`.

In [None]:
savePath:{[dt;tabName] 
               mounts:read0 `:par.txt; 
               numMounts:count mounts;
               allocateTo:mounts dt mod numMounts; //determine allocation
               ` sv hsym[`$allocateTo],(`$string dt),tabName
 }

In [None]:
savePath[2020.01.01;`tab]

In [None]:
.Q.par[`:.;2020.01.01;`tab]

In [27]:
//your answer here 
savePath:{[dt;tabName] 
               mounts:read0 `:par.txt; 
               numMounts:count mounts;
               allocateTo:mounts dt mod numMounts; //determine allocation
               ` sv hsym[`$allocateTo],(`$string dt),tabName
 }

savePath[2020.01.01;`tab]

.Q.par[`:.;2020.01.01;`tab]

:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1/2020.01.01/tab
:/home/jovyan/course-advanced/.hidden/db/segmentedDBmounts/d1/2020.01.01/tab


# Creating Segmented Tables

There is no one-size-fits-all utility to create segments. Instead, you write a q program that places a subset of each partition slice into a segment.

Lets now try and create our own segmented databases.

We will use directories `...06 Segmented Databases/1`, `...06 Segmented Databases/2` and `...06 Segmented Databases/seg_root` as our segmented database root, when we construct the segments and write our **par.txt** file.

First let's navigate back to our module home directory (`...06 Segmented Databases`): 

In [28]:
\cd ../               

In [29]:
show dir: hsym `$d:first system"pwd"
d   //module root string

`:/home/jovyan/course-advanced/.hidden/db
/home/jovyan/course-advanced/.hidden/db


Let's build our segmented database root director: 

In [30]:
show segRoot:` sv  dir,`seg_root

`:/home/jovyan/course-advanced/.hidden/db/seg_root


Firstly we'll put some data in each of our "mounts": 

In [31]:
(` sv (dir,`$"1/2021.01.02/seg/")) set .Q.en[segRoot;] ([] ti:09:30:00 09:31:00; s:`ibm`t; p:101 17f)
(` sv (dir,`$"2/2021.01.03/seg/")) set .Q.en[segRoot;] ([] ti:09:30:00 09:31:00; s:`ibm`t; p:101.5 17.5)
(` sv (dir,`$"1/2021.01.04/seg/")) set .Q.en[segRoot;] ([] ti:09:30:00 09:31:00; s:`ibm`t; p:103 16.5f)
(` sv (dir,`$"2/2021.01.05/seg/")) set .Q.en[segRoot;] ([] ti:09:30:00 09:31:00; s:`ibm`t; p:102 17f)

:/home/jovyan/course-advanced/.hidden/db/1/2021.01.02/seg/
:/home/jovyan/course-advanced/.hidden/db/2/2021.01.03/seg/
:/home/jovyan/course-advanced/.hidden/db/1/2021.01.04/seg/
:/home/jovyan/course-advanced/.hidden/db/2/2021.01.05/seg/


And update our par.txt file to have the paths to each of our data mounts:

In [32]:
(` sv (segRoot,`par.txt)) 0: (d,"/1"; d,"/2")   //paths are relative to ./db_exc dir
read0 ` sv (segRoot,`par.txt)

:/home/jovyan/course-advanced/.hidden/db/seg_root/par.txt
"/home/jovyan/course-advanced/.hidden/db/1"
"/home/jovyan/course-advanced/.hidden/db/2"


We've now made our segmented database! 

Let's take a second to make sure everything is in order: 

In [33]:
show mounts: hsym `$read0 ` sv (segRoot,`par.txt)  //this list of each of our "mounts" for our segmented database
key each mounts                                    //the partitions exist for each of our mounts

2021.01.02 2021.01.04
2021.01.03 2021.01.05
`:/home/jovyan/course-advanced/.hidden/db/1`:/home/jovyan/course-advanced/.hidden/db/2


In [34]:
key segRoot          //our home directory has our par.txt and sym file - all looks good! 

`s#`par.txt`sym


Our final test is now to load the directory: 

In [35]:
1_ string segRoot                 //the directory path - dropping the : at the beginning
system"cd ",1_ string segRoot     //moving to the directory 
\pwd                              //confirming correct
\l .                              //loading dir 

/home/jovyan/course-advanced/.hidden/db/seg_root
"/home/jovyan/course-advanced/.hidden/db/seg_root"


Once you load a directory (using `\l dir`) you will automatically move into that directory - we can now see what tables we have in our directory 

In [37]:
tables[]
5#select from seg

`daily`depth`mas`nbbo`quote`seg`td`trade
date       ti       s   p    
-----------------------------
2021.01.02 09:30:00 ibm 101  
2021.01.02 09:31:00 t   17   
2021.01.03 09:30:00 ibm 101.5
2021.01.03 09:31:00 t   17.5 
2021.01.04 09:30:00 ibm 103  


<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:12px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i>Remember to ensure the segments conform and are complete. Overlapping segments will result in duplicate records in query results and an incomplete decomposition will result in dropped records!</i></p>

##### Quiz Time!
Try the Segmented Exercises to try writing a segmented database yourself!