In [1]:
#;.pykx.disableJupyter()

In [2]:
# https://code.kx.com/pykx/3.0/examples/jupyter-integration.html#q-first-mode
import pykx as kx
kx.util.jupyter_qfirst_enable()

PyKX now running in 'jupyter_qfirst' mode. All cells by default will be run as q code. 
Include '%%py' at the beginning of each cell to run as python code. 


**Learning Outcomes**

To understand: 
* What is an enumeration?
* How to create enumeration?
* Operations on enumerations
* The sym file
* Foreign keys in tables

# Introduction

Let's revise what we've previously learned about textual data in kdb+/q. We have two data types, symbols and strings which we can use to store textual data. Here we'll recap and discuss what to consider when planning to persist data to disk.

## Recap on textual datatypes 

Behaviour: 
* Symbols are more performant with queries (as they're atomic) but can be more expensive in storage 
* Strings are easier to manipulate and parse but introduce extra complexity when querying as they are treated as lists in kdb+

Data usage: 
* Symbols should be used for highly repetitive data e.g. sym 
* Strings should be used for variable data e.g. orderIDs

## Considerations

We need to determine which of these types is best to use for data we want to store on disk. 

When making this choice, we have to think of both: 
1. **Storage**: What disk resources do we have available to us when storing these massive tables 
2. **Performance**: What time/memory performance requirements do we have when we query these tables

At first this seems like a trade off: Strings are less expensive to store, but symbols are much more efficient to query. Luckily, this is where [enumerations](https://en.wikipedia.org/wiki/Enumerated_type) come in to save the day!

## Enumeration as a concept 

[Enumeration](https://code.kx.com/q/basics/enumerations/) is a method of associating a data set with a set of distinct values, commonly referred to as an enumeration domain. It is a method of data normalization as well as a technique to improve performance and save space when storing data on disk.

As an example, let's look at the below list of fruit listed with their indices:

| apple | peach| banana| peach| banana| banana| apple|
|--|--|---|--|--|--|--|
|0|1|2|3|4|5|6|

Another way to store this would to be to store a distinct list of the fruit and then map the indices to each one.

| apple| peach| banana|
|----|----|---|
| 0| 1| 2|

|0|1|2|1|2|2|0|
|-|-|-|-|-|-|-|
|0|1|2|3|4|5|6|

With this second method we only need to store the name of each fruit once!

Fortunately kdb+/q already provides a method to achieve this enumeration without us having to go through all these steps! 

# Creating an Enumeration

Within kdb+/q there is the ability to create enumerations as either direct and strict mappings, or as an extensible mappings. Enumeration is only possible when working with symbols. 

The two methods differ in the kdb+/q operator that is used: 

* [Enumerate - `$`](https://code.kx.com/q/ref/enumerate/) - all items in list to be enumerated must be in the enumeration domain, if not the enumeration will fail.
* [Enum extend - `?`](https://code.kx.com/q/ref/enum-extend/) - any items not in the enumeration domain will be added to the enumeration domain 

Syntax: 
    
    <enumeration domain list as a symbol>$<list to enumerate> 
    <enumeration domain list as a symbol>?<list to enumerate> 
        
We will look at each individually. 

## Enumerate - $
Let's work with the fruit example from above, we have a list of repetitive symbols and a small list of the unique symbol values: 

In [3]:
show fruitbowl:`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple    //creation a items list of symbols 
show s:distinct fruitbowl       //getting our distinct values
show ref:s?fruitbowl            //getting our references from the list to the unique values

`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple
`apple`peach`banana
0 1 2 1 0 2 1 0 2 0


We can form an enumeration between these two lists by doing the following: 

In [4]:
show enumFruitbowl:`s$fruitbowl    //notice how the output starts with `s$
fruitbowl                          //we haven't directly modified fruitbowl itself 

`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple
`s$`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple


Let's see what happens to our lists when we modify our enumeration domain (`s`): 

In [5]:
//s
s[0]:`orange  //changing our first item in the list 
s 

`orange`peach`banana


In [6]:
enumFruitbowl  //updated the first item with the new enumeration domain value 
fruitbowl      //unchanged 

`s$`orange`peach`banana`peach`orange`banana`peach`orange`banana`orange
`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple


<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:10px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i>Since changing our enumeration domain changes all the values in our data, it can be very  useful to leverage this to make any changes required to keep our data in-sync if values change.</i></p>

Now let's see what happens when we try to add a fruit to the fruitbowl that does not exist in our enumeration domain `s`

In [None]:
`s$`mango,fruitbowl //returns a cast error

<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:10px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i>Enumerating using <code>$</code> won't deal with unforeseen values - namely new items not already in your enumeration domain will throw an error. This is good for instances where we don't want to extend that domain or if we want to be alerted to inconsistencies. </i></p>

As you can see, `$` Enumeration only allows you to add symbols to the list that already exist in the enumeration domain. For more flexibility, we use `?` Enum Extend 

## Enum extend - ?
This method will let you expand the domain and dynamically add new unique values. Let's look at an example below:

In [7]:
fruitbowl                   //our original enumeration domain
`s?`mango,fruitbowl         //adding our new fruit to the bowl 
s                           //our new value has now been added to the enumeration domain

`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple
`s$`mango`apple`peach`banana`peach`apple`banana`peach`apple`banana`apple
`orange`peach`banana`mango`apple


Another useful feature of this extended enumeration is that we don't have to already have a enumeration domain defined - we can create one while performing the enumeration:

In [10]:
`ourNewEnum in key `.           //check if this variable already exists in our workspace
`ourNewEnum?`mango,fruitbowl    //enumerating against a new variable creates it 
ourNewEnum                      //here it is! unique items in order of appearance

1b
`ourNewEnum$`mango`apple`peach`banana`peach`apple`banana`peach`apple`banana`a..
`mango`apple`peach`banana


<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:2px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i>Enumeration extend <code>?</code> is frequently used with market data as new companies are constantly beginning to trade on exchanges.</i></p>

If we specify a file handle rather than a variable (i.e. <code>\`:ourNewEnum</code> rather than <code>\`ourNewEnum</code>) this enumeration is not only created, but it is written to disk: 

In [12]:
`ourNewEnum2 in key `.           //check if this variable already exists in our workspace
`:ourNewEnum2?`mango,fruitbowl  //enumerating against a file handle creates this on disk and also as a variable 
ourNewEnum
ourNewEnum2

1b
`ourNewEnum2$`mango`apple`peach`banana`peach`apple`banana`peach`apple`banana`..
`mango`apple`peach`banana
`mango`apple`peach`banana


If we look now at our local directory in Jupyter you will now see this value stored there! Or we can use key to list the contents of the current directory as follows: 

In [14]:
last key `:.   //subsetting to last so the contents aren't dropped from displaying

ourNewEnum2


<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:2px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i>Enumeration extend <code>?</code> preserves the attributes of the list however <code>$</code> doesn't. </i></p>

In [15]:
`ourNewEnum2?`g#`mango,fruitbowl  //preserves the group attribute
`ourNewEnum2$`g#`mango,fruitbowl

`g#`ourNewEnum2$`g#`mango`apple`peach`banana`peach`apple`banana`peach`apple`b..
`ourNewEnum2$`mango`apple`peach`banana`peach`apple`banana`peach`apple`banana`..


## $ Enumeration versus ? Enumeration

Now that we have gone through both methods of enumeration, what are the main differences between them?

* `$` is used for fixed enumerated lists. `?` is used when we expect the enumerated list to grow/change
* `?` preserves the attribute of the list, `$` does not
* `?` will create an enumeration domain if it does not already exist

When should we use `$` versus `?`

* Use `$` when the list of symbols in the database will be fixed and unchanging. This will provide the added benefit of preventing any unwanted symbols being added to the database. 
* Use `?` for lists/databases containing data which will have new symbols added all the time, for example market data.

##### Exercise 

Create a 1000 item list called `quarantineActivities` drawn from the following list: <code>\`books\`music\`books\`netflix\`news\`internet\`music\`hulu\`eat\`eat\`eat\`snack</code> (repetition intended). 

Create an enumeration domain called `activities` with the unique activities and a new list called `qActivities` which is the enumeration of `quarantineActivities` using the `activities` domain.

*(ASIDE: what happens if we use the original list as our enumeration domain so -
<code>\`books\`music\`books\`netflix\`news\`internet\`music\`hulu\`eat\`eat\`eat\`snack</code>*

In [20]:
show quarantineActivities: 1000?`books`music`books`netflix`news`internet`music`hulu`eat`eat`eat`snack 
//getting the unique items - from our original list, or from quarantineActivies
show activities: distinct quarantineActivities

`hulu`music`snack`music`news`eat`hulu`eat`books`eat`books`eat`hulu`internet`b..
`hulu`music`snack`news`eat`books`internet`netflix


In [21]:
//enumerating using the activities domain
show qActivities: `activities$quarantineActivities

`activities$`hulu`music`snack`music`news`eat`hulu`eat`books`eat`books`eat`hul..


In [None]:
//ASIDE
//what if you used the original list as activities? 
activities: `books`music`books`netflix`news`internet`music`hulu`eat`eat`eat`snack 
show qA:`activities$quarantineActivities     //this still works, but it's bad practice since we have repetition in  
                                                //our list, and some indexes will never be used (recall 1 3 2 3 1?3)
/delete activities from `.                    //removing activities from our current namespace 
distinct qA                                  //you see we are missing some indexes from our list!

In [19]:
//your answer here 
quarantineActivities: 1000?`books`music`books`netflix`news`internet`music`hulu`eat`eat`eat`snack
activities:distinct quarantineActivities
show qActivities: `activities$quarantineActivities

`activities$`snack`eat`news`eat`hulu`music`music`music`snack`hulu`eat`eat`eat..


Oh no! We forgot to exercise! Quick - let's pretend that all that time we spent snacking, we were instead exercising before someone finds our activity log and judges us. 

Update the values in `qActivities` by modifying `activities`  - as a refresher (and to cover our trail), modify the values in `quarantineActivites` too.

In [None]:
activities                             //lets look at our domain first 
activities[activities?`snack]:`exercise  //now lets update the list
activities                             //updated activities 

In [None]:
qActivities                            //phew!

In [None]:
//refresher - modifying quarantineActivities 
quarantineActivities[where quarantineActivities = `snack]: `exercise 
quarantineActivities                   //judgement adverted!

In [25]:
//your answer here
activities[2]:`exercise
qActivities
quarantineActivities[where quarantineActivities = `snack]:`exercise
quarantineActivities

`activities$`hulu`music`exercise`music`news`eat`hulu`eat`books`eat`books`eat`..
`hulu`music`exercise`music`news`eat`hulu`eat`books`eat`books`eat`hulu`interne..


Add a new activity `zooming` to our `activities` list using the enum Extend method:

In [None]:
activities
`activities?`zooming   //the act of enumerating this symbol adds it 
activities

In [26]:
//your answer here
`activities?`zooming
activities

`activities$`zooming
`hulu`music`exercise`news`eat`books`internet`netflix`zooming


Apply the new `activities` enumeration to the following index listing: `0 8 0 2 3 3 1 5`

In [28]:
`activities!0 8 0 2 3 3 1 5   //using ! we can apply an enumeration to an index listing

`activities$`hulu`zooming`hulu`exercise`news`news`music`books


In [27]:
//your answer here
indexListing: 0 8 0 2 3 3 1 5 
`activities!indexListing

`activities$`hulu`zooming`hulu`exercise`news`news`music`books


# Enumeration as a Datatype 
All enumerations have a datatype value 20 (kdb+ v3.6+, previously 20-76).

In [29]:
type s               //our Enumeration domain is a symbol list  
type enumFruitbowl     //our Enumeration itself is type 20
type `newS?fruitbowl   //a new Enumeration - also type 20

11h
20h
20h


There was a [change in kdb+ v3.6](https://code.kx.com/q/releases/ChangesIn3.6/#64-bit-enumerations) so that all 64-bit enums are type 20h regardless of their domain and there is no practical limit to the number of enumerations that can be in operation (in older versions the type was between 20-76 so users were limited to a max of 57 enumerations). 

# Operations on Enumerations

An enumerated list acts the same as a symbol list when we perform operations it, and therefore does not require any special code modification

In [33]:
enumFruitbowl                 //our enumerated list 
//operations on enumerations work the same as if they were operating on the actual symbol list
//where enumFruitbowl= `peach   
enumFruitbowl =/: `peach`orange

`s$`orange`peach`banana`peach`orange`banana`peach`orange`banana`orange
0101001000b
1000100101b


## Un-enumerating

There are some situations in where we only need to extract the non-enumerated values. For example, converting from one enumeration domain to another, which happens when copying from one kdb+ database to another or when merging two databases. 

We can use `value` or `get` to unenumerate our data:

In [34]:
show sym:`AAPL`IBM`JPM    //our enumeration domain
show L:100000?sym         //creating our to unenumerated list

`AAPL`IBM`JPM
`AAPL`JPM`AAPL`JPM`AAPL`IBM`JPM`JPM`JPM`JPM`AAPL`AAPL`JPM`JPM`JPM`IBM`JPM`IBM..


In [35]:
show enumL:`sym$L        //making our new enumerated list 
show value enumL         //unenumerating our list - notice the leading enumeration domain is gone 

`sym$`AAPL`JPM`AAPL`JPM`AAPL`IBM`JPM`JPM`JPM`JPM`AAPL`AAPL`JPM`JPM`JPM`IBM`JP..
`AAPL`JPM`AAPL`JPM`AAPL`IBM`JPM`JPM`JPM`JPM`AAPL`AAPL`JPM`JPM`JPM`IBM`JPM`IBM..


In [None]:
L~value enumL            //unenumerating our enumerated list is the same as our original list
get[enumL]~value enumL   //value or get both work to unenumerate

<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:2px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i> Actually value and get are the same "under the hood"!</i></p>

In [None]:
value 
get 

# The `sym` file 
The default name for the enumeration used within kdb+/q tick capture systems is `sym`.

<img src="../images/qbies.png" style="width: 50px;padding-right:5px;padding-top:10px;padding-left:5px;" align="left"/>

<p style='color:#273a6e'><i> In real-world applications, the ticker symbol in the trade and quote tables (aka TAQ data) is always enumerated, generally this will be enumerated using a file called `sym.</i></p>

1. The ticker symbol list is a finite list that only rarely changes
2. Instead of searching in a variable length of list which is time-consuming, we can speed it up by searching a list of integers.
3. Reading and writing the index list is a fast operation.

The sym file is central to all kdb+/q systems in which we will discuss in both the practical guidance and also later in the course. Firstly, let's look at how we save enumeration down to disk using [?](https://code.kx.com/q/ref/enum-extend/#filepath):

In [36]:
seasons:`summer`winter 
`:seasons?`autumn`summer`winter`summer`autumn`winter`winter

`seasons$`autumn`summer`winter`summer`autumn`winter`winter


Let's check what seasons look like now:

In [37]:
seasons //autumn is added to the sym variable

`autumn`summer`winter


# Foreign Keys in tables

A [foreign key](https://code.kx.com/q4m3/8_Tables/#85-foreign-keys-and-virtual-columns) defines a mapping from the rows of the table in which it is defined to the rows of the
table with the corresponding primary key. This foreign key relationship is achieved via the application of an enumeration, specifically `$` - this is how Foreign keys provide referential integrity. 


Since this utilizes `$`, the enumeration enforces the values be within the enumeration domain so any attempt to insert a foreign key value that is not in the primary key will fail. The foreign key relationship is established by enumerating a column in one table, against the (unique) key of a second table.

This will be shown using two examples.

## Explicit Foreign key 
This method enforces referential integrity. 

In the first example we will define a foreign key explicitly on initialization:

In [38]:
//Example 1
show sector:([symb:`IBM`MSFT`FDP]ex:`N`CME`N;MC:1000 250 5000) 
show t:([]sym:`sector$`IBM`FDP`FDP`FDP`IBM`MSFT;price:6?1f)

symb| ex  MC  
----| --------
IBM | N   1000
MSFT| CME 250 
FDP | N   5000
sym  price    
--------------
IBM  0.4331538
FDP  0.2017667
FDP  0.8032723
FDP  0.1444925
IBM  0.9366088
MSFT 0.2369792


This relationship can be shown via the tables meta information, specifically the `f` column which indicates foreign key relationships:

In [39]:
meta t         //here t has an foreign key relations to sector applied to the sym column
meta sector

c    | t f      a
-----| ----------
sym  | s sector  
price| f         
c   | t f a
----| -----
symb| s    
ex  | s    
MC  | j    


[fkeys]() returns the column in which has a foreign key associated with it and also the names of the column that is it's associated with. 

In [40]:
fkeys t 

sym| sector


The neat thing is we can now use this foreign key relationship within `t` to use back data from sector: 

In [41]:
select from t where sym.ex=`N        //notice we only return `IBM`FDP which correspond to ex `N in sector
select sym, price, sym.MC from t     //we can use these values in any part of our qSQL statement
select count i by sym.ex from t 

sym price    
-------------
IBM 0.4331538
FDP 0.2017667
FDP 0.8032723
FDP 0.1444925
IBM 0.9366088
sym  price     MC  
-------------------
IBM  0.4331538 1000
FDP  0.2017667 5000
FDP  0.8032723 5000
FDP  0.1444925 5000
IBM  0.9366088 1000
MSFT 0.2369792 250 
ex | x
---| -
CME| 1
N  | 5


It is important to note that the sym column is now an enumeration over the keyed table domain of sector.
The general notation for a predefined foreign key is:
    
    select a.b from c where
        a is the foreign key (sym)
        b is a field in the primary key table (ind)
        c is the foreign key table (trade)

We also cannot insert data into our table that would violate our foreign key relationship: 

In [43]:
`t insert (`IBM;0.4)   //this works because `IBM is in sector 
t
//`t insert (`NEW;1f)    //this returns a cast error

,7
sym  price    
--------------
IBM  0.4331538
FDP  0.2017667
FDP  0.8032723
FDP  0.1444925
IBM  0.9366088
MSFT 0.2369792
IBM  0.4      
IBM  0.4      


In [44]:
`sector upsert (`NEW;`CME;200)
`t insert (`NEW;1f)    //this now works, because we have defined the sym within sector
t

sector
,8
sym  price    
--------------
IBM  0.4331538
FDP  0.2017667
FDP  0.8032723
FDP  0.1444925
IBM  0.9366088
MSFT 0.2369792
IBM  0.4      
IBM  0.4      
NEW  1        


This is how foreign keys ensure referential integrity. 

Like enumerations, we can resolve a foreign key by applying `value` to column and we will retrieve the actual values:

In [45]:
meta update value sym from t //applying value to the enumerated column
meta update get sym from t   //can also use get 

c    | t f a
-----| -----
sym  | s    
price| f    
c    | t f a
-----| -----
sym  | s    
price| f    
