In [1]:
import numpy as np, scipy, pandas as pd
from pyDbs.__init__ import *

# Documentation for ```pyDbs.adj```

In [2]:
pd.set_option('display.max_rows', 5)

A helper class used to adjust pandas objects (series, dataframes, indices). The class only defines staticmethods. Here, we discuss the main method that is frequently used in other classes, namely the ```adj.rc_pd``` method. This method allows for adjustments to pandas objects in different ways that are all based on the domains defined as ```pd.Index``` instances. In general, the method allows for three types of adjustments to be made to a given symbol: 
1. Adjustment to the symbol indices through ```lag``` specification. This allows to lag specified index levels in the symbol.
2. Adjustment to the symbol domains through ```alias``` specification. This maps the domain names of pandas indices to new ones.
3. Adjustment to the symbol through conditions (```c```). Conditions can be added quite flexibly, but ultimately relies on matching the underlying pands indices of the condition and the symbol.

Here, we go through each of them in turn and expand a bit on how it work.

*Define some pandas series, dataframes, and indices:*

In [3]:
idx1 = pd.Index(range(10), name = 'a')
idx2 = pd.Index(range(11,15), name = 'b')
mIdx1 = pd.MultiIndex.from_product([idx1,idx2])
mIdx2 = pd.MultiIndex.from_arrays([idx1[0:4], idx2[0:4]])
s1 = pd.Series(range(len(idx1)), index = idx1, name = 's1')
s2 = pd.Series(range(len(idx2[2:])), index = idx2[2:], name = 's2')
s3 = pd.Series(range(len(mIdx1[0:20])), index = mIdx1[0:20], name = 's3')
s4 = pd.Series(range(len(mIdx2)), index = mIdx2, name = 's4')
df = pd.DataFrame(np.vstack([range(len(mIdx1)), range(len(mIdx1))]).T, index = mIdx1, columns = ['c1','c2'])

## 1. Lag

The method ```adj.rc_pd(s, c = None, alias = None, lag = None, pm = True, **kwargs)``` starts by adjusting the underlying index of a symbol ```s``` using the ```lag``` statement. If ```lag = None``` no adjustments are made. To lag relevant index levels, the ```lag``` argument should be a dictionary with key = domain of the symbol that we want to lag and key = the number of lags. Note that value = 1 is added to the index.

**Indices:**

*Lag 1d index and compare to before*

In [4]:
pd.Series(adj.rc_pd(idx1, lag = {'a': 1}), index= idx1) # index is the original one, values in the series represents new values

a
0     1
1     2
     ..
8     9
9    10
Name: a, Length: 10, dtype: int64

*Lag 2d index and compare:*

In [5]:
pd.Series(adj.rc_pd(mIdx1, lag = {'a': 1}).values, index= mIdx1) # index is the original one, values in the series represents new values)

a  b 
0  11     (1, 11)
   12     (1, 12)
           ...   
9  13    (10, 13)
   14    (10, 14)
Length: 40, dtype: object

*Lag both levels in a multiindex and compare:*

In [6]:
pd.Series(adj.rc_pd(mIdx2, lag = {'a': 1, 'b': -1}).values, index= mIdx2) # index is the original one, values in the series represents new values)

a  b 
0  11    (1, 10)
1  12    (2, 11)
2  13    (3, 12)
3  14    (4, 13)
dtype: object

**Series and dataframes:** Other pandas are adjusted in a similar way - it is their underlying ```pd.Index``` instance that is passed to the lag methods. So we'll just provide one example:

In [7]:
adj.rc_pd(s1, lag = {'a':1}) # this returns a series where the value represents s1[a-1]; thus when a = 1 in the index, the value represents s1[0]

a
1     0
2     1
     ..
9     8
10    9
Name: s1, Length: 10, dtype: int64

In [8]:
adj.rc_pd(df, lag = {'b':-1})

Unnamed: 0_level_0,Unnamed: 1_level_0,c1,c2
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
0,10,0,0
0,11,1,1
...,...,...,...
9,12,38,38
9,13,39,39


## 2. Alias

After potentially adjusting the index with lags, the alias statement maps to new domain names. This is straightforwardly done with a dictionary mapping from old to new names. If ```alias = None``` no adjustments are made in the ```rc_pd``` method.

**Indices:**

*Adjust name of 1d index to 'aa' instead of 'a':*

In [9]:
adj.rc_pd(idx1, alias = {'a':'aa'}).name

'aa'

*Adjust domains of 2d index:*

In [10]:
adj.rc_pd(mIdx1, alias = {'b':'bb'}).names

FrozenList(['a', 'bb'])

*Adjust both names*

In [11]:
adj.rc_pd(mIdx1, alias = {'a':'mrPoxycat', 'b': 'bb'}).names

FrozenList(['mrPoxycat', 'bb'])

**Series and dataframes:** Adjustments are made to the underlying incides, the syntax is identical to when applied to indices.

## 3. Conditions

The final and most important feature is the condititions used to subset a symbol. The conditions (```c``` in the call ```self.rc_pd(s, c = None, alias = None, lag = None, pm = True, **kwargs)```) can be of different input types that can also nested in different ways. Ultimately, however, the condition structure boils down to a search for overlaps in the index of the symbol ```s``` and the conditions that we pass. 

### A. Passing a single pandas index, series, or dataframe

If ```c``` is a  pandas object (one of the three above), the method simply compares the index of the symbol ```s``` (or the symbol itself, if it is an index) with the index of the condition (or the condition itself, if it is an index). The ```pm``` argument specifies whether or not to look for *partial matches* when comparing the domains. The simplest way to illustrate the effect of this is to look at two indices that are defined over different domains ('a' and 'b' respectively): If ```pm=True```, we only subset the symbol ```s``` for the domains in the condition index that overlaps with domains of ```s```. Thus, when two symbols are defined over non-overlapping domains, we have:

*Partial match with non-overlapping domains:*

In [12]:
adj.rc_pd(idx1, idx2, pm = True) # idx2 has no overlap in domains --> return full index

Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='a')

*Non partial match with non-overlapping domains:*

In [13]:
adj.rc_pd(idx1, idx2, pm = False) # idx2 has no overlap in domains --> return empty index

Index([], dtype='int64', name='a')

Now let us consider an implication where domains actually overlap. For instance, ```s1``` is defined over the full index 'a'. and the index ```mIdx2``` is defined over only some of the values of 'a', but also another domain 'b'. In this case we get:

*With ```pm = True``` this returns the elements where index domain 'a' overlaps in the two symbols:*

In [18]:
adj.rc_pd(s1, mIdx2, pm = True)

a
0    0
1    1
2    2
3    3
Name: s1, dtype: int64

With ```pm = False``` this returns an empty series, because the condition (```mIdx2```) also contains the domain 'b', which it cannot find any matches for in ```s1```:

In [19]:
adj.rc_pd(s1, mIdx2, pm = False)

Series([], Name: s1, dtype: int64)

The ```pm = False``` has no effect if domains of the condition are all in the symbol domains:

In [22]:
all(adj.rc_pd(s1, idx1, pm = True) == adj.rc_pd(s1, idx1, pm = False))

True

Once again, all methods are implemented such that they equally accept pandas series, dataframes, or indices; in the end, they only look at the index of the given object (which refers to the index itself if we pass an index).

*The conditions can also be passed as series or dataframes instances, and the symbol can also be any of the three types:*

In [23]:
adj.rc_pd(mIdx2, df) # returns the part of the multiindex that the dataframe is defined over as well.

MultiIndex([(0, 11),
            (1, 12),
            (2, 13),
            (3, 14)],
           names=['a', 'b'])

### B. Nested conditions - tuples

We can also nest conditions by passing tuples. We allow for three kind of tuples to be passed:
* Negation: Added using the tuple structure ```t = ('not', c)``` where ```c``` itself is a condition (or further nested conditions). This simply negates the condition ```c```.
* And tuples: Added using the tuple structure ```t = ('and', [c1, ... ,cN])```, where each ```c1,..., cN``` are conditions (or further nested conditions). This requires that all conditions in the list holds.
* Or tuples: Added using the tuple structure ```t = ('or', [c1,...,cN])```, where each ```c1,...,cN``` are conditions (or further nested conditions). In this case, we simply require one of the conditions in the list to hold. 

The tuple structure allows us to combine searches for domain overlaps in a generally nested fashion. The following provides some examples.

*Simple negation:*

In [29]:
adj.rc_pd(s3, ('not', mIdx2)) # get elements in s3 that is not in the multiindex

a  b 
0  12     1
   13     2
         ..
4  13    18
   14    19
Name: s3, Length: 16, dtype: int64

*Negation for combination of to indices (nested) - conditions do not have to share domains:*

In [30]:
adj.rc_pd(s3, ('not', ('or', [mIdx2, idx2[:2]])))

a  b 
0  13     2
   14     3
         ..
4  13    18
   14    19
Name: s3, Length: 8, dtype: int64

*Three-level nesting, just to show the idea:*

In [31]:
# Takes elements that are not either in mIdx2 or not in idx2[:2]
adj.rc_pd(df, ('not', ('or', [mIdx2, ('not', idx2[:2])])))

Unnamed: 0_level_0,Unnamed: 1_level_0,c1,c2
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
0,12,1,1
1,11,4,4
...,...,...,...
9,11,36,36
9,12,37,37
