Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

View key clean-up #67

Closed
jamesrkg opened this issue Sep 23, 2015 · 4 comments
Closed

View key clean-up #67

jamesrkg opened this issue Sep 23, 2015 · 4 comments
Assignees
Milestone

Comments

@jamesrkg
Copy link
Contributor

There are some problems with the current view key notation that need to be cleaned up.

method colon-delimiting

The method-part of the view key needs to be colon-delimit-able so that it it can describe the effect of different methods acting on x and y. Where only 1 method is named and both x and y are present, the same method should be assumed to be working on both.

The general rule should be that method_a:method_b|x:y means the intersection of method_a(x) by method_b(y), a more concrete example being frequency:mean|x:y means the intersection of frequency(x) by mean(y).

By extension, though, this renders what is currently frequency|x:y incorrect as the key for a column base row because this should mean the intersection of frequency(x) by frequency(y), or in plain speak where the row and column bases intersect (e.g. the number of cases in x and y).

As a consequence, the correct key for a column base row should be simply frequency|x and for a row base column frequency|y. Incidentally this is perfectly in keeping with the fundamental meaning of frequency| as basic counts, since the mention of either x or y is an implied collapse of all their values, respectively.

More examples (assuming x and y each have 3 possible values):

x|frequency||||counts                      
x|frequency||y||c%                          
x|frequency||x||r%                  
x|frequency|x|||cbase                      # same as x|frequency|x[(1,2,3)]|||net1-3
x|frequency|y|||rbase                      # same as x|frequency|y[(1,2,3)]|||net1-3
x|frequency|x[(1,2)]|||xnet1-2
x|frequency|y[(1,2)]|||ynet1-2
x|frequency|x:y|||cbase*rbase              # same as x|frequency|x[(1,2,3)]:y[(1,2,3)]|||cbase*rbase

... and so on.

Another important change that should be made is to use the conventional curly brace for set notation, so logic descriptors should be written as x[{1,2}]: instead of x[(1,2)]:. Currently the curly brace is used for answer count, but the two uses should be swapped. In this way one answer from codes 1 or 2 would be written as x[{1,2}(1)]:.

Due to the required delimitable-nature of the method-part of the view key, it may be prudent to put in place some truncation rules that method names must adhere to. For example instead of frequency perhaps simply f will suffice, especially given that it's so common. for other methods a 6-character limit per sub/method-part (to allow for needed abbreviations like stddev, stderr and so on) would help condense the overall key length and improve readability.

To avoid ambiguity, what is currently the relation part of the view key must always include a colon.

|:| means no conditions placed on either x or y
|x:| collapsed x, no conditions placed on y
|:y| collapsed y, no conditions placed on x
|x:y| collapsed x and y

The new convention means you should never see something like |y:x| because the left-hand side will always describe x and the right-hand side will always describe y.

In accordance with all of these proposed changes, the above view keys would become:

x|f|:|||counts                      
x|f|:|y||c%                          
x|f|:|x||r%                  
x|f|x:|||cbase                      # same as x|f|x[{1,2,3}]|||net1-3
x|f|:y|||rbase                      # same as x|f|y[{1,2,3}]|||net1-3
x|f|x:y|||cbase*rbase               # same as x|f|x[{1,2,3}]:y[{1,2,3}]|||cbase*rbase

However, all of these examples use the same method on x and y, which will often not be the case. Where a different method is used on each, both methods must be named and must be colon-delimited.

In conjunction with the need for descriptive stats to be named using sub-methods, this leads to:

x|d.mean:f|x:|||cmean                # column mean
x|f:d.mean|:y|||rmean                # row mean

Including the change for set notation, block nets also need to appear in discrete x/y-blocks delimited with a comma, meaning they will change from |x[(1,2),(3,4),(5,6):y to |x[{1,2}],x[{3,4}],x[{5,6}]: This both corrects for ambiguity compared to complex logic and to provide for a comma-delimited relationship between the multiple methods and x/y.

Given the likely eventuality of other block methods the conventions should be similarly lazy, where f|x[{1,2}],x[{3,4}],x[{5,6}]: is effectively shorthand for f,f,f:f|x[{1,2}],x[{3,4}],x[{5,6}]:.

This is more relevant when imagining the needs of a block of descriptive stats, in which case d.mean,d.stddev,d.stderr:f|x: is more meaningful. In any case, parts that are not mentioned explicitly imply uniform application, so as to prevent the need for something like d.mean,d.stddev,d.stderr:f|x,x,x:.

x|d.mean:f|x:|||cmean                # column mean
x|f:d.mean|:y|||rmean                # row mean

effective base

Effective base view keys should indicate a sub-method of frequency and must name a weight-part. What is currently x|frequency|x:y|||ebase should become x|f.eff:f|x:||weight|ecbase. Similarly, an effective row base would be x|f:f.eff|:y||weight|erbase.

@jamesrkg jamesrkg self-assigned this Sep 23, 2015
@jamesrkg jamesrkg added this to the RG-11 milestone Sep 23, 2015
@jamesrkg
Copy link
Contributor Author

The more I think about this the more I wonder if the meaning of relationship as we've understood it until now is defunct, because the colon becomes the link between x/y and their respective methods, especially given examples like x|f|:y|||rbase, in which x doesn't need to be mentioned at all but the relationship between x and y is still described.

One option would be instead of:

x-position | method | relationship | relative to | weight | shortname

We could move to:

x-position | method/s | condition/s | relative to | weight | shortname

Since the third part of the view key actually describes the conditions placed on x and/or y as they are fed into their respective methods.

@jamesrkg
Copy link
Contributor Author

Examples of frequency-only keys:

############################ Counts
x|f|:|||counts              

   1  2  3  4  5
1  1  7  3  2  7
2  4  3  5  4  6
3  6  2  4  6  3

############################ Column base
x|f|x:|||cbase  

       1   2   3   4   5
cbase  11  12  12  12  16

############################ Column base percentages
x|f|:|y||counts             

    a   b   c   d   e
x   9   58  25  17  44
y   36  25  42  33  38
z   55  17  33  50  19

############################ Row base
x|f|:y|||rbase  

   rbase
1  20
2  22
3  21

############################ Row base percentages
x|f|:|x||counts             

    a   b   c   d   e
x   5   35  15  10  35
y   18  14  23  18  27
z   29  10  19  29  14

############################ Intersection base
x|f|x:y|||base  

       rbase
cbase  63

############################ Intersection base percentage
x|f|:|xy||counts                

    a   b   c   d   e
x   2   11  5   3   11
y   6   5   8   6   10
z   10  3   6   10  5

############################ Unfiltered column base percentages
x|f|:|y@||counts                

    a   b   c   d   e
x   9   58  25  17  44
y   36  25  42  33  38
z   55  17  33  50  19

############################ Unfiltered y base percentages
x|f|:|@y||counts                

    a   b   c   d   e
x   9   58  25  17  44
y   36  25  42  33  38
z   55  17  33  50  19

############################ Unfiltered row base percentages
x|f|:|x@||counts                

    a   b   c   d   e
x   5   35  15  10  35
y   18  14  23  18  27
z   29  10  19  29  14

############################ Unfiltered x base percentages
x|f|:|@x||counts                

    a   b   c   d   e
x   5   35  15  10  35
y   18  14  23  18  27
z   29  10  19  29  14

############################ Unfiltered total N percentages
x|f|:|N||counts             

    a   b   c   d   e
x   1   7   3   2   7
y   4   3   5   4   6
z   6   2   4   6   3

############################ Column logic
x|f|x[{1,2}]:|||clogic

         1   2   3   4   5
clogic   5   10  8   6   13

############################ Column count logic
x|f|x[{1,2}(1)]:|||cclogic

          1   2   3   4   5
cclogic   3   4   2   3   4

############################ Column arithemtic logic
x|f.math:f|x[{1,2}-{3}]:|||calogic

          1    2    3    4    5
calogic   -1   8    4    0    10

############################ Row logic
x|f|:y[{3,4}]|||rlogic  

   rlogic
1  5
2  9
3  10

############################ Row count logic
x|f|:y[{3,4}(1)]|||rlogic   

   rclogic
1  3
2  5
3  4

############################ Row arithemtic logic
x|f:f.math|:y[{3,4}-{5}]|||ralogic  

   ralogic
1  -2
2  3
3  7

############################ Intersection logic
x|f|x[{1,2}]:y[{3,4}]|||base    

         rlogic
clogic   63

############################ Block logic rows
x|f|x[{1,2}],x[{2,3}]:|||clogic 

         1   2   3   4   5
clogic1  5   10  8   6   13
clogic2  10  5   9   10  9

############################ Block logic columns
x|f|:y[{3,4}],y[{4,5}]|||rlogic 

   rlogic1 rlogic2
1  5       9
2  9       10
3  10      9

############################ Intersection block logic
x|f|x[{1,2}],x[{2,3}]:y[{3,4}],y[{4,5}]|||base  

         rlogic1 rlogic2
clogic1  14      19
clogic2  19      19

############################ Effective column base
x|f.eff:f|x:||weight|ecbase 

       1   2   3   4   5
ecbase 11  12  12  12  16

############################ Effective row base
x|f:f.eff|:y||weight|ernet  

   ernet
1  5
2  9
3  10

############################ Effective intersection base
x|f.eff|x:y||weight|base    

       erbase
ecbase 63

These examples include something we're not planning to support for a while yet:

############################ Unfiltered column base percentages
x|f|:|y@||counts                

############################ Unfiltered y base percentages
x|f|:|@y||counts                

############################ Unfiltered row base percentages
x|f|:|x@||counts                

############################ Unfiltered x base percentages
x|f|:|@x||counts        

############################ Unfiltered total N percentages
x|f|:|N||counts     

In these cases:

  • y@ means column percentages based on the frequency of y-values independent of x, and vice-versa for x@
  • @y means percentages based on the base of y independent of x, and vice-versa for @x
  • N means percentages based on the total sample size (N) of the source data
  • These conventions could potentially support percentages based on any arbitrary variable base by using something like @q5, but that's certainly not required in the forseeable future!

@jamesrkg
Copy link
Contributor Author

Nested notation

x and y are themselves lazy-notations for x0 and y0, which become explicitly required when the axes are nested.

Following is an example of notation describing column logic on the 2nd x-level filtered by those who answered each of the values in the column on the 1st x-level.

Nested notation also requires the presence of >-delimiters to identify each nested level. The use of > will be identical to how it appears in the x or y keys of the link.

As with the absence of x/y in a non-nested view keys, an "unattended" > indicates that no special conditions were placed on the preceding level, as in the following example:

x|f|>x1[{1,2}]:|||cnlogic

x0    x1         1   2   3   4   5
1     clogic     2   4   9   8   3
2     clogic     3   2   1   2   4
3     clogic     6   7   3   4   7
4     clogic     3   3   5   1   5

As with the presence of x/y in non-nested view keys, an "attended" > indicates that a full-collapse or partial-conditioning has been applied, as in the following example:

x|f|x0>x1[{1,2}]:|||cnlogic

x0        x1         1   2   3   4   5
cbase     clogic     2   4   9   8   3

Other than the explicit x0/x1 notation and the addition of >, all the same rules apply, so the row base for this relationship would be:

x|f|>x1[{1,2}]:y|||cnlogic

x0    x1         rbase
1     clogic     26
2     clogic     12
3     clogic     27
4     clogic     17

Nested notation also applies to relative notation, let's assume the y-axis is also nested and we want percentages based on the 1st y-level rather than the 2nd. In this case the percentages for the first two columns are based on y0=1, and the third and fourth columns on y0=2.

x|f|>x1[{1,2}]:|y0||cnlogic

      y0         1     1     2     2
      y1         1     2     1     2
x0    x1         
1     clogic     34    42    74    45
2     clogic     22    75    63    23
3     clogic     58    87    22    36
4     clogic     63    63    15    17

In any case echewing explicit level notation will always be interpreted as the last-level. So if the y-axis had 2 nested levels, relative to y should be interpreted as relative to y1. The same is true for the conditional part of the key notation, where in the above example :y should be interpreted as :y1 (if y was nested).

@jamesrkg jamesrkg modified the milestones: RG-12, RG-11, RG-14 Oct 9, 2015
@jamesrkg jamesrkg assigned alextanski and unassigned jamesrkg Oct 20, 2015
@jamesrkg jamesrkg modified the milestones: RG-17, RG-14, RG-20 Nov 6, 2015
@jamesrkg jamesrkg modified the milestones: RG-22, RG-20 Dec 11, 2015
@jamesrkg
Copy link
Contributor Author

This will be resolved by #290.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants