From bb59d2d9eb129ddd0b2c8ea9ee06c1b13be6e8ce Mon Sep 17 00:00:00 2001 From: Symbolics Date: Thu, 1 Feb 2024 16:40:37 +0800 Subject: [PATCH] Document the new sample function --- content/en/docs/Manuals/data-frame.md | 15 ++++---- content/en/docs/Manuals/select.md | 55 ++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 8 deletions(-) diff --git a/content/en/docs/Manuals/data-frame.md b/content/en/docs/Manuals/data-frame.md index 272d90b..93fc8b5 100644 --- a/content/en/docs/Manuals/data-frame.md +++ b/content/en/docs/Manuals/data-frame.md @@ -2248,13 +2248,13 @@ but did not here so you can see the values that were replaced. ## Sampling -You can take a random sample of the rows of a data-frame with the `random-sample` function: +You can take a random sample of the rows of a data-frame with the `select:sample` function: ```lisp LS-USER> mtcars # -LS-USER> (random-sample mtcars 3) +LS-USER> (sample mtcars 3 :skip-unselected t) # LS-USER> (print-data *) @@ -2264,16 +2264,17 @@ LS-USER> (print-data *) ;; 2 Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2 ``` -You can also take random samples from CL sequences. +You can also take random samples from CL sequences and arrays, with or without replacement and in various proportions. For further information see [sampling](/docs/manuals/select/#sampling) in the [select system manual](/docs/manuals/select/). Uses [Vitter's Algorithm D](http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf) to efficiently select the rows. Sometimes you may want to use the -algorithm at a lower level. You don’t want the sample itself; you only -want the indices. In this case, you can directly use `map-random-below`, -which simply calls a provided function on each index. +algorithm at a lower level. If you don’t want the sample itself, say you +only want the indices, you can directly use `map-random-below`, which +simply calls a provided function on each index. -This is a port to standard common lisp of ruricolist's +This is an enhancement and port to standard common lisp of +ruricolist's [random-sample](https://github.com/ruricolist/random-sample/tree/master). It also removes the dependency on Trivia, which has a restrictive license (LLGPL). diff --git a/content/en/docs/Manuals/select.md b/content/en/docs/Manuals/select.md index 3056526..597cdaf 100644 --- a/content/en/docs/Manuals/select.md +++ b/content/en/docs/Manuals/select.md @@ -23,7 +23,7 @@ Select provides: 3. A set of utility functions for traversing selections in array-like objects. -It combines the functionality of dplyr's _slice_ and _select_ methods. +It combines the functionality of dplyr's _slice_, _select_ and _sample_ methods. ## Basic Usage {#Using} @@ -164,6 +164,59 @@ as well: ; 75 78 79 80 81 82 84 86 88 91 93 98 100 103 107 108 109 112 113 116 117 120) ``` +## Sampling +You may sample sequences, arrays and data frames with the `sample` generic function, and extend it for your own objects. The function signature is: + +```lisp +(defgeneric sample (data n &key + with-replacement + skip-unselected) +``` +By default in common lisp, `key` values that are not provide are `nil`, so you need to turn them _on_ if you want them. + +`:skip-unselected t` means to _not_ return the values of the object that were not part of the sample. This is turned off by default because a common use case is splitting a data set into training and test groups, and the second value is ignored by default in Common Lisp. The `let-plus` package, imported by default in `select`, makes it easy to destructure into test and training. This example is from the tests included with select: + +```lisp +(let+ ((*random-state* state) + ((&values train test) (sample arr35 2)) + ... +``` + +Note the setting of `*random-state*`. You should use this pattern of setting `*random-state*` to a saved seed anytime you need reproducible results (like in a testing scenerio). + +The size of the sample is determined by the value of `n`, which must be between 0 and the number of rows (for an `array`) or length if a `sequence`. If `(< n 1)`, then `n` indicates a _proportion_ of the sample, e.g. 2/3 (values of `n` less than one may be `rational` or `float`. For example, let's take a training sample of 2/3 of the rows in the `mtcars` dataset: + +```lisp +LS-USER> (sample mtcars 2/3) + +# +# + +LS-USER> (dims mtcars) +(32 12) +``` +You can see that `mtcars` has 32 rows, and has been divides into 2/3 and 1/3 for training / test. + +You can also take samples of sequences (lists and vectors), for example using the `DATA` variable defined above: + +```lisp +LS-USER> (length data) +121 +LS-USER> (sample data 10 :skip-unselected t) +#(43 117 42 29 41 105 116 27 133 58) +LS-USER> (sample data 1/10 :skip-unselected t) +#(119 116 7 53 27 114 31 23 121 109 42 125) +``` + +`list` objects can also be sampled: +```lisp +(sample '(a b c d e f g) 0.5) +(A E G B) +(F D C) +``` +Note that `n` is rounded up when the number of elements is odd and a proportional number is requested. + + ## Extensions {#extensions} The previous section describes the core functionality. The semantics