Skip to content

Commit

Permalink
Document the new sample function
Browse files Browse the repository at this point in the history
  • Loading branch information
Symbolics committed Feb 1, 2024
1 parent 7b917e1 commit bb59d2d
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 8 deletions.
15 changes: 8 additions & 7 deletions content/en/docs/Manuals/data-frame.md
Expand Up @@ -2248,13 +2248,13 @@ but did not here so you can see the values that were replaced.

## Sampling

You can take a random sample of the rows of a data-frame with the `random-sample` function:
You can take a random sample of the rows of a data-frame with the `select:sample` function:

```lisp
LS-USER> mtcars
#<DATA-FRAME (32 observations of 12 variables)
Motor Trend Car Road Tests>
LS-USER> (random-sample mtcars 3)
LS-USER> (sample mtcars 3 :skip-unselected t)
#<DATA-FRAME (3 observations of 12 variables)>
LS-USER> (print-data *)
Expand All @@ -2264,16 +2264,17 @@ LS-USER> (print-data *)
;; 2 Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
```

You can also take random samples from CL sequences.
You can also take random samples from CL sequences and arrays, with or without replacement and in various proportions. For further information see [sampling](/docs/manuals/select/#sampling) in the [select system manual](/docs/manuals/select/).

Uses [Vitter's Algorithm
D](http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf) to
efficiently select the rows. Sometimes you may want to use the
algorithm at a lower level. You don’t want the sample itself; you only
want the indices. In this case, you can directly use `map-random-below`,
which simply calls a provided function on each index.
algorithm at a lower level. If you don’t want the sample itself, say you
only want the indices, you can directly use `map-random-below`, which
simply calls a provided function on each index.

This is a port to standard common lisp of ruricolist's
This is an enhancement and port to standard common lisp of
ruricolist's
[random-sample](https://github.com/ruricolist/random-sample/tree/master).
It also removes the dependency on Trivia, which has a restrictive
license (LLGPL).
Expand Down
55 changes: 54 additions & 1 deletion content/en/docs/Manuals/select.md
Expand Up @@ -23,7 +23,7 @@ Select provides:
3. A set of utility functions for traversing selections in
array-like objects.

It combines the functionality of dplyr's _slice_ and _select_ methods.
It combines the functionality of dplyr's _slice_, _select_ and _sample_ methods.

## Basic Usage {#Using}

Expand Down Expand Up @@ -164,6 +164,59 @@ as well:
; 75 78 79 80 81 82 84 86 88 91 93 98 100 103 107 108 109 112 113 116 117 120)
```

## Sampling
You may sample sequences, arrays and data frames with the `sample` generic function, and extend it for your own objects. The function signature is:

```lisp
(defgeneric sample (data n &key
with-replacement
skip-unselected)
```
By default in common lisp, `key` values that are not provide are `nil`, so you need to turn them _on_ if you want them.

`:skip-unselected t` means to _not_ return the values of the object that were not part of the sample. This is turned off by default because a common use case is splitting a data set into training and test groups, and the second value is ignored by default in Common Lisp. The `let-plus` package, imported by default in `select`, makes it easy to destructure into test and training. This example is from the tests included with select:

```lisp
(let+ ((*random-state* state)
((&values train test) (sample arr35 2))
...
```

Note the setting of `*random-state*`. You should use this pattern of setting `*random-state*` to a saved seed anytime you need reproducible results (like in a testing scenerio).

The size of the sample is determined by the value of `n`, which must be between 0 and the number of rows (for an `array`) or length if a `sequence`. If `(< n 1)`, then `n` indicates a _proportion_ of the sample, e.g. 2/3 (values of `n` less than one may be `rational` or `float`. For example, let's take a training sample of 2/3 of the rows in the `mtcars` dataset:

```lisp
LS-USER> (sample mtcars 2/3)
#<DATA-FRAME (21 observations of 12 variables)>
#<DATA-FRAME (11 observations of 12 variables)>
LS-USER> (dims mtcars)
(32 12)
```
You can see that `mtcars` has 32 rows, and has been divides into 2/3 and 1/3 for training / test.

You can also take samples of sequences (lists and vectors), for example using the `DATA` variable defined above:

```lisp
LS-USER> (length data)
121
LS-USER> (sample data 10 :skip-unselected t)
#(43 117 42 29 41 105 116 27 133 58)
LS-USER> (sample data 1/10 :skip-unselected t)
#(119 116 7 53 27 114 31 23 121 109 42 125)
```

`list` objects can also be sampled:
```lisp
(sample '(a b c d e f g) 0.5)
(A E G B)
(F D C)
```
Note that `n` is rounded up when the number of elements is odd and a proportional number is requested.


## Extensions {#extensions}

The previous section describes the core functionality. The semantics
Expand Down

0 comments on commit bb59d2d

Please sign in to comment.