Skip to content

Commit

Permalink
Merge branch 'master' into LiorKogan-patch-2
Browse files Browse the repository at this point in the history
  • Loading branch information
LiorKogan committed Nov 29, 2023
2 parents b2bf219 + 73447c3 commit 4d75a93
Show file tree
Hide file tree
Showing 6 changed files with 140 additions and 170 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/mariner2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,9 @@ jobs:
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: "us-east-1"
- name: Upload artifacts to s3 - staging
run: make upload-artifacts SHOW=1 STAGING=1 VERBOSE=1
run: |
make upload-artifacts SHOW=1 VERBOSE=1
make upload-release SHOW=1 STAGING=1 VERBOSE=1
- name: Upload artifacts to s3 - release # todo: trigger this manually instead
if: ${{ github.ref != 'refs/heads/master' }}
run: make upload-artifacts SHOW=1 VERBOSE=1
run: make upload-release SHOW=1 VERBOSE=1
60 changes: 17 additions & 43 deletions docs/docs/bloom-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,14 @@ Use one Bloom filter per user, checked for every transaction. Provide an extreme
Using Redis Stack's Bloom filter for this type of application provides these benefits:

- Fast transaction completion
- Decreased possibility for transaction to brake in case of network partitions (connection needs to be kept open for a shorter time)
- Decreased possibility for transaction to break in case of network partitions (connection needs to be kept open for a shorter time)
- Extra layer of security for both credit card owners and retailers

Other questions a Bloom filter can help answer in the finance industry are:

- Has the user ever made purchases in this category of products/services?
- Do I need to skip some security steps when the user is buying with a vetted online shop (big retailers like Amazon, Apple app store...)?
- Has this credit card been reported as lost/stolen? An additional benefit of using a bloom filter in the last case is that financial organizations can exchange their lists of stolen/blocked credit card numbers without revealing the numbers themselves.
- Has this credit card been reported as lost/stolen? An additional benefit of using a Bloom filter in the last case is that financial organizations can exchange their lists of stolen/blocked credit card numbers without revealing the numbers themselves.

**Ad placement (retail, advertising)**

Expand Down Expand Up @@ -71,58 +71,32 @@ Using Redis Stack's Bloom filter for this type of application provides these ben
- Very fast and efficient way to do a common operation
- No need to invest in expensive infrastructure

## Examples:
## Example

* Adding new items to the filter
Consider a bike manufacturer that makes a million different kinds of bikes and you'd like to avoid using a duplicate model name in new models. A Bloom filter can be used to detect duplicates. In the example that follows, you'll create a filter with space for a million entries and with a 0.1% error rate. Add one model name and check if it exists. Then add multiple model names and check if they exist.

> A new filter is created for you if it does not yet exist

```
> BF.ADD newFilter foo
{{< clients-example bf_tutorial bloom >}}
> BF.RESERVE bikes:models 0.001 1000000
OK
> BF.ADD bikes:models "Smoky Mountain Striker"
(integer) 1
```

* Checking if an item exists in the filter

```
> BF.EXISTS newFilter foo
> BF.EXISTS bikes:models "Smoky Mountain Striker"
(integer) 1
```

```
> BF.EXISTS newFilter notpresent
(integer) 0
```

* Adding and checking multiple items

```
> BF.MADD myFilter foo bar baz
> BF.MADD bikes:models "Rocky Mountain Racer" "Cloudy City Cruiser" "Windy City Wippet"
1) (integer) 1
2) (integer) 1
3) (integer) 1
```

```
> BF.MEXISTS myFilter foo nonexist bar
> BF.MEXISTS bikes:models "Rocky Mountain Racer" "Cloudy City Cruiser" "Windy City Wippet"
1) (integer) 1
2) (integer) 0
2) (integer) 1
3) (integer) 1
```

* Creating a new filter with custom properties
{{< /clients-example >}}

```
> BF.RESERVE customFilter 0.0001 600000
OK
```

```
> BF.MADD customFilter foo bar baz
```
Note: there is always a chance that even with just a few items, there could be a false positive, meaning an item could "exist" even though it has not been explicitly added to the Bloom filter. For a more in depth understanding of the probabilistic nature of a Bloom filter, check out the blog posts linked at the bottom of this page.

## Sizing Bloom filters
With Redis Stack's bloom filters most of the sizing work is done for you:
## Reserving Bloom filters
With Redis Stack's Bloom filters most of the sizing work is done for you:

```
BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion] [NONSCALING]
Expand All @@ -135,7 +109,7 @@ The rate is a decimal value between 0 and 1. For example, for a desired false po
This is the number of elements you expect having in your filter in total and is trivial when you have a static set but it becomes more challenging when your set grows over time. It's important to get the number right because if you **oversize** - you'll end up wasting memory. If you **undersize**, the filter will fill up and a new one will have to be stacked on top of it (sub-filter stacking). In the cases when a filter consists of multiple sub-filters stacked on top of each other latency for adds stays the same, but the latency for presence checks increases. The reason for this is the way the checks work: a regular check would first be performed on the top (latest) filter and if a negative answer is returned the next one is checked and so on. That's where the added latency comes from.

#### 3. Scaling (`EXPANSION`)
Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created.
Adding an element to a Bloom filter never fails due to the data structure "filling up". Instead the error rate starts to grow. To keep the error close to the one set on filter initialisation - the Bloom filter will auto-scale, meaning when capacity is reached an additional sub-filter will be created.
The size of the new sub-filter is the size of the last sub-filter multiplied by `EXPANSION`. If the number of elements to be stored in the filter is unknown, we recommend that you use an expansion of 2 or more to reduce the number of sub-filters. Otherwise, we recommend that you use an expansion of 1 to reduce memory consumption. The default expansion value is 2.

The filter will keep adding more hash functions for every new sub-filter in order to keep your desired error rate.
Expand Down
30 changes: 24 additions & 6 deletions docs/docs/count-min-sketch.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,29 @@ This application answers this question: What was the sales volume (on a certain
Use one Count-Min sketch created per day (period). Every product sale goes into the CMS. The CMS give reasonably accurate results for the products that contribute the most toward the sales. Products with low percentage of the total sales are ignored.

## Examples
Let's say we choose an error of 0.1%(`0.001`) with certainty of 99.8%(`0.998`) (thus probability of error 0.02% (`0.002`)). The resulting sketch will try to keep the error within 0.1% of the sum of counts of **ALL** elements that have been added to the sketch and the probability for this error to be higher than that (a collision of an element below the threshold with an element above the threshold) will be 0.02%.

```
> CMS.INITBYPROB key 0.001 0.002
```
Assume you select an error rate of 0.1% (0.001) with a certainty of 99.8% (0.998). This means you have an error probability of 0.02% (0.002). Your sketch strives to keep the error within 0.1% of the total count of all elements you've added. There's a 0.02% chance the error might exceed this—like when an element below the threshold overlaps with one above it. When you add a few items to the CMS and evaluate their frequency, remember that in such a small sample, collisions are rare, as seen with other probabilistic data structures.

{{< clients-example cms_tutorial cms >}}
> CMS.INITBYPROB bikes:profit 0.001 0.002
OK
> CMS.INCRBY bikes:profit "Smokey Mountain Striker" 100
(integer) 100
> CMS.INCRBY bikes:profit "Rocky Mountain Racer" 200 "Cloudy City Cruiser" 150
1) (integer) 200
2) (integer) 150
> CMS.QUERY bikes:profit "Smokey Mountain Striker" "Rocky Mountain Racer" "Cloudy City Cruiser" "Terrible Bike Name"
1) (integer) 100
2) (integer) 200
3) (integer) 150
4) (integer) 0
> CMS.INFO bikes:profit
1) width
2) (integer) 2000
3) depth
4) (integer) 9
5) count
6) (integer) 450
{{< /clients-example >}}

##### Example 1:
If we had a uniform distribution of 1000 elements where each has a count of around 500 the threshold would be 500:
Expand Down Expand Up @@ -93,4 +111,4 @@ Adding, updating and querying for elements in a CMS has a time complexity O(1).
- [An Improved Data Stream Summary: The Count-Min Sketch and its Applications](http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf)

## References
- [Count-Min Sketch: The Art and Science of Estimating Stuff](https://redis.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/)
- [Count-Min Sketch: The Art and Science of Estimating Stuff](https://redis.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/)
41 changes: 9 additions & 32 deletions docs/docs/cuckoo-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,43 +39,20 @@ Note> In addition to these two cases, Cuckoo filters serve very well all the Blo

## Examples

* Cuckoo: Adding new items to a filter
> You'll learn how to create an empty cuckoo filter with an initial capacity for 1,000 items, add items, check their existence, and remove them. Even though the `CF.ADD` command can create a new filter if one isn't present, it might not be optimally sized for your needs. It's better to use the `CF.RESERVE` command to set up a filter with your preferred capacity.

> Create an empty cuckoo filter with an initial capacity (of 1000 items)
```
> CF.RESERVE newCuckooFilter 1000
{{< clients-example cuckoo_tutorial cuckoo >}}
> CF.RESERVE bikes:models 1000000
OK
> CF.ADD bikes:models "Smoky Mountain Striker"
(integer) 1
```

> A new filter is created for you if it does not yet exist
```
> CF.ADD newCuckooFilter foo
> CF.EXISTS bikes:models "Smoky Mountain Striker"
(integer) 1
```

You can add the item multiple times. The filter will attempt to count it.

* Cuckoo: Checking whether item exists

```
> CF.EXISTS newCuckooFilter foo
(integer) 1
```

```
> CF.EXISTS newCuckooFilter notpresent
> CF.EXISTS bikes:models "Terrible Bike Name"
(integer) 0
```

* Cuckoo: Deleting item from filter

```
> CF.DEL newCuckooFilter foo
> CF.DEL bikes:models "Smoky Mountain Striker"
(integer) 1
```
{{< /clients-example >}}

## Bloom vs. Cuckoo filters
Bloom filters typically exhibit better performance and scalability when inserting
Expand Down
91 changes: 45 additions & 46 deletions docs/docs/t-digest.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,62 +62,56 @@ You measure the IP packets transferred over your network each second and try to

## Examples

#### Creating a t-digest
In the following example, you'll create a t-digest with a compression of 100 and add items to it. The `COMPRESSION` argument is used to specify the tradeoff between accuracy and memory consumption. The default value is 100. Higher values mean more accuracy. Note: unlike some of the other probabilistic data structures, the `TDIGEST.ADD` command will not create a new structure if the key does not exist.

```
> TDIGEST.CREATE my-tdigest COMPRESSION 100
```
{{< clients-example tdigest_tutorial tdig_start >}}
> TDIGEST.CREATE bikes:sales COMPRESSION 100
OK
> TDIGEST.ADD bikes:sales 21
OK
> TDIGEST.ADD bikes:sales 150 95 75 34
OK
{{< /clients-example >}}

The `COMPRESSION` argument is used to specify the tradeoff between accuracy and memory consumption. The default is 100. Higher values mean more accuracy.

#### Adding a single element to the t-digest:
```
> TDIGEST.ADD my-tdigest 20.9
```

#### Adding multiple elements to the t-digest:
```
> TDIGEST.ADD my-tdigest 308 315.9
```

You can repeat calling [TDIGEST.ADD](https://redis.io/commands/tdigest.add/) whenever new observations are available

#### Estimating fractions or ranks by values

Another helpful feature in t-digest is CDF (definition of rank) which gives us the fraction of observations smaller or equal to a certain value. This command is very useful to answer questions like "*What's the percentage of observations with a value lower or equal to X*".

>More precisely, `TDIGEST.CDF` will return the estimated fraction of observations in the sketch that are smaller than X plus half the number of observations that are equal to X
Let's illustrate this with an example: if we have a set of observations of people's age with gaussian distribution, we can ask a question like "What's the percentage of people younger than 50 years?"
>More precisely, `TDIGEST.CDF` will return the estimated fraction of observations in the sketch that are smaller than X plus half the number of observations that are equal to X. We can also use the `TDIGEST.RANK` command, which is very similar. Instead of returning a fraction, it returns the ----estimated---- rank of a value. The `TDIGEST.RANK` command is also variadic, meaning you can use a single command to retrieve estimations for one or more values.
```
> TDIGEST.ADD my-tdigest 45.88 44.2 58.03 19.76 39.84 69.28 50.97 25.41 19.27 85.71 42.63
Here's an example. Given a set of biker's ages, you can ask a question like "What's the percentage of bike racers that are younger than 50 years?"

> TDIGEST.CDF my-tdigest 50
```
{{< clients-example tdigest_tutorial tdig_cdf >}}
> TDIGEST.CREATE racer_ages
OK
> TDIGEST.ADD racer_ages 45.88 44.2 58.03 19.76 39.84 69.28 50.97 25.41 19.27 85.71 42.63
OK
> TDIGEST.CDF racer_ages 50
1) "0.63636363636363635"
> TDIGEST.RANK racer_ages 50
1) (integer) 7
> TDIGEST.RANK racer_ages 50 40
1) (integer) 7
2) (integer) 4
{{< /clients-example >}}

The `TDIGEST.RANK` command is very similar to `TDIGEST.CDF` but instead of returning a fraction, it returns the **number** of observations in the sketch that are smaller than X plus half the number of observations that are equal to X, or in other words - the estimated rank of a value.

```
> TDIGEST.RANK my-tdigest 50
```

And lastly, `TDIGEST.REVRANK key value...` is similar to [TDIGEST.RANK](https://redis.io/commands/tdigest.rank/), but returns, for each input value, an estimation of the number of (observations larger than a given value + half the observations equal to the given value).


#### Estimating values by fractions or ranks

`TDIGEST.QUANTILE key fraction...` returns, for each input fraction, an estimation of the value (floating point) that is smaller than the given fraction of observations.

```
> TDIGEST.QUANTILE my-tdigest 0.5
```

`TDIGEST.BYRANK key rank...` returns, for each input rank, an estimation of the value (floating point) with that rank.
`TDIGEST.QUANTILE key fraction...` returns, for each input fraction, an estimation of the value (floating point) that is smaller than the given fraction of observations. `TDIGEST.BYRANK key rank...` returns, for each input rank, an estimation of the value (floating point) with that rank.

```
> TDIGEST.BYRANK my-tdigest 4
```
{{< clients-example tdigest_tutorial tdig_quant >}}
> TDIGEST.QUANTILE racer_ages .5
1) "44.200000000000003"
> TDIGEST.BYRANK racer_ages 4
1) "42.630000000000003"
{{< /clients-example >}}

`TDIGEST.BYREVRANK key rank...` returns, for each input **reverse rank**, an estimation of the **value** (floating point) with that reverse rank.

Expand All @@ -141,23 +135,28 @@ If `destKey` is an existing sketch, its values are merged with the values of the

Use `TDIGEST.MIN` and `TDIGEST.MAX` to retrieve the minimal and maximal values in the sketch, respectively.

```
> TDIGEST.MIN my-tdigest
> TDIGEST.MAX my-tdigest
```
{{< clients-example tdigest_tutorial tdig_min >}}
> TDIGEST.MIN racer_ages
"19.27"
> TDIGEST.MAX racer_ages
"85.709999999999994"
{{< /clients-example >}}

Both return nan when the sketch is empty.
Both return `nan` when the sketch is empty.

Both commands return accurate results and are equivalent to `TDIGEST.BYRANK my-tdigest 0` and `TDIGEST.BYREVRANK my-tdigest 0` respectively.
Both commands return accurate results and are equivalent to `TDIGEST.BYRANK racer_ages 0` and `TDIGEST.BYREVRANK racer_ages 0`, respectively.

Use `TDIGEST.INFO my-tdigest` to retrieve some additional information about the sketch.
Use `TDIGEST.INFO racer_ages` to retrieve some additional information about the sketch.

#### Resetting a sketch

`TDIGEST.RESET my-tdigest`
{{< clients-example tdigest_tutorial tdig_reset >}}
> TDIGEST.RESET racer_ages
OK
{{< /clients-example >}}

## Academic sources
- [The _t_-digest: Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403)

## References
- [t-digest: A New Probabilistic Data Structure in Redis Stack](https://redis.com/blog/t-digest-in-redis-stack/)
- [t-digest: A New Probabilistic Data Structure in Redis Stack](https://redis.com/blog/t-digest-in-redis-stack/)
Loading

0 comments on commit 4d75a93

Please sign in to comment.