# Redis

*Data Structures and Information Retrieval in Python*

Copyright 2021 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In [1]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
# download('https://github.com/AllenDowney/DSIRP/raw/main/utils.py')

[Click here to run this chapter on Colab](https://colab.research.google.com/github/AllenDowney/DSIRP/blob/main/chapters/chap01.ipynb)

## Persistence

Data stored only in the memory of a running program is called "volatile", because it disappears when the program ends.

Data that still exists after the program that created it ends is called
"persistent". In general, files stored in a file system are persistent,
as well as data stored in databases.

A simple way to make data persistent is to store it in a file. For example, before the program ends, it could translate its data structures into a format like [JSON](https://en.wikipedia.org/wiki/JSON) and then write them into a file.
When it starts again, it could read the file and rebuild the data
structures.

But there are several problems with this solution:

1.  Reading and writing large data structures (like a Web index) would
    be slow.

2.  The entire data structure might not fit into the memory of a single
    running program.

3.  If a program ends unexpectedly (for example, due to a power outage),
    any changes made since the program last started would be lost.

A better alternative is a database that provides persistent storage and
the ability to read and write parts of the database without reading and
writing the whole thing.

There are many kinds of [database management systems](https://en.wikipedia.org/wiki/Database) (DBMS) that provide
these capabilities.

The database we'll use is Redis, which organizes data in structures that are similar to Python data structures.
Among others, it provides lists, hashes (similar to Python dictionaries), and sets.

Redis is a "key-value database", which means that it represents a mapping from keys to values.
In Redis, the keys are strings and the values can be one of several types.

## Redis clients and servers

Redis is usually run as a remote service; in fact, the name stands for
"REmote DIctionary Server". To use Redis, you have to run the Redis
server somewhere and then connect to it using a Redis client. 

To get started, we'll run the Redis server on the same machine where we run the Jupyter server.
This will let us get started quickly, but if we are running Jupyter on Colab, the database lives in a Colab runtime environment, which disappears when we shut down the notebook.
So it's not really persistent.

Later we will use [RedisToGo](http://thinkdast.com/redistogo), which runs Redis in the cloud.
Databases on RedisToGo are persistent.

The following command starts the Redis server and, with the `daemonize` options, runs it in the background so the Jupyter server can resume.

In [133]:
!redis-server --daemonize yes

237877:C 24 Oct 2021 11:28:53.646 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
237877:C 24 Oct 2021 11:28:53.646 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=237877, just started
237877:C 24 Oct 2021 11:28:53.646 # Configuration loaded


## redis-py

To talk to the Redis server, we'll use [redis-py](https://redis-py.readthedocs.io/en/stable/index.html).
Here's how we use it to connect to the Redis server.

In [134]:
import redis

r = redis.Redis()

The `set` method adds a key-value pair to the databased.
In the following example, the key and value are both strings.

In [135]:
r.set('key', 'value')

True

The `get` method looks up a key and returns the corresponding value.

In [136]:
r.get('key')

b'value'

The result is not actually a string; it is a [bytearray](https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string).

For many purposes, a bytearray behaves like a string so for now we will treat it like a string and deal with differences as they arise.

The values can be integers or floating-point numbers.

In [142]:
r.set('x', 5)

True

And Redis provides some functions that understand numbers, like `incr`.

In [143]:
r.incr('x')

6

But if you `get` a numeric value, the result is a bytearray.

In [144]:
value = r.get('x')
value

b'6'

If you want to do math with it, you have to convert it back to a number.

In [145]:
int(value)

6

If you want to set more than one value at a time, you can pass a dictionary to `mset`.

In [7]:
d = dict(x=5, y='string', z=1.23)
r.mset(d)

True

In [9]:
r.get('y')

b'string'

In [10]:
r.get('z')

b'1.23'

If you try to store any other type in a Redis database, you get an error.

In [16]:
from redis import DataError

t = [1, 2, 3]

try:
    r.set('t', t)
except DataError as e:
    print(e)

Invalid input of type: 'list'. Convert to a bytes, string, int or float first.


We could use the `repr` function to create a string representation of a list, but that representation is Python-specific.
It would be better to make a database that can work with any language.
To do that, we can use JSON to create a string representation.

The `json` module provides a function `dumps`, that creates a language-independent representation of most Python objects.

In [146]:
import json

t = [1, 2, 3]
s = json.dumps(t)
s

'[1, 2, 3]'

When we read one of these strings back, we can use `loads` to convert it back to a Python object.

In [148]:
t = json.loads(s)
t

[1, 2, 3]

## Redis Data Types

JSON can represent most Python objects, so we could use it to store arbitrary data structures in Redis. But in that case Redis only knows that they are strings; it can't work with them as data structures. For example, if we store a data structure in JSON, the only way to modify it would be to:

1. Get the entire structure, which might be large,

2. Load it back into a Python structure,

3. Modify the Python structure,

4. Dump it back into a JSON string, and

5. Replace the old value in the database with the new value.

That's not very efficient. A better alternative is to use the data types Redis provides, which you can read about in the
[Redis Data Types Intro](https://redis.io/topics/data-types-intro).

# Lists

The `rpush` method adds new elements to the end of a list (the `r` is for the right-hand side of the list).

In [152]:
r.rpush('t', 1, 2, 3)

3

You don't have to do anything special to create a list; if it doesn't exist, Redis creates it.

`llen` returns the length of the list.

In [153]:
r.llen('t')

3

`lrange` gets elements from a list. With the indices `0` and `-1`, it gets all of the elements.

In [154]:
r.lrange('t', 0, -1)

[b'1', b'2', b'3']

The result is a Python list, but the elements are bytestrings.

`rpop` removes elements from the end of the list.

In [155]:
r.rpop('t')

b'3'

You can read more about the other list methods in the [Redis documentation](https://redis.io/commands#list).

And you can read about the [redis-py API here](https://redis-py.readthedocs.io/en/stable/index.html#redis.Redis.rpush).

In general, the documentation of Redis is very good; the documentation of `redis-py` is a little rough around the edges.

## Hash

A [Redis hash](https://redis.io/commands#hash) is similar to a Python dictionary, but just to make things confusing the nomenclature is a little different.

What we would call a "key" in a Python dictionary is called a "field" in a Redis hash.

The `hset` method sets a field-value pair in a hash:

In [157]:
r.hset('h', 'field', 'value')

1

The `hget` method looks up a field and returns the corresponding value.

In [158]:
r.hget('h', 'field')

b'value'

`hset` can also take a Python dictionary as a parameter:

https://github.com/redis/redis-py/blob/cf5c5865bb9947498f3810b028628f3d2ab14030/redis/commands.py

In [159]:
d = dict(a=1, b=2, c=3)
r.hset('h', mapping=d)

3

To iterate the elements of a hash, we can use `hscan_iter`:

In [160]:
for field, value in r.hscan_iter('h'):
    print(field, value)

b'field' b'value'
b'a' b'1'
b'b' b'2'
b'c' b'3'


The results are bytestrings for both the fields and values.

## Deleting

Before we go on, let's clean up the database by deleting all of the key-value pairs.

In [74]:
for key in r.keys():
    r.delete(key)

## Anagrams (again!)

In a previous notebook, we made sets of words that are anagrams of each other by building a dictionary where they keys are sorted strings of letters and the values are lists of words.

We'll start by solving this problem again using Python data structures; then we'll translate it into Redis.

The following cell downloads a file that contains the list of words.

In [75]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/DSIRP/raw/main/american-english')

And here's a generator function that reads the words in the file and yields them one at a time.

In [91]:
def iterate_words(filename):
    """Read lines from a file and split them into words."""
    for line in open(filename):
        for word in line.split():
            yield word.strip()

The "signature" of a word is a string that contains the letter of the word in sorted order.
So if two words are anagrams, they have the same signature.

In [76]:
def signature(word):
    return ''.join(sorted(word))

The following loop makes a dictionary of anagram lists.

In [93]:
d = {}
for word in iterate_words('american-english'):
    key = signature(word)
    d.setdefault(key, []).append(word)

The following loop prints all anagram lists with 6 or more words

In [95]:
for v in anagram_dict.values():
    if len(v) >= 6:
        print(len(v), v)

6 ['abets', 'baste', 'bates', 'beast', 'beats', 'betas']
6 ['aster', 'rates', 'stare', 'tares', 'taser', 'tears']
6 ['caret', 'cater', 'crate', 'react', 'recta', 'trace']
7 ['carets', 'caster', 'caters', 'crates', 'reacts', 'recast', 'traces']
6 ['drapes', 'padres', 'parsed', 'rasped', 'spared', 'spread']
6 ['lapse', 'leaps', 'pales', 'peals', 'pleas', 'sepal']
6 ['least', 'slate', 'stale', 'steal', 'tales', 'teals']
6 ['opts', 'post', 'pots', 'spot', 'stop', 'tops']
6 ['palest', 'pastel', 'petals', 'plates', 'pleats', 'staple']
7 ['pares', 'parse', 'pears', 'rapes', 'reaps', 'spare', 'spear']


Now, to do the same thing in Redis, we have two options:

* We can store the anagram lists using Redis lists, using the signatures as keys.

* We can store the whole data structure in a Redis hash.

A problem with the first option is that the keys in a Redis database are like global variables. If we create a large number of keys, we are likely to run into name conflicts.
We can mitigate this problem by giving each key a prefix that identifies its purpose.

The following loop implements the first option, using "Anagram" as a prefix for the keys.

In [104]:
for word in iterate_words('american-english'):
    key = f'Anagram:{signature(word)}'
    r.lpush(key, word)

An advantage of this option is that it makes good use of Redis lists. A drawback is that makes many small database transactions, so it is relatively slow. 

We can use `keys` to get a list of all keys with a given prefix.

In [105]:
keys = r.keys('Anagram*')
len(keys)

96936

The following loop prints all lists with 6 or more elements.

In [106]:
for key in keys:
    if r.llen(key) >= 6:
        print(r.lrange(key, 0, -1))

[b'staple', b'pleats', b'plates', b'petals', b'pastel', b'palest']
[b'tops', b'stop', b'spot', b'pots', b'post', b'opts']
[b'tears', b'taser', b'tares', b'stare', b'rates', b'aster']
[b'spread', b'spared', b'rasped', b'parsed', b'padres', b'drapes']
[b'trace', b'recta', b'react', b'crate', b'cater', b'caret']
[b'teals', b'tales', b'steal', b'stale', b'slate', b'least']
[b'spear', b'spare', b'reaps', b'rapes', b'pears', b'parse', b'pares']
[b'sepal', b'pleas', b'peals', b'pales', b'leaps', b'lapse']
[b'traces', b'recast', b'reacts', b'crates', b'caters', b'caster', b'carets']
[b'betas', b'beats', b'beast', b'bates', b'baste', b'abets']


Before we go on, we can delete the keys from the database like this.

In [102]:
r.delete(*keys)

96936

The second option is to compute the dictionary of anagram lists locally and then store it as a Redis hash.

The following function uses `dumps` to convert lists to strings that can be stored as values in a Redis hash.


In [116]:
hash_key = 'AnagramHash'
for field, t in anagram_dict.items():
    value = json.dumps(t)
    r.hset(hash_key, field, value)

We can do the same thing faster if we convert all of the lists to JSON locally and store all of the field-value pairs with one `hset` command.

In [124]:
r.delete(hash_key)

1

In [125]:
d = {key:json.dumps(t) for key, t in anagram_dict.items()}

In [126]:
r.hset(hash_key, mapping=d)

96936

The following loops iterates through the field-value pairs, converts each value back to a Python list, and prints the lists with 6 or more elements.

In [127]:
for field, value in r.hscan_iter(hash_key):
    t = json.loads(value)
    if len(t) >= 6:
        print(t)

['aster', 'rates', 'stare', 'tares', 'taser', 'tears']
['caret', 'cater', 'crate', 'react', 'recta', 'trace']
['least', 'slate', 'stale', 'steal', 'tales', 'teals']
['carets', 'caster', 'caters', 'crates', 'reacts', 'recast', 'traces']
['pares', 'parse', 'pears', 'rapes', 'reaps', 'spare', 'spear']
['abets', 'baste', 'bates', 'beast', 'beats', 'betas']
['drapes', 'padres', 'parsed', 'rasped', 'spared', 'spread']
['palest', 'pastel', 'petals', 'plates', 'pleats', 'staple']
['opts', 'post', 'pots', 'spot', 'stop', 'tops']
['lapse', 'leaps', 'pales', 'peals', 'pleas', 'sepal']


In [128]:
!killall redis-server

## Redis data types

Redis is basically a map from keys, which are strings, to values, which
can be one of several data types. The most basic Redis data type is a
*string*. I will write Redis types in italics to distinguish them from
Java types.

To add a *string* to the database, use `jedis.set`, which is similar to
`Map.put`; the parameters are the new key and the corresponding value.
To look up a key and get its value, use `jedis.get`:

In [None]:
jedis.set("mykey", "myvalue");
String value = jedis.get("mykey");

In this example, the key is ``mykey\"\" and the value is ``myvalue\"\".

Redis provides a *set* structure, which is similar to a Java
`Set<String>`. To add elements to a Redis *set*, you choose a key to
identify the *set* and then use `jedis.sadd`:

In [None]:
jedis.sadd("myset", "element1", "element2", "element3");
boolean flag = jedis.sismember("myset", "element2");

You don't have to create the *set* as a separate step. If it doesn't
exist, Redis creates it. In this case, it creates a *set* named `myset`
that contains three elements.

The method `jedis.sismember` checks whether an element is in a *set*.
Adding elements and checking membership are constant time operations.

Redis also provides a *list* structure, which is similar to a Java
`List<String>`. The method `jedis.rpush` adds elements to the end (right
side) of a *list*:

In [None]:
jedis.rpush("mylist", "element1", "element2", "element3");
String element = jedis.lindex("mylist", 1);

Again, you don't have to create the structure before you start adding
elements. This example creates a *list* named "mylist" that contains
three elements.

The method `jedis.lindex` takes an integer index and returns the
indicated element of a *list*. Adding and accessing elements are
constant time operations.

Finally, Redis provides a hash structure, which is similar to a Java
`Map<String, String>`. The method `jedis.hset` adds a new entry to the
hash:

In [None]:
jedis.hset("myhash", "word1", Integer.toString(2));
String value = jedis.hget("myhash", "word1");

This example creates a hash named `myhash` that contains one entry,
which maps from the key `word1` to the value ``2\"\".

The keys and values are *string*s, so if we want to store an `Integer`,
we have to convert it to a `String` before we call `hset`. And when we
look up the value using `hget`, the result is a `String`, so we might
have to convert it back to `Integer`.

Working with Redis hashes can be confusing, because we use a key to
identify which hash we want, and then another key to identify a value
in the hash. In the context of Redis, the second key is called a
"field", which might help keep things straight. So a "key" like `myhash`
identifies a particular hash, and then a "field" like `word1`
identifies a value in the hash.

For many applications, the values in a Redis hash are integers, so
Redis provides a few special methods, like `hincrby`, that treat the
values as numbers:

In [None]:
jedis.hincrBy("myhash", "word2", 1);

This method accesses `myhash`, gets the current value associated with
`word2` (or 0 if it doesn't already exist), increments it by 1, and
writes the result back to the hash.

Setting, getting, and incrementing entries in a hash are constant time
operations.

You can read more about Redis data types at
<http://thinkdast.com/redistypes>.

## Exercise 11

At this point you have the information you need to make a web search
index that stores results in a Redis database.

Now run `ant JedisIndexTest`. It should fail, because you have some work
to do!

`JedisIndexTest` tests these methods:

-   `JedisIndex`, which is the constructor that takes a `Jedis` object
    as a parameter.

-   `indexPage`, which adds a Web page to the index; it takes a `String`
    URL and a jsoup `Elements` object that contains the elements of the
    page that should be indexed.

-   `getCounts`, which takes a search term and returns a
    `Map<String, Integer>` that maps from each URL that contains the
    search term to the number of times it appears on that page.

Here's an example of how these methods are used:

In [None]:
WikiFetcher wf = new WikiFetcher();
String url1 = 
    "http://en.wikipedia.org/wiki/Java_(programming_language)";
Elements paragraphs = wf.readWikipedia(url1);

Jedis jedis = JedisMaker.make();
JedisIndex index = new JedisIndex(jedis);
index.indexPage(url1, paragraphs);
Map<String, Integer> map = index.getCounts("the");

If we look up `url1` in the result, `map`, we should get 339, which is
the number of times the word "the" appears on the Java Wikipedia page
(that is, the version we saved).

If we index the same page again, the new results should replace the old
ones.

One suggestion for translating data structures from Java to Redis:
remember that each object in a Redis database is identified by a unique
key, which is a *string*. If you have two kinds of objects in the same
database, you might want to add a prefix to the keys to distinguish
between them. For example, in our solution, we have two kinds of
objects:

-   We define a `URLSet` to be a Redis *set* that contains the URLs that
    contain a given search term. The key for each `URLSet` starts with
    ``URLSet:\"\", so to get the URLs that contain the word "the", we
    access the *set* with the key ``URLSet:the\"\".

-   We define a `TermCounter` to be a Redis hash that maps from each
    term that appears on a page to the number of times it appears. The
    key for each `TermCounter` starts with ``TermCounter:\"\" and ends
    with the URL of the page we're looking up.

In my implementation, there is one `URLSet` for each term and one
`TermCounter` for each indexed page. I provide two helper methods,
`urlSetKey` and `termCounterKey`, to assemble these keys.

## More suggestions if you want them

At this point you have all the information you need to do the exercise,
so you can get started if you are ready. But I have a few suggestions
you might want to read first:

-   For this exercise I provide less guidance than in previous
    exercises. You will have to make some design decisions; in
    particular, you will have to figure out how to divide the problem
    into pieces that you can test one at a time, and then assemble the
    pieces into a complete solution. If you try to write the whole thing
    at once, without testing smaller pieces, it might take a very long
    time to debug.

-   One of the challenges of working with persistent data is that it is
    persistent. The structures stored in the database might change every
    time you run the program. If you mess something up in the database,
    you will have to fix it or start over before you can proceed. To
    help you keep things under control, I've provided methods called
    `deleteURLSets`, `deleteTermCounters`, and `deleteAllKeys`, which
    you can use to clean out the database and start fresh. You can also
    use `printIndex` to print the contents of the index.

-   Each time you invoke a `Jedis` method, your client sends a message
    to the server, then the server performs the action you requested and
    sends back a message. If you perform many small operations, it will
    probably take a long time. You can improve performance by grouping a
    series of operations into a `Transaction`.

For example, here's a simple version of `deleteAllKeys`:

In [None]:
public void deleteAllKeys() {
    Set<String> keys = jedis.keys("*");
    for (String key: keys) {
        jedis.del(key);
    }
}

Each time you invoke `del` requires a round-trip from the client to the
server and back. If the index contains more than a few pages, this
method would take a long time to run. We can speed it up with a
`Transaction` object:

In [None]:
public void deleteAllKeys() {
    Set<String> keys = jedis.keys("*");
    Transaction t = jedis.multi();
    for (String key: keys) {
        t.del(key);
    }
    t.exec();
}

`jedis.multi` returns a `Transaction` object, which provides all the
methods of a `Jedis` object. But when you invoke a method on a
`Transaction`, it doesn't run the operation immediately, and it doesn't
communicate with the server. It saves up a batch of operations until you
invoke `exec`. Then it sends all of the saved operations to the server
at the same time, which is usually much faster.

## A few design hints

Now you *really* have all the information you need; you should start
working on the exercise. But if you get stuck, or if you really don't
know how to get started, you can come back for a few more hints.

**Don't read the following until you have run the test code, tried out
some basic Redis commands, and written a few methods in
`JedisIndex.java`**.

OK, if you are really stuck, here are some methods you might want to
work on:

In [None]:
/**
 * Adds a URL to the set associated with term.
 */
public void add(String term, TermCounter tc) {}

/**
 * Looks up a search term and returns a set of URLs.
 */
public Set<String> getURLs(String term) {}

/**
 * Returns the number of times the given term appears at the given URL.
 */
public Integer getCount(String url, String term) {}

/**
 * Pushes the contents of the TermCounter to Redis.
 */
public List<Object> pushTermCounterToRedis(TermCounter tc) {}

These are the methods I used in my solution, but they are certainly not
the only way to divide things up. So please take these suggestions if
they help, but ignore them if they don't.

For each method, consider writing the tests first. When you figure out
how to test a method, you often get ideas about how to write it.

Good luck!

Here are more detailed instructions to help you get started:

-   Create an account on RedisToGo, at <http://thinkdast.com/redissign>,
    and select the plan you want (probably the free plan to get
    started).

-   Create an "instance", which is a virtual machine running the Redis
    server. If you click on the "Instances" tab, you should see your new
    instance, identified by a host name and a port number. For example,
    I have an instance named "dory-10534".

-   Click on the instance name to get the configuration page. Make a
    note of the URL near the top of the page, which looks like this:

In [None]:
redis://redistogo:1234567feedfacebeefa1e1234567@dory.redistogo.com:10534

This URL contains the server's host name, `dory.redistogo.com`, the port
number, `10534`, and the password you will need to connect to the
server, which is the long string of letters and numbers in the middle.
You will need this information for the next step.

In [None]:
`redis://redistogo:1234567feedfacebeefa1e1234567@dory.redistogo.com:10534`

Because this file contains the password for your Redis server, you
should not put this file in a public repository. To help you avoid doing
that by accident, the repository contains a `.gitignore` file that makes
it harder (but not impossible) to put this file in your repo.

