In [None]:
import os
from elasticsearch import Elasticsearch

# Getting data
Before we start our search and similarity experiments, we need data. 

We're going to copy the wellcome collection image dataset from the production cluster to our local machine. Some of the elasticsearch concepts in this notebook might be a bit unclear while we run through these steps, but they should be made clearer over the next few notebooks where we'll have a populated data store to experiment in. 

We're basically going to be following [this guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade-remote.html) for reindexing data from a remote index to a local one, mirroring instructions with the [python elasticsearch library](https://elasticsearch-py.readthedocs.io/en/latest/index.html).

## Connecting to a remote cluster
First we want to check that our source index exists and that the cluster it's running in is healthy etc. To connect to the cluster, we'll be making use of the variables we defined in `.env`.

In [None]:
remote_es = Elasticsearch(
    hosts=os.environ['REMOTE_HOST'],
    http_auth=(
        os.environ['REMOTE_USER'],
        os.environ['REMOTE_PASS']
    )
)

In [None]:
response = remote_es.search(
    body={"query": {"match_all": {}}},
    index=os.environ['INDEX_NAME']
)

## Connecting to our local cluster
This is our cluster running in docker, which is connected to this jupyter container through a [bridge network](https://docs.docker.com/network/bridge/). 

The `LOCAL_HOST` variable should be something like `http://elasticsearch:9200`. Instead of using `http://localhost:9200` as we normally would when running outside of docker, we replace `localhost` with `elasticsearch`, the same name we gave to the container which our local cluster is running in.

In [None]:
local_es = Elasticsearch(
    hosts=os.environ['LOCAL_HOST'],
    http_auth=(
        os.environ['LOCAL_USER'],
        os.environ['LOCAL_PASS']
    )
)

There's no data in here (yet) so there's nothing for us to look at. Let's print a list of the existing indexes in the cluster, just to make sure.

In [None]:
local_es.indices.get_alias("*")

See, nothing.

We should create an index with the appropriate name to reindex our remote data into, and modify a couple of settings to make the reindex run faster.

In [None]:
local_es.indices.create(
    index=os.environ['INDEX_NAME'], 
    body={
        "settings" : {
            "refresh_interval": -1,
            "number_of_replicas": 0
        },
    }
)


Let's verify that that index now exists:

In [None]:
local_es.indices.get_alias("*")

Great, now we have a remote and local index to work with.

We're going to kick off a reindex from our remote cluster into our little local one. Normally, this command would run for a few seconds before timing out and cancelling the reindex. To stop that from happening, we'll set `wait_for_completion=False` so that the operation will run in the background without timing out. The command will then return a response containing a task ID, which we can use to monitor the progress of the reindex.

In [None]:
response = local_es.reindex(
    body={
        "source": {
            "remote": {
                "host": os.environ['REMOTE_HOST'],
                "username": os.environ['REMOTE_USER'],
                "password": os.environ['REMOTE_PASS']
            },
            "index": os.environ['INDEX_NAME'],
        },
        "dest":{
            "index": os.environ['INDEX_NAME']
        },
    },
    wait_for_completion=False
)

response

## Monitoring the state of the reindex

In [None]:
local_es.tasks.get(response['task'])

If you need to cancel the reindex at any point, run:

## When the reindex finishes

Finally, when all the data has been copied over to our local index, we need to modify the index settings again. We set a custom `refresh_interval` and `number_of_replicas` when we created the index so that the reindex would run quickly - now we should set them back to the default so that the data can actually be searched.

In [None]:
local_es.indices.put_settings(
    index=os.environ['INDEX_NAME'],
    body={
        "settings" : {
            "refresh_interval": "30s",
            "number_of_replicas": 1
        },
    }
)

Now we can start working seriously on search and similarity!