# Upgrade index to use ELSER using Reindex API

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/06-upgrading-index-elser.ipynb)

Elasticsearch [Reindex API](https://elasticsearch-py.readthedocs.io/en/stable/api.html#elasticsearch.Elasticsearch.reindex) can be used when you want to move data from one index to another, update or change mapping of the index or even update data of your index. 

In this workbook we will see example on how to migrate your index to use ELSER model using [Reindex API with ingestion pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#reindex-with-an-ingest-pipeline). 

Few scenerios that we will see in this workbook are:

1. Migrating a index which doesn't have generated [`text_expansion`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-text-expansion-query.html) field to  ELSER model `.elser_model_2` 
2. Upgrade an existing index with `.elser_model_1` to use `.elser_model_2` model
3. Upgrade a index which use different model to use ELSER
 


# 🧰 Requirements

For this example, you will need:

- An Elastic deployment with minimum **4GB machine learning node**
   - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](   https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook))

   

# Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page
   - Under **Advanced settings**, go to **Machine Learning instances**
   - You'll need at least **4GB** RAM per zone for this tutorial
   - Select **Create deployment**

# Setup ELSER

ELSER is a trained model by Elastic that help with performing semantic search on your data and retrieve results based on the context. 

To use ELSER, you must have the [appropriate subscription]() level or the trial period activated.

Elasticsearch version < 8.11 supports `.elser_model_1` and from 8.11 Elastic supports `.elser_model_2` model which offers improved retrieval accuracy and faster indexing. 


# Install packages and connect with Elasticsearch Client

To get started, we will need to connect to our Elastic deployment using the Python client. As we are using Elastic Cloud deployment, we will use the **Cloud ID** to identify our deployment. To find your **Cloud ID**, go to https://cloud.elastic.co/deployments and select your deployment.

Next, we will install `elasticsearch` package using `pip`. 

In [1]:
!pip install elasticsearch -qU

Next, we will import all the modules that we need. 

In [2]:
from elasticsearch import Elasticsearch, helpers
from urllib.request import urlopen
import getpass
import json

Now we will instantiate the Python Elasticsearch client. For authorization,on prompt we will provide our `Cloud ID` and `password`, which would enable use us to create `Elasticsearch` instance


In [77]:
# Found in the 'Manage Deployment' page
CLOUD_ID = getpass.getpass('Enter Elastic Cloud ID:  ')

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = getpass.getpass('Enter Elastic password:  ')

# Create the client instance
client = Elasticsearch(
    cloud_id=CLOUD_ID,
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

#  Case 1: Migrate an index with no `text_expansion` field

In this example we will see how to upgrade an index which has a simple [ingestion pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html) configured to use ELSER model `elser_model_2`. 

# Create Ingestion pipeline 

We will create a simple pipeline to convert title field values to lowercase and use this ingestion pipeline on our index. 

In [None]:

client.ingest.put_pipeline(
    id="ingest-pipeline-lowercase", 
    description="Ingest pipeline to change title to lowercase",
    processors=[
    {
      "lowercase": {
        "field": "title"
      }
    }
  ]
)

# Create index with mappings

Next, we will create a index `movies` with pipeline `ingest-pipeline-lowercase` that we created in previous step.

In [None]:
client.indices.create(
  index="movies",
  settings={
      "index": {
          "number_of_shards": 1,
          "number_of_replicas": 1,
          "default_pipeline": "ingest-pipeline-lowercase"
      }
  },
  mappings={
    "properties": {
      "plot": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
    }
  }
)

# Insert Documents
we are now ready to insert sample dataset of 12 movies to our index `movies`

In [None]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json"
response = urlopen(url)

# Load the response data into a JSON object
data_json = json.loads(response.read())

# Prepare the documents to be indexed
documents = []
for doc in data_json:
    documents.append({
        "_index": "movies",
        "_source": doc,
    })

# Use helpers.bulk to index
helpers.bulk(client, documents)

print("Done indexing documents into `movies` index!")

# Upgrade index `movies` to use ELSER model

**`Note:`** Before you begin upgrading index, make sure you are on 8.11 version in cloud. You can follow these [instructions](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-ELSER.html#trained-model) to download and deploy trained model in the Kibana UI or using the Dev Tools **Console**.  

we are ready to re-index  `movies` to a new index with the ELSER model `.elser_model_2`. As a first step, we have to create new ingestion pipeline and a index to use ELSER model. 

# Create a new pipeline with ELSER 
Lets create a new ingestion pipeline with ELSER model `.elser_model_2`. 

In [None]:
client.ingest.put_pipeline(
    id="elser-ingest-pipeline", 
    description="Ingest pipeline for ELSER",
    processors=[
    {
      "inference": {
        "model_id": ".elser_model_2",
        "target_field": "ml",
        "field_map": {
          "plot": "text_field"
        },
        "inference_config": {
          "text_expansion": {
            "results_field": "tokens"
          }
        }
      }
    }
  ]
)

# Create a index with mappings

Next, create an index with [`text_expansion`](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-text-expansion-query.html) query supporting ELSER model and [`rank_features`](https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-features.html) to work with our vectors. 



In [None]:
client.indices.create(
  index="elser-movies",
  mappings={
    "properties": {
      "plot": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "ml.tokens": {
        "type": "rank_features"
      },
    }
  }
)

# Reindex with updated pipeline 

With the help of [Reindex API](https://elasticsearch-py.readthedocs.io/en/stable/api.html#elasticsearch.Elasticsearch.reindex), we can copy data from old index `movies` and to new index `elser-movies` with  ingestion pipeline set to `elser-ingest-pipeline` .  On success, the index `elser-movies` creates tokens on the `text_expansion` terms that you targeted for ELSER inference.

In [None]:
client.reindex(source={
    "index": "movies"
  }, dest={
    "index": "elser-movies",
    "pipeline":  "elser-ingest-pipeline"
  })

Once reindex is complete, inspect a document in the index `elser-movies` and notice that the document now has a additional field `"ml": {"tokens":...}` with terms that we will be using in to search in our `text_expansion` query. 

Also note, you can now delete the old index `movies` if you don't need them anymore. 

# Querying documents with ELSER 

Let's try a semantic search on our index with ELSER model `.elser_model_2`

In [78]:
response = client.search(
    index='elser-movies', 
    size=3,
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id":".elser_model_2",
                "model_text":"investigation"
            }
        }
    }
)

for hit in response['hits']['hits']:
    doc_id = hit['_id']
    score = hit['_score']
    title = hit['_source']['title']
    plot = hit['_source']['plot']
    print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")

Score: 6.403748
Title: se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Score: 3.6703482
Title: the departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Score: 2.9359207
Title: the usual suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.



# Case 2: Upgrade index with ELSER model to `.elser_model_2`

If you already have a index with ELSER model `.elser_model_1` and would like to upgrade to `.elser_model_2`, you can use the Reindex API with ingestion pipeline to use ELSER `.elser_model_2` model.

**`Note:`** Before we begin, ensure that you are on Elasticsearch 8.11 version and ELSER model `.elser_model_2` is deployed. You can follow these [instructions](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-ELSER.html#trained-model) to download and deploy trained model in the Kibana UI or using the Dev Tools **Console**. 


# Create a new ingestion pipeline

We will create a pipeline with `.elser_model_2` to enable us with reindexing. 

In [None]:
client.ingest.put_pipeline(
    id="elser-pipeline-upgrade-demo", 
    description="Ingest pipeline for ELSER upgrade demo",
    processors=[
    {
      "inference": {
        "model_id": ".elser_model_2",
        "target_field": "ml",
        "field_map": {
          "plot": "text_field"
        },
        "inference_config": {
          "text_expansion": {
            "results_field": "tokens"
          }
        }
      }
    }
  ]
)

# Create a new index with mappings
We will create  a new index with required mappings supporting ELSER

In [None]:
client.indices.create(
  index="elser-upgrade-index-demo",
  mappings={
    "properties": {
      "plot": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "ml.tokens": {
        "type": "rank_features"
      },
    }
  }
)

# Use Reindex API
we will use [Reindex API](https://elasticsearch-py.readthedocs.io/en/stable/api.html#elasticsearch.Elasticsearch.reindex) to move data from old index to new index `elser-upgrade-index-demo`. We will be excluding target field `ml` from old index and instead generate new `ml` tokens with `.elser_model_2` while reindexing. 

**`Note:`** Make sure to replace `my-index` with your index name that you intend to upgrade.



In [None]:
client.reindex(source={
    "index": "my-index", # replace with your index name
    "_source": {
      "excludes": ["ml"]
    }}, 
    dest={
    "index": "elser-upgrade-index-demo",
    "pipeline":  "elser-pipeline-upgrade-demo"
  })

# Querying your data

Once reindexing is complete, you are ready to query on your data and perform semantic search 

In [75]:
response = client.search(
    index='elser-upgrade-index-demo', 
    size=3,
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id":".elser_model_2",
                "model_text":"child toy"
            }
        }
    }
)

for hit in response['hits']['hits']:
    doc_id = hit['_id']
    score = hit['_score']
    title = hit['_source']['title']
    plot = hit['_source']['plot']
    print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")


Score: 3.3168378
Title: Fight Club
Plot: An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.

Score: 1.5777297
Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Score: 1.1162646
Title: The Matrix
Plot: A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.



# Case 3: Upgrade a index with different model to ELSER

Now we will see how to move your index which already has generated `embedding` using a different model. 

Lets consider the index - `blogs` and has generated `text_embedding` using the NLP model `sentence-transformers__all-minilm-l6-v2`. In case you would like know about more how to load a NLP model to an index, follow the steps from our notebook [loading-model-from-hugging-face.ipynb](../integrations/hugging-face/loading-model-from-hugging-face.ipynb)

Follow similiar proceedure that we did in previously. 
1. Create a ingestion pipeline with ELSER model `.elser_model_2`
2. Create a index with mappings, with the pipeline we create in step 1
3. Reindex, excluding the field with generated embedding from the `blogs` index

Before we begin, lets take a look at our index `blogs` and see the mappings

In [17]:
client.indices.get(index="blogs")

ObjectApiResponse({'blogs': {'aliases': {}, 'mappings': {'properties': {'text_embedding': {'properties': {'is_truncated': {'type': 'boolean'}, 'model_id': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}, 'predicted_value': {'type': 'dense_vector', 'dims': 384, 'index': True, 'similarity': 'l2_norm'}}}, 'title': {'type': 'text', 'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}, 'settings': {'index': {'routing': {'allocation': {'include': {'_tier_preference': 'data_content'}}}, 'number_of_shards': '1', 'provided_name': 'blogs', 'default_pipeline': 'vectorize_blogs', 'creation_date': '1697033266359', 'number_of_replicas': '1', 'uuid': 'e_SkUcPXT06ZMs1ZbsZfQw', 'version': {'created': '8500003'}}}}})

Notice the field `text_embedding`, We will have to exclude this field in our new index and generate mapping with `text_expansion` against the field `title` from the `blogs` index

# Create ingestion pipeline


In [None]:
client.ingest.put_pipeline(
    id="elser-pipeline-blogs", 
    description="Ingest pipeline for ELSER upgrade",
    processors=[
    {
      "inference": {
        "model_id": ".elser_model_2",
        "target_field": "ml",
        "field_map": {
          "title": "text_field"
        },
        "inference_config": {
          "text_expansion": {
            "results_field": "tokens"
          }
        }
      }
    }
  ]
)

# Create index with mappings

Lets create a index `elser-blogs` with mappings

In [42]:
client.indices.create(
  index="elser-blogs",
  mappings={
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "ml.tokens": {
        "type": "rank_features"
      },
    }
  }
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'elser-blogs'})

# Reindex API

we will use the [Reindex API](https://elasticsearch-py.readthedocs.io/en/stable/api.html#elasticsearch.Elasticsearch.reindex) to copy data and generate `text_expansion` embedding to our new index `elser-blogs`. 

In [None]:
client.reindex(source={
    "index": "blogs",
    "_source": {
      "excludes": ["text_embedding"]
    }
  }, dest={
    "index": "elser-blogs",
    "pipeline":  "elser-pipeline-blogs"
  })

# Querying your data
Success! Now we can query data on the index `elser-blogs`.

In [46]:
response = client.search(
    index='elser-blogs', 
    size=3,
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id":".elser_model_2",
                "model_text":"Track network connections"
            }
        }
    }
)

for hit in response['hits']['hits']:
    doc_id = hit['_id']
    score = hit['_score']
    title = hit['_source']['title']
    print(f"Score: {score}\nTitle: {title}")


Score: 27.618645
Title: Brewing in Beats: Track network connections
Score: 3.8143802
Title: Machine Learning for Nginx Logs - Identifying Operational Issues with Your Website
Score: 3.3623078
Title: Data Visualization For Machine Learning
