# Create your Index for Similarity Search

![Converting our Plain Text Docs into chunked docs in an opensearch index](../img/txt-doc-to-os-docs.png)

In order to ingest our transcriptions we need to prepare an opensearch index to store our data.

In this workshop, we're ingesting ONLY [our transcription example](../transcripts/transcription_example.txt) but our opensearch index will have hundreds of documents and our final RAG Application will have tens of thousands of documents.

---

🔍 Let's examine the metadata of our document

```yaml
description: "Do you have a grip on productivity? Are you worried that external factors could disrupt what you’re doing at any second? Time to put things in a VICE!"
pub_date: "March 10th, 2022"
title: "18: Putting External Factors in a VICE Grip \U0001F5DC"
url: https://relay.fm/conduit/18
```

This information along with our `content` needs to be mapped out into an index.

While all of the metadata is a string we want to setup our metadata to fit our needs which means `pub_date` should be a `date` value.

Let's start out by importing our environment variables and loading our imports. Then we'll establish our connection with our OpenSearch®️ service.

In [1]:
import os

from dotenv import load_dotenv
from opensearchpy import OpenSearch

load_dotenv()

connection_string = os.getenv("OPENSEARCH_SERVICE_URI")
client = OpenSearch(connection_string, use_ssl=True, timeout=100)
client.info()

ConnectionError: ConnectionError((<urllib3.connection.HTTPSConnection object at 0x1068fc350>, 'Connection to os-priceline-langchain-demo-devrel-jay.l.aivencloud.com timed out. (connect timeout=100)')) caused by: ConnectTimeoutError((<urllib3.connection.HTTPSConnection object at 0x1068fc350>, 'Connection to os-priceline-langchain-demo-devrel-jay.l.aivencloud.com timed out. (connect timeout=100)'))

Next, let's define our mapping for this index. We know that our index will use _K-Nearest Neighbors_. This means that we need to enable it in the settings.

We'll also provide the context around the vectors that we'll create. The `knn_vector` mapping will use dimension settings for [the model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) we're using.

Finally, we'll create a date-match pattern that we'll use to convert our date to a format that can be used with OpenSearch®️.

In [None]:

index_settings = {
        'settings': {
            'index': {
            "knn": True
            },
        },
        "mappings": {
            "properties": {
            "title": {"type": "text"},
            "description": {"type": "text"},
            "url": {"type": "keyword"},
            "content": {"type": "text"},
            "content_vector": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "space_type": "l2",
                    "engine": "faiss",
                },
            },
            "pub_date": {"type": "date"},
            }
        }
    }

We'll wrap up with defining our index name and adding to our .env file.

In [None]:
!echo INDEX_NAME="embedded_transcripts" >> .env

In [None]:
load_dotenv()
index_name = os.getenv("INDEX_NAME")
client.indices.create(index=index_name, body=index_settings, ignore=400)


In this notebook we created our OpenSearch®️ index. We looked at the metadata and made sure that the values matched.

In the next notebook we'll split our documents to fit our vectorization model and generate embeddings.

Move onto the [next notebook](2-chunk-segment-ingest.ipynb) or push the button below

[![Chunk and Ingest your Data](https://img.shields.io/badge/2-Chunk%20and%20Ingest%20Your%20Docs-153a5a?style=for-the-badge&labelColor=ec6147)](2-chunk-segment-ingest.ipynb)
