# README

Upload the data to the open search through langchain.

**References**
* [OpenSearch Langchain](https://python.langchain.com/v0.2/docs/integrations/vectorstores/opensearch/)

# Install & Import

In [179]:
%pip install --upgrade --quiet  opensearch-py langchain-community python-dotenv langchain_openai
%pip install --upgrade --quiet transformers sentence-transformers pandas

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
8228.47s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
8234.93s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


`similarity_search` by default performs the Approximate `k-NN` Search which uses one of the several algorithms like `lucene`, `nmslib`, `faiss` recommended for large datasets. To perform brute force search we have other search methods known as `Script Scoring` and `Painless Scripting`. Check this for more details.

In [128]:
import os
import textwrap
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import CharacterTextSplitter

load_dotenv()

True

In [129]:
import getpass
import os

# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

**state_of_the_union.txt**

For this text file you can download from these link [state_of_the_union.txt](https://github.com/hwchase17/chroma-langchain/blob/master/state_of_the_union.txt)



In LangChain, the state_of_the_union.txt file is often used as a sample text document for demonstrating various functionalities like text processing, document loading, and embedding. The file typically contains the transcript of a U.S. president's State of the Union address, which is a speech given annually to outline the current condition of the country and the administration's legislative agenda.

**Embeddings**

We can use the Huggingface Embeddings for free instead of the OpenAI Embeddings

* [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)


# Start

In [165]:
loader = TextLoader("./state_of_the_union.txt")
loader

<langchain_community.document_loaders.text.TextLoader at 0x7e8b4a729e70>

In [166]:
documents = loader.load()
documents

[Document(metadata={'source': './state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determ

In [167]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter

<langchain_text_splitters.character.CharacterTextSplitter at 0x7e8b4a72a290>

In [168]:
docs = text_splitter.split_documents(documents)
docs[0]

Document(metadata={'source': './state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determi

In [169]:
# embeddings = OpenAIEmbeddings()
# embeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


# [similarity_search using Approximate k-NN](https://python.langchain.com/v0.2/docs/integrations/vectorstores/opensearch/#similarity_search-using-approximate-k-nn)

In [135]:
OPENSEARCH_USERNAME = os.environ.get('OPENSEARCH_USERNAME')
OPENSEARCH_PASSWORD = os.environ.get('OPENSEARCH_PASSWORD')

In [136]:
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    opensearch_url="https://localhost:9200",
    http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    
)

In [137]:
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query, k=10)
docs

[Document(metadata={'source': './state_of_the_union.txt'}, page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'),
 Document(metadata={'source': './state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and S

In [138]:
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


In [139]:

docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    opensearch_url="https://localhost:9200",
    http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    engine="faiss",
    space_type="innerproduct",
    ef_construction=256,
    m=48,
    
)

In [140]:
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

In [141]:
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


# [similarity_search using Script Scoring](https://python.langchain.com/v0.2/docs/integrations/vectorstores/opensearch/#similarity_search-using-script-scoring)

`similarity_search` using `Script Scoring` with Custom Parameters

In [142]:
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    opensearch_url="https://localhost:9200",
    http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    is_appx_search=False
)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
    "What did the president say about Ketanji Brown Jackson",
    k=1,
    search_type="script_scoring",
)

In [143]:
print((docs[0].page_content))

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


# [similarity_search using Painless Scripting](https://python.langchain.com/v0.2/docs/integrations/vectorstores/opensearch/#similarity_search-using-painless-scripting)

In [152]:
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    opensearch_url="https://localhost:9200",
    http_auth=(OPENSEARCH_USERNAME, OPENSEARCH_PASSWORD),
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    is_appx_search=False
)


In [153]:
filter = {"bool": {"filter": {"term": {"text": "smuggling"}}}}
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(
    "What did the president say about Ketanji Brown Jackson",
    search_type="painless_scripting",
    space_type="cosineSimilarity",
    pre_filter=filter,
)

In [154]:
print(docs[0].page_content)

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. 

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  

We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. 

We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.


# [Maximum marginal relevance search (MMR)](https://python.langchain.com/v0.2/docs/integrations/vectorstores/opensearch/#maximum-marginal-relevance-search-mmr)

If you’d like to look up for some similar documents, but you’d also like to receive `diverse results`, MMR is method you should consider. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10, lambda_param=0.5)

In [None]:
print(docs[0].page_content)

IndexError: list index out of range

In [170]:
docsearch = OpenSearchVectorSearch(
    index_name="index-*",
    embedding_function=embeddings,
    opensearch_url="http://localhost:9200",
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    is_appx_search=False
)

In [171]:
# you can specify custom field names to match the fields you're using to store your embedding, document text value, and metadata
docs = docsearch.similarity_search(
    "Who was asking about getting lunch today?",
    search_type="script_scoring",
    space_type="cosinesimil",
    vector_field="message_embedding",
    text_field="message",
    metadata_field="message_metadata",
)

ConnectionError: ConnectionError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))) caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

# Upload Data into Opensearch

In [172]:
from opensearchpy import OpenSearch

## Step 1: Connect to OpenSearch

In [175]:
username = os.environ.get("OPENSEARCH_USERNAME")
password = os.environ.get("OPENSEARCH_PASSWORD")

client = OpenSearch(
    hosts=[{"host": "localhost", "port": 9200}],
    http_auth=(username, password),  # Basic Auth
    use_ssl=True,  # Enable SSL if needed
    verify_certs=False  # Disable SSL verification if using self-signed certs
)



In [176]:
file_path = "temp/frappe_framework_v1.csv" # This dataset won't be available in the git repository use your own

## Step 2: Create or check if an index exists

In [177]:
index_name = "frappe_framework_v1"
if not client.indices.exists(index=index_name):
    client.indices.create(index=index_name)

## Step 3: Prepare data (example with a pandas DataFrame)

In [180]:
import pandas as pd
df = pd.read_csv(file_path)

In [182]:
# Convert DataFrame to a list of dictionaries (JSON-like)
documents = df.to_dict(orient="records")

## Step 4: Upload data (indexing each document)

In [233]:
for i, doc in enumerate(documents):
    response = client.index(index=index_name, id=i, body=doc)
print(f"Pushing all documents to opensearch")

Note : *Optional: Use the Bulk API for Large Data*

For larger datasets, you can use the bulk API to upload data in batches, which is more efficient than uploading one document at a time.

# Retrive Information

In [196]:
index_name = 'frappe_framework_v1'

## Retrieving All Documents

In [205]:
query = {
    "query": {
        "match_all": {}  # Retrieves all documents (you can modify this for specific queries)
    }
}

In [203]:
response = client.search(index=index_name, body=query)
print(f"total documents {response['hits']['total']['value']}")

total documents 1359


## Example of a Specific Query

In [221]:
query = {
    "query": {
        "match": {
            "assistant": "frappe.call"  # Replace with the field and value you're searching for
        }
    }
}

In [223]:
response = client.search(index=index_name, body=query)

print(f"total matching value {response['hits']['total']['value']}")

total matching value 9


## Example with Pagination (Retrieve More Documents):

In [228]:
# Define the pagination parameters
size = 5  # Number of documents to retrieve per request
from_ = 0  # Starting point for pagination

query = {
    "query": {
        "match_all": {}  # This retrieves all documents
    },
    "size": size,
    "from": from_  # Start from document 0
}

# Perform the search with pagination
response = client.search(index=index_name, body=query)

print(f"total matching value {response['hits']['total']['value']} & size for a page {len(response['hits']['hits'])}")

total matching value 1359 & size for a page 5


**Move to next page**

In [231]:
size = 5
from_ = 0
total_documents = 1359  # Based on your sample data

while from_ < total_documents:
    query = {
        "query": {
            "match_all": {}
        },
        "size": size,
        "from": from_
    }

    response = client.search(index=index_name, body=query)

    # Process and print the results
    for hit in response['hits']['hits']:
        pass
        # print(f"ID: {hit['_id']}, User: {hit['_source']['user']}, Assistant: {hit['_source']['assistant']}")

    from_ += size  # Move to the next batch
