# ElasticSearch

>[Elasticsearch](https://www.elastic.co/elasticsearch/) 是一个分布式的 RESTful 搜索和分析引擎。它提供了一个分布式的、支持多租户的全文搜索引擎，带有 HTTP Web 界面和 `schema-free` 的 JSON 文档。

这个 Notebook 展示了如何使用与`Elasticsearch`数据库相关的功能。

# ElasticVectorSearch class

## 安装

查看 [Elasticsearch 安装说明](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).

通过指定`Elasticsearch URL/index name/embedding`获取不需要密码登录的 Elasticsearch 实例。

示例:
```python
        from langchain import ElasticVectorSearch
        from langchain.embeddings import OpenAIEmbeddings

        embedding = OpenAIEmbeddings()
        elastic_vector_search = ElasticVectorSearch(
            elasticsearch_url="http://localhost:9200",
            index_name="test_index",
            embedding=embedding
        )
```

要连接到需要登录凭据的 Elasticsearch 实例，
如 `Elastic Cloud`, 使用 Elasticsearch URL 格式
(`https://username:password@es_host:9243`)。

示例:
```python
        from langchain import ElasticVectorSearch
        from langchain.embeddings import OpenAIEmbeddings

        embedding = OpenAIEmbeddings()

        elastic_host = "cluster_id.region_id.gcp.cloud.es.io"
        elasticsearch_url = f"https://username:password@{elastic_host}:9243"
        elastic_vector_search = ElasticVectorSearch(
            elasticsearch_url=elasticsearch_url,
            index_name="test_index",
            embedding=embedding
        )
```

## Example

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch
from langchain.document_loaders import TextLoader

In [5]:
from langchain.document_loaders import TextLoader
loader = TextLoader('/data/项目/01-产业链图谱-节点识别/knowledge/氢能源/氢能源知识文档.md', encoding='utf-8')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 167, which is longer than the specified 50
Created a chunk of size 281, which is longer than the specified 50
Created a chunk of size 202, which is longer than the specified 50
Created a chunk of size 219, which is longer than the specified 50
Created a chunk of size 213, which is longer than the specified 50
Created a chunk of size 619, which is longer than the specified 50
Created a chunk of size 195, which is longer than the specified 50
Created a chunk of size 815, which is longer than the specified 50
Created a chunk of size 126, which is longer than the specified 50
Created a chunk of size 868, which is longer than the specified 50


In [7]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
path = "/data/models/ernie-3.0-nano-zh"
embeddings = HuggingFaceEmbeddings(model_name=path)

No sentence-transformers model found with name /data/models/ernie-3.0-nano-zh. Creating a new one with MEAN pooling.
Some weights of ErnieModel were not initialized from the model checkpoint at /data/models/ernie-3.0-nano-zh and are newly initialized: ['ernie.pooler.dense.weight', 'ernie.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
len(docs)

10

In [21]:
ElasticVectorSearch.from_documents??

In [27]:
db = ElasticVectorSearch(elasticsearch_url="http://localhost:9200", index_name='my_docs', embedding=embeddings)

In [28]:
db.from_documents(docs, embedding=embeddings)

ValueError: Did not find elasticsearch_url, please add an environment variable `ELASTICSEARCH_URL` which contains it, or pass  `elasticsearch_url` as a named parameter.

In [29]:
db = ElasticVectorSearch.from_documents(
    docs, embeddings, elasticsearch_url="http://localhost:9200",index_name='my_docs')

In [22]:
ElasticVectorSearch??

In [30]:
db.index_name

'my_docs'

In [13]:
query = "上游"
docs = db.similarity_search(query, k=10)

In [16]:
for doc in docs:
    print(doc.page_content)

<split>

> 上游产业链划分：能源端
<split>

> 下游产业链划分：应用端
**氢能源总体产业链划分**：
<split>

> 国际产业链所属发展阶段

### 1.3 氢能源技术原理
<split>

### 1.2 产业链发展

> 我国产业链所属发展阶段
<split>

> 中游产业链划分：中游氢燃料电池关键零部件端
<split>

## 2. 氢能源产业链划分

> 产业链总体划分
**氢能源上游产业链划分**：制氢、存储、运输、加氢。
# 氢能源知识文档

## 1. 氢能源简介

### 1.1 制氢能力
<split>

## 3. 其他

## 4. 参考文件

1. 

<split>


# ElasticKnnSearch Class
The `ElasticKnnSearch` implements features allowing storing vectors and documents in Elasticsearch for use with approximate [kNN search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html)

In [None]:
!pip install langchain elasticsearch

In [None]:
from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch
from langchain.embeddings import ElasticsearchEmbeddings
import elasticsearch

In [None]:
# Initialize ElasticsearchEmbeddings
model_id = "<model_id_from_es>" 
dims = dim_count
es_cloud_id = "ESS_CLOUD_ID"
es_user = "es_user"
es_password = "es_pass"
test_index = "<index_name>"
#input_field = "your_input_field" # if different from 'text_field'

In [None]:
# Generate embedding object
embeddings = ElasticsearchEmbeddings.from_credentials(
    model_id,
    #input_field=input_field,
    es_cloud_id=es_cloud_id,
    es_user=es_user,
    es_password=es_password,
)

In [None]:
# Initialize ElasticKnnSearch
knn_search = ElasticKnnSearch(
	es_cloud_id=es_cloud_id, 
	es_user=es_user, 
	es_password=es_password, 
	index_name= test_index, 
	embedding= embeddings
)

## Test adding vectors

In [None]:
# Test `add_texts` method
texts = ["Hello, world!", "Machine learning is fun.", "I love Python."]
knn_search.add_texts(texts)

# Test `from_texts` method
new_texts = ["This is a new text.", "Elasticsearch is powerful.", "Python is great for data analysis."]
knn_search.from_texts(new_texts, dims=dims)

## Test knn search using query vector builder 

In [None]:
# Test `knn_search` method with model_id and query_text
query = "Hello"
knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2)
print(f"kNN search results for query '{query}': {knn_result}")
print(f"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'")

# Test `hybrid_search` method
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2)
print(f"Hybrid search results for query '{query}': {hybrid_result}")
print(f"The 'text' field value from the top hit is: '{hybrid_result['hits']['hits'][0]['_source']['text']}'")

## Test knn search using pre generated vector 


In [None]:
# Generate embedding for tests
query_text = 'Hello'
query_embedding = embeddings.embed_query(query_text)
print(f"Length of embedding: {len(query_embedding)}\nFirst two items in embedding: {query_embedding[:2]}")

# Test knn Search
knn_result = knn_search.knn_search(query_vector = query_embedding, k=2)
print(f"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'")

# Test hybrid search - Requires both query_text and query_vector
knn_result = knn_search.knn_hybrid_search(query_vector = query_embedding, query=query_text, k=2)
print(f"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'")

## Test source option

In [None]:
# Test `knn_search` method with model_id and query_text
query = "Hello"
knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2, source=False)
assert not '_source' in knn_result['hits']['hits'][0].keys()

# Test `hybrid_search` method
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2, source=False)
assert not '_source' in hybrid_result['hits']['hits'][0].keys()

## Test fields option 

In [None]:
# Test `knn_search` method with model_id and query_text
query = "Hello"
knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2, fields=['text'])
assert 'text' in knn_result['hits']['hits'][0]['fields'].keys()

# Test `hybrid_search` method
query = "Hello"
hybrid_result = knn_search.knn_hybrid_search(query = query, model_id= model_id, k=2, fields=['text'])
assert 'text' in hybrid_result['hits']['hits'][0]['fields'].keys()

### Test with es client connection rather than cloud_id 

In [None]:
# Create Elasticsearch connection
es_connection = Elasticsearch(
    hosts=['https://es_cluster_url:port'], 
    basic_auth=('user', 'password')
)

In [None]:
# Instantiate ElasticsearchEmbeddings using es_connection
embeddings = ElasticsearchEmbeddings.from_es_connection(
    model_id,
    es_connection,
)

In [None]:
# Initialize ElasticKnnSearch
knn_search = ElasticKnnSearch(
	es_connection = es_connection,
	index_name= test_index, 
	embedding= embeddings
)

In [None]:
# Test `knn_search` method with model_id and query_text
query = "Hello"
knn_result = knn_search.knn_search(query = query, model_id= model_id, k=2)
print(f"kNN search results for query '{query}': {knn_result}")
print(f"The 'text' field value from the top hit is: '{knn_result['hits']['hits'][0]['_source']['text']}'")
