# Hybrid search
- What is Hybrid Search?
    - Hybrid search is about perform semantic and keyword search over the data in one query and combine the results for more relevant results.
    - Here is the workflow :
        - Create dense vectors using an external embedding model.
        - Create sparse vectors using an external model.
        - Create an index that supports sparse-dense vectors (s1 or p1 with the dotproduct metric).
        - Upsert dense and sparse vectors to your index.
        - Search the index using sparse-dense vectors.
        - Pinecone returns sparse-dense vectors.
        
### Read further
- https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/pinecone_hybrid_search.html
- https://docs.pinecone.io/docs/hybrid-search

## Prepare environment

In [None]:
%pip install python-dotenv

In [23]:
import os
from pathlib import Path
path_dir_script = Path(os.getcwd())
path_dir_root = path_dir_script.parent

In [None]:
from dotenv import load_dotenv, find_dotenv
path_file_dotenv = Path(path_dir_root, '.env')
load_dotenv(path_file_dotenv)

True

## Load data

In [None]:
!pip install GitPython

In [None]:
from langchain.document_loaders import GitLoader # https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/git.html
import os

In [None]:
loader = GitLoader(
    repo_path=os.environ['PATH_CODE_REPO_1'],
    branch="master", 
    file_filter=lambda file_path: file_path.endswith(".swift") or file_path.endswith(".h") or file_path.endswith(".m")
)

In [None]:
data = loader.load()

In [None]:
len(data)

## Split text

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents, chunk_size=1000, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  texts = text_splitter.split_documents(documents)
  return texts

texts = split_docs(data)
print(len(texts))

## Create index

In [None]:
import pinecone

In [None]:
pinecone.init(
    api_key=os.getenv('PINECODE_API_KEY'),
    environment=os.getenv('PINECODE_ENVIRONMENT')
)

In [None]:
# Check existing indexes
print(pinecone.list_indexes())

In [None]:
index_name = "ai-repo-reader-1"

In [None]:
# Clean up previous index before creating new one
pinecone.delete_index(index_name)

In [None]:
# Note: it’s important to make sure that the “context” field that holds the document text in the metadata is not indexed. 
# Currently you need to specify explicitly the fields you do want to index. For more information checkout Pinecone’s docs.
pinecone.create_index(
    name = index_name, 
    dimension = 1536, # dimensionality of dense model
    metric = "dotproduct", # sparse vqlues supported only for dotproduct
    pod_type = "s1",
    metadata_config={"indexed": []}
)

In [None]:
index = pinecone.Index(index_name)

## Get embeddings

In [None]:
import os
from langchain.embeddings import OpenAIEmbeddings

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])

In [None]:
!pip install pinecone_text

In [None]:
from pinecone_text.sparse import BM25Encoder
# or from pinecone_text.sparse import SpladeEncoder if you wish to work with SPLADE
# use default tf-idf values
bm25_encoder = BM25Encoder().default()

In [None]:
corpus = [t.page_content for t in texts]

# fit tf-idf values on your corpus
bm25_encoder.fit(corpus)

# store the values to a json file
bm25_encoder.dump("bm25_values.json")

# load to your BM25Encoder object
bm25_encoder = BM25Encoder().load("bm25_values.json")

## Load Retriever

In [None]:
from langchain.retrievers import PineconeHybridSearchRetriever
retriever = PineconeHybridSearchRetriever(embeddings=embeddings, sparse_encoder=bm25_encoder, index=index)

In [None]:
retriever.add_texts(corpus)

## Use Retriever

In [None]:
result = retriever.get_relevant_documents("Where is frequency measured?")

In [None]:
print(result[0].page_content)

In [None]:
result = retriever.get_relevant_documents("Where is AudioKit instance used?")

In [None]:
print(result[0].page_content)

In [None]:
print(result[1].page_content)