# RAG Chat Bot for Hybrid Search

This is the accompanying notebook for the Oct 19 (2023) RAG for Hybrid Search meetup.

---

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/rag-for-hyrbid/rag-for-hybrid-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/rag-for-hyrbid/rag-for-hybrid-search.ipynb)

Quick notes:
- You will need an OpenAI API Key
- You will need a Pinecone account
- If running locally, this notebook is meant to run on a Macbook Pro (m2)
- Cells that preview data are commented out, so that users can more easily navigate the notebook on Github. Run the notebook locally with these cells un-commented to see data previews.

---



Make sure you have [docker](https://docs.docker.com/engine/install/) and [homebrew](https://brew.sh/) installed for the above steps. 

---

[Read more about how unstructured partitions PDFs & why the fact that our PDFs have columns determined our use of the `ocr_only` strategy. 
](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf)

[Run the docker container](https://unstructured-io.github.io/unstructured/api.html#using-docker-images) from `Unstructured.`


In [2]:
# !docker pull quay.io/unstructured-io/unstructured-api:latest
# !docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

In [23]:
# Brew install packages (poppler==23.10.0 and tesseract==5.3.3)

!brew install \
  poppler \  
  tesseract

In [25]:
# Pip install libraries

!pip3 install -qU \
  pinecone-client==2.2.4 \
  pinecone-text==0.5.4 \
  unstructured==0.10.24 \
  sentence-transformers==2.2.2 \
  langchain==0.0.327

Imports

In [3]:
import pinecone
import os
import re
from uuid import uuid4
from typing import IO, Any, Dict, List, Tuple
from copy import deepcopy

from unstructured.partition.pdf import partition_pdf
from unstructured.documents import elements
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from sentence_transformers import SentenceTransformer
import pinecone
import openai
from pinecone.core.client.model.query_response import QueryResponse

from pinecone_text.sparse import BM25Encoder

  from tqdm.autonotebook import tqdm


Set up the environment variables we'll need. We recommend using `dotenv`. It's a super simple way to keep your variables safe, but accessible. Simply create a `.env` file with your secrets in it, and use the Python `dotenv` and `os` libraries to load them.

In [4]:
%load_ext dotenv
from dotenv import load_dotenv
import os

In [5]:
# Make sure dotenv is in our kernel environment & working

load_dotenv()

True

In [6]:
pinecone_api_key = os.getenv('PINECONE_API_KEY')  # You can get your Pinecone api key and env (e.g. "us-east-1") at app.pinecone.io
pinecone_env = os.getenv('PINECONE_ENV')
openai_api_key = os.getenv('OPENAI_API_KEY')


In [7]:
# Let's make sure our dotenv secrets loaded correctly

assert len(pinecone_api_key) > 0
assert len(pinecone_env) > 0
assert len(openai_api_key) > 0

# Download some articles we're interested in learning more about. 

Remember, hybrid search is best for knowledge that contains a lot of unique keywords that you'd like to search for, along with concepts you'd like clarity on, etc. Data that works best for this type of thing include medical data, most types of research data, data with lots of entities in it, etc.

We'll be using Arxiv.org articles about different vector search algorithms for this demo. They've got lots of jargon and concepts that'll work great for hybrid search!

In [8]:
freshdisk = os.path.join("/Users/audrey.lorberfeld/Downloads/freshdiskann_paper.pdf")
hnsw = os.path.join("/Users/audrey.lorberfeld/Downloads/hnsw_paper.pdf")
ivfpq = os.path.join("/Users/audrey.lorberfeld/Downloads/ivfpq_paper.pdf")

# Partitioning & Cleaning our PDFs

This step is optional. Partitioning simply uses ML to break a document up into pages, paragraphs, the title, etc. It's a nice-to-have that allows you to exclude certain elements you might not want to index, such as an article's bibliography (although we'll keep that since it could be useful information). 

If you want to skip this step, you can just read the PDFs into text or json, etc. and make your chunks straight from that object(s). 

Note: this notebook assumes you have partitioned your PDF. If you want to run this notebook from start to finish as-is, you'll need to run this step.

In [9]:
# Let's partition all of our PDFs and store their partitions in a dictionary for easy retrieval & inspection later

# Note: This takes ~2 mins to run

partitioned_files = {
    "freshdisk": partition_pdf(freshdisk, url=None, strategy = 'ocr_only'),
    "hnsw": partition_pdf(hnsw, url=None, strategy = 'ocr_only'),
    "ivfpq": partition_pdf(ivfpq, url=None, strategy = 'ocr_only'),            
}


In [10]:
# Let's make an archived copy of partitioned_files dict so if we mess it up while cleaning, we don't have to re-ocr our PDFs:

partitioned_files_copy = deepcopy(partitioned_files)

In [12]:
# partitioned_files.get('freshdisk')

You can see in the preview above that each of our PDFs now has elements classifying different parts of the text, such as `Text`, `Title`, and `EmailAddress`.

Data cleaning matters a lot when it comes to hybrid search, because for the keyword-search part we care about each individual token (word).

Let's filter out all of the email addresses to start with, since we don't need those for any reason.

In [13]:
def remove_unwanted_categories(elements: Dict[str, List[elements]], unwanted_cat: str) -> None: 
    """
    Remove partitions containing an unwanted category.
    
    :parameter elements: Partitioned pieces of our documents.
    :parameter unwanted_cat: The name of the category we'd like filtered out.
    """
    for key, value in elements.items():
        elements[key] = [i for i in value if not i.category == unwanted_cat]
        

In [14]:
# Remove unwanted EmailAddress category from dictionary of partitioned PDFs

remove_unwanted_categories(partitioned_files, 'EmailAddress')

No more `EmailAddress` elements!:

In [16]:
# partitioned_files.get('freshdisk')

To actually see what our elements are, we can call the `.text` attribute of each object:

In [18]:
# Text preview of what's actually in one of our dictionary items:

# [i.text for i in partitioned_files.get('freshdisk')]

![Screenshot 2023-10-18 at 12.51.55 PM.png](attachment:4d164df8-187c-4d6a-879a-09c712c1e31e.png)

![Screenshot 2023-10-18 at 12.32.39 PM.png](attachment:0aa6e277-09cf-4ac5-81c4-755c6fb9657e.png)

You can see there are weird things like blank spaces, single letters, etc. as their own partitions. We don't want these either, so let's get rid of them. 

You can also see where some page breaks were that spanned single words -- these are identifiable by a word ending with a `- `. For these, we want to get rid of the `- ` and squish the word back together, so it makes sense.

(You can also see that not all of the email addresses were caught by Unstructured's ML. It's too cumbersome to go through each doc and weed those out by hand, so we'll just have to leave them for now)

In [19]:
# Remove empty spaces & single-letter/-digit partitions:

def remove_space_and_single_partitions(elements: Dict[str, List[elements]]) -> None: 
    """
    Remove empty partitions & partitions with lengths of 1.
    
    :parameter elements: Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        elements[key] = [i for i in value if len(i.text.strip()) > 1 ]

In [20]:
remove_space_and_single_partitions(partitioned_files)

No more single-character partitions or partitions with only whitespace, perfect!

In [22]:
# [i.text for i in partitioned_files.get('freshdisk')]

Let's now get rid of those strange words that have been split across page breaks (e.g. `funda- mental`):

In [23]:
# Note: this function transforms our elemenets into their text representations

def rejoin_split_words(elements: Dict[str, List[elements]]) -> None: 
    """
    Rejoing words that are split over pagebreaks.
    
    :parameter elements: Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        elements[key] = [i.text.replace('- ', '') for i in value if '- ' in i.text]



In [24]:
rejoin_split_words(partitioned_files)

In [29]:
# partitioned_files.get('freshdisk')

You can see now that we've sewn those split words back together:

![Screenshot 2023-10-18 at 1.57.35 PM.png](attachment:a3bd85fa-d8cb-4c64-996b-009ccc52e11b.png)

![Screenshot 2023-10-18 at 1.57.22 PM.png](attachment:b7683d0c-c77c-4c1e-a82c-7d540b32a66b.png)

The last cleaning step we'll want to take is removing the inline citations, e.g. `[6, 9, 11, 16, 32, 35, 38, 43, 59]` and `[12]`.

In [26]:
def remove_inline_citation_numbers(elements: Dict[str, List[elements]]) -> None: 
    """
    Remove inline citation numbers from partitions.
    
    :parameter elements: Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        pattern = re.compile(r'\[\s*(\d+\s*,\s*)*\d+\s*\]')
        elements[key] = [pattern.sub('', i) for i in value]



In [27]:
remove_inline_citation_numbers(partitioned_files)

We've still got some weird numbers in there, but it's pretty good!

In [30]:
# partitioned_files.get('freshdisk')

Now that we've cleaned our data, we can zip all the partitions (per PDF) back together so we're starting our chunking from a single, coherent text object.

In [31]:
# Sew our partitions back together, per PDF:

def stitch_partitions_back_together(elements: Dict[str, List[elements]]) -> None: 
    """
    Stitch partitions back into single string object.
    
    :parameter elements:  Partitioned pieces of our documents.
    """
    for key, value in elements.items():
        elements[key] = ' '.join(value)

In [32]:
stitch_partitions_back_together(partitioned_files)

Good to go! All of our PDFs are now cleaned and single globs of text data

In [34]:
# partitioned_files

In [35]:
# Let's save our cleaned files to a new variable that makes more sense w/the current state 

cleaned_files = partitioned_files

# Chunking our PDF content

Chunking is integral to achieving great relevance with vector search, whether that's sparse vector search, dense vector search, or hybrid vector search.

From our [chunking strategy post](https://www.pinecone.io/learn/chunking-strategies/):

> The main reason for chunking is to ensure we’re embedding a piece of content with as little noise as possible that is still semantically relevant . . . For example, in semantic search, we index a corpus of documents, with each document containing valuable information on a specific topic. By applying an effective chunking strategy, we can ensure our search results accurately capture the essence of the user’s query. If our chunks are too small or too large, it may lead to imprecise search results or missed opportunities to surface relevant content. As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will make sense to the language model as well. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.

We need to chunk our PDFs' (text) data into sizable chunks that are semantically coherent and dense with contextual information. 

We'll use LangChain's `RecusiveCharacterTextSplitter` since it's a super easy utility that makes chunking quick and customizable. You should experiment with different chunk sizes and overlap values to see how the resulting chunks differ. You want each chunk to make a reasonable amount of sense as a stand-alone data object. After some experimentation on our end, we will choose a `chunk_size` of `512` and a `chunk_overlap` of `35` (characters).

In [37]:
def generate_chunks(doc: str, chunk_size: int = 512, chunk_overlap: int = 35) -> List[Document]:
    """
    Generate chunks of a certain size and token overlap. 

    :param doc: Document we want to turn into chunks.
    :param chunk_size: Desired size of our chunks, in tokens (words).
    :param chunk_overlap: Desired # of tokens (words) that will overlap across chunks.

    :return: Chunks representations of the given document.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap
    )

    return splitter.create_documents([doc])



In [38]:
def chunk_documents(docs: Dict[str, List[elements]],  chunk_size: int = 512, chunk_overlap: int = 35) -> None: 
    """
    Iterate over documents and chunk each one.
    
    :parameter docs: The documents we want to chunk.
    :param chunk_size: Desired size of our chunks, in tokens (words).
    :param chunk_overlap: Desired # of tokens (words) that will overlap across chunks.
    """
    for key, value in docs.items():
        chunks = generate_chunks(value)
        docs[key] = [c.page_content for c in chunks]  # Grab the text representation of the chunks via the `page_content` attribute


In [39]:
chunk_documents(cleaned_files)

In [40]:
chunked_files = cleaned_files

Check out our chunks!

In [43]:
# chunked_files

# Create Dense Embeddings of our Chunks

Hybrid search needs both dense embeddings and sparse embeddings of the same content in order to work. Let's start with dense embeddings.

We'll use the `'all-MiniLM-L12-v2'` [model](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) hosted by HuggingFace to create our dense embeddings. It's currently high on their [MTEB (Massive Text Embedding Benchmark) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) (Reranking section), so it's a pretty safe bet. This will output dense vectors of 384 dimensions.

Note: if you're playing around with this notebook, make sure to save your chunks and embeddings (both sparse and dense) in `pkl` [files](https://stackoverflow.com/questions/11218477/how-can-i-use-pickle-to-save-a-dict-or-any-other-python-object), so that you don't have to wait for the embeddings to generate again if you want to rerun any steps in this notebook.

We'll have to create a dense embedding of each of our PDFs' chunks:

In [44]:
def produce_embeddings(chunks: List[str]) -> List[str]:
    """
    Produce dense embeddings for each chunk.

    :param chunks: The chunks we want to create dense embeddings of.

    :return: Dense embeddings produced by our SentenceTransformer model `all-MiniLM-L12-v2`.
    """
    model = SentenceTransformer('all-MiniLM-L12-v2')
    embeddings = []
    for c in chunks:
        embedding = model.encode(c)
        embeddings.append(embedding)
    return embeddings
    

In [45]:
freshdisk_dembeddings = produce_embeddings(chunked_files.get('freshdisk'))

In [46]:
hnsw_dembeddings = produce_embeddings(chunked_files.get('hnsw'))

In [47]:
ivfpq_dembeddings = produce_embeddings(chunked_files.get('ivfpq'))

In [48]:
# We can confirm the shape of each our dense embeddings is 384:

# Make binary lists to keep track of any shapes that are *not* 384
freshdisk_assertion = [0 for i in freshdisk_dembeddings if i.shape == 384]
hnsw_assertion = [0 for i in hnsw_dembeddings if i.shape == 384]
ivfpq_assertion = [0 for i in ivfpq_dembeddings if i.shape == 384]

# Sum up our lists. If there are any embeddings that are not of shape 384, these sums will be > 0
assert sum(freshdisk_assertion) == 0
assert sum(hnsw_assertion) == 0
assert sum(ivfpq_assertion) == 0

# Create Sparse Embeddings of our Chunks

Now we can create our sparse embeddings. We will use the BM25 algorithm to create our sparse embeddings. The resulting vector will represent an inverted index of the tokens in our chunks, constrained by things like chunk length. 

Pinecone has an awesome [text library](https://github.com/pinecone-io/pinecone-text) that makes generating these vectors super easy. We also have [a great notebook](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/bm25/bm25-vector-generation.ipynb) all about BM25 encodings.

Since we're using a ML-implemented version of BM25, we need to "fit" the model to our corpus. To do this, we'll combine all 3 of our PDFs together, so that the BM25 model can compute all the token frequencies etc correctly. We'll then encode each of our documents with our "fitted" model.

In [49]:
# Join the content of all our PDFs together into 1 large corpus

corpus = ""

for i, v in chunked_files.items():
    corpus += ' '.join(v)

In [51]:
len(corpus)  # Awesome, we've got lots o' tokens here for our BM25 model to learn :)

127590

In [52]:
# Initialize BM25 and fit the corpus

bm25 = BM25Encoder()
bm25.fit(corpus)

100%|████████████████████████████████| 127590/127590 [00:02<00:00, 58007.82it/s]


<pinecone_text.sparse.bm25_encoder.BM25Encoder at 0x2b7110e10>

In [53]:
# Create embeddings for each chunk
freshdisk_sembeddings = [bm25.encode_documents(i) for i in chunked_files.get('freshdisk')]

In [54]:
hnsw_sembeddings = [bm25.encode_documents(i) for i in chunked_files.get('hnsw')]

In [55]:
ivfpq_sembeddings = [bm25.encode_documents(i) for i in chunked_files.get('ivfpq')]

Let's look at the sparse embeddings for one of our PDFs.

You'll see that each PDF's chunks has now transformed into a dictionary with `indices` and `values` keys. 

In [57]:
# freshdisk_sembeddings

In [58]:
# We want the # of chunks per PDF to be equal to the # of sparse embeddings we've generated. Let's check that:

assert len(freshdisk_sembeddings) == len(chunked_files.get('freshdisk'))
assert len(hnsw_sembeddings) == len(chunked_files.get('hnsw'))
assert len(ivfpq_sembeddings) == len(chunked_files.get('ivfpq'))

# Getting Our Embeddings into Pinecone

Now that we have made our sparse and dense embeddings, it's time to index them into our Pinecone index.

One thing to note is that only [p1 and s1 pods support hybrid search](https://docs.pinecone.io/docs/indexes). Since we're not concerned about high throughput for a demo, we'll go with s1, which is optimized for storage over throughput.

Hybrid search inherently also needs `"dotproduct"` as the similarity `metric`.

In [59]:
pinecone.init(
   api_key=pinecone_api_key,
   environment=pinecone_env
)
# choose a name for your index
index_name = "hybrid-search-demo-oct23"
 
# create the index
pinecone.create_index(
   name = index_name,
   dimension = 384,  # dimensionality of our vectors
   metric = "dotproduct",
   pod_type = "s1"
)

In [60]:
# Let's confirm everything looks good with our index

pinecone.describe_index('hybrid-search-demo-oct23')

IndexDescription(name='hybrid-search-demo-oct23', metric='dotproduct', replicas=1, dimension=384.0, shards=1, pods=1, pod_type='s1.x1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

We'll create an index object out of the index we just made. We'll make this with Pinecone's [GRPC client](https://docs.pinecone.io/docs/performance-tuning#using-the-grpc-client-to-get-higher-upsert-speeds), since it's a little faster for upserts:


In [61]:
index = pinecone.GRPCIndex("hybrid-search-demo-oct23")


We'll need to make unique IDs for all of our objects, which is easy with the `uuid` library in Python:

In [62]:
def create_ids(chunks: str) -> List[str]:
    """
    Create unique IDs for each document (chunk) in our index.

    :param chunks: Chunks of our PDF file.

    :return: Unique IDs for chunks.
    """
    return [str(uuid4()) for _ in range(len(chunks))]

In [63]:
freshdisk_ids = create_ids(chunked_files.get('freshdisk'))
hnsw_ids = create_ids(chunked_files.get('hnsw'))
ivfpq_ids = create_ids(chunked_files.get('ivfpq'))


In [64]:
# Let's preview one of our IDs:

freshdisk_ids[0]

'4c4a6ec8-a345-46cd-8c79-b8fb1a4c2a34'

In [65]:
# Let's make sure we have the same # of IDs as there are chunks:

assert len(freshdisk_ids) == len(chunked_files.get('freshdisk'))
assert len(hnsw_ids) == len(chunked_files.get('hnsw'))
assert len(ivfpq_ids) == len(chunked_files.get('ivfpq'))

Now that we have our IDs, we can make our composite sparse-dense objects that we'll index into Pinecone. These will take 4 components: 
- Our IDs
- Our sparse embeddings
- Our dense embeddings
- Our chunks

We'll use the actual text content of our PDFs (stored in our chunks) as metadata. This allows the end user to see the content of what's being returned by their search instead of just the sparse/dense vectors. In order to store our chunks' textual data in digestible metadata object for Pinecone, we'll want to turn each chunk into a dict that has a `'text'` key to hold the chunk value.

In [66]:
def create_metadata_objs(doc: List[str]) -> List[dict[str]]:
    """
    Create objects to store as metadata alongside our sparse and dense vectors in our hybird Pinecone index.

    :param doc: Chunks of a document we'd like to use while creating metadata objects.

    :return: Metadata objects with a "text" key and a value that points to the text content of each chunk. 
    """
    return [{'text': d} for d in doc]

In [68]:
freshdisk_metadata = create_metadata_objs(chunked_files.get('freshdisk'))
hnsw_metadata = create_metadata_objs(chunked_files.get('hnsw'))
ivfpq_metadata = create_metadata_objs(chunked_files.get('ivfpq'))

In [69]:
# Preview

freshdisk_metadata[0]

{'text': 'Approximate nearest neighbor search (ANNS) is a fundamental building block in information retrieval with graphbased indices being the current state-of-the-art  and widely used in the industry. Recent advances  in graph-based indices have made it possible to index and search billion-point datasets with high recall and millisecond-level latency on a single commodity machine with an SSD. In this paper, we present the first graph-based ANNS index that reflects corpus updates into the index in real-time without'}

In [70]:
def create_composite_objs(ids: str, sembeddings: List[Dict[str, List[Any]]], dembeddings: List[float], metadata: Dict[str, str]) -> List[Dict[str, Any]]:
    """
    Create objects for indexing into Pinecone. Each object contains a document ID (which corresponds to the chunk, not the larger document), 
    the chunk's sparse embedding, the chunk's dense embedding, and the chunk's corresponding metadata object.

    :param ids: Unique ID of a chunk we want to index.
    :param sembeddings: Sparse embedding representation of a chunk we want to index.
    :param dembeddings: Dense embedding representation of a chunk we want to index.
    :param metadata: Metadata objects with a "text" key and a value that points to the text content of each chunk. 

    :return: Composite objects in the correct format for ingest into Pinecone.
    """
    to_index = []

    for i in range(len(metadata)):
        to_index_obj = {
                'id': ids[i],
                'sparse_values': sembeddings[i],
                'values': dembeddings[i],
                'metadata': metadata[i]
            }
        to_index.append(to_index_obj)
    return to_index

In [71]:
freshdisk_com_objs = create_composite_objs(freshdisk_ids, freshdisk_sembeddings, freshdisk_dembeddings, freshdisk_metadata)
hnsw_com_objs = create_composite_objs(hnsw_ids, hnsw_sembeddings, hnsw_dembeddings, hnsw_metadata)
ivfpq_com_objs = create_composite_objs(ivfpq_ids, ivfpq_sembeddings, ivfpq_dembeddings, ivfpq_metadata)

In [73]:
# freshdisk_com_objs[0]

Now we can index ("upsert") our objects into our Pinecone index!

In [74]:
index.upsert(freshdisk_com_objs)
index.upsert(hnsw_com_objs)
index.upsert(ivfpq_com_objs)


In [75]:
# Woo we have our vectors (253) in our index!

index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 253}},
 'total_vector_count': 253}

# Query Our Hybrid Docs

Now that we have all of our hybrid vector objects in our Pinecone index, we can issue some queries!

Since issuing a query to a vector index requires the query to be vectorized in the same way as the objects in the index are vectorized (so they can match up in vector space), for hybrid queries we'll have to vectorize the query *twice*! Once as a sparse vector and once as a dense vector. We then send both of those vectors to Pinecone to get items back.

In [76]:
query = "What are nearest neighbors?" 


Create sparse embedding from query

Note: do *not* refit the bm25 model here. We want to keep the token frequencies etc from when we fit it to the text from our PDFs!

You might be wondering how the model gets "refit" when the corpus changes, the answer is a little complicated, but essentially this is a special implementation of BM25 (which usually runs online) that has precomputed frequencies for English words, based off the MSMarco dataset. So, when you add new docs to the corpus, you don't have to "refit" the BM25 model, it just finds the word frequencies in the MSMarco dataset.

More here: https://github.com/pinecone-io/pinecone-text/blob/main/pinecone_text/sparse/bm25_encoder.py#L255



In [77]:
query_sembedding = bm25.encode_documents(query)

In [78]:
# Cool! We can see there are only two values in here, because BM25 automatically removed stop word like "what" and "is"

query_sembedding  

{'indices': [3650065742, 1196854555],
 'values': [0.3225806451612903, 0.3225806451612903]}

In [79]:
# Create dense embedding

query_dembedding = produce_embeddings([query])

In [81]:
# query_dembedding

Pinecone vector search has a cool user feature where you can weight the sparse vectors higher or lower (i.e. of more or less importance) than the dense vectors. This is controlled by the `alpha` parameter. An `alpha` of 0 means you're doing a totally keyword-based search (i.e. only over sparse vectors), while an `alpha` of 1 means you're doing a totally semantic search (i.e. only over dense vectors).

Let's make a function that'll let us weight our vectors by alpha.

(We'll also include `k`, which is the number of docs we want to retrieve)

In [82]:
# Integrate alpha and top-k

def weight_by_alpha(sparse_embedding: Dict[str, List[Any]], dense_embedding: List[float], alpha: float) -> Tuple[Dict[str, List[Any]], List[float]]:
    """
    Weight the values of our sparse and dense embeddings by the parameter alpha (0-1).

    :param sparse_embedding: Sparse embedding representation of one of our documents (or chunks).
    :param dense_embedding: Dense embedding representation of one of our documents (or chunks).
    :param alpha: Weighting parameter between 0-1 that controls the impact of sparse or dense embeddings on the retrieval and ranking
        of returned docs (chunks) in our index.

    :return: Weighted sparse and dense embeddings for one of our documents (chunks).
    """
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    hsparse = {
        'indices': sparse_embedding['indices'],
        'values':  [v * (1 - alpha) for v in sparse_embedding['values']]
    }
    hdense = [v * alpha for v in dense_embedding]
    return hsparse, hdense

Now let's make a function that'll query our Pinecone index while taking into account whatever `alpha` and `k` values we want to pass:

In [84]:
# Note this doesn't have any genAI in it yet


def issue_hybrid_query(sparse_embedding: Dict[str, List[Any]], dense_embedding: List[float], alpha: float, top_k: int) -> QueryResponse:
    """
    Send properly formatted hybrid search query to Pinecone index and get back `k` ranked results (ranked by dot product similarity, as 
        defined when we made our index).

    :param sparse_embedding: Sparse embedding representation of one of our documents (or chunks).
    :param dense_embedding: Dense embedding representation of one of our documents (or chunks).
    :param alpha: Weighting parameter between 0-1 that controls the impact of sparse or dense embeddings on the retrieval and ranking
        of returned docs (chunks) in our index.
    :param top_k: The number of documents (chunks) we want back from Pinecone.

    :return: QueryResponse object from Pinecone containing top-k results.
    """
    scaled_sparse, scaled_dense = weight_by_alpha(sparse_embedding, dense_embedding, alpha)

    result = index.query(
        vector=scaled_dense,
        sparse_vector=scaled_sparse,
        top_k=top_k,
        include_metadata=True
    )
    return result

Let's issue a pure semantic search:

In [85]:
# Note, for our dense embedding (`query_dembedding`), we need to grab the 1st value [0] since Pinecone expects a Numpy array when queried:

issue_hybrid_query(query_sembedding, query_dembedding[0], 0.0, 5)

{'matches': [{'id': '5856cf9d-8e3c-4707-bfea-66f290dd8576',
              'metadata': {'text': 'to the closest neighbors in a k-NN graph '
                                   'serve as a simple approximation of the '
                                   'Delaunay graph  (a graph which guranties '
                                   'that the result of a basic greedy graph '
                                   'traversal is always the nearest neighbor). '
                                   'Unfortunately, Delaunay graph cannot be '
                                   'efficiently constructed without prior '
                                   'information about the structure of a space '
                                   ', but its approximation by the nearest '
                                   'neighbors can be done by using only '
                                   'distances between the stored elements. It '
                                   'was shown that proximity graph approaches '


And now a pure keyword search. You can see how many more domain-specific words are in these results:

In [86]:
issue_hybrid_query(query_sembedding, query_dembedding[0], 1.0, 5)

{'matches': [{'id': 'e3fa7bd5-e76b-4126-b997-78c29b443683',
              'metadata': {'text': 'Inc., San Francisco, CA, USA. Piotr Indyk '
                                   'and Rajeev Motwani. 1998. Approximate '
                                   'Nearest Neighbors: Towards Removing the '
                                   'Curse of Dimensionality. In Proceedings of '
                                   'the Thirtieth Annual ACM Symposium on '
                                   'Theory of Computing (Dallas, Texas, USA) '
                                   '(STOC ’98). ACM, New York, NY, USA, '
                                   '604-613. M Iwasaki. [n.d.]. '
                                   '_https://github.com/yahoojapan/NGT/wiki '
                                   'Masajiro Iwasaki and Daisuke Miyazaki. '
                                   '2018. Optimization of Indexing Based on '
                                   'k-Nearest Neighbor Graph for Proximity '
                    

You can see the differences above: when we issue a purely semantic search, our search results are about what the idea of "nearest neighbors" is; in our keyword search, the vast majority of our search results are just exact-word matches for the tokens "nearest" and "neighbors". Most of them are just citations from the HNSW article's bibliography!

Can we get the best of both worlds? In an ideal world, my search results would both tell me "about" the concept of nearest neighbors and contain things like citations that I could read more about later.

Let's see if we can get a combination of semantic and keyword search by toggling our `alpha` value:

In [88]:
issue_hybrid_query(query_sembedding, query_dembedding[0], 0.3, 5)  # closer to 1.0 = closer to pure keyword search

{'matches': [{'id': 'e3fa7bd5-e76b-4126-b997-78c29b443683',
              'metadata': {'text': 'Inc., San Francisco, CA, USA. Piotr Indyk '
                                   'and Rajeev Motwani. 1998. Approximate '
                                   'Nearest Neighbors: Towards Removing the '
                                   'Curse of Dimensionality. In Proceedings of '
                                   'the Thirtieth Annual ACM Symposium on '
                                   'Theory of Computing (Dallas, Texas, USA) '
                                   '(STOC ’98). ACM, New York, NY, USA, '
                                   '604-613. M Iwasaki. [n.d.]. '
                                   '_https://github.com/yahoojapan/NGT/wiki '
                                   'Masajiro Iwasaki and Daisuke Miyazaki. '
                                   '2018. Optimization of Indexing Based on '
                                   'k-Nearest Neighbor Graph for Proximity '
                    

Amazing! You can see that our first couple search results are not very different than our pure keyword search. But when you get further down the results list, you'll see that we get an equation we can use to calculate KNN. That's a bit more useful than #3 in our pure keyword search, which is a bibliography entry. That's likely because we have semantic search in there too -- Pinecone knows we want to know "about" KNN, so it fetches items with lots of domain-specific terms (keyword search), but also items that demonstrate the "aboutness" of KNN (semantic search).


# Let's take a closer look. For science! 

In [89]:
pure_keyword = issue_hybrid_query(query_sembedding, query_dembedding[0], 1.0, 5)
pure_semantic = issue_hybrid_query(query_sembedding, query_dembedding[0], 0.0, 5)
hybrid_1 = issue_hybrid_query(query_sembedding, query_dembedding[0], 0.1, 5)
hybrid_2 = issue_hybrid_query(query_sembedding, query_dembedding[0], 0.2, 5)
hybrid_3 = issue_hybrid_query(query_sembedding, query_dembedding[0], 0.3, 5)
hybrid_4 = issue_hybrid_query(query_sembedding, query_dembedding[0], 0.4, 5)
hybrid_5 = issue_hybrid_query(query_sembedding, query_dembedding[0], 0.5, 5)

In [90]:
# Let's turn these all into dataframes and see the different rankings
# Feel free to skip this part (it's just an interesting side journey)

import pandas as pd

df = pd.concat([
    pd.DataFrame([(i['metadata']['text'], i['score'], 'keyword') for i in pure_keyword.get('matches')]),
    pd.DataFrame([(i['metadata']['text'], i['score'], 'semantic') for i in pure_semantic.get('matches')]),
    pd.DataFrame([(i['metadata']['text'], i['score'], 'hybrid_1') for i in hybrid_1.get('matches')]),
    pd.DataFrame([(i['metadata']['text'], i['score'], 'hybrid_2') for i in hybrid_2.get('matches')]),
    pd.DataFrame([(i['metadata']['text'], i['score'], 'hybrid_3') for i in hybrid_3.get('matches')]),
    pd.DataFrame([(i['metadata']['text'], i['score'], 'hybrid_4') for i in hybrid_4.get('matches')]),
    pd.DataFrame([(i['metadata']['text'], i['score'], 'hybrid_5') for i in hybrid_5.get('matches')]),
]).rename(columns={0: 'document', 1: 'score', 2: 'search_type'})

# Note: don't pay too much attention to the "score" column. This really only matters within the same type of search, for ranking docs.
# Don't use it to compare *across* different search types (e.g. keyword search isn't inherently more relevant simply because it has higher
# scores overall)

In [91]:
df.head()

Unnamed: 0,document,score,search_type
0,"Inc., San Francisco, CA, USA. Piotr Indyk and ...",0.721957,keyword
1,for nearest neighbors using a greedy search al...,0.708882,keyword
2,the 24th International Conference on Artificia...,0.644924,keyword
3,k-nearest neighbor graphs to solve nearest nei...,0.639628,keyword
4,(it can be random or supplied by a separate al...,0.628335,keyword


In [94]:
# Let's give each document a label so that it's easier to see their ranking differences per search type

from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
df['document_encoded'] = label_encoder.fit_transform(df['document'])

df.head().sort_values(['document_encoded'])

Unnamed: 0,document,score,search_type,document_encoded
4,(it can be random or supplied by a separate al...,0.628335,keyword,0
0,"Inc., San Francisco, CA, USA. Piotr Indyk and ...",0.721957,keyword,1
1,for nearest neighbors using a greedy search al...,0.708882,keyword,3
3,k-nearest neighbor graphs to solve nearest nei...,0.639628,keyword,4
2,the 24th International Conference on Artificia...,0.644924,keyword,8


In [95]:
for i, v in df.groupby(['search_type']):
    print(v[['search_type', 'document_encoded', 'score']])
    

  search_type  document_encoded     score
0    hybrid_1                10  0.091840
1    hybrid_1                 6  0.090954
2    hybrid_1                 1  0.088548
3    hybrid_1                 7  0.085497
4    hybrid_1                 2  0.084330
  search_type  document_encoded     score
0    hybrid_2                 1  0.158927
1    hybrid_2                 3  0.151589
2    hybrid_2                 6  0.148853
3    hybrid_2                 4  0.144768
4    hybrid_2                10  0.144045
  search_type  document_encoded     score
0    hybrid_3                 1  0.229306
1    hybrid_3                 3  0.221250
2    hybrid_3                 6  0.206752
3    hybrid_3                 4  0.206626
4    hybrid_3                 0  0.204291
  search_type  document_encoded     score
0    hybrid_4                 1  0.299685
1    hybrid_4                 3  0.290912
2    hybrid_4                 4  0.268483
3    hybrid_4                 8  0.265589
4    hybrid_4                 0  0

Above, you can see the subtle ranking differences across each search type. For the most part, `document 1` and `document 3` are the top two documents, except in `hybrid_1` and `semantic`. In those two searche types, document 10 is the first-ranked document, while `document 6` is the 2nd-ranked document in `hybrid_1` and `document 9` is the 2nd-ranked document in `semantic`.

It's up to you and your stakeholders to find the ideal `alpha` for your use case(s).

Directly, for our use case, it seems anything >= `alpha=0.2` gets us similar results, so the impact of `alpha` is most discernable between `0.0-0.2`.

Cool!

# Incorporating GenAI

Now, hybrid search is cool enough, but what if you don't want to spend time sifting through your index's search results? What if you just want a single answer to a query?

That's where GenAI comes in. 

We will make a retrieval augmented generation (RAG) pipeline that will make this happen.

Since large language models (LLMs) do not know a ton of specific information (they are trained on the general Internet), especially if the information is from PDFs that it would have to download to have access to (like what are in our index), we need to give it this information!

We do this by first sending our query to our Pinecone index and grabbing some search  results. We then attach these search results to our original query and send *both* to the LLM. That way, the LLM both knows what we want to ask it & can pull from its general knowledge store *and* has a specialized knowledge store (our Pinecone search results so that it can get us extra specific information.

Let's try it out:

In [96]:
# Let's grab the textual metadata from our search results:

hybrid_context = [i.get('metadata').get('text') for i in hybrid_2.get('matches')]
pure_keyword_context = [i.get('metadata').get('text') for i in pure_keyword.get('matches')]
pure_semantic_context = [i.get('metadata').get('text') for i in pure_semantic.get('matches')]

In [97]:
# We are then going to combine this "context" with our original query in a format that our LLM likes:

hybrid_augmented_query = "\n\n---\n\n".join(hybrid_context)+"\n\n-----\n\n"+query
pure_keyword_augmented_query = "\n\n---\n\n".join(pure_keyword_context)+"\n\n-----\n\n"+query
pure_semantic_augmented_query = "\n\n---\n\n".join(pure_keyword_context)+"\n\n-----\n\n"+query

In [98]:
print(hybrid_augmented_query)

Inc., San Francisco, CA, USA. Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA) (STOC ’98). ACM, New York, NY, USA, 604-613. M Iwasaki. [n.d.]. _https://github.com/yahoojapan/NGT/wiki Masajiro Iwasaki and Daisuke Miyazaki. 2018. Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data. Herve Jegou,

---

for nearest neighbors using a greedy search algorithm. The greedy search algorithm traverses the graph starting at a designated navigating or start node s € P. The search iterates by greedily walking from the current node u to a node v € Nout(u) that minimizes the distance to the query, and terminates when it reaches a locally-optimal node, say p*, that has the property d(p*,q) < d(p,q) Vp € Nou(p*). Greedy search cannot improve distance to the query point by navigating out of p*

In [99]:
# We are then going to give our LLM some instructions for how to act:

primer = f"""You are Q&A bot. A highly intelligent system that answers
user questions based on the information provided by the user above
each question. If the information can not be found in the information
provided by the user you truthfully say "I don't know".
"""

In [100]:
# Now we query our LLM with our augmented query & our primer!
# Our hybrid query:

openai.api_key = openai_api_key


hybrid_res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": hybrid_augmented_query}
    ]
)

hybrid_res

<OpenAIObject chat.completion id=chatcmpl-8DzKLEmZkLOASkAcbzPa9me2c0nCb at 0x2b64aa450> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Nearest neighbors is a term used in computer science, specifically in the field of data mining and information retrieval. It refers to algorithms or methods that find the data points in a given dataset that are closest or most similar to a given point or query. Nearest neighbor methods are often used to classify objects, predict values, or find patterns within data.",
        "role": "assistant"
      }
    }
  ],
  "created": 1698344697,
  "id": "chatcmpl-8DzKLEmZkLOASkAcbzPa9me2c0nCb",
  "model": "gpt-4-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 70,
    "prompt_tokens": 693,
    "total_tokens": 763
  }
}

In [103]:
# Our pure_keyword query:

pure_keyword_res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": pure_keyword_augmented_query}
    ]
)

pure_keyword_res

<OpenAIObject chat.completion id=chatcmpl-8DzLUYjwPimrqTJc1k3oWlHdgmglo at 0x2b64a9f70> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Nearest neighbors is a type of algorithm in data mining and machine learning. It's used to find the most similar items to a given item. In other words, it finds the 'neighbors' in a dataset that are 'nearest' to the target item based on some measure of distance (like Euclidean distance). This algorithm is often used in recommendation systems, data clustering, and anomaly detection.",
        "role": "assistant"
      }
    }
  ],
  "created": 1698344768,
  "id": "chatcmpl-8DzLUYjwPimrqTJc1k3oWlHdgmglo",
  "model": "gpt-4-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 79,
    "prompt_tokens": 743,
    "total_tokens": 822
  }
}

In [102]:
# Our pure_semantic query:

pure_semantic_res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": pure_semantic_augmented_query}
    ]
)

pure_semantic_res

<OpenAIObject chat.completion id=chatcmpl-8DzKnx9HxFaTUkuGYMY2gHzTcv530 at 0x2b7deb5f0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Nearest neighbors, in the context of the provided text, refers to a method in data mining and machine learning, used for classification and regression. The concept is based on the principle that data points which are similar to each other fall into similar categories. In other words, the 'nearest neighbor' of a data point is one that has the most similar characteristics to it. This method is often used in pattern recognition, recommendation systems, and optimizing search results by considering the proximity of data points in a high-dimensional space.",
        "role": "assistant"
      }
    }
  ],
  "created": 1698344725,
  "id": "chatcmpl-8DzKnx9HxFaTUkuGYMY2gHzTcv530",
  "model": "gpt-4-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 103,
    "prompt_tok

You can see across the different results above that our hybrid result is likely the most helpful.

Our `pure_keyword_res` seems to contain more domain-specific words than our `pure_semantic_res` (e.g. `Euclidean distance`). 

Our `pure_semantic_res`, on the other hand, seems to contain fewer implementation details and more conceptual details (e.g. `The concept is based on the principle that data points . . . `).

# What if we take our our Pinecone vectors altogether??

In [105]:
# What if we issue our original query without our Pinecone vectors as context?

res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": query}
    ]
)

res

<OpenAIObject chat.completion id=chatcmpl-8DzMREfDHewszdDS0QmHMhKJRaMTh at 0x2b64abc50> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Based on the information provided by the user, I don't know what nearest neighbors refers to.",
        "role": "assistant"
      }
    }
  ],
  "created": 1698344827,
  "id": "chatcmpl-8DzMREfDHewszdDS0QmHMhKJRaMTh",
  "model": "gpt-4-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 19,
    "prompt_tokens": 69,
    "total_tokens": 88
  }
}

We can see that RAG really does have a huge impact! Without our PDFs, ChatGPT doesn't know much helpful detail at all! Nor can it give us bibliographic data for articles we might want to look up later!

# All finished!

Check out [our documentation on hybrid search](https://docs.pinecone.io/docs/hybrid-search-and-sparse-vectors) and keep building awesome things!