# Pinecone

- Author: [ro__o_jun](https://github.com/ro-jun)
- Design: []()
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/01-OpenAIEmbeddings.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/01-OpenAIEmbeddings.ipynb)

## Overview

This tutorial provides a comprehensive guide to integrating `Pinecone` with `LangChain` for creating and managing high-performance vector databases.  
It explains how to set up Pinecone, `preprocess documents` , `handle stop words` , and utilize Pinecone's APIs for vector indexing and `document retrieval` . 
Additionally, it demonstrates advanced features like `hybrid search` using dense and `sparse embeddings` , `metadata filtering` , and `dynamic reranking` to build efficient and scalable search systems.  

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [What is Pinecone?](#what-is-pinecone)
- [Pinecone setup guide](#Pinecone-setup-guide)
- [Handling Stop Words](#handling-stop-words)
- [Data preprocessing](#data-preprocessing)
- [Pinecone and LangChain Integration Guide: Step by Step](#pinecone-and-langchain-integration-guide-step-by-step)
- [Create Retriever](#create-retriever)
- [License Information](#license-information)



### References

- [Pinecone-LangChain](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html)
- [Pinecone-official-website](https://docs.pinecone.io/integrations/langchain)
- [Automated Summarisation and Evaluation Framework - arXiv](https://arxiv.org/abs/2310.02759)
- [Pinecone Rerank Document](https://docs.pinecone.io/guides/inference/rerank)
- [Model and fee structure](https://docs.pinecone.io/models/overview)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain-pinecone",
        "pinecone-client",
        "nltk",
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "PINECONE_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Pinecone",
    },
)

Environment variables have been set successfully.


[Note] If you are using a `.env` file, proceed as follows.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## What is Pinecone?

`Pinecone` is a **cloud-based** , high-performance vector database for **efficient vector storage and retrieval** in AI and machine learning applications.

**Features** :
1. **Supports SDKs** for Python, Node.js, Java, and Go.
2. **Fully managed** : Reduces the burden of infrastructure management.
3. **Real-time updates** : Supports real-time insertion, updates, and deletions.

**Advantages** :
1. Scalability for large datasets.
2. Real-time data processing.
3. High availability with cloud infrastructure.

**Disadvantages** :
1. Relatively higher cost compared to other vector databases.
2. Limited customization options.

## Pinecone setup guide

This section explains how to set up Pinecone, including API key creation.

**[steps]**

1. Log in to [Pinecone](https://www.pinecone.io/)
2. Create an API key under the `API Keys` tab.

![example](./assets/04-pinecone-api-key.png)  
![example](./assets/04-pinecone-api-key-name.png)  

### notification

Since the functions below are **custom implemented**, you must update the libraries below before proceeding.

In [None]:
%pip install pymupdf
# %pip install -U langchain-teddynote

Collecting pymupdf
  Using cached pymupdf-1.25.1-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Using cached pymupdf-1.25.1-cp39-abi3-win_amd64.whl (16.6 MB)
Installing collected packages: pymupdf
Successfully installed pymupdf-1.25.1
Note: you may need to restart the kernel to use updated packages.


## Handling Stop Words
- Process stopwords before vectorizing text data to improve the quality of embeddings and focus on meaningful words.

In [5]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\thdgh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\thdgh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
from nltk.corpus import stopwords

stop_words = stopwords.words("english")
print("Number of stop words :", len(stop_words))
print("Print 10 stop words :", stop_words)

Number of stop words : 179
Print 10 stop words : ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'no

## Data preprocessing

Below is the preprocessing process for general documents.  
Reads all `.pdf` files under `ROOT_DIR` and saves them in `document_lsit.`

In [21]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import glob

# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)

split_docs = []

# Convert text file to load -> List[Document] format
files = sorted(glob.glob("data/*.pdf"))

for file in files:
    loader = PyMuPDFLoader(file)
    split_docs.extend(loader.load_and_split(text_splitter))

# Check document count
len(split_docs)

85

In [22]:
split_docs[0].page_content

'Comparative Study and Framework for Automated Summariser\nEvaluation: LangChain and Hybrid Algorithms\nBagiya Lakshmi S (bagiyalakshmi59@gmail.com), Sanjjushri Varshini R\n(sanjjushrivarshini@gmail.com), Rohith Mahadevan (rohithmahadev30@gmail.com), Raja CSP\nRaman (raja.csp@gmail.com)\nAbstract'

In [23]:
split_docs[0].metadata

{'source': 'data\\final-Research-Paper-5.pdf',
 'file_path': 'data\\final-Research-Paper-5.pdf',
 'page': 0,
 'total_pages': 9,
 'format': 'PDF 1.4',
 'title': 'final-Research-Paper-5',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Skia/PDF m119 Google Docs Renderer',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

Performs document processing to save DB in Pinecone. You can select `Metadata_Keys` during this process.

You can additionally tag metadata and, if desired, add and process metadata ahead of time in a preprocessing task.

- `split_docs` : List[Document] containing the results of document splitting.
- `metadata_keys` : List containing metadata keys to be added to the document.
- `min_length` : Specifies the minimum length of the document. Documents shorter than this length are excluded.
- `use_basename` : Specifies whether to use the file name based on the source path. The default is `False` .

**Preprocessing of documents**

- Extract the required `metadata` information.
- Filters only data longer than the minimum length.
- Specifies whether to use the document's `basename` . The default is `False` .
- Here, `basename` refers to the very last part of the file.
- For example, `/data/final-Research-Paper-5.pdf` becomes `final-Research-Paper-5.pdf`.


In [24]:
split_docs[0].metadata

{'source': 'data\\final-Research-Paper-5.pdf',
 'file_path': 'data\\final-Research-Paper-5.pdf',
 'page': 0,
 'total_pages': 9,
 'format': 'PDF 1.4',
 'title': 'final-Research-Paper-5',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Skia/PDF m119 Google Docs Renderer',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

In [25]:
from langchain_teddynote.community.pinecone import preprocess_documents

contents, metadatas = preprocess_documents(
    split_docs=split_docs,
    metadata_keys=["source", "page", "author"],
    min_length=5,
    use_basename=True,
)

  0%|          | 0/85 [00:00<?, ?it/s]

In [1]:
from tqdm import tqdm
import os

# 입력 데이터: split_docs
# 예시 split_docs: List[Document] 형태의 데이터
# split_docs = [...]  # 데이터를 여기에 삽입하세요.

# 설정 값
metadata_keys = ["source", "page", "author"]
min_length = 5
use_basename = True

# 결과를 저장할 변수 초기화
contents = []
metadatas = {key: [] for key in metadata_keys}

# 문서 전처리 작업
for doc in tqdm(split_docs):
    content = doc.page_content.strip()  # 공백 제거
    if content and len(content) >= min_length:  # 조건: 비어있지 않고 최소 길이 이상
        contents.append(content)  # 콘텐츠 저장
        for k in metadata_keys:
            value = doc.metadata.get(k)  # 메타데이터 키 가져오기
            if k == "source" and use_basename:  # use_basename 처리
                value = os.path.basename(value)
            try:
                metadatas[k].append(int(value))  # 정수 변환
            except (ValueError, TypeError):
                metadatas[k].append(value)  # 실패 시 원본 값 저장

# 결과 확인
print("Processed contents:", contents[:5])
print("------------------------------------")
print("Processed metadatas keys:", metadatas.keys())
print("------------------------------------")
print("Source metadata examples:", metadatas["source"][:5])


NameError: name 'split_docs' is not defined

In [33]:
# Check documents to be saved in VectorStore
contents[:5]

['Comparative Study and Framework for Automated Summariser\nEvaluation: LangChain and Hybrid Algorithms\nBagiya Lakshmi S (bagiyalakshmi59@gmail.com), Sanjjushri Varshini R\n(sanjjushrivarshini@gmail.com), Rohith Mahadevan (rohithmahadev30@gmail.com), Raja CSP\nRaman (raja.csp@gmail.com)\nAbstract',
 'Raman (raja.csp@gmail.com)\nAbstract\nAutomated Essay Score (AES) is proven to be one of the cutting-edge technologies. Scoring\ntechniques are used for various purposes. Reliable scores are calculated based on influential',
 "variables. Such variables can be computed by different methods based on the domain. The\nresearch is concentrated on the user's understanding of a given topic. The analysis is based on\na scoring index by using Large Language Models. The user can then compare and contrast the",
 "understanding of a topic that they recently learned. The results are then contributed towards\nlearning analytics and progression is made for enhancing the learning ability. In this researc

In [34]:
# Check metadata to be saved in VectorStore
metadatas.keys()

dict_keys(['source', 'page', 'author'])

In [35]:
# Check the source in metadata.
metadatas["source"][:5]

['final-Research-Paper-5.pdf',
 'final-Research-Paper-5.pdf',
 'final-Research-Paper-5.pdf',
 'final-Research-Paper-5.pdf',
 'final-Research-Paper-5.pdf']

In [36]:
# Check number of documents, check number of sources, check number of pages
len(contents), len(metadatas["source"]), len(metadatas["page"])

(85, 85, 85)

## Pinecone and LangChain Integration Guide: Step by Step

This guide outlines the integration of Pinecone and LangChain to set up and utilize a vector database. 

Below are the key steps to complete the integration.

### Pinecone client initialization and vector database setup

The provided code performs the initialization of a Pinecone client, sets up an index in Pinecone, and defines a vector database to store embeddings.

**[caution]**    
If you are considering HybridSearch, specify the metric as dotproduct.

**Pinecone index settings**

In [28]:
import os
from langchain_teddynote.community.pinecone import create_index

# Initializes a connection to Pinecone using an API key from environment variables.
pc = create_index(
    api_key=os.environ.get("PINECONE_API_KEY"),
    index_name="langchain-opentutorial-index",  # change if desired
    dimension=3072,  # It is suitable for embedding work. (OpenAIEmbeddings: 1536, UpstageEmbeddings: 4096)
    metric="dotproduct",
)

[create_index]
{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-open-tutorial-02': {'vector_count': 85}},
 'total_vector_count': 85}


**Setting up an index using paid pods**

In [18]:
import os
from langchain_teddynote.community.pinecone import create_index
from pinecone import PodSpec

# Initializes a connection to Pinecone using an API key from environment variables.
pc = create_index(
    api_key=os.environ.get("PINECONE_API_KEY"),
    index_name="langchain-opentutorial-index-2",  # change if desired
    dimension=3072,  # It is suitable for embedding work. (OpenAIEmbeddings: 1536, UpstageEmbeddings: 4096)
    metric="dotproduct",
    pod_spec=PodSpec(
        environment="us-west1-gcp", pod_type="p1.x1", pods=1
    ),  # Use Paid Pods
)

ForbiddenException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2024-07', 'X-Cloud-Trace-Context': '003dc97ea1faba8a2bb7756d85da15b3', 'Date': 'Wed, 08 Jan 2025 14:38:14 GMT', 'Server': 'Google Frontend', 'Content-Length': '193', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"FORBIDDEN","message":"Request failed. You've reach the max pod-based indexes allowed in project Q&A Chat (0). To add more pod-based indexes, upgrade your plan."},"status":403}


## Create Sparse Encoder

- Create a sparse encoder.

- Perform stopword processing.

- Learn contents using Sparse Encoder. The encode learned here is used to create a Sparse Vector when storing documents in VectorStore.


In [19]:
from langchain_teddynote.community.pinecone import (
    create_sparse_encoder,
    fit_sparse_encoder,
)

sparse_encoder = create_sparse_encoder(stopwords.words("english"), mode="english")

Learn Corpus on Sparse Encoder.

- `save_path` : Path to save Sparse Encoder. Later, the Sparse Encoder saved in pickle format will be loaded and used for query embedding. Therefore, specify the path to save it.

In [20]:
# Learn contents using Sparse Encoder
saved_path = fit_sparse_encoder(
    sparse_encoder=sparse_encoder, contents=contents, save_path="./sparse_encoder.pkl"
)

  0%|          | 0/85 [00:00<?, ?it/s]

[fit_sparse_encoder]
Saved Sparse Encoder to: ./sparse_encoder.pkl


[Optional]  
Below is the code to use when you need to reload the learned and saved Sparse Encoder later.

In [21]:
from langchain_teddynote.community.pinecone import load_sparse_encoder

# It is used later to load the learned sparse encoder.
sparse_encoder = load_sparse_encoder("./sparse_encoder.pkl")
[load_sparse_encoder]

[load_sparse_encoder]
Loaded Sparse Encoder from: ./sparse_encoder.pkl


[<function langchain_teddynote.community.pinecone.load_sparse_encoder(file_path: str) -> Any>]

### Pinecone: Add to DB Index (Upsert)

![04-pinecone-upsert-data](./assets/04-pinecone-upsert-data.png)

- `context`: This is the content of the document.
- `page` : The page number of the document.
- `source` : This is the source of the document.
- `values` : This is an embedding of a document obtained through Embedder.
- `sparse values` : This is an embedding of a document obtained through Sparse Encoder.

In [22]:
from langchain_openai import OpenAIEmbeddings

openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Upsert documents in batches without distributed processing.
If the amount of documents is not large, use the method below.

In [29]:
%%time
from langchain_teddynote.community.pinecone import upsert_documents

upsert_documents(
    index=pc,  # Pinecone Index
    namespace="langchain-open-tutorial-01",  # Pinecone namespace
    contents=contents,  # Previously preprocessed document content
    metadatas=metadatas,  # Previously preprocessed document metadata
    sparse_encoder=sparse_encoder,  # Sparse encoder
    embedder=openai_embeddings,
    batch_size=32,
)

  0%|          | 0/3 [00:00<?, ?it/s]

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************-pcA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Below, distributed processing is performed to quickly upsert large documents. Use this for large uploads.

In [30]:
%%time
from langchain_teddynote.community.pinecone import upsert_documents_parallel

upsert_documents_parallel(
    index=pc,  # Pinecone Index
    namespace="langchain-open-tutorial-01",  # Pinecone namespace
    contents=contents,  # Previously preprocessed document content
    metadatas=metadatas,  # Previously preprocessed document metadata
    sparse_encoder=sparse_encoder,  # Sparse encoder
    embedder=openai_embeddings,
    batch_size=32,
)

문서 Upsert 중:   0%|          | 0/3 [00:00<?, ?it/s]

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************-pcA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

### Index inquiry/delete

The `describe_index_stats` method provides statistical information about the contents of an index. This method allows you to obtain information such as the number of vectors and dimensions per namespace.

**Parameter** * `filter` (Optional[Dict[str, Union[str, float, int, bool, List, dict]]]): A filter that returns statistics only for vectors that meet certain conditions. Default is None * `**kwargs`: Additional keyword arguments

**Return value** * `DescribeIndexStatsResponse`: Object containing statistical information about the index

**Usage example** * Default usage: `index.describe_index_stats()` * Apply filter: `index.describe_index_stats(filter={'key': 'value'})`

[Note] - metadata filtering is only available to paid users.

In [31]:
# Index lookup
pc.describe_index_stats()

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-open-tutorial-02': {'vector_count': 85}},
 'total_vector_count': 85}

**Delete namespace**

In [32]:
from langchain_teddynote.community.pinecone import delete_namespace

delete_namespace(
    pinecone_index=pc,
    namespace="langchain-open-tutorial-01",
)

Namespace 'langchain-open-tutorial-01' does not exist.


In [33]:
pc.describe_index_stats()

{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-open-tutorial-02': {'vector_count': 85}},
 'total_vector_count': 85}

Below are features exclusive to paid users. Metadata filtering is available to paid users.

In [34]:
from langchain_teddynote.community.pinecone import delete_by_filter

# Delete with metadata filtering (paid feature)
delete_by_filter(
    pinecone_index=pc,
    namespace="angchain-open-tutorial-02",
    filter={"source": {"$eq": "final-Research-Paper-5.pdf"}},
)
pc.describe_index_stats()

필터를 사용한 삭제 중 오류 발생:
UNKNOWN:Error received from peer  {created_time:"2025-01-08T14:42:48.8544859+00:00", grpc_status:3, grpc_message:"Invalid request."}


{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-open-tutorial-02': {'vector_count': 85}},
 'total_vector_count': 85}

## Create Retriever

**PineconeKiwiHybridRetriever initialization parameter settings**

The `init_pinecone_index` function and the `PineconeKiwiHybridRetriever` class implement a hybrid search system using Pinecone. This system combines dense and sparse vectors to perform effective document retrieval.

Pinecone index initialization

The `init_pinecone_index` function initializes the Pinecone index and sets up the necessary components.

Parameters * `index_name` (str): Pinecone index name * `namespace` (str): Namespace to use * `api_key` (str): Pinecone API key * `sparse_encoder_pkl_path` (str): Sparse encoder pickle file path * ` stopwords` (List[str]): List of stop words * `tokenizer` (str): Tokenizer to use (default: "kiwi") * `embeddings` (Embeddings): Embedding model * `top_k` (int): Maximum number of documents to return (default: 10) * `alpha` (float): Weight of dense and sparse vectors Adjustment parameter (default: 0.5)

**주요 기능** 
1. Pinecone index initialization and statistical information output
2. Sparse encoder (BM25) loading and tokenizer settings
3. Specify namespace


In [35]:
from langchain_teddynote.community.pinecone import init_pinecone_index

pinecone_params = init_pinecone_index(
    index_name="langchain-opentutorial-index",  # Pinecone 인덱스 이름
    namespace="langchain-open-tutorial-02",  # Pinecone Namespace
    api_key=os.environ["PINECONE_API_KEY"],  # Pinecone API Key
    sparse_encoder_path="./sparse_encoder.pkl",  # Sparse Encoder 저장경로(save_path)
    stopwords=stopwords.words("english"),  # 불용어 사전
    tokenizer="english",
    embeddings=OpenAIEmbeddings(model="text-embedding-3-large"),  # Dense Embedder
    top_k=5,  # Top-K 문서 반환 개수
    alpha=0.5,  # alpha=0.75로 설정한 경우, (0.75: Dense Embedding, 0.25: Sparse Embedding)
)

[init_pinecone_index]
{'dimension': 3072,
 'index_fullness': 0.0,
 'namespaces': {'langchain-open-tutorial-02': {'vector_count': 85}},
 'total_vector_count': 85}


**PineconeKiwiHybridRetriever**

The class `PineconeKiwiHybridRetriever` implements a hybrid retriever combining Pinecone and Kiwi.

**Main properties** * `embeddings`: Embedding model for dense vector transformations * `sparse_encoder:` Encoder for sparse vector transformations * `index`: Pinecone index object * `top_k`: Maximum number of documents to return * `alpha`: Weight adjustment parameters for dense and sparse vectors * `namespace`: Namespace within the Pinecone index.

**Features** * HybridSearch Retriever combining dense and sparse vectors * Search strategy can be optimized through weight adjustment * Various dynamic metadata filtering can be applied (using `search_kwargs`: `filter`, `k`, `rerank`, ` rerank_model`, `top_n`, etc.)

**Use example** 
1. Initialize required components with the `init_pinecone_index` function   
2. Create a `PineconeKiwiHybridRetriever` instance with initialized components.  
3. Perform a hybrid search using the generated retriever to create a `PineconeKiwiHybridRetriever`.

In [37]:
from langchain_teddynote.community.pinecone import PineconeKiwiHybridRetriever

# Create a searcher
pinecone_retriever = PineconeKiwiHybridRetriever(**pinecone_params)

general search

In [39]:
# execution result
search_results = pinecone_retriever.invoke("Summarize PDF with LangChain")
for result in search_results:
    print(result.page_content)
    print(result.metadata)
    print("\n====================\n")

framework designed for summarization tasks. LangChain is employed to summarize PDF
content, creating a condensed version of the information that retains key insights and context.
3.3 Summarizing Content and User's Understanding Comparison
{'context': "framework designed for summarization tasks. LangChain is employed to summarize PDF\ncontent, creating a condensed version of the information that retains key insights and context.\n3.3 Summarizing Content and User's Understanding Comparison", 'page': 4.0, 'author': '', 'source': 'final-Research-Paper-5.pdf'}


The process involves utilizing a Langchain tool to summarize the PDF and extract the essential
information. By employing this technique, the research aims to determine how well the user
comprehends the summarized content.
{'context': 'The process involves utilizing a Langchain tool to summarize the PDF and extract the essential\ninformation. By employing this technique, the research aims to determine how well the user\ncomprehends t

Using dynamic search_kwargs - k: specify maximum number of documents to return

In [40]:
# execution result
search_results = pinecone_retriever.invoke(
    "Summarize PDF with LangChain", search_kwargs={"k": 1}
)
for result in search_results:
    print(result.page_content)
    print(result.metadata)
    print("\n====================\n")

framework designed for summarization tasks. LangChain is employed to summarize PDF
content, creating a condensed version of the information that retains key insights and context.
3.3 Summarizing Content and User's Understanding Comparison
{'context': "framework designed for summarization tasks. LangChain is employed to summarize PDF\ncontent, creating a condensed version of the information that retains key insights and context.\n3.3 Summarizing Content and User's Understanding Comparison", 'page': 4.0, 'author': '', 'source': 'final-Research-Paper-5.pdf'}





Use dynamic `search_kwargs` - `alpha` : Weight adjustment parameters for dense and sparse vectors. Specify a value between 0 and 1. `0.5` is the default, the closer it is to 1, the higher the weight of the dense vector is.

In [41]:
# execution result
search_results = pinecone_retriever.invoke(
    "Langchain and pdf", search_kwargs={"alpha": 1, "k": 1}
)
for result in search_results:
    print(result.page_content)
    print(result.metadata)
    print("\n====================\n")

Langchain. It quantifies the congruence between these two sources of information, affording
users a tangible score that reflects their grasp of the topic. Furthermore, the research delves
even deeper by comparing the user's understanding with the original PDF content, affording yet
{'context': "Langchain. It quantifies the congruence between these two sources of information, affording\nusers a tangible score that reflects their grasp of the topic. Furthermore, the research delves\neven deeper by comparing the user's understanding with the original PDF content, affording yet", 'page': 1.0, 'author': '', 'source': 'final-Research-Paper-5.pdf'}




In [42]:
# execution result
search_results = pinecone_retriever.invoke(
    "Langchain and pdf", search_kwargs={"alpha": 0, "k": 1}
)
for result in search_results:
    print(result.page_content)
    print(result.metadata)
    print("\n====================\n")

The process involves utilizing a Langchain tool to summarize the PDF and extract the essential
information. By employing this technique, the research aims to determine how well the user
comprehends the summarized content.
{'context': 'The process involves utilizing a Langchain tool to summarize the PDF and extract the essential\ninformation. By employing this technique, the research aims to determine how well the user\ncomprehends the summarized content.', 'page': 0.0, 'author': '', 'source': 'final-Research-Paper-5.pdf'}




**Metadata 필터링**

![example](./assets/04-pinecone-metadata.png)

Using dynamic search_kwargs - filter: Apply metadata filtering

(Example) Only documents with page less than 5 are searched.

In [43]:
# execution result
search_results = pinecone_retriever.invoke(
    "Summarize PDF with LangChain",
    search_kwargs={"filter": {"page": {"$lt": 5}}, "k": 2},
)
for result in search_results:
    print(result.page_content)
    print(result.metadata)
    print("\n====================\n")

framework designed for summarization tasks. LangChain is employed to summarize PDF
content, creating a condensed version of the information that retains key insights and context.
3.3 Summarizing Content and User's Understanding Comparison
{'context': "framework designed for summarization tasks. LangChain is employed to summarize PDF\ncontent, creating a condensed version of the information that retains key insights and context.\n3.3 Summarizing Content and User's Understanding Comparison", 'page': 4.0, 'author': '', 'source': 'final-Research-Paper-5.pdf'}


The process involves utilizing a Langchain tool to summarize the PDF and extract the essential
information. By employing this technique, the research aims to determine how well the user
comprehends the summarized content.
{'context': 'The process involves utilizing a Langchain tool to summarize the PDF and extract the essential\ninformation. By employing this technique, the research aims to determine how well the user\ncomprehends t

**Reranking applied**

You can get **retrieval - reranker** results with a simple `search_kwargs` application.

(However, reranker is a paid function, so please check the fee structure in advance.)

In [44]:
# reranker not used
retrieval_results = pinecone_retriever.invoke(
    "pdf and langchain",
)

# Use BGE-reranker-v2-m3 model
reranked_results = pinecone_retriever.invoke(
    "pdf and langchain",
    search_kwargs={"rerank": True, "rerank_model": "bge-reranker-v2-m3", "top_n": 3},
)

In [45]:
# Compare retrieval_results and reranked_results.
for res1, res2 in zip(retrieval_results, reranked_results):
    print("[Retrieval]")
    print(res1.page_content)
    print("\n------------------\n")
    print("[Reranked] rerank_score: ", res2.metadata["rerank_score"])
    print(res2.page_content)

[Retrieval]
The process involves utilizing a Langchain tool to summarize the PDF and extract the essential
information. By employing this technique, the research aims to determine how well the user
comprehends the summarized content.

------------------

[Reranked] rerank_score:  0.9884919
framework designed for summarization tasks. LangChain is employed to summarize PDF
content, creating a condensed version of the information that retains key insights and context.
3.3 Summarizing Content and User's Understanding Comparison
[Retrieval]
framework designed for summarization tasks. LangChain is employed to summarize PDF
content, creating a condensed version of the information that retains key insights and context.
3.3 Summarizing Content and User's Understanding Comparison

------------------

[Reranked] rerank_score:  0.97903574
The process involves utilizing a Langchain tool to summarize the PDF and extract the essential
information. By employing this technique, the research aims to det

## License Information

This document includes content from external sources licensed under **Creative Commons Attribution 4.0 International (CC BY 4.0)**. Proper attribution is provided as follows:

- Title: *Automated Summarisation and Evaluation Framework*  
- Authors: [Author Names]  
- Source: [arXiv](https://arxiv.org/abs/2310.02759)  
- License: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)  

The license permits sharing and adaptation with proper attribution. For further details, visit [Creative Commons](https://creativecommons.org/licenses/by/4.0/).