<div class="alert alert-block alert-warning">
<b>Note:</b>This notebook uses KDB.AI 0.0.9 which has an temporary version of the Python Client API that will change significantly in the final version for release in September 2023.
    
</div>

# Semantic Search on PDF Documents with KDB.AI

This example demonstrates how to use KDB.AI to run semantic search on unstructured text documents. 

Semantic search allows users to perform searches based on the meaning or similarity of the data rather than exact matches. It works by converting the query into a vector representation and then finding similar vectors in the database. This way, even if the query and the data in the database are not identical, the system can identify and retrieve the most relevant results based on their semantic meaning.

## Aim
In this tutorial, we'll walk you through the process of performing semantic search on documents, taking PDFs as example, using KDB.AI as the vector store. We will cover the following topics:

- How to create vector embeddings using Sentence Transformer
- How to store those embeddings in KDB.AI
- How to search with a query using KDB.AI

## 1. Load and Split Document

### Install Dependencies:

We first need to install some libraries:
- [PyDFF2](https://pypi.org/project/PyPDF2/) which is a useful library when handling PDFs in Python
- [spaCy](https://spacy.io/) for advanced natural language processing which will help identify sentences in the PDF


In [1]:
%pip install PyPDF2 -q
%pip install spacy -q

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
!python -m spacy download en_core_web_sm -q

[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Load and Split PDF into Sentences

We leverage the power of PyPDF2 for PDF processing and spaCy for advanced natural language processing, the code below extracts content from each page of the PDF and processes it to identify sentences.

The PDF we are using is [this research paper](https://arxiv.org/pdf/2308.05801.pdf) presenting information on the formation of Interstellar Objects in the Milky Way.

In [3]:
import PyPDF2
import spacy

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

def split_pdf_into_sentences(pdf_path):
    # Open the PDF file
    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        
        # Extract text from each page and concatenate
        full_text = ""
        for page_number in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_number]
            full_text += page.extract_text()
        
        # Process the text using spaCy for sentence tokenization
        doc = nlp(full_text)
        sentences = [sent.text for sent in doc.sents]
        
        return sentences

# Define PDF path
pdf_path = 'research_paper.pdf'

# Split the PDF into sentences
pdf_sentences = split_pdf_into_sentences(pdf_path)
len(pdf_sentences)

393

In [4]:
pdf_sentences[0]

'Draft version August 14, 2023\nTypeset using L ATEX default style in AASTeX631\nThe Galactic Interstellar Object Population: A Framework for Prediction and Inference\nMatthew J. Hopkins\n ,1Chris Lintott\n ,1Michele T. Bannister\n ,'

## 2. Create Vector Embeddings 

Next, we use the Sentence Transformers library to create embeddings for our collection of sentences.

### Installation

In [5]:
%pip install -qU sentence-transformers

[0mNote: you may need to restart the kernel to use updated packages.


### Selecting a Sentence Transformer model

There are 100+ different types of Sentence Transformers models available - see [HuggingFace](https://huggingface.co/sentence-transformers) for the full list. The diversity among these primarily stems from variations in their training data. Selecting the ideal model for your needs involves matching the domain and task closely, while also considering the benefits of incorporating larger datasets to enhance scale. 

This tutorial will use the `all-MiniLM-L6-v2` pre-trained model, this embedding model can create sentence and document embeddings that can be used for a wide variety of tasks including semantic search which makes it a good choice for our needs.

In [6]:
from sentence_transformers import SentenceTransformer 
model=SentenceTransformer("all-MiniLM-L6-v2")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### Generate embeddings

In [7]:
import numpy as np 

# Create embeddings
embeddings = model.encode(np.array(pdf_sentences))
embeddings.shape

(393, 384)

<div class="alert alert-block alert-info">
<b>Important:</b>Note the dimension of our embeddings is 384, it is the second value returned from shape. This will need to match the dimensions we set in the KDB.AI index in the next step.
</div>

## 3. Store Embeddings in KDB.AI

With the embeddings created, we need to store them in a vector database to enable efficient searching. KDB.AI is perfect for this task.

### Create a vector index

With KDB.AI we have the choice between HNSW (Hierarchical Navigable Small World) and IVFPQ (Inverted File with Product Quantization) indexing methods. Generally, for semantic search of documents, the HNSW indexing method might be more suitable. Here's why:

- **Search Speed and Approximation**: HNSW is designed for fast approximate nearest neighbor searches. It can efficiently handle high-dimensional data, which is common in natural language processing tasks involving text documents.
- **Semantic Representation**: The Sentence Transformers library, used in this example, generates embeddings that capture semantic meaning. HNSW is well-suited for indexing such embeddings and performing semantic searches.
- **Scalability**: HNSW is scalable and can handle large datasets effectively, making it suitable for applications with a vast number of documents.

HNSW provides approximate search results, meaning that the nearest neighbors might not be exact matches but are close in terms of similarity. If you require more accurate search results, IVFPQ might be a better choice, albeit at the cost of slightly slower search times and potentially higher memory usage.

The command below creates an index named `myHNSW` that performs Hierarchical Navigable Small Worlds (HNSW) for 384-dimensional vectors, which matches the dimensions of our embeddings as identified in the previous step.

In [8]:
import kdbai
index = kdbai.Index('myHNSW', dict(type='hnsw', metric='L2', dims=384, efConstruction=8, M=8))

### Add embeddings to index

In [9]:
index.insert(embeddings)

## 4. Searching with a Query using KDB.AI

Now that the embeddings are stored in KDB.AI, we can perform semantic search using `search`. 

First, we embed our search term using the Sentence Transformer model as before. Then we search our index to return to 3 most similar vectors.

In [10]:
search_term = 'number of interstellar objects in the milky way'

search_term_vector = model.encode(search_term)
np.array(pdf_sentences)[(index.search( np.array(search_term_vector),3))[1]]

array([['2J. Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes\n2\n1Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK\n2School of Physical and Chemical Sciences—Te Kura Mat¯ u, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand\n3Just Group plc, Enterprise House, Bancroft road, Reigate, Surrey RH2 7RP, UK\n4Canadian Institute for Theoretical Astrophysics, University of Toronto, 60 St. George Street, Toronto, ON, M5S 3H8, Canada\n5Dunlap Institute for Astronomy and Astrophysics, University of Toronto, 50 St. George Street, Toronto, ON M5S 3H4, Canada\nABSTRACT\nThe Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering\napproximately 1015pc−3around the Sun, which are formed and shaped by a diverse set of processes\nranging from planet formation to galactic dynamics.',
        'In this work, we develop\nthis method and apply it to the stellar population of the Milky Way, estimated 

In [11]:
search_term = 'how does planet formation occur'

search_term_vector = model.encode(search_term)
np.array(pdf_sentences)[(index.search( np.array(search_term_vector),3))[1]]

array([['The pop-\nulation’s dominant dynamical formation mechanisms would preferentially harvest more distant, ice-rich planetesimals\nfrom the disks of the source systems.',
        'A protoplanetary disk has to first order the same composition as the star it forms around,\nsince they both form from the same molecular cloud core.',
        'While in reality, stars will each produce a distribution of ISOs that\nformed at different positions in their protoplanetary disk and thus have a range of compositions, this simplification\nof only modelling planetesimals which form exterior to the water ice line is justified by the proportionally greater\nreservoir of snowline-exterior planetesimals, and the higher efficiencies of formation mechanisms dynamically stripping\nthem into the interstellar population (Fitzsimmons et al. 2023).']],
      dtype='<U5531')