<a href="https://colab.research.google.com/github/Nanditha-V/longchain/blob/master/text_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In information retrieval, semantic search has emerged as a game-changer. It allows us to search and retrieve documents based on their meaning or concepts rather than just keyword matching.

In first approach, we use OpenAI embedings, langchain - a platform that provides tools and apis for building applications powered by LLMs and
FAISS - It is an outstanding library designed for the fast retrieval of nearest neighbors in high-dimensional spaces, enabling quick semantic nearest neighbor search even at a large scale.

In second approach, Sentence Transformers a deep learning model, generates dense vector representations of sentences, effectively capturing their semantic meanings. we are using "paraphrase-MiniLM-L6-v2" because it maps sentences and paragraphs to a dense vector of space 384 dimension which helps in semantic search.

steps followed in above approaches in short:
1. reading the pdf(various libraries like pypdf2,pdfminer or pdfplumber) and chunking the pdf into paragraph( or pages)
2. converting the text into embeddings
3. using the query text to search in the document
4. we can use different similarity scoring (like cosine similarity/levenshtein/jaccard distance)

First Approach : Using OpenAi embeddings, Faiss and langchain

In [None]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "-----"

In [None]:
#Read the PDF
pdfreader = PdfReader('/content/Animal Welfare Report 2021.pdf')

In [None]:
#PDF Parsing
from typing_extensions import Concatenate
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [None]:
# We need to split the text using Character Text Split such that it should not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [None]:
document_search = FAISS.from_texts(texts, embeddings)

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [None]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [None]:
query = "sheep slaughter"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' 100% of the sheep slaughtered are stunned prior to slaughter, excluding religious slaughter, and 79% are stunned even with those cases taken into consideration.'

In [None]:
query = "chicken"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' It is known that 100% of the chicken meat we purchase comes from animals raised in cage-free conditions; we are enhancing our monitoring in order to increase the answers given by some of suppliers (31.82% of the mapped chain). It is also known that the chicken meat we purchase complied with the transportation limit of up to eight hours in the case of 99.04% of the animals (38.02% of the mapped chain). We also know that 100% of the chicken meat we purchase came from animals stunned prior to slaughter with the effectiveness of stunning being on average 99.04% of cases.'

**Sentence Transformer: second approach**

In [None]:
import os
import fitz
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re
import string

In [None]:
#Using model "paraphrase-MiniLM-L6-v2"
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
#Process the PDF and Create Chunks
def preprocess_text(text):
    text = re.sub(f"[{string.punctuation}]", "", text)
    text = text.lower()
    return text

def chunk_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    chunks = []
    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]
        text = page.get_text()
        paragraphs = text.split('\n\n')
        chunks.extend(paragraphs)
    return chunks

In [None]:
#Search PDF
def search_pdf(pdf_chunks, query):
    query = preprocess_text(query)
    query_embedding = model.encode([query])[0]
    chunk_embeddings = model.encode(pdf_chunks)

    similarities = cosine_similarity([query_embedding], chunk_embeddings)[0]
    sorted_indices = sorted(range(len(similarities)), key=lambda k: similarities[k], reverse=True)

    results = [(pdf_chunks[i], similarities[i]) for i in sorted_indices]
    return results



In [None]:
#Main

pdf_path = '/content/Animal Welfare Report 2021.pdf'

query = 'sheep slaughtering in USA'

pdf_chunks = chunk_pdf(pdf_path)
search_results = search_pdf(pdf_chunks, query)

for result, similarity in search_results:
    print(f"Similarity: {similarity:.2f}\n{result}\n")


Similarity: 0.63
 
 
 
Marfrig Animal Welfare Report 
 
 
36 
Sheep account for 0.0033% of the operations of Marfrig Global. The operations consist 
exclusively of sheep slaughtering at the company’s Patagonia slaughterhouse in Chile.  
Raising sheep 
Of the total number of sheep we purchase for slaughter, 100% are kept from birth up until 
slaughter on extensive grazing and fed on natural pasturage. Of all the animals, 100% 
are raised in a stocking density of 1 sheep / ha.  
All sheep are purchased from Chilean farmers. All must comply with the regulations of 
the Chilean Agricultural and Livestock Service and meet the clauses regarding Good 
Animal Welfare Practices, and ensure the quality of the product. We also request that 
100% possess PABCO certification.  
We thus ensure that 100% of the sheep involved in our operations are free from restrictive 
confinement and are raised in enriched environments. 
Regarding practices in the field: it is known that at least 46% of the animals

conclusion: