# Advanced Text Analysis and Retrieval System
## Introduction
This notebook demonstrates the creation of an advanced text analysis and retrieval system leveraging a multitude of powerful libraries and APIs, including `langchain`, `pymupdf`, `cohere`, `pinecone-client`, `PyPDF2`, `openai`, `datasets`, and `ragas`. It highlights the integration and use of external API services from Cohere, Pinecone, and OpenAI for embedding, indexing, and model inference, illustrating the cutting-edge capabilities in processing and analyzing textual data.
## Main Functionality Overview
- **Environment Setup:** Initial setup involves installing necessary libraries and configuring API keys for Cohere, Pinecone, and OpenAI services, ensuring the seamless integration of these tools for our text analysis tasks.
- **Vector Database Integration:** A serverless index is created with Pinecone, a vector database designed for efficient large-scale vector search operations, facilitating rapid text retrieval.
- **Embedding Generation and PDF Processing:** The notebook utilizes `CohereEmbeddings` for generating text embeddings and `PdfReader` for the extraction and processing of PDF documents. This includes support for multiple languages, showcasing the system's versatility.
- **Data Indexing and Retrieval:** Text data is processed by splitting into manageable chunks, embedding these chunks, and then indexing them in Pinecone. This allows for fast and efficient retrieval of information, demonstrating the practical application of vector databases in handling and searching large datasets.
This structured approach enables the notebook to serve as a comprehensive guide to building an advanced text analysis and retrieval system, incorporating the latest technologies in natural language processing and database management.

<h1> Install and import libraries

<h3> This section covers the installation and importing of necessary libraries for the project. </br>
It ensures all required tools are available for the analysis and retrieval tasks ahead. </h3>

In [None]:
# pip install langchain
# pip install pymupdf
# pip install cohere
# pip install pinecone-client
# pip install PyPDF2
# pip install openai
# pip install datasets
# pip install ragas
# pip install --upgrade --quiet  langchain-google-genai pillow
# pip install python-dotenv

import os
import random
from pinecone import Pinecone, ServerlessSpec
from langchain.vectorstores import Pinecone as PineconeStore
from PyPDF2 import PdfReader
from langchain.embeddings import CohereEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from datasets import Dataset
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from ragas import evaluate
from langchain_openai.chat_models import AzureChatOpenAI
import google.generativeai as genai
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)


<h1> Create pinecode index and load pdf <h1>

<h3> Here, we initialize a Pinecone index for efficient vector search and load a PDF document for analysis. </br>
This step is crucial for setting up our data storage and retrieval system. </h3>

In [4]:
load_dotenv('keys.env')

PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
COHERE_API_KEY = os.getenv('COHERE_API_KEY')
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

INDEX_NAME = "quickstart"

# Create a serverless index
# "dimension" needs to match the dimensions of the vectors you upsert
pc = Pinecone(api_key=PINECONE_API_KEY)

embeddings = CohereEmbeddings(model = "embed-multilingual-v3.0", cohere_api_key=COHERE_API_KEY)

pc.delete_index(INDEX_NAME)

####
if INDEX_NAME not in [index.name for index in pc.list_indexes()]:
    pc.create_index(name=INDEX_NAME, dimension=1024,
    spec=ServerlessSpec(cloud='aws', region='us-west-2')
    )
    # Load PDF with hebrew support
    pdf_file = open('example.pdf', 'rb')  # Open your PDF in binary mode
    reader = PdfReader(pdf_file)  # Create a PdfFileReader object
    heb_pages = reader.pages

    pages = ""

    # Create large string
    for page in reader.pages:
      pages += page.extract_text()

    # Split the PDF into smaller chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(pages)
    text_content = [doc for doc in texts]

    docsearch = PineconeStore.from_texts(text_content, embeddings, index_name=INDEX_NAME)
else:
  text_field = "text"

  # switch back to normal index for langchain
  index = pc.Index(INDEX_NAME)

  docsearch = PineconeStore(
      index, embeddings, text_field
  )



<h1> Create large language model using azure </h1> 
<h3> In this part, we leverage Azure's capabilities to create a large language model. </br>
This model will play a key role in processing and understanding the textual content from our documents. </h3>

In [5]:
GPT_DEPLOYMENT_NAME="chatgpt_16k"

AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY')
AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')


llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment=GPT_DEPLOYMENT_NAME,
    model='azure',
    validate_base_url=False
)

<h1> Retreive k random documents from index. </br>
<h3>For each document retrieved, we generate a question whose answer lies within the document's content, storing these questions in a list named `questions`. </br>
This exercise aims to test the retrieval accuracy and relevance of the indexed data. 

</h1>

In [6]:
retriever = docsearch.as_retriever()
random_documents = random.choices(texts, k=2)

questions = []
documents = []

# Create k questions who's answer is in k docs respectively
for doc in random_documents:
  template = "Generate a question in hebrew who's answer is within the following text: {doc}"
  prompt = ChatPromptTemplate.from_template(template)

  # Setup RAG pipeline
  rag_chain = (
      {"context": retriever,  "doc": RunnablePassthrough()}
      | prompt
      | llm
      | StrOutputParser()
  )

  questions.append(rag_chain.invoke(doc))
  documents.append(doc)


<h1> Create a chat prompt using Gemini. </h1>
<h3>Using the document as context, we send prompts requesting answers to each question, respectively, through Gemini. 
The responses are saved in a list called `answers`, showcasing the interaction with the model and its understanding of the context. </h3>

In [7]:
model = "gemini-pro"
genai.configure(api_key=GOOGLE_API_KEY)
generation_config = {
    "temperature": 1,
    "top_p": 1,
    "top_k": 1,
    "max_output_tokens": 1024,
}
safety_settings = [
    {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
    {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_ONLY_HIGH"},
    {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_ONLY_HIGH"},
    {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_ONLY_HIGH"},
]

gemini = genai.GenerativeModel(
    model_name=model,
    safety_settings=safety_settings,
    generation_config=generation_config,
)

answers = []
chat=gemini.start_chat()

for i in range(len(documents)):
  question = questions[i]
  document = documents[i]

  # Define prompt template
  template = f"""You are an assistant for question-answering tasks.
  Use the following pieces of retrieved context and question below to answer the question in hebrew.
  Use two sentences maximum and keep the answer concise.
  Question: {question}
  Context: {document}
  Answer:
  """

  response = chat.send_message(template)
  answers.append(response.text)


<h1>Prepare a dataset to be used with metrics evalution.</h1>
<h3> This section involves preparing a dataset that will be used to evaluate the effectiveness of our retrieval and question-answering system. <br>
It sets the stage for assessing performance and accuracy. </h3>

In [None]:
ground_truths = []

for answer in answers:
    list = []
    list.append(answer)
    ground_truths.append(list)

print(len(questions))
print(len(answers))

contexts = []
answersb = []

# Inference
for query in questions:
    contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])
    response = chat.send_message(query)
    answersb.append(response.text)

# To dict
data = {
    "question": questions,
    "answer": answersb,
    "contexts": contexts,
    "ground_truths": ground_truths
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

<h1> Evaluate metrics using dataset  </h1>
<h3> Finally, we evaluate various metrics using the prepared dataset. </br>
This evaluation will help us understand the strengths and weaknesses of our system, providing insights into areas for improvement.</h3>

In [None]:
from IPython.display import display

result = evaluate(
    llm =llm,
    dataset = dataset,
    metrics=[
        context_precision,
        context_recall,
        answer_relevancy,
        faithfulness
    ],
)

df = result.to_pandas()
display(df)