### This python notebook is for extracting the text from the documents and then creating the embeddings and then upserting it in the  Pinecone vector database. 

## Installing the required libraries.

In [1]:
!pip install -q --upgrade langchain pypdf pinecone-client google-generativeai langchain-google-genai python-dotenv

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch2tflite 1.0.0 requires tflite-runtime~=2.5, which is not installed.
mediapipe 0.10.8 requires protobuf<4,>=3.11, but you have protobuf 4.25.2 which is incompatible.
tensorboard 2.10.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.25.2 which is incompatible.
tensorflow 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.25.2 which is incompatible.
tensorflow-intel 2.14.0 requires keras<2.15,>=2.14.0, but you have keras 2.10.0 which is incompatible.
tensorflow-intel 2.14.0 requires tensorboard<2.15,>=2.14, but you have tensorboard 2.10.1 which is incompatible.
tensorflow-intel 2.14.0 requires tensorflow-estimator<2.15,>=2.14.0, but you have tensorflow-estimator 2.10.0 which is incompatible.
tf2onnx 1.15.1 requires protobuf~=3.20.2, but you have protobuf 4.25.2 which is incompatible.
torch2t

In [1]:
from pinecone import Pinecone
from pinecone import  PodSpec
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
import google.generativeai as genai
import warnings
import pinecone
import uuid
import os
import time

from langchain_google_genai import GoogleGenerativeAIEmbeddings

## Setting up the API Key

In [2]:
# Used to securely store your API key
import os
from dotenv import load_dotenv

GOOGLE_API_KEY = os.getenv("GEMINI_API_KEY")  
genai.configure(api_key=GOOGLE_API_KEY)

In [8]:
#Create a new .env file in the workspace and store the API key in it
# !echo -e 'GOOGLE_API_KEY={YOUR_GOOGLE_API_KEY_TEXT}' > .env

In [3]:
load_dotenv()

True

In [4]:
!ls -a

.
..
.env
.git
.gitignore
.streamlit
LICENSE
README.md
__pycache__
documents
knowledge-upsert.ipynb
mainapp.py
pythonenv.cfg
requirements.txt
utils.py


## Loading the model

In [None]:
model = genai.GenerativeModel('gemini-pro')

In [None]:
%%time
response = model.generate_content("What is the true meaning of life?")

CPU times: user 246 ms, sys: 28.1 ms, total: 274 ms
Wall time: 19.8 s


In [None]:
response

<google.generativeai.types.generation_types.GenerateContentResponse at 0x797c3b1d9d80>

In [None]:
response.text

"The true meaning of life is a philosophical and existential question that has been debated and contemplated by humans for centuries. There is no one definitive answer, as the meaning of life is subjective and personal to each individual. However, there are many different philosophical, religious, spiritual, and personal perspectives that offer potential answers to this question.\n\n1. **Purpose and Fulfillment**: Some people believe that the meaning of life is to find a purpose or calling that brings them fulfillment and satisfaction. This could involve pursuing a career, engaging in creative activities, making a positive impact on the world, or simply finding joy and contentment in everyday life.\n\n2. **Relationships and Human Connection**: For many people, the meaning of life is found in their relationships with others. This could include family, friends, romantic partners, or a sense of community. Building strong and meaningful connections with others can provide a sense of belong

In [None]:
print(response.candidates[0].content)

parts {
  text: "The true meaning of life is a philosophical and existential question that has been debated and contemplated by humans for centuries. There is no one definitive answer, as the meaning of life is subjective and personal to each individual. However, there are many different philosophical, religious, spiritual, and personal perspectives that offer potential answers to this question.\n\n1. **Purpose and Fulfillment**: Some people believe that the meaning of life is to find a purpose or calling that brings them fulfillment and satisfaction. This could involve pursuing a career, engaging in creative activities, making a positive impact on the world, or simply finding joy and contentment in everyday life.\n\n2. **Relationships and Human Connection**: For many people, the meaning of life is found in their relationships with others. This could include family, friends, romantic partners, or a sense of community. Building strong and meaningful connections with others can provide a

In [None]:
# Accessing parts of a response with multiple parts:
for part in response.parts:

    print(part.text)

candidate_parts = response.candidates[0].content.parts


The true meaning of life is a philosophical and existential question that has been debated and contemplated by humans for centuries. There is no one definitive answer, as the meaning of life is subjective and personal to each individual. However, there are many different philosophical, religious, spiritual, and personal perspectives that offer potential answers to this question.

1. **Purpose and Fulfillment**: Some people believe that the meaning of life is to find a purpose or calling that brings them fulfillment and satisfaction. This could involve pursuing a career, engaging in creative activities, making a positive impact on the world, or simply finding joy and contentment in everyday life.

2. **Relationships and Human Connection**: For many people, the meaning of life is found in their relationships with others. This could include family, friends, romantic partners, or a sense of community. Building strong and meaningful connections with others can provide a sense of belonging, 

### To Markdown

In [None]:
from IPython.display import display
from IPython.display import Markdown
import textwrap
def to_markdown(text):
  text = text.replace('•','*')
  return Markdown(textwrap.indent(text, '>', predicate=lambda _: True))

In [None]:
to_markdown(response.text)

>The true meaning of life is a philosophical and existential question that has been debated and contemplated by humans for centuries. There is no one definitive answer, as the meaning of life is subjective and personal to each individual. However, there are many different philosophical, religious, spiritual, and personal perspectives that offer potential answers to this question.
>
>1. **Purpose and Fulfillment**: Some people believe that the meaning of life is to find a purpose or calling that brings them fulfillment and satisfaction. This could involve pursuing a career, engaging in creative activities, making a positive impact on the world, or simply finding joy and contentment in everyday life.
>
>2. **Relationships and Human Connection**: For many people, the meaning of life is found in their relationships with others. This could include family, friends, romantic partners, or a sense of community. Building strong and meaningful connections with others can provide a sense of belonging, purpose, and love, which can contribute to a fulfilling life.
>
>3. **Personal Growth and Self-Realization**: Some believe that the meaning of life is to embark on a journey of personal growth and self-realization. This could involve learning and developing skills, exploring new interests, challenging oneself, becoming more self-aware, and striving to become the best version of oneself.
>
>4. **Making a Positive Impact**: Others find meaning in making a positive difference in the world. This could involve volunteering, advocating for social or environmental causes, donating to charity, or simply trying to be a kind and compassionate person. Feeling like one is making a positive contribution to society or the world can provide a sense of purpose and fulfillment.
>
>5. **Experiencing and Appreciating Life**: For some people, the meaning of life is simply to experience and appreciate all that life has to offer. This could involve savoring moments of joy, exploring new places, immersing oneself in nature, or appreciating the beauty of art, music, or literature. Fully engaging with the world and appreciating the simple pleasures of life can bring a sense of contentment and meaning.
>
>6. **Spiritual and Religious Beliefs**: For those who hold religious or spiritual beliefs, the meaning of life may be derived from their faith or spiritual practices. This could involve following religious teachings, seeking spiritual enlightenment, connecting with a higher power, or striving to live in accordance with spiritual values.
>
>7. **Finding Balance and Harmony**: Some believe that the meaning of life lies in finding balance and harmony in all aspects of life. This could involve balancing work and personal life, maintaining healthy relationships, pursuing both physical and mental health, and striving for a sense of inner peace and contentment.
>
>8. **Creating Legacy**: For others, the meaning of life is to create a legacy or leave a lasting impact on the world. This could involve raising a family, writing a book, creating art, or inventing something that will benefit future generations. Leaving a positive mark on the world can provide a sense of purpose and meaning beyond one's own lifetime.
>
>Ultimately, the true meaning of life is unique to each individual and can evolve and change over time. It is a deeply personal and introspective question that requires ongoing reflection, exploration, and self-discovery.

## Loading PDF Files from Folder

In [5]:
#Extract data from the PDF
def load_pdf(data):
    loader = DirectoryLoader(data,
                    glob="*.pdf",
                    loader_cls=PyPDFLoader)

    documents = loader.load()

    return documents

In [6]:
%%time
data = load_pdf("documents/")

CPU times: total: 16.3 s
Wall time: 39 s


In [7]:
data

[Document(page_content="NITRAIPURChatbotpdf\n1.Howtocalculatecpi?\nAns:Youcancalculatecpibymultiplyingmarksandcreditthenaddthesevalueanddivideitbytotalcredits.\nGenesishttps://www.nitrr.ac.in/genesis.php\nThefirstPresidentofindependentIndiahonorableDr.RajendraPrasadlaidtheFoundationstoneofthecollegebuildingon14thSeptember1956.Theconstructionworkwascompletedin1962andinaugurationwason14thMarch1963byIndia'sfirstPrimeMinisterPt.JawaharlalNehru.AfterindependenceandwithreorganisationofthestateofMadhyaPradesh,thegovernmentattentionwasdirectedtowardsgivingprioritytooveralldevelopmentoftechnicaleducation.\nTillaslateas1956therewereonlythreetechnicalinstitutesinthecountryofferingcoursesintheimportantfieldsofMiningandMetallurgicalEngineering.Inviewofthisfactandalsowithanaimofharnessingtheamplemineralresourcesoftheregion,thisinstitutewasset-upon1stMay1956asGovernmentCollegeofMiningandMetallurgy.Thefirstsessionofthecollegecommencedfrom1stJuly1956withtheadmissionof15studentseachinMiningandMetallurgy

In [14]:
#Create text chunks
def text_split(data):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 5000, chunk_overlap = 300)
    text_chunks = text_splitter.split_documents(data)
    return text_chunks
content ="\n\n".join(str(page.page_content) for page in data)

In [15]:
text_chunks = text_split(data)
text_chunks

[Document(page_content="NITRAIPURChatbotpdf\n1.Howtocalculatecpi?\nAns:Youcancalculatecpibymultiplyingmarksandcreditthenaddthesevalueanddivideitbytotalcredits.\nGenesishttps://www.nitrr.ac.in/genesis.php\nThefirstPresidentofindependentIndiahonorableDr.RajendraPrasadlaidtheFoundationstoneofthecollegebuildingon14thSeptember1956.Theconstructionworkwascompletedin1962andinaugurationwason14thMarch1963byIndia'sfirstPrimeMinisterPt.JawaharlalNehru.AfterindependenceandwithreorganisationofthestateofMadhyaPradesh,thegovernmentattentionwasdirectedtowardsgivingprioritytooveralldevelopmentoftechnicaleducation.\nTillaslateas1956therewereonlythreetechnicalinstitutesinthecountryofferingcoursesintheimportantfieldsofMiningandMetallurgicalEngineering.Inviewofthisfactandalsowithanaimofharnessingtheamplemineralresourcesoftheregion,thisinstitutewasset-upon1stMay1956asGovernmentCollegeofMiningandMetallurgy.Thefirstsessionofthecollegecommencedfrom1stJuly1956withtheadmissionof15studentseachinMiningandMetallurgy

In [16]:
print("length of my chunk:", len(text_chunks))

length of my chunk: 480


In [17]:
print("The total number of words in the context:", len(content))

The total number of words in the context: 596206


In [18]:
text_chunks[101]

Document(page_content='9.\nParticle\nsize\nanalyzer\n300\nMI,\nZeiss\nprimo\nstar\nDeterminatio\n(Range\n0.5-10\n10.\nHPLC\n9300,YL\nInstruments\nSeparation\nnon-volatile\nunstable\ncom\n11.\nPotentiostat\nPMC-1000,\nParstat\nMC\nEIS\nstudy,\nTafel\nplot\netc\n12\n\xa0\nGC-MS\n\xa0\nGC-2010\nPlus,\nShimazdu\nTo\nide\ncompounds\nsample\n\xa0\n13\nParticle\nSize\n–\nZeta\nPotential\nAnalyzer\n\xa0\n\xa0\nLitesizer\n–\n500,\nAnton\nPaar\nTo\ndeterm\nsize,\nand\nliquid\nsamp\n\xa0\n14\n\xa0\nThermogravimetric\nAnalyzer\n(TGA)\n\xa0\nSetaram,\nLabsys\nEvo\n\xa0\nTo\ndeterm\nstability\nof\nt', metadata={'source': 'documents\\doc1.pdf', 'page': 101})

In [19]:
page_content_of_chunks = [text_chunk.page_content for text_chunk in text_chunks]
page_content_of_chunks

["NITRAIPURChatbotpdf\n1.Howtocalculatecpi?\nAns:Youcancalculatecpibymultiplyingmarksandcreditthenaddthesevalueanddivideitbytotalcredits.\nGenesishttps://www.nitrr.ac.in/genesis.php\nThefirstPresidentofindependentIndiahonorableDr.RajendraPrasadlaidtheFoundationstoneofthecollegebuildingon14thSeptember1956.Theconstructionworkwascompletedin1962andinaugurationwason14thMarch1963byIndia'sfirstPrimeMinisterPt.JawaharlalNehru.AfterindependenceandwithreorganisationofthestateofMadhyaPradesh,thegovernmentattentionwasdirectedtowardsgivingprioritytooveralldevelopmentoftechnicaleducation.\nTillaslateas1956therewereonlythreetechnicalinstitutesinthecountryofferingcoursesintheimportantfieldsofMiningandMetallurgicalEngineering.Inviewofthisfactandalsowithanaimofharnessingtheamplemineralresourcesoftheregion,thisinstitutewasset-upon1stMay1956asGovernmentCollegeofMiningandMetallurgy.Thefirstsessionofthecollegecommencedfrom1stJuly1956withtheadmissionof15studentseachinMiningandMetallurgyEngineering.\nIn1958-5

In [20]:
len(page_content_of_chunks)

480

## Loading Embeddings models

In [21]:
embeddings = GoogleGenerativeAIEmbeddings(model = "models/embedding-001",google_api_key=GOOGLE_API_KEY)

In [22]:
embeddings

GoogleGenerativeAIEmbeddings(model='models/embedding-001', task_type=None, google_api_key=SecretStr('**********'), client_options=None, transport=None)

### Setting up Pinecone APIs

In [28]:
PINECONE_API_KEY = os.environ.get('PINE_API_KEY')
PINECONE_API_ENV = os.environ.get('PINE_API_ENV')
PINE_INDEX = os.environ.get('PINE_INDEX')

### Setting the Pinecone Indexes (Required once to set) to perform semantic search from vector database.

In [31]:
index_name=PINE_INDEX
spec = PodSpec(environment=PINECONE_API_ENV)
pc = Pinecone(
    api_key=PINECONE_API_KEY
)
index = pc.Index(index_name)

In [32]:
query_result = embeddings.embed_query("Hello world")
print("Length of The Embeddings is: ", len(query_result))

query_result[:10]

Length of The Embeddings is:  768


[0.05889487,
 -0.004501751,
 -0.067298084,
 -0.012740517,
 0.064561136,
 0.025551839,
 0.023632249,
 -0.039868433,
 -0.009893336,
 0.0501891]

In [26]:
# pc.delete_index(index_name)

In [30]:
import time
index_name=PINE_INDEX
#Creating pinecone index and initialising it
if index_name not in pc.list_indexes().names():
    pc.create_index(
        index_name,
        dimension=768,  # dimensionality of text-embedding
        metric='cosine',
        spec=spec
    )

    time.sleep(1)

# connect to index
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [33]:
embeddings

GoogleGenerativeAIEmbeddings(model='models/embedding-001', task_type=None, google_api_key=SecretStr('**********'), client_options=None, transport=None)

In [34]:
embedding = embeddings.embed_documents(page_content_of_chunks)

CPU times: total: 1.17 s
Wall time: 18 s


In [35]:
len(embedding)

480

In [36]:
# Upserting the embeddings of the Context Text Book data to the Vector database.
for i in range(len(embedding)):
  vector_dict = {}
  vector_dict["id"]= str(uuid.uuid4())
  vector_dict["values"] = embedding[i]

  metadata = {}
  metadata["text"] = page_content_of_chunks[i]
  vector_dict["metadata"] = metadata
  print(vector_dict)
  print(f"Upserting vector to Pinecone...")
  index.upsert([vector_dict])

{'id': '6d2f8e04-73ef-4b64-839c-2bbf638a4bcf', 'values': [0.032356106, -0.013251104, -0.049963262, 0.018946894, 0.09605494, 0.04954254, 0.040289942, -0.014565615, 0.002005008, 0.08722653, -0.023206964, 0.026112888, -0.020321952, -0.03663139, 0.017005635, -0.022776613, 0.021535823, 0.00089385203, 0.0012295232, -0.059695434, 0.017349862, -0.02000959, 0.017568998, -0.031875417, -0.0068000085, 0.0054202667, 0.009054977, -0.055254117, -0.03966554, -0.0017324735, -0.042378858, 0.041061733, -0.03879207, -0.0031198931, 0.0011743457, -0.04866712, 0.045427304, -0.008434978, 0.042338144, 0.081683, -0.00038832758, 0.015234285, -0.014170739, -0.0060523865, 0.01665115, -0.038076214, -0.014473763, 0.02069002, 0.055706386, -0.08411271, -0.021094311, -0.011101124, 0.08144073, -0.013562425, 0.027612414, -0.028815536, 0.02549657, -0.066505365, -0.050500985, 0.0069354516, 0.0016377415, -0.019371662, 0.018589204, 0.012835693, -0.011414338, -0.050631184, -0.042698488, 0.032584056, 0.053869087, -0.021492459,

In [37]:
for index in pc.list_indexes():
    print(index['name'])

nidhi


In [38]:
index_description = pc.describe_index("nidhi")
index_description

{'dimension': 768,
 'host': 'nidhi-9s1f050.svc.gcp-starter.pinecone.io',
 'metric': 'cosine',
 'name': 'nidhi',
 'spec': {'pod': {'environment': 'gcp-starter',
                  'pod_type': 'starter',
                  'pods': 1,
                  'replicas': 1,
                  'shards': 1}},
 'status': {'ready': True, 'state': 'Ready'}}

### Testing if the Semantic search working perfectly or not !

In [45]:
query_string = "How to calculate cpi"

query_vector = embeddings.embed_query(query_string)
index = pc.Index(index_name)
vectors = index.query(
    vector = query_vector,
    top_k = 20,
    include_values = False,
    include_metadata = True
)

In [46]:
vectors

{'matches': [{'id': '461938b7-a7a2-42bb-aa24-3de372e482c4',
              'metadata': {'text': '●\n'
                                   'GST\n'
                                   'and\n'
                                   'other\n'
                                   'taxes\n'
                                   'as\n'
                                   'applicable\n'
                                   'at\n'
                                   'the\n'
                                   'time\n'
                                   'of\n'
                                   'test\n'
                                   'will\n'
                                   'be\n'
                                   'extra\n'
                                   'as\n'
                                   'per\n'
                                   'GOI\n'
                                   'on\n'
                                   'above\n'
                                   'mentioned\n'
                     

In [47]:
for i, match in enumerate(vectors['matches'], start=1):
    print(f"{i}. {match['metadata']['text']}\n")

1. ●
GST
and
other
taxes
as
applicable
at
the
time
of
test
will
be
extra
as
per
GOI
on
above
mentioned
price.
●
Testing
charges
are
to
be
paid
by
DD
in
favor
of 
Director ,
NIT
RAIPUR
 payable
at
RAIPUR.
●
Samples
to
be
submitted
to
HOD
Chemistry
Department,
NIT
Raipur .
●
o
v
●
List
of
Ph.D.
Scholars
DepartmentofComputerApplications
●
About
●
Curriculum
●
Laboratories
●
Faculty
●
Staﬀ
●
Syllabus
●
Events
●
Achievement
●
Downloads
●
Virtual
Tour
 
About
Department
of
Computer
Applications,
NIT
Raipur
Mission: 
The
Department
of
Computer
Applications
at
National
Institute
of
Technology
Raipur
imparts
quality
education
through
the
best
possible
post-graduate
educational
programs
in
the
field
of

2. TechnicalAssistant
9669338576
gitaraj_23@reddif fmail.com
TechnicalAssistant
7746816780
sammusoni9612@gmail.com
LabAssistant
9641305344
7001034709
uttamraij@gmail.com
AG-II(OnContract)
9993874142
Ujjwala.y@gmail.com
ahuTechnicalAssistant(OnContract)
9907897957
Khageshkumarsahu@gmail.com
LabAtt

## Making RAG Pipeline

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
model = ChatGoogleGenerativeAI(model="gemini-pro",
                             temperature=0.5,
                             convert_system_message_to_human=True,
                             )
model

ChatGoogleGenerativeAI(model='gemini-pro', client= genai.GenerativeModel(
   model_name='models/gemini-pro',
   generation_config={}.
   safety_settings={}
), temperature=0.5, convert_system_message_to_human=True)

In [None]:
def chat(input_query):
  query = input_query
  query_vector = embeddings.embed_query(query)
  index = pc.Index(index_name)
  vectors = index.query(
    vector = query_vector,
    top_k = 20,
    include_values = False,
    include_metadata = True
  )

  context = ""
  for i, match in enumerate(vectors['matches'], start=1):
    context += f"{i}. {match['metadata']['text']}\n"

  prompt_template =f"""
    You are a AI Health chatbot made by NITRR MakerSpace . Please use the following information to answer the user's question.
    If you cannot answer the question using the given information just reply 'I don't know',
    don't try to make up an answer, always say 'thanks for asking!' at the end of the answer.

  Context: {context}\n
  User query: {query}

  Only return the helpful answer and nothing else.
  Helpful answer:
  """

  response = model.generate_content(prompt_template)
  return response

## Output

To see direct output only please run these blocks before running the blocks below -> requirements libraries, imports, Gemini APIs and Pinecone APIs, to markdown, Loading Embeddings, set Pinecone Indexes, Loading Gemini-pro models and RAG Pipeline. Now you can run the Output Block below. 😀

In [None]:
while True:
  input_text = input("Enter your Question: ")
  answer = chat(input_text)
  print(f"User : {input_text} ")
  try:
    print("Bot: " + to_markdown(answer.text).data)
  except ValueError:
    print(f"Bot: {answer.prompt_feedback.safety_ratings[0]}")

User : Hello, Who are you? 
Bot: >I am a AI Health Chatbot made by NITRR MakerSpace. Thanks for asking!
User : How aeroplane fly? 
Bot: >I don't know. Thanks for asking!
User : What is moon? 
Bot: >I don't know, thanks for asking!
User : What is malaria? 
Bot: >Malaria is a disease caused by organisms called protozoa. The only way to get malaria is to be bitten by a certain type of mosquito that has bitten someone who has the disease. Thanks for asking!
User : How to stay healthy? 
Bot: >A healthy lifestyle—eating right, regular exercise, maintaining a healthy weight, not smoking, and controlling hypertension—can reduce the risk of developing atherosclerosis, help keep the disease from progressing, and sometimes cause it to regress.
>
>thanks for asking!
User : What is child abuse? 
Bot: category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: HIGH



In [None]:
answer.prompt_feedback.safety_ratings

[category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: NEGLIGIBLE
, category: HARM_CATEGORY_HATE_SPEECH
probability: NEGLIGIBLE
, category: HARM_CATEGORY_HARASSMENT
probability: NEGLIGIBLE
, category: HARM_CATEGORY_DANGEROUS_CONTENT
probability: NEGLIGIBLE
]

In [None]:
data_type = type(answer.prompt_feedback.safety_ratings[0])

value = answer.prompt_feedback.safety_ratings[0]

result_string = f"Data Type: {data_type}\nValue:\n{value}"

print(result_string)


Data Type: <class 'google.ai.generativelanguage_v1beta.types.safety.SafetyRating'>
Value:
category: HARM_CATEGORY_SEXUALLY_EXPLICIT
probability: HIGH

