<a href="https://colab.research.google.com/github/SinaRampe/applications-with-LangChain/blob/main/PDF_Query_chain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.157-py3-none-any.whl (727 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m727.6/727.6 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting aiohttp<4.0.0,>=3.8.3
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
Collecting openapi-schema-pydantic<2.0,>=1.2
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collectin

In [2]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

In [28]:
# Get your API keys from openai, you will need to create an account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "none"

In [4]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


In [5]:
# location of the pdf file/files. 
reader = PdfReader('/content/gdrive/My Drive/data/Personalised prescribing_main report_1_0.pdf')

In [6]:
reader

<PyPDF2._reader.PdfReader at 0x7f5a00f4dc60>

In [7]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [None]:
# raw_text

In [8]:
raw_text[:100]

'1\n© Royal College of Physicians and British Pharmacological Society 2022\nUsing pharmacogenomics to  '

In [9]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 

text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [10]:
len(texts)

155

In [11]:
texts[0]

'1\n© Royal College of Physicians and British Pharmacological Society 2022\nUsing pharmacogenomics to  \nimprove patient outcomes\nA report from the Royal College of Physicians and  \nBritish Pharmacological Society joint working party2\n© Royal College of Physicians and British Pharmacological Society 2022\nRoyal College of Physicians\nThe Royal College of Physicians (RCP) plays a leading \nrole in the delivery of high-quality patient care by \nsetting standards of medical practice and promoting \nclinical excellence. The RCP provides physicians in over \n30 medical specialties with education, training and \nsupport throughout their careers. As an independent \ncharity representing more than 40,000 fellows and \nmembers worldwide, the RCP advises and works with \ngovernment, patients, allied healthcare professionals \nand the public to improve health and healthcare.\nBritish Pharmacological \nSociety\nThe British Pharmacological Society (BPS) is a \ncollaborative, global community at 

In [12]:
texts[1]

'and the public to improve health and healthcare.\nBritish Pharmacological \nSociety\nThe British Pharmacological Society (BPS) is a \ncollaborative, global community at the heart of \npharmacology. Founded in 1931, the BPS represents \nmore than 4,000 members from over 60 countries. As a \nvibrant science that studies drug action, pharmacology \nlies at the heart of biomedical science, linking together \nchemistry, physiology and pathology. Pharmacologists, \nboth clinical and non-clinical, work closely with a wide \nvariety of other disciplines that make up modern \nbiomedical science, including but not limited to clinical \nmedicine, biochemistry, neuroscience, molecular and \ncell biology, genetics, immunology and cancer biology. \nClinical pharmacology and therapeutics is a medical \nspecialty recognised by the RCP.Citation for this document \nRoyal College of Physicians and British Pharmacological \nSociety. Personalised prescribing: using \npharmacogenomics to improve patient ou

In [13]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [14]:
docsearch = FAISS.from_texts(texts, embeddings)

In [15]:
docsearch

<langchain.vectorstores.faiss.FAISS at 0x7f59fa4d05b0>

In [24]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

In [25]:
llm = ChatOpenAI(temperature=0, model_name="gpt-4")

In [26]:
chain = load_qa_chain(llm, chain_type="stuff")

In [27]:
query = "who are the authors of the article?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The authors of the report are not explicitly mentioned, but the members of the working party who contributed to the development of the report include:\n\n1. Professor Sir Munir Pirmohamed (co-chair) - British Pharmacological Society (BPS)\n2. Professor Donal O’Donoghue* (co-chair) - Royal College of Physicians (RCP)\n3. Dr Richard Turner (co-secretary) - BPS\n4. Dr Emma Magavern (co-secretary) - BPS\n5. Deborah Roebuck - RCP Patient and Carer Network\n6. Dr Paul Ross - Oncology\n7. Professor Bernard Keavney - Cardiology\n8. Professor Claire Shovlin - Respiratory medicine\n9. Dr Joyce Popoola - Renal medicine\n10. Dr Shuaib Nasser - Allergy and immunology\n11. Dr Meriel McEntagart - Clinical genetics\n12. Sonali Sanghvi - NHS England\n13. Dr Anneke Seller - Health Education England\n14. Dr Michelle Bishop - Health Education England\n15. Professor Sir Mark Caulfield - Genomics England\n16. Ravi Sharma - Royal Pharmaceutical Society\n17. Dr Imran Rafi - Royal College of General Practitio

In [19]:
query = "What is personalized medicine?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

" Personalized medicine is a form of medicine that takes into account a person's individual genetic makeup to guide treatment decisions and make them more precise and effective. It includes the use of pharmacogenomics to study how genes affect a person's response to drugs, as well as other approaches such as accounting for demographic, health, drug-food interactions and drug-drug interactions. It has the potential to improve patient outcomes and reduce preventable health conditions and costs to the NHS."