<a href="https://colab.research.google.com/github/SinaRampe/applications-with-LangChain/blob/main/2_GPT-4_for_pdf_QandA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [45]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [46]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

In [74]:
# Get your API keys from openai, you will need to create an account. 
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = "" # insert openai api key

In [48]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


In [49]:
# location of the pdf file/files. 
reader = PdfReader('/content/gdrive/My Drive/data/Personalised prescribing_main report_1_0.pdf')

In [50]:
reader

<PyPDF2._reader.PdfReader at 0x7f59fa780160>

In [51]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [52]:
# raw_text

In [53]:
raw_text[:100]

'1\n© Royal College of Physicians and British Pharmacological Society 2022\nUsing pharmacogenomics to  '

In [54]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 

text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 4000,
    chunk_overlap  = 800,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [55]:
len(texts)

40

In [56]:
texts[0]

'1\n© Royal College of Physicians and British Pharmacological Society 2022\nUsing pharmacogenomics to  \nimprove patient outcomes\nA report from the Royal College of Physicians and  \nBritish Pharmacological Society joint working party2\n© Royal College of Physicians and British Pharmacological Society 2022\nRoyal College of Physicians\nThe Royal College of Physicians (RCP) plays a leading \nrole in the delivery of high-quality patient care by \nsetting standards of medical practice and promoting \nclinical excellence. The RCP provides physicians in over \n30 medical specialties with education, training and \nsupport throughout their careers. As an independent \ncharity representing more than 40,000 fellows and \nmembers worldwide, the RCP advises and works with \ngovernment, patients, allied healthcare professionals \nand the public to improve health and healthcare.\nBritish Pharmacological \nSociety\nThe British Pharmacological Society (BPS) is a \ncollaborative, global community at 

In [58]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [59]:
docsearch = FAISS.from_texts(texts, embeddings)

In [60]:
docsearch

<langchain.vectorstores.faiss.FAISS at 0x7f59f590dd50>

In [61]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

In [62]:
llm = ChatOpenAI(temperature=0, model_name="gpt-4")

In [63]:
chain = load_qa_chain(llm, chain_type="stuff")

In [70]:
query = "The main types of genetic testing approaches?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The main types of genetic testing approaches are:\n\n1. Single gene testing: This approach focuses on testing a specific gene, such as HLA-B*57:01 (abacavir) and DPYD (fluoropyrimidines).\n\n2. Testing a panel of pharmacogenes in one test: This method involves testing multiple pharmacogenes simultaneously using various techniques, including sequencing, genome-wide arrays, or mass spectrometry.\n\n3. Whole-exome sequencing (WES) and whole-genome sequencing (WGS): These approaches involve sequencing the entire exome (protein-coding regions of the genome) or the whole genome, respectively. This provides comprehensive genetic information, including pharmacogenomic data.'

In [72]:
docs

[Document(page_content='based on European ancestry populations, and \ntherefore it is vital that implementation considers the \ndiversity of our population to ensure that we do not \nexacerbate health and race inequalities. \n5.5 Genotyping and laboratory \nconsiderations\nThe main types of genetic testing approaches \navailable are: \n>  single gene testing, as employed for HLA-B*57:01 \n(abacavir) and DPYD  (fluoropyrimidines) \n>  testing a panel of pharmacogenes in one test \n(which could be based on a number of approaches \nincluding sequencing, genome-wide arrays or mass \nspectrometry)\n>  whole-exome sequencing (WES) and whole-genome \nsequencing (WGS).At present, since our understanding of clinically \nactionable pharmacogenomics is limited to relatively \nfew genes, the best options available are either single \ngene testing or panel testing. The former is simpler and \ncan also include POC testing. However, it requires the \nHCP to request the test (‘a reactive approach’), a

In [65]:
query = "What is personalized medicine?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

"Personalized medicine, also known as precision medicine or individualized medicine, is a medical approach that takes into account an individual's unique genetic makeup, lifestyle, and environmental factors to tailor prevention, diagnosis, and treatment strategies. This approach aims to provide more effective and safer healthcare by considering the specific characteristics of each person, rather than using a one-size-fits-all approach. Personalized medicine often involves the use of pharmacogenomics, which is the study of how a person's genes affect their response to drugs, to guide the choice and dosage of medications for optimal effectiveness and minimal side effects."

In [66]:
! pip install anvil-uplink

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting argparse
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


In [67]:
import anvil.server

In [76]:
anvil.server.connect("") #insert anvil uplink

Disconnecting from previous connection first...
Connecting to wss://anvil.works/uplink
Anvil websocket open
Connected to "Default environment" as SERVER


In [71]:
# Tells the jupyter server that this is a an Anvil callable function
@anvil.server.callable
# Define the function that is going to do the NLP
def answer_questions(prompt):
  docs = docsearch.similarity_search(prompt)
  return chain.run(input_documents=docs, question=prompt)