**Quering the Biophysics: Searching for principles by William Balek which is freely available at the Princeton Website**

In [None]:
!pip -q install tiktoken 
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip -q install sentencepiece Xformers einops
!pip -q install unstructured pandoc

In [11]:
!pip install langchain chromadb pypdf sentence_transformers auto-gptq

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.9.1-py3-none-any.whl (249 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.3/249.3 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting auto-gptq
  Downloading auto_gptq-0.2.2.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.7/52.7 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━

In [12]:
import os
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

In [13]:
loader = PyPDFLoader("/content/WB_biophysics110918.pdf")
pages = loader.load_and_split()

In [14]:
pages[6]

Document(page_content='imental details of particular systems, yet still be deriv-\nable from succinct and abstract principles that transcend\nthese details? For me, the answer to all of these ques-\ntions is an enthusiastic “yes,” and I hope that this book\nwill succeed in conveying both my enthusiasm and the\nreasons that lie behind it.\nI have emphasized that, in the physics tradition, our\nsubject should be deﬁned by the kinds of questions we\nask, but I haven’t given you a list of these questions.\nWorse yet, this emphasis on questions and concepts\nmight leave us ﬂoating, disconnected from the data. It\nis, after all, the phenomena of life which are so dra-\nmatic and which demand our attention, so we should\nstart there. There are so many beautiful things about\nlife, however, that is can be diﬃcult to choose a concrete\nstarting point. Before explaining the choices I made in\nwriting this book, I want to emphasize that there are\nmany equally good choices. Indeed, if we choose a

In [15]:
model_name = "intfloat/e5-large-v2"

hf = HuggingFaceEmbeddings(model_name=model_name)

Downloading (…)8230d/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)7a2b48230d/README.md:   0%|          | 0.00/65.6k [00:00<?, ?B/s]

Downloading (…)2b48230d/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)8230d/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)7a2b48230d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [16]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## Here is the nmew embeddings being used
embedding = hf #instructor_embeddings

vectordb = Chroma.from_documents(documents=pages, 
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [17]:
retriever = vectordb.as_retriever()

In [18]:
docs = retriever.get_relevant_documents("What is umbrella sampling?")

In [19]:
len(docs)

4

In [20]:
docs[0]

Document(page_content='17\nthe expected distribution of currents as\nP(i)=∞/summationdisplay\nn=0P(i|n)P(n) (17)\n=∞/summationdisplay\nn=0P(i|n)e−¯n¯nn\nn!(18)\n=∞/summationdisplay\nn=0¯nn\nn!e−¯n\n/radicalbig\n2π(σ2\n0+nσ2\n1)exp/bracketleftbigg\n−(i−ni1)2\n2(σ2\n0+nσ2\n1)/bracketrightbigg\n.(19)\nIn Fig 4, we see that this really gives a very good descrip-\ntion of the distribution that we observe when we sample\nthe currents in response to a large number of ﬂashes.\nProblem 9: Exploring the sampling problem. The data\nthat we see in Fig 4 are not a perfect ﬁt to our model. On the other\nhand, there are only 350 samples that we are using to estimate\nthe shape of the underlying probability distribution. This is an\nexample of a problem that you will meet many times in comparing\ntheory and experiment; perhaps you have some experience from\nphysics lab courses which is relevant here. We will return to these\nissues of sampling and ﬁtting nearer the end of the course, when\nwe have som

In [21]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [22]:
retriever.search_type

'similarity'

In [23]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from langchain.llms import HuggingFacePipeline
import torch

model_name = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantize_config = BaseQuantizeConfig.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_quantized(model_name,
                                           use_safetensors=True,
                                           model_basename="Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order",
                                           device="cuda:0",
                                           use_triton=False, # True or False
                                           quantize_config=quantize_config)

Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)ct-order.safetensors:   0%|          | 0.00/8.11G [00:00<?, ?B/s]



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from langchain.llms import HuggingFacePipeline
import torch

model_name = "TheBloke/Nous-Hermes-13B-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantize_config = BaseQuantizeConfig.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_quantized(model_name,
                                           use_safetensors=True,
                                           model_basename="nous-hermes-13b-GPTQ-4bit-128g.no-act.order",
                                           device="cuda:0",
                                           use_triton=False, # True or False
                                           quantize_config=quantize_config)

In [24]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import torch

pipe = pipeline(
    "text-generation",
    model=model, 
    tokenizer=tokenizer, 
    max_length=2500,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerN

In [40]:
qa_chain = RetrievalQA.from_chain_type(llm=local_llm, 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [41]:
qa_chain.combine_documents_chain.llm_chain.prompt.template = '''
Your ony source is the provided PDF
Use the following pieces of context to answer the users question. 
Let’s think step by step
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Example: What do you think of the color blue? 
Answer: This is not available in the provided book
Example: What is your height?
Answer: This is not available in the provided book
If the query is not in the provided pdf that is the vector database, DO NOT ANSWER!
----------------
{context}

Question: {question}
Helpful Answer:'''

In [42]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    temp_resp = wrap_text_preserve_newlines(llm_response['result'])
    #temp_resp = trim_string(temp_resp)
    print(temp_resp)
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [32]:
# full example
query = "What is your favorite color?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 That is not available in the provided book


Sources:
/content/WB_biophysics110918.pdf
/content/WB_biophysics110918.pdf


In [35]:
# full example
query = "Where does our understanding of ion channels trace back to?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Our current understanding of ion channels comes from decades of research starting with the seminal works of
Alan Hodgkin and Andrew Huxley who won the Nobel Prize in Physiology or Medicine in 1963 for their discoveries
concerning the fundamental mechanism of nerve impulse transmission. They developed a mathematical model called
the "Hodgkin-Huxley" model which described how action potentials were generated and propagated along axons.
Since then, numerous studies have expanded upon this initial framework and led to the development of new
theories and techniques for studying ion channels.


Sources:
/content/WB_biophysics110918.pdf
/content/WB_biophysics110918.pdf


Part of the actual text: Our understanding of ion channels goes back to the classic work of Hodgkin and Huxley in the 1940s and 50s. They studied the giant axon, a single cell, visi- ble to the naked eye, which runs along the length of a squid’s body, and along which action potentials are prop- agated to trigger the squid’s escape reflex. Passing a con- ducting wire through the interior of the long axon, they short–circuited the propagation, insuring that the volt- age across the membrane was spatially uniform, as in our idealization above.
**So the LLLM works!**

In [36]:
# full example
query = "How many children were in the movie, the sound of music?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 I am sorry, but there is no information regarding the number of children in the movie "The Sound of Music".


Sources:
/content/WB_biophysics110918.pdf
/content/WB_biophysics110918.pdf


In [38]:
# full example
query = "What kind of channels are in the stomatogastric ganglion of crabs and lobsters"
llm_response = qa_chain(query)
process_llm_response(llm_response)



 The stomatogastric ganglion in crustaceans contains several types of ion channels, including sodium channels,
potassium channels, calcium channels, and chloride channels. These channels play crucial roles in regulating
the electrical activity of the neurons in the ganglion, which controls various aspects of the animals' feeding
behavior.


Sources:
/content/WB_biophysics110918.pdf
/content/WB_biophysics110918.pdf


Part of the text: An important feature of this cell, shared by many other cells, is the presence of voltage–gated calcium channels. This means that, as action potentials occur, they trigger calcium flux into the cell. Because there are also channels which are directly affected by the calcium concentration, a complete model must include a description of the calcium buffering or pumping that counterbalances this flux.

In [43]:
# full example
query = "Why did the Roman empire fail?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 There is no straightforward answer to this question, as historians continue to debate the causes of Rome's
decline and fall. Some argue that internal weaknesses like corruption, political instability, and economic
stagnation contributed to its collapse, while others blame external pressures such as invasions by barbarian
tribes and environmental factors like climate change. Ultimately, it was likely a complex combination of
multiple factors that led to Rome's demise.


Sources:
/content/WB_biophysics110918.pdf
/content/WB_biophysics110918.pdf


The model is not all perfect as it fails a final test but if we know exactly what questions we need to answer from the PDF, it does a pretty good job