### Import the libraries

In [1]:
import os
import gc

import torch
from dotenv import load_dotenv
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM  
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

In [2]:
import torch
print(f"CUDA version: {torch.version.cuda}")

CUDA version: 12.1


In [3]:
pip show faiss-cpu

Note: you may need to restart the kernel to use updated packages.Name: faiss-cpu
Version: 1.12.0
Summary: A library for efficient similarity search and clustering of dense vectors.
Home-page: 
Author: 
Author-email: Kota Yamaguchi <yamaguchi_kota@cyberagent.co.jp>
License: 
Location: c:\users\subhi.gupta\appdata\local\anaconda3\envs\torchenv\lib\site-packages
Requires: numpy, packaging
Required-by: 





### Set the API token and other secret keys

In [4]:
# make Hub downloads resilient on slower links
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "180"
os.environ["HF_HUB_DOWNLOAD_RETRY"]   = "20"

### Read the research papers

In [5]:
dataset_path = r"D:\Intelligent QA AI\research_docs"
all_docs = []

for file in os.listdir(dataset_path):
    if file.endswith('.pdf'): 
        
        file_path = os.path.join(dataset_path, file)
        loader = PyPDFLoader(file_path, mode="single")
        docs = loader.load()
        
        all_docs.append(docs[0])

In [6]:
print(len(all_docs))

2


In [7]:
doc = all_docs[0]
doc.page_content

"Hybrid modeling for\nbiopharmaceutical processes:\nadvantages, opportunities, and\nimplementation\nHarini Narayanan1, Moritz von Stosch2, Fabian Feidl2,\nMichael Sokolov2, Massimo Morbidelli2 and Alessandro Butté2*\n1Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA,\nUnited States,2DataHow AG, Zurich, Switzerland\nProcess models are mathematical formulations (essentially a set of equations) that\ntry to represent the real system/process in a digital or virtual form. These are\nderived either based on fundamental physical laws often combined with empirical\nassumptions or learned based on data. The former has been existing for several\ndecades in chemical and process engineering while the latter has recently\nreceived a lot of attention with the emergence of several artiﬁcial intelligence/\nmachine learning techniques. Hybrid modeling is an emerging modeling paradigm\nthat explores the synergy between existing these two paradigms, tak

In [8]:
doc = all_docs[1]
doc.page_content



### Split the text into chunks

In [9]:
text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=900,
        chunk_overlap=100,
        length_function=len
    )

chunks = text_splitter.split_documents(all_docs)

In [10]:
len(chunks)

318

In [11]:
for i in range (0,9):
    print(chunks[i].page_content)
    print("\n")

Hybrid modeling for
biopharmaceutical processes:
advantages, opportunities, and
implementation
Harini Narayanan1, Moritz von Stosch2, Fabian Feidl2,
Michael Sokolov2, Massimo Morbidelli2 and Alessandro Butté2*
1Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA,
United States,2DataHow AG, Zurich, Switzerland
Process models are mathematical formulations (essentially a set of equations) that
try to represent the real system/process in a digital or virtual form. These are
derived either based on fundamental physical laws often combined with empirical
assumptions or learned based on data. The former has been existing for several
decades in chemical and process engineering while the latter has recently
received a lot of attention with the emergence of several artiﬁcial intelligence/


received a lot of attention with the emergence of several artiﬁcial intelligence/
machine learning techniques. Hybrid modeling is an emerging modeling paradigm

In [12]:
gc.collect()

13439

### Creating Vector Embeddings & Storing Embeddings in a Vector Database

In [13]:
embedding = HuggingFaceEmbeddings(
    model_name="NeuML/pubmedbert-base-embeddings"
)

In [14]:
vector_store = FAISS.from_documents(chunks, embedding)

  return forward_call(*args, **kwargs)
  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [15]:
vector_store.save_local("faiss_index")

In [16]:
# Load from local storage
vector_store = FAISS.load_local("faiss_index", embedding, allow_dangerous_deserialization=True)

In [17]:
gc.collect()

0

### Find the similiar chunks from the database

In [18]:
question = "What is hybrid modeling approach?"
searchDocs = vector_store.similarity_search(question, k=3)

for i in range(len(searchDocs)):
    print(searchDocs[i].page_content)
    print("\n")

depend heavily on the quality of the data on which they are trained.
Despite their flexibility, the effectiveness of ML models is closely tied
to acquiring high-quality data, a task complicated by noise and dis-
turbances in real-world processes. This dependency on data quality
challenges the creation of robust models that can reliably interpret and
predict based on underlying process data. Hybrid modeling addresses
these challenges by combining the strengths of both approaches. It
integrates the broad applicability and interpretability of FPMs, which
are based on system-independent physics laws, with the adaptability
of ML models to leverage system-specific process data. Compared to
FPMs, hybrid models offer superior extrapolation capabilities beyond
the data range, although the conditions required to guarantee their
extrapolation accuracy require further research.


detailed predictions of complex properties, such as reaction kinetics
and material behaviors, under varying conditions 

### Load the tokenizer and count the tokens

In [19]:
model_id  = "TheBloke/PMC_LLAMA-7B-GPTQ"         # dash, not underscore

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


In [20]:
total_tokens = 0
for i in range(len(searchDocs)):
    tokens = tokenizer(searchDocs[i].page_content)
    num_tokens = len(tokens['input_ids'])
    total_tokens = total_tokens + num_tokens
print("Number of tokens in input prompt:", total_tokens)

Number of tokens in input prompt: 578


### Load LLM model

In [21]:
gc.collect()

0

In [22]:
os.makedirs("./model_offload", exist_ok=True)

model = AutoGPTQForCausalLM.from_quantized(
    model_id,
    device_map="auto",
    max_memory={0: "5GB", "cpu": "14GB"},  # Adjust based on your system
    offload_folder="./model_offload",
    use_safetensors=True,
    trust_remote_code=True
)


The following generation flags are not valid and may be ignored: ['pad_token_id']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['pad_token_id']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
INFO - The layer lm_head is not quantized.
Some weights of the model checkpoint at C:\Users\subhi.gupta\.cache\huggingface\hub\models--TheBloke--PMC_LLAMA-7B-GPTQ\snapshots\7739ce0d4d7057bf5faf0efa19601dcd5640b346\model.safetensors were not used when initializing LlamaForCausalLM: {'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.la

In [23]:
model

LlamaGPTQForCausalLM(
  (model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32000, 4096, padding_idx=31999)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): QuantLinear()
            (k_proj): QuantLinear()
            (v_proj): QuantLinear()
            (o_proj): QuantLinear()
          )
          (mlp): LlamaMLP(
            (gate_proj): QuantLinear()
            (up_proj): QuantLinear()
            (down_proj): QuantLinear()
            (act_fn): SiLU()
          )
          (input_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
          (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
        )
      )
      (norm): LlamaRMSNorm((4096,), eps=1e-06)
      (rotary_emb): LlamaRotaryEmbedding()
    )
    (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
  )
)

### Make inference through LLM by providing the context

In [24]:
context_text = "\n\n".join([doc.page_content for doc in searchDocs])

question = "What is hybrid modeling?"

# Create the prompt
prompt = f"""Based on the following context, please answer the question. Answer the question in descriptive way atleast in 4-5 lines.

Context: {context_text}

Question: {question}

Answer:"""


In [25]:
# Generate answer
inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True)

# Move inputs to the same device as the model
inputs = inputs.to(model.device)  # or inputs.to("cuda") if you know it's on GPU

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    min_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id,
    stop_strings=["\n\nQuestion:", "\nQuestion:", "Question:"],
    tokenizer=tokenizer
)

# Extract just the answer
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = full_response[len(prompt):].strip()
print("Question:", question)
print("Answer:", answer)


Question: What is hybrid modeling?
Answer: Hybrid modeling is a method for developing process models that combines

knowledge from physics-based process models (FPMs) and machine learning-based

process models (MLPs). FPMs are deterministic models that provide 

accurate predictions of process behaviors. MLPs are probabilistic 

models that can leverage data to make predictions with a high level of 

accuracy, precision, and robustness.


### Post-process the answer

In [26]:
# Stop at various unwanted patterns
stop_patterns = [
    "\nContext:",
    "\nQuestion:", 
    "\n\nQuestion:",
    "\nQ:",
    "Context:",
    "Question:",
    "\n\n\n"
]

for pattern in stop_patterns:
    if pattern in answer:
        answer = answer.split(pattern)[0].strip()
        break

print("Question:", question)
print("Answer:", answer)

Question: What is hybrid modeling?
Answer: Hybrid modeling is a method for developing process models that combines

knowledge from physics-based process models (FPMs) and machine learning-based

process models (MLPs). FPMs are deterministic models that provide 

accurate predictions of process behaviors. MLPs are probabilistic 

models that can leverage data to make predictions with a high level of 

accuracy, precision, and robustness.


In [27]:
gc.collect()

15

In [52]:
question = "Who is prerak?""

In [56]:
question

'Who is prerak?'

In [58]:
prompt = f"""You MUST answer ONLY based on the provided context below. DO NOT use any external knowledge.

Context: {context_text}

Question: {question}

IMPORTANT RULES:
- If the context does NOT contain information to answer the question, respond EXACTLY: "I don't have enough information in the provided context to answer this question."
- If the context contains relevant information, provide a 4-5 line descriptive answer.
- DO NOT make up or guess any information.

Answer:"""


In [59]:
# Generate answer
inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True)

# Move inputs to the same device as the model
inputs = inputs.to(model.device)  # or inputs.to("cuda") if you know it's on GPU

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    min_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id,
    stop_strings=["\n\nQuestion:", "\nQuestion:", "Question:"],
    tokenizer=tokenizer
)

# Extract just the answer
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = full_response[len(prompt):].strip()
print("Question:", question)
print("Answer:", answer)


Question: Who is prerak?
Answer: -The data provided is for a chemical process. The data is a vector x(t) with the elements x j (t) representing the concentration of the j th component in the system.

-The question is: At what time is the concentration of component 3 equal to 30?

-The context states that the system is a "chemical process". The context also states that the process is a "three-dimensional chemical system."

-The
