In [7]:
!pip install PyPDF2



In [8]:
!pip install sentence_transformers



In [9]:
!pip install faiss-cpu



Creating Embeddings of relevant parts of paper

In [32]:
import re
import PyPDF2
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import os

Extracting text from paper and creating embedding using biobert

In [33]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text

In [34]:
def remove_sections(text, sections):
    for section in sections:
        text = re.split(f'{section}', text, flags=re.IGNORECASE)[0]
    return text

In [35]:
def process_pdfs(pdf_paths, sections_to_remove):
    all_text = ""
    for pdf_path in pdf_paths:
        pdf_text = extract_text_from_pdf(pdf_path)
        cleaned_text = remove_sections(pdf_text, sections_to_remove)
        all_text += cleaned_text + " "  # Combine texts from all PDFs
    return all_text

In [36]:
pdf_paths = ["/content/24993349.pdf", "/content/Cancer - 1 January 1981 - Miller - Reporting results of cancer treatment.pdf",
             "/content/cancers-03-03279-v2.pdf","/content/dunn2004.pdf","/content/zugazagoitia2016.pdf"]

In [37]:
sections_to_remove = ["References", "Acknowledgements", "Appendix",'LITERATURE CITATIONS']


In [38]:
combined_text = process_pdfs(pdf_paths, sections_to_remove)
chunks = combined_text.split('.')

In [39]:
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')
embeddings = model.encode(chunks)

## Store embeddings using FAISS for fast retrieval

In [40]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

In [41]:
index.add(np.array(embeddings))
chunk_mapping = {i: chunk for i, chunk in enumerate(chunks)}
print("Cleaned Text Chunks:", chunks[:5])

Cleaned Text Chunks: ['How Cancer Arises\nAuthor(s): Robert A', ' Weinberg\nSource: Scientific American , Vol', ' 275, No', ' 3, SPECIAL ISSUE: WHAT YOU NEED TO KNOW \nABOUT CANCER (SEPTEMBER 1996), pp', ' 62-70\nPublished by: Scientific American, a division of Nature America, Inc']


## Giving user query to question llm

Retrieving  the most relevant chunks

In [43]:
query = "What is breast cancer?"

query_embedding = model.encode([query])
k = 3  # Number of top chunks to retrieve
distances, indices = index.search(query_embedding, k)

retrieved_chunks = [chunk_mapping[idx] for idx in indices[0]]
print("Relevant chunks:", retrieved_chunks)

Relevant chunks: [', skin and subcutaneous metastases, inflammatory \nbreast cancer, intraoral lesions, or recurrent rectal \ncancer)', 'And the recently isolated BRCA1 and\nBRCA2 genes seem to account for the\nbulk of familial breast cancers, encom-\npassing as many as 20 percent of all pre-menopausal breast cancers in this coun-\ntry and a substantial proportion of fa-\nmilial ovarian cancers as well', ' Involved in \nglioblastoma (a brain cancer) and breast cancer\nerb-B2 Also called HER-2 or neu']


In [44]:
context = " ".join(retrieved_chunks)

prompt = f"Context: {context} \n\nQuestion: {query}"

print("Generated Prompt for LLM:\n", prompt)

Generated Prompt for LLM:
 Context: , skin and subcutaneous metastases, inflammatory 
breast cancer, intraoral lesions, or recurrent rectal 
cancer) And the recently isolated BRCA1 and
BRCA2 genes seem to account for the
bulk of familial breast cancers, encom-
passing as many as 20 percent of all pre-menopausal breast cancers in this coun-
try and a substantial proportion of fa-
milial ovarian cancers as well  Involved in 
glioblastoma (a brain cancer) and breast cancer
erb-B2 Also called HER-2 or neu 

Question: What is breast cancer?


## RAP for gpt2

In [45]:
from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

response = generator(prompt, max_new_tokens=100, num_return_sequences=1, truncation=True)

print("LLM Response:", response[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LLM Response: Context: , skin and subcutaneous metastases, inflammatory 
breast cancer, intraoral lesions, or recurrent rectal 
cancer) And the recently isolated BRCA1 and
BRCA2 genes seem to account for the
bulk of familial breast cancers, encom-
passing as many as 20 percent of all pre-menopausal breast cancers in this coun-
try and a substantial proportion of fa-
milial ovarian cancers as well  Involved in 
glioblastoma (a brain cancer) and breast cancer
erb-B2 Also called HER-2 or neu 

Question: What is breast cancer? Answer: breast cancer is

a brain cancer. As well As breast cancer

has been reported in women's breast milk and breast

milk products (i.e. breast implants), breast cancer remains a

concerning chronic disease. It is a brain cancer. As well I

have mentioned it first of all in relation to   encom (dysenteritis ),  as

an early brain cancer, that's

the same as
