This Google Colab notebook demonstrates how to set up and use LangChain and Hugging Face tools for processing and analyzing documents. The steps include installing necessary libraries, uploading and processing a PDF file, and generating text based on the document content.

Install necessary libraries:

In [None]:
 %pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-google-genai langchain-chroma bs4 faiss-cpu


In [None]:
%pip install --upgrade --quiet  langchain-huggingface text-generation transformers google-search-results numexpr langchainhub sentencepiece jinja2


In [None]:
!pip install transformers[torch] -U




In [None]:
!pip install pypdf




In [None]:
!pip install --upgrade huggingface_hub




Import necessary libraries and modules:

In [None]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import FAISS




Set Google API Key environment variable:

In [None]:
# Set Google API Key environment variable
import getpass
import os

os.environ["GOOGLE_API_KEY"] = getpass.getpass()


··········


Upload a PDF file:

In [None]:
from google.colab import files
uploaded = files.upload()


Saving IJPRBS343.pdf to IJPRBS343 (1).pdf


Load the PDF file using LangChain

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("IJPRBS343.pdf")
pages = loader.load()


Split documents into chunks for processing:

In [None]:
# Split documents into chunks for processing
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)


In [None]:
splits = text_splitter.split_documents(pages)


Generate embeddings for document chunks:

In [None]:
# Generate embeddings for document chunks
vectorstore = FAISS.from_documents(documents=splits, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))


Set model and task for text generation:

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
task="text-generation"


In [None]:

retriever = vectorstore.as_retriever()



Get Hugging Face token and log in:

In [None]:
from getpass import getpass
hf_token = getpass("Hugging Face Key: ")


Hugging Face Key: ··········


In [None]:
!huggingface-cli login --token $hf_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Set up the device for processing (CUDA or CPU):

In [None]:

import torch
import os
import numpy as np
from torch import cuda, bfloat16


#In a MAC Silicon the device must be 'mps'
# device = torch.device('mps') #to use with MAC Silicon
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'



In [None]:
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM


Initialize Hugging Face model and tokenizer:

In [None]:
# begin initializing HF items, need auth token for these
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_token
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    device_map='auto',
    use_auth_token=hf_token
)
model.eval()
print(f"Model loaded on {device}")




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Model loaded on cuda:0


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id,
                                          use_aut_token=hf_token)

Set up the pipeline and LLM:

In [None]:
from langchain.llms import HuggingFacePipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=100,
    do_sample=True,
    #trust_remote_code=True,
    repetition_penalty=1.1,
    return_full_text=True,
    device_map='auto'
)

llm = HuggingFacePipeline(pipeline=pipe)


Define a function to format documents:

In [None]:
prompt = hub.pull("rlm/rag-prompt")


Set up and invoke the RAG chain:

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


In [None]:
rag_chain.invoke("RELATION BETWEEN THYROIDISM AND BREAST CANCER?")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: RELATION BETWEEN THYROIDISM AND BREAST CANCER? \nContext: Review Article                                                                                                                    ISSN: 2277-8713                                           \nMounika B, IJPRBS, 2013; Volume 2(3): 197-214                                                                         IJPRBS \n                                                 Av ailable Online At www.ijprbs.com  \n assessment of potential fetal risk; on the \nbasis of clinical judgment, the \nendocrinologist can have this study done \n[17]. \nRELATION BETWEEN THYROIDISM AND \nBREAST CANCER \n Ultrasonographic evaluation of the thyroid \ngland was conducted by the same \nradiologist using a

Process the response and print the answer:

In [None]:
response = rag_chain.invoke("RELATION BETWEEN THYROIDISM AND BREAST CANCER?")

# Check if the response is a string
if isinstance(response, str):
    # Split the response by 'Answer:' and take the part after it
    if 'Answer:' in response:
        answer_only = response.split('Answer:')[1].strip()
    else:
        answer_only = response.strip()
else:
    # Handle other types of responses if necessary
    answer_only = str(response)

print(answer_only)
