# LLM Creation with LangChain and PDF Processing:

**LLM Creation with LangChain and PDF Processing**

In this notebook, we explore how to build a Retrieval-Augmented Generation (RAG) system using **LangChain**, by leveraging PDFs as a source of knowledge.

The notebook is structured into three main parts:

**1. PDF Preprocessing and Vectorization**

In the first part, we preprocess a collection of PDF files by:
- Cleaning the text and removing unnecessary elements such as images.
- Splitting the content into manageable chunks.
- Embedding the chunks using a transformer-based embedding model.
- Storing the resulting embeddings in a **ChromaDB** vector database for efficient semantic search.

**2. LLM Integration and Simple RAG Function**

In the second part, we:
- Import a pre-trained LLM model (e.g., **Mostral**).
- Store additional PDFs in the ChromaDB using the same embedding process.
- Build a basic **RAG function**: the function retrieves relevant chunks from ChromaDB based on a user query and combines them with the question to form a context-rich prompt. The LLM then responds based on this augmented context.

**3. End-to-End QA Pipeline with LangChain**

Finally, we use **LangChain's RetrievalQA** pipeline to create a more structured and modular RAG system:

### 1. Importing libraries :

In [16]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from langchain.document_loaders import UnstructuredPDFLoader
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

In [1]:
#!pip install pdfminer.six
#!pip install "unstructured[pdf]"
#!pip install chromadb
#!pip install -U langchain-community

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## 2. Login into huggingface_hub:

In [4]:
from huggingface_hub import login
login("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")

## 3. PDF Preprocessing and Vectorization:

In [5]:
# create a variable to store the path:
pdf_folder = "/kaggle/input/pdfsint/docsPDFS"  
# list to store the PDFs con:
docs = []

# looping on pdfs ...
for filename in os.listdir(pdf_folder):
    if filename.endswith(".pdf"):
        loader = UnstructuredPDFLoader(os.path.join(pdf_folder, filename))
        print(loader)
        docs.extend(loader.load())

<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aabb88267a0>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab9611fca0>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab9000d480>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab904bb340>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab91bbc9a0>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab907469e0>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab90285390>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab90647490>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab90502410>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab915a1ea0>
<langchain_community.document_loaders.pdf.UnstructuredPDFLoader object at 0x7aab911bb6d0>
<langchain

In [6]:
# Splitting the content into manageable chunks.
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = splitter.split_documents(docs)
print(len(documents))

5212


In [7]:
# Loading the embedding model from huggingfacee
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# create a vecto store using chroma db :
vectorstore = Chroma.from_documents(documents, embedding_model, persist_directory="./chroma_db")
vectorstore.persist()

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vectorstore.persist()


## 4. LLM Integration and Simple RAG Function:

In [8]:
# fucntion to genrate answers:
def generate_answer(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [9]:
# to search with embedding vectors ":"
retriever = vectorstore.as_retriever()

# fucntion use the rap pipeline:
def rag_pipeline(query):
    retrieved_docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in retrieved_docs[:3]])  # top 3
    prompt = f"Réponds à la question suivante en te basant sur le contexte donné :\n\nContexte:\n{context}\n\nQuestion: {query}"
    return generate_answer(prompt)

In [10]:
# laod the tokenizer and mistral model using huggindface:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")

tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

### Testing the pipeline :

In [11]:
# testing ...
question = "what's the supervised learning ?"
answer = rag_pipeline(question)
print(answer)

  retrieved_docs = retriever.get_relevant_documents(query)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Réponds à la question suivante en te basant sur le contexte donné :

Contexte:
Supervised learning: Supervised learning is a process where our machines are designed to learn with the feeding of labelled data. In this process our machine

is being trained by giving it access to a huge amount of data and training the machine to analyze it. For instance, the machine is given a number of images of dogs taken from many different angles with colour variations, breeds and many more diversity. So that, the machine learns to analyze data from these diverse images of dogs and the “insight” of machines keep increasing and soon the machine can predict if it’s a dog from a whole different picture which was not even a part of the labelled data set of dog images the machine was fed earlier.
into

Supervised learning includes training a machine learning model on labeled data, which has already been categorized with the correct answers. The machine learning algorithm uses this labeled data to learn how

## 5. End-to-End QA Pipeline with LangChain

In this block, we build a complete question-answering (QA) pipeline by combining a text generation model (via HuggingFace) with LangChain. First, a `text-generation` pipeline is created using a model and tokenizer, set to generate up to 512 new tokens. This pipeline is then wrapped using `HuggingFacePipeline` to make it compatible with LangChain. Next, we use `RetrievalQA` to connect the language model to a document retriever (`vectorstore`), allowing the model to answer questions based on relevant context. Finally, an interactive `while` loop lets the user input questions, retrieves the most relevant documents, and generates context-aware responses.

In [15]:
# create the pipeline :
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512)
# combining the pipe with langchain as a backend :
llm = HuggingFacePipeline(pipeline=pipe)

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=pipe)


In [18]:
# connect the language model to a document retriever
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

In [19]:
# simple chat using the language model and documents retriever :
while True:
        query = input("\question ? : ")
        if query.lower() in ["exit", "quit"]:
            break
        response = qa_chain.run(query)
        print(f"\nresponse : {response}")

\question ? :  what's deep learning ?


  response = qa_chain.run(query)
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



response : Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

169 | P a g e

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 7, ISSUE 6, June-2020

DEEP LEARNING Deep learning is a function of Artificial Intelligence that copy's how the human brain works in processing data and pattern creation that are vital in making strategic decisions. Deep learning is also known as a deep neural network since it has systems capable of learning unsupervised data from unstructured data (Hargrave, 2019). Deep knowledge helps to gain massive amounts of unstructured data that makes it strenuous for humans to process and understand (Hargrave, 2019). Deep learning uses a hierarchical level of artificial neural networks that makes the system undergo the process of machine learning (Hargrave, 2019). In general, dee

\question ? :  exit


**@MOHAMED AMHAL**