Retrieval-Augmented Generation (RAG) is an AI architecture that enhances the capabilities of Large Language Models (LLMs) by integrating external knowledge sources into the generation process. Traditional LLMs, while powerful, are limited to the information present in their training data, which can become outdated or insufficient for specific queries. RAG addresses this limitation by retrieving relevant information from external databases or documents in real-time, ensuring that the generated responses are both accurate and up-to-date.

In a typical RAG system, when a user poses a question, the model first retrieves pertinent documents or data from an external source. This retrieved information is then combined with the model's internal knowledge to generate a response that is both contextually relevant and factually accurate. This approach not only improves the quality of AI-generated content but also mitigates issues like "hallucinations," where models produce plausible-sounding but incorrect information.

**Example 1: Using Hugging Face's `pipeline` with DistilBERT**

In this example, we utilize Hugging Face's `pipeline` for question answering, employing the `distilbert-base-uncased-distilled-squad` model. This model is a distilled version of BERT, optimized for efficiency while maintaining performance.

```python
from transformers import pipeline

# Initialize the question-answering pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

# Define your question
question = "What is the main point of the article?"

# Ensure the extracted text is not empty
if extracted_text:
    # Prepare the input for the model
    qa_input = {
        'question': question,
        'context': extracted_text
    }
    # Get the answer
    answer = qa_pipeline(qa_input)
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']}")
else:
    print("Error: No text extracted from the PDF.")
```

In this script, the `pipeline` is initialized for question answering with the specified model. The `question` variable holds the query we want to answer, and `extracted_text` contains the content from which the answer is to be derived. The model processes the input and returns the most probable answer found within the context.

**Example 2: Handling Longer Texts with Chunking**

When dealing with lengthy documents, it's essential to manage the input size to fit within the model's maximum token limit. One common approach is to split the text into manageable chunks with some overlap to ensure context continuity.

```python
from transformers import pipeline

# Initialize the question-answering pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

def chunk_text(text, max_length, overlap):
    """
    Splits the text into chunks of max_length with a specified overlap.

    Args:
        text: The input text to be chunked.
        max_length: Maximum length of each chunk.
        overlap: Number of overlapping tokens between chunks.

    Returns:
        A list of text chunks.
    """
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + max_length, len(words))
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start += max_length - overlap
    return chunks

# Define your question
question = "What is the main point of the article?"

# Parameters
max_chunk_length = 450  # Adjust based on model's max token limit minus space for the question
overlap_length = 50     # Number of overlapping tokens

# Split the extracted text into chunks
text_chunks = chunk_text(extracted_text, max_chunk_length, overlap_length)

# Iterate over chunks and get answers
answers = []
for chunk in text_chunks:
    qa_input = {
        'question': question,
        'context': chunk
    }
    answer = qa_pipeline(qa_input)
    answers.append(answer['answer'])

# Combine or select the most appropriate answer
# For simplicity, we'll just print all answers here
for idx, ans in enumerate(answers):
    print(f"Answer from chunk {idx + 1}: {ans}")
```

In this script, the `chunk_text` function divides the `extracted_text` into smaller segments, each with a specified maximum length and overlap. This ensures that the model can process each chunk without exceeding its token limit. The script then iterates over these chunks, applies the question-answering pipeline to each, and collects the answers. Finally, it prints the answers obtained from each chunk.

By employing such techniques, we can effectively handle longer texts and improve the accuracy of AI-generated responses, especially when combined with RAG architectures that provide access to external, up-to-date information.



In [1]:
! pip install  PyPDF2 transformers

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [2]:
import PyPDF2
from google.colab import files
from transformers import pipeline


In [3]:
import PyPDF2

def extract_text_from_pdf(file_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        file_path: The path to the PDF file.

    Returns:
        The extracted text as a string.
    """
    text = ""
    try:
        with open(file_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page_num, page in enumerate(reader.pages):
                page_text = page.extract_text()
                if page_text:
                    text += page_text
                else:
                    print(f"Warning: No text extracted from page {page_num + 1}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")
    return text


In [4]:
# Upload the PDF file
from google.colab import files
uploaded = files.upload()
file_path = list(uploaded.keys())[0]

# Extract text from the PDF
extracted_text = extract_text_from_pdf(file_path)

# Output the extracted text length and a snippet
print(f"Extracted text length: {len(extracted_text)}")
print(f"Extracted text snippet: {extracted_text[:500]}")


Saving 1001_Nights_ICIDS_final.pdf to 1001_Nights_ICIDS_final.pdf
Extracted text length: 49866
Extracted text snippet: See discussions, st ats, and author pr ofiles f or this public ation at : https://www .researchgate.ne t/public ation/365931544
Bringing Stories to Life in 1001 Nights: A Co-creative Text Adventu re Game
Using a Story Generation Model
Conf erence Paper    in  Lecture Not es in Comput er Scienc e · Dec ember 2022
DOI: 10.1007/978-3-031-22298-6_42
CITATIONS
18READS
1,889
6 author s, including:
Yuqian Sun
Royal Colle ge of Art
14 PUBLICA TIONS    89 CITATIONS    
SEE PROFILE
Chang Hee L ee
Korea Ad


In [5]:
from transformers import pipeline

# Initialize the question-answering pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

# Define your question
question = "What is the main point of the article?"

# Ensure the extracted text is not empty
if extracted_text:
    # Prepare the input for the model
    qa_input = {
        'question': question,
        'context': extracted_text
    }
    # Get the answer
    answer = qa_pipeline(qa_input)
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']}")
else:
    print("Error: No text extracted from the PDF.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


Question: What is the main point of the article?
Answer: bringing
storytelling to real life


In [6]:
from transformers import pipeline

# Initialize the question-answering pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

def chunk_text(text, max_length, overlap):
    """
    Splits the text into chunks of max_length with a specified overlap.

    Args:
        text: The input text to be chunked.
        max_length: Maximum length of each chunk.
        overlap: Number of overlapping tokens between chunks.

    Returns:
        A list of text chunks.
    """
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + max_length, len(words))
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        start += max_length - overlap
    return chunks

# Define your question
question = "What is the main point of the article?"

# Parameters
max_chunk_length = 450  # Adjust based on model's max token limit minus space for the question
overlap_length = 50     # Number of overlapping tokens

# Split the extracted text into chunks
text_chunks = chunk_text(extracted_text, max_chunk_length, overlap_length)

# Iterate over chunks and get answers
answers = []
for chunk in text_chunks:
    qa_input = {
        'question': question,
        'context': chunk
    }
    answer = qa_pipeline(qa_input)
    answers.append(answer['answer'])

# Combine or select the most appropriate answer
# For simplicity, we'll just print all answers here
for idx, ans in enumerate(answers):
    print(f"Answer from chunk {idx + 1}: {ans}")


Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Answer from chunk 1: game design
Answer from chunk 2: puts an end to his heinous crimes
Answer from chunk 3: puts an end to his heinous crimes
Answer from chunk 4: to enhance the game playing experience by providing an immersive and engaging experience
Answer from chunk 5: weapons
Answer from chunk 6: lead the King to tell more stories that contain keywords and collect weapons
Answer from chunk 7: Fig. 3
Answer from chunk 8: daughters of Snaxen
Answer from chunk 9: dreamily
Answer from chunk 10: improve public engagement
Answer from chunk 11: compare this with their achievements in the game
Answer from chunk 12: investigate the impact of engagement in storytelling
Answer from chunk 13: storytelling
Answer from chunk 14: described the plot in detail
Answer from chunk 15: I love this world, I also want to create valuable works
Answer from chunk 16: 422) who won the game
Answer from chunk 17: game environment
Answer from chunk 18: Talk to the ghost: The Storybox methodology
Answer from ch

In [7]:
!lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:           6
    Model:                79
    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             0
    BogoMIPS:             4399.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 cl
                          flush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc re
                          p_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3
                           fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand
                           hypervisor lahf_lm abm 3dnowprefetch i

In [8]:
import tensorflow as tf

if tf.config.list_physical_devices('GPU'):
    !nvidia-smi
else:
    print("No GPU found")


Mon Jan 13 09:05:28 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P0              28W /  70W |    699MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [9]:
!free -h


               total        used        free      shared  buff/cache   available
Mem:            12Gi       2.0Gi       5.8Gi        15Mi       4.9Gi        10Gi
Swap:             0B          0B          0B


In [10]:
!pip list #you can use pip to list all installed Python packages with versions.


Package                            Version
---------------------------------- ------------------
absl-py                            1.4.0
accelerate                         1.2.1
aiohappyeyeballs                   2.4.4
aiohttp                            3.11.11
aiosignal                          1.3.2
alabaster                          1.0.0
albucore                           0.0.19
albumentations                     1.4.20
altair                             5.5.0
annotated-types                    0.7.0
anyio                              3.7.1
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
array_record                       0.6.0
arviz                              0.20.0
astropy                            6.1.7
astropy-iers-data                  0.2025.1.6.0.33.42
astunparse                         1.6.3
async-timeout                      4.0.3
atpublic                           4.1.0
attrs                              24.3.0
audioread            

In [12]:
!pip list > installed_packages.txt


In [13]:
!df -h


Filesystem      Size  Used Avail Use% Mounted on
overlay         113G   33G   80G  30% /
tmpfs            64M     0   64M   0% /dev
shm             5.7G  4.0K  5.7G   1% /dev/shm
/dev/root       2.0G  1.2G  820M  59% /usr/sbin/docker-init
/dev/sda1        68G   36G   33G  52% /opt/bin/.nvidia
tmpfs           6.4G  368K  6.4G   1% /var/colab
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware


In [15]:
import os

for key, value in os.environ.items():
    print(f"{key}: {value}")


SHELL: /bin/bash
NV_LIBCUBLAS_VERSION: 12.2.5.6-1
NVIDIA_VISIBLE_DEVICES: all
COLAB_JUPYTER_TRANSPORT: ipc
NV_NVML_DEV_VERSION: 12.2.140-1
NV_CUDNN_PACKAGE_NAME: libcudnn8
CGROUP_MEMORY_EVENTS: /sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events
NV_LIBNCCL_DEV_PACKAGE: libnccl-dev=2.19.3-1+cuda12.2
NV_LIBNCCL_DEV_PACKAGE_VERSION: 2.19.3-1
VM_GCE_METADATA_HOST: 169.254.169.253
HOSTNAME: 23398c4b22c0
LANGUAGE: en_US
TBE_RUNTIME_ADDR: 172.28.0.1:8011
COLAB_TPU_1VM: 
GCE_METADATA_TIMEOUT: 3
NVIDIA_REQUIRE_CUDA: cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=

In [16]:
try:
    import torch_xla
    import torch_xla.core.xla_model as xm
    print("TPU is available")
except ImportError:
    print("TPU is not available")


TPU is not available


In [17]:
import subprocess

def run_command(command):
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    return result.stdout

# Gather system information
report = "### System Configuration Report\n\n"
report += "**CPU Information:**\n" + run_command("lscpu") + "\n"
report += "**GPU Information:**\n" + (run_command("nvidia-smi") if tf.config.list_physical_devices('GPU') else "No GPU found") + "\n"
report += "**Memory Information:**\n" + run_command("free -h") + "\n"
report += "**Disk Space:**\n" + run_command("df -h") + "\n"
report += "**Python Version:**\n" + run_command("python --version") + "\n"
report += "**Installed Packages:**\n" + run_command("pip list") + "\n"

# Save to a text file
with open("system_configuration_report.txt", "w") as file:
    file.write(report)

print("System configuration report saved to 'system_configuration_report.txt'")


System configuration report saved to 'system_configuration_report.txt'


In [18]:
from google.colab import files

files.download("system_configuration_report.txt")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>