# The Gemma Weaver: weaving knowledge from different languages

## Introduction: Weaving Knowledge with Gemma 2B: A Multilingual RAG System

This notebook details the creation of a Retrieval-Augmented Generation (RAG) system, leveraging the power of Google's Gemma 2B language model, fine-tuned on a multilingual dataset from Project Gutenberg. This approach combines the generative capabilities of LLMs with the ability to retrieve relevant information from a knowledge base, resulting in more accurate and contextually rich responses.

The notebook is divided into three key parts:

1. **Fine-tuning:** In this section, we adapt the pre-trained Gemma 2B model to better understand and generate text from the multilingual Gutenberg dataset. We utilize Parameter-Efficient Fine-Tuning (PEFT) techniques, specifically Low-Rank Adaptation (LoRA), to efficiently train the model. Quantization is also employed to manage memory usage during training.

2. **Making the RAG Database:** Here, we construct a vector database from the same Gutenberg dataset. This involves chunking the text data, generating sentence embeddings using a transformer model, and indexing these embeddings using FAISS for efficient similarity search. This database will serve as the knowledge source for our RAG system.

3. **Putting it Together:**  This final section integrates the fine-tuned Gemma 2B model with the RAG database. When a query is posed, the system first retrieves relevant text chunks from the database based on semantic similarity. This retrieved context is then used to augment the prompt given to the language model, allowing it to generate more informed and grounded answers.

Each cell in the notebook is explained in detail to provide a clear understanding of the code and the underlying processes involved in building this multilingual RAG system. By following this notebook, you will gain a comprehensive understanding of how to build a RAG system from scratch using state-of-the-art models and techniques.


## Part 1: Fine-tuning a Language Model

This section focuses on fine-tuning a pre-trained language model, specifically Google's Gemma 2B, on a multilingual text dataset. The goal is to adapt the model to better understand and generate text relevant to the content it's trained on.

In [None]:
!pip install -q datasets trl accelerate peft bitsandbytes kagglehub sentence-transformers faiss-gpu

This cell installs the necessary Python libraries for fine-tuning and working with large language models. Here's a breakdown:

*   `datasets`: Provides access to various datasets, including those from Hugging Face.
*   `trl` (Transformers Reinforcement Learning): Contains tools and utilities for training language models, including supervised fine-tuning.
*   `accelerate`: Facilitates distributed training and handles hardware abstraction.
*   `peft` (Parameter-Efficient Fine-Tuning): Enables efficient fine-tuning of large models by only training a small subset of parameters, like LoRA (Low-Rank Adaptation).
*   `bitsandbytes`: Allows for efficient memory usage by quantizing model weights (e.g., loading in 4-bit).
*   `kagglehub`: Enables interaction with Kaggle models and datasets.
*   `sentence-transformers`: A library for creating sentence embeddings. While not directly used in the fine-tuning, it's used later for the RAG database.
*   `faiss-gpu`: A library for efficient similarity search in high-dimensional spaces, used later for the RAG database. The `-gpu` version indicates GPU support.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
from transformers import GemmaTokenizerFast
from peft import LoraConfig, get_peft_model, TaskType, PeftModel, PeftConfig
from accelerate import Accelerator
from datasets import load_dataset
import numpy as np
import os
import pickle
import re
from tqdm import tqdm
import concurrent.futures
import cProfile
import pandas as pd
import shutil
import kagglehub

This cell imports the specific modules and classes needed from the installed libraries. Key imports include:

*   `torch`: The fundamental PyTorch library for tensor operations and neural networks.
*   `transformers`: Hugging Face's core library for working with pre-trained language models, including tokenizers and model classes.
*   `trl`: Specifically `SFTConfig` for setting up supervised fine-tuning configurations and `SFTTrainer` for performing the fine-tuning.
*   `peft`: For using LoRA (`LoraConfig`, `get_peft_model`) and managing PEFT models.
*   `accelerate`: For using the `Accelerator` class, although it's not explicitly used in this snippet.
*   `datasets`: To load the training dataset.
*   Standard Python libraries like `os`, `pickle`, `re`, `tqdm` for file operations, data serialization, regular expressions, and progress bars.


In [None]:
# Set the environment variables for Kaggle and huggingface.
# from kaggle_secrets import UserSecretsClient if you use kaggle
# from google.colab import userdata if you use google colab
#import getpass if you use jupyter notebook
os.environ["KAGGLE_USERNAME"] = "your-username"# or UserSecretsClient().get_secret(KAGGLE_USERNAME) or userdata.get(KAGGLE_USERNAME) or getpass.getpass("Enter your KAGGLE_USERNAME: ")
os.environ["KAGGLE_KEY"] = "kaggle-api-key" # or UserSecretsClient().get_secret(KAGGLE_KEY) or userdata.get(KAGGLE_KEY) or getpass.getpass("Enter your KAGGLE_KEY: ")
os.environ["HF_TOKEN"] = "huggingface-api-key" # or UserSecretsClient().get_secret(HF_TOKEN) or userdata.get(HF_TOKEN) or getpass.getpass("Enter your HF_TOKEN: ")

This cell sets environment variables required for authentication with Kaggle and Hugging Face.

*   `KAGGLE_USERNAME` and `KAGGLE_KEY`: Your Kaggle API credentials, needed for uploading the fine-tuned model later. **Remember to replace `"your-username"` and `"kaggle-api-key"` with your actual credentials.** The commented-out code shows alternative ways to securely manage secrets depending on your environment (Kaggle Secrets, Google Colab userdata, or direct input).
*   `HF_TOKEN`: Your Hugging Face API token, which might be required for downloading certain models or pushing to the Hugging Face Hub. **Replace `"huggingface-api-key"` with your actual token.**

In [None]:
dataset = load_dataset("sedthh/gutenberg_multilang", split="train")
dataset = dataset.shuffle(seed=65).select(range(3000)) # Only use 3000 samples for quick demo
dataset = dataset.rename_column('TEXT', 'text')

This cell loads and prepares the dataset for fine-tuning.

*   `load_dataset("sedthh/gutenberg_multilang", split="train")`: Loads the "sedthh/gutenberg_multilang" dataset from Hugging Face Datasets. This dataset contains multilingual text extracted from Project Gutenberg. The `split="train"` argument specifies that we want the training portion of the dataset.
*   `.shuffle(seed=65)`: Shuffles the dataset with a fixed random seed (65) for reproducibility.
*   `.select(range(3000))`: Selects the first 3000 samples from the shuffled dataset. This is done for demonstration purposes to speed up the fine-tuning process. In a real-world scenario, you would likely use a larger portion of the dataset.
*   `.rename_column('TEXT', 'text')`: Renames the column containing the text data from 'TEXT' to 'text'. This is often necessary as the `SFTTrainer` expects the text column to be named 'text' by default.
for more information take a look at dataset [page](https://huggingface.co/datasets/sedthh/gutenberg_multilang).

we used 7 languages for fine-tuning which are mentioned in the dataset page.

In [None]:
model_name = "google/gemma-2-2b-it" 

quantizationConfig = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantizationConfig
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = 'right' # To avoid warnings
lora_config = LoraConfig(
    r=4,  # Adjust this value to control the number of trainable parameters
    lora_alpha=8,
    target_modules=["q_proj", "v_proj"],  # Specify target modules to apply LoRA(linear)
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
print("Model loaded")
model.enable_input_require_grads()# This is necessary for training, If not you will face errors.

This cell loads the pre-trained Gemma 2B model and configures it for fine-tuning with LoRA and quantization.

*   `model_name = "google/gemma-2-2b-it"`: Specifies the pre-trained model to use, in this case, the instruction-tuned version of Gemma 2B from Google.
*   `quantizationConfig = BitsAndBytesConfig(...)`: Configures the `bitsandbytes` library for 4-bit quantization.
    *   `load_in_4bit=True`: Loads the model weights in 4-bit precision, reducing memory usage.
    *   `bnb_4bit_compute_dtype=torch.float16`: Specifies the data type for computation within the 4-bit quantized layers (using half-precision floating-point for potential speedups).
    *   `bnb_4bit_quant_type="nf4"`: Uses the Normal Float 4 (NF4) quantization type.
*   `model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantizationConfig)`: Loads the pre-trained Gemma model with the specified quantization configuration. `AutoModelForCausalLM` is used because Gemma is a causal language model (predicting the next token).
*   `tokenizer = AutoTokenizer.from_pretrained(model_name)`: Loads the tokenizer associated with the Gemma model. Tokenizers are used to convert text into numerical representations that the model can understand.
*   `tokenizer.padding_side = 'right'`: Sets the padding side for tokenization. Padding is used to make sequences in a batch have the same length. Setting it to 'right' is a common practice to avoid issues with certain models.
*   `lora_config = LoraConfig(...)`: Configures the LoRA (Low-Rank Adaptation) method for parameter-efficient fine-tuning.
    *   `r=4`: The rank of the low-rank matrices used in LoRA. A smaller value reduces the number of trainable parameters.
    *   `lora_alpha=8`: A scaling factor for the LoRA updates.
    *   `target_modules=["q_proj", "v_proj"]`: Specifies the linear layers within the transformer blocks where LoRA will be applied (query and value projection layers in the attention mechanism).
    *   `lora_dropout=0.1`: Dropout probability for the LoRA layers.
    *   `bias="none"`: Specifies that no bias terms should be added in the LoRA layers.
    *   `task_type="CAUSAL_LM"`: Indicates that the task is causal language modeling.
*   `model = get_peft_model(model, lora_config)`: Wraps the base Gemma model with the LoRA adapters, making only the LoRA parameters trainable.
*   `print("Model loaded")`: Prints a confirmation message.
*   `model.enable_input_require_grads()`: Enables gradient computation for the input embeddings. This is crucial for training. If not enabled, you will encounter errors during the training process.


In [None]:
training_args = SFTConfig(
    per_device_train_batch_size=1,
    torch_empty_cache_steps=5,
    max_steps=500,
    warmup_steps=200,
    logging_steps=1,
    save_strategy="no",
    gradient_checkpointing=True,
    max_seq_length=512,
    output_dir="gemma2_2b",
    report_to="none",
)

In [None]:
trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        processing_class=tokenizer,
    )

This cell sets up the training configuration and initializes the `SFTTrainer`.

*   `training_args = SFTConfig(...)`: Creates a configuration object for supervised fine-tuning.
    *   `per_device_train_batch_size=1`: The batch size for training on each GPU or CPU.
    *   `torch_empty_cache_steps=5`: How often to empty the PyTorch CUDA cache to free up GPU memory.
    *   `max_steps=500`: The total number of training steps.
    *   `warmup_steps=200`: The number of steps for the learning rate warmup phase.
    *   `logging_steps=1`: How often to log training information.
    *   `save_strategy="no"`: Disables saving checkpoints during training.
    *   `gradient_checkpointing=True`: Enables gradient checkpointing to reduce memory usage during backpropagation, potentially at the cost of some computation time.
    *   `max_seq_length=512`: The maximum length of input sequences the model will process.
    *   `output_dir="gemma2_2b"`: The directory where training outputs (like logs) will be saved.
    *   `report_to="none"`: Disables reporting training metrics to platforms like Weights & Biases.
*   `trainer = SFTTrainer(...)`: Initializes the `SFTTrainer` with the configured settings.
    *   `model=model`: The fine-tuned model (Gemma with LoRA).
    *   `args=training_args`: The training configuration.
    *   `train_dataset=dataset`: The dataset to use for training.
    *   `processing_class=tokenizer`: The tokenizer to use for preprocessing the data.

In [None]:
trainer.train()

This cell starts the fine-tuning process. The `trainer.train()` method will iterate through the training dataset, update the model's weights (specifically the LoRA adapters in this case), and log the training progress according to the `training_args`.

In [None]:
trainer.save_model("/kaggle/working/gemma2_2b")

In [None]:
# 1. Load the base model
model_name_or_path = "/kaggle/input/gemma-2/transformers/gemma-2-2b-it/2"#"google/gemma-2-2b-it"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

# 2. Load the PEFT adapter
adapter_model_name_or_path = "/kaggle/working/gemma2_2b"
model = PeftModel.from_pretrained(model, adapter_model_name_or_path)

# 3. Merge the adapter into the base model
merged_model = model.merge_and_unload()

In [None]:
os.makedirs("/kaggle/working/gemma2-2b-gutenberg-merged", exist_ok=True)
merged_model.save_pretrained("/kaggle/working/gemma2-2b-gutenberg-merged")
print("Merged model saved.")

These cells saves the fine-tuned LoRA adapters and then merges them back into the original base model.

*   `trainer.save_model("/kaggle/working/gemma2_2b")`: Saves the trained LoRA adapters to the specified directory. This saves the changes made during fine-tuning.
*   The subsequent lines load the base Gemma model again:
    *   `model_name_or_path = ...`: Specifies the path to the original Gemma model. Note that it's loading from a Kaggle input directory, indicating the base model was likely downloaded and placed there previously.
    *   `model = AutoModelForCausalLM.from_pretrained(model_name_or_path)`: Loads the base model.
*   Then, it loads the saved LoRA adapters:
    *   `adapter_model_name_or_path = "/kaggle/working/gemma2_2b"`: Specifies the directory where the LoRA adapters were saved.
    *   `model = PeftModel.from_pretrained(model, adapter_model_name_or_path)`: Loads the LoRA adapters and applies them to the base model.
*   Finally, it merges the adapters into the base model:
    *   `merged_model = model.merge_and_unload()`: Merges the LoRA weights into the base model's weights, creating a standalone fine-tuned model. The `unload()` part releases the LoRA adapters from memory.
    *   `os.makedirs(...)`: Creates the directory to save the merged model if it doesn't exist.
    *   `merged_model.save_pretrained(...)`: Saves the fully merged fine-tuned model to the specified directory.
    *   `print("Merged model saved.")`: Prints a confirmation message.

In [None]:
# List of files to copy
files_to_copy = ['/kaggle/working/gemma2_2b/tokenizer_config.json',
                '/kaggle/working/gemma2_2b/tokenizer.model', '/kaggle/working/gemma2_2b/tokenizer.json',
                '/kaggle/working/gemma2_2b/special_tokens_map.json']
destination_directory = '/kaggle/working/gemma2-2b-gutenberg-merged'

# Ensure the destination directory exists
if not os.path.exists(destination_directory):
    os.makedirs(destination_directory)

# Copy each file
for file in files_to_copy:
    shutil.copy(file, destination_directory)
    print(f"File '{file}' copied to '{destination_directory}'.")

This cell copies necessary tokenizer files from the LoRA adapter's directory to the merged model's directory. The tokenizer configuration is essential for correctly processing text with the fine-tuned model.

*   `files_to_copy = [...]`: Lists the specific tokenizer files needed. These files contain information about the vocabulary, tokenization rules, and special tokens used by the tokenizer.
*   `destination_directory = ...`: Specifies the directory where the merged model is saved.
*   The code then checks if the destination directory exists and creates it if it doesn't.
*   Finally, it iterates through the `files_to_copy` list and uses `shutil.copy()` to copy each file to the destination directory.


In [None]:
if "KAGGLE_USERNAME" not in os.environ or "KAGGLE_KEY" not in os.environ:
    kagglehub.login()

model_version = 1
kaggle_username = kagglehub.whoami()["username"]
fine_tuned_model_name = "gemma2_2b_gutneberg"
handle = f'{kaggle_username}/gemma2/transformers/{fine_tuned_model_name}'
print(f"Handle: {handle}\n")
local_model_dir = "/kaggle/working/gemma2-2b-gutenberg-merged"
kagglehub.model_upload(handle, local_model_dir)
print("Done!")

This cell uploads the fine-tuned model to Kaggle Models.

*   `if "KAGGLE_USERNAME" not in os.environ or "KAGGLE_KEY" not in os.environ:`: Checks if the Kaggle API credentials are set as environment variables. If not, it calls `kagglehub.login()` to prompt for login.
*   `model_version = 1`: Sets the version of the model being uploaded.
*   `kaggle_username = kagglehub.whoami()["username"]`: Retrieves your Kaggle username.
*   `fine_tuned_model_name = "gemma2_2b_gutneberg"`: Defines the name you want to give to your fine-tuned model on Kaggle.
*   `handle = f'{kaggle_username}/gemma2/transformers/{fine_tuned_model_name}'`: Creates the unique identifier (handle) for your model on Kaggle. The format is typically `username/model_group/model_type/model_name`.
*   `print(f"Handle: {handle}\n")`: Prints the model handle.
*   `local_model_dir = "/kaggle/working/gemma2-2b-gutenberg-merged"`: Specifies the local directory where the merged fine-tuned model is saved.
*   `kagglehub.model_upload(handle, local_model_dir)`: Uploads the model from the local directory to Kaggle Models using the specified handle.
*   `print("Done!")`: Prints a confirmation message.


## Part 2: Making the RAG Database

This section focuses on creating a vector database that will be used for retrieving relevant information to augment the language model's responses. This involves processing a collection of text documents, creating embeddings for text chunks, and indexing these embeddings for efficient similarity search.

In [None]:
import os
import re
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import faiss
import numpy as np
import pandas as pd
from typing import List, Tuple, Optional, Dict
import textwrap
import json
import pickle
import kagglehub
import concurrent
import gc

This cell imports the necessary libraries for building the RAG database. Many of these libraries were also used in the fine-tuning section, but some are specific to this part:

*   `sentence_transformers`: Used to create embeddings of text chunks.
*   `faiss`: The library for efficient similarity search in high-dimensional spaces, which will be used to build and query the vector database.
*   `typing`: For type hinting, making the code more readable and maintainable.
*   `concurrent`: For parallel processing of files.
*   `gc`: For garbage collection, useful for managing memory when dealing with large datasets.

In [None]:
paths = []
for dirname, _, filenames in os.walk('/kaggle/input/gutenberg-over-70000'):
    for idx, filename in enumerate(filenames):
        paths.append(os.path.join(dirname, filename))

    if len(paths) > 150: #only 150 Books for a quik demo, The actuall full data takes a lot of time
        break
        
print(f"data lenght:{len(paths)}")

In [None]:
meta_path = paths[0]#this is a .csv file so we can exclude it
paths = paths[1:]

In [None]:
def unpickle_file(path):
    """Unpickle a single file and process its content."""
    with open(path, "rb") as f:
        try:
            file = pickle.load(f)
        except (pickle.UnpicklingError, EOFError, ImportError) as e:
            print(f"Error unpickling file at {path}: {e}")
            return None

    # Ensure file is a string
    if isinstance(file, str):
        file = ' '.join(file.split())
        pattern = r'[^a-zA-Z0-9 ]'
        file = re.sub(pattern, ' ', file)
        file = re.sub(r'\s+', ' ', file).strip()
        return file  # Return the processed file
    else:
        print(f"Warning: The file at {path} did not contain a string.")
        return None

def process_files(paths):
    """Process files in parallel using a generator."""
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for processed_file in tqdm(executor.map(unpickle_file, paths), desc="Processing files", unit="file"):
            if processed_file is not None:
                yield processed_file  # Yield each processed file

In [None]:
processed_files = list(process_files(paths))
print("Books are ready...")

These cells loads and preprocesses the text data from the Gutenberg dataset.

*   The first part of the cell uses `os.walk` to recursively go through the directory `/kaggle/input/gutenberg-over-70000` and collect the paths of all files within it. It limits the number of processed files to 150 for a quick demo.
*   `print(f"data lenght:{len(paths)}")`: Prints the number of files found.
*   `meta_path = paths[0]`: Assumes the first file is a metadata file (likely a CSV) and stores its path.
*   `paths = paths[1:]`: Removes the metadata file path from the list of text file paths.
*   The `unpickle_file` function takes a file path, attempts to unpickle its contents (assuming the files are pickled objects), and then preprocesses the text if it's a string. Preprocessing includes:
    *   Removing non-alphanumeric characters and spaces using regular expressions.
    *   Removing extra whitespace.
*   The `process_files` function uses `concurrent.futures.ProcessPoolExecutor` to process the files in parallel, speeding up the process. It uses a generator to yield processed files one by one, which is memory-efficient.
*   `processed_files = list(process_files(paths))`: Executes the parallel processing and stores the processed text content of the books in a list.
*   `print("Books are ready...")`: Prints a confirmation message.
for more info about the dataset see the dataset [page](https://www.kaggle.com/datasets/jasonheesanglee/gutenberg-over-70000)

In [None]:
def chunk_text(text: str, chunk_size: int = 256, overlap: int = 32) -> List[str]:
    """Chunk text with overlap."""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

def create_embeddings(text_list: List[str]) -> np.ndarray:
    """Create embeddings."""
    embeddings = embedding_model.encode(text_list)
    return np.array(embeddings).astype('float32')

def create_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:
    """Create FAISS index."""
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index

def initialize_rag_system(full_text_list: List[str], 
                         chunk_size: int = 256, 
                         overlap: int = 32) -> Tuple[faiss.IndexFlatL2, List[str], Dict[int, int], AutoModelForCausalLM, AutoTokenizer]:
    """
    Initialize the RAG system by creating embeddings, vector database, and loading models.
    Should be called once at startup.
    
    Args:
        full_text_list: List of all text documents
        chunk_size: Size of text chunks
        overlap: Overlap between chunks
        
    Returns:
        Tuple containing (faiss_index, all_chunks, chunk_to_doc_map, model, tokenizer)
    """

    # Process all documents
    all_chunks = []
    chunk_to_doc_map = {}  # Maps chunk index to original document index
    
    for doc_idx, text in enumerate(full_text_list):
        chunks = chunk_text(text, chunk_size, overlap)
        for chunk in chunks:
            chunk_to_doc_map[len(all_chunks)] = doc_idx
            all_chunks.append(chunk)
    print("Chunks loaded successfully...")
    print(f"chunk length:{len(all_chunks)}")
    # Create embeddings for all chunks
    chunk_embeddings = create_embeddings(all_chunks)
    print("Embeddings created successfully...")
    # Create FAISS index
    index = create_faiss_index(chunk_embeddings)
    print("Vector database created successfully...")
    
    return index, all_chunks, chunk_to_doc_map

def save_rag_system(faiss_index: faiss.IndexFlatL2, 
                   all_chunks: list, 
                   chunk_to_doc_map: dict,
                   save_dir: str = "rag_system"):
    """
    Save the RAG system components to disk.
    
    Args:
        faiss_index: The FAISS index
        all_chunks: List of text chunks
        chunk_to_doc_map: Mapping from chunk index to document index
        save_dir: Directory to save the components
    """
    # Create directory if it doesn't exist
    os.makedirs(save_dir, exist_ok=True)
    
    # Save FAISS index
    faiss.write_index(faiss_index, f"{save_dir}/faiss_index.bin")
    
    # Save chunks and mapping
    with open(os.path.join(save_dir, "chunks_and_mapping.pkl"), "wb") as f:
        pickle.dump({
            "all_chunks": all_chunks,
            "chunk_to_doc_map": chunk_to_doc_map
        }, f)
    
    print(f"RAG system saved to {save_dir}")

def load_rag_system(save_dir: str = "rag_system") -> tuple:
    """
    Load the RAG system components from disk.
    
    Args:
        save_dir: Directory containing the saved components
        
    Returns:
        tuple: (faiss_index, all_chunks, chunk_to_doc_map)
    """
    # Load FAISS index
    faiss_index = faiss.read_index(f"{save_dir}/faiss_index.bin")
    
    # Load chunks and mapping
    with open(os.path.join(save_dir, "chunks_and_mapping.pkl"), "rb") as f:
        data = pickle.load(f)
        all_chunks = data["all_chunks"]
        chunk_to_doc_map = data["chunk_to_doc_map"]
    
    print(f"RAG system loaded from {save_dir}")
    return faiss_index, all_chunks, chunk_to_doc_map



def process_rag_query(
    query: str,
    faiss_index: faiss.IndexFlatL2,
    all_chunks: List[str],
    chunk_to_doc_map: Dict[int, int],
    model,
    tokenizer: AutoTokenizer,
    embedding_model,
    top_k: int = 5
) -> Dict:
    """
    Process a query using the pre-initialized RAG system with memory optimization.
    """
    # Get query embedding
    query_embedding = create_embeddings([query])[0].reshape(1, -1)
    
    # Search for relevant chunks
    distances, chunk_indices = faiss_index.search(query_embedding, top_k)
    
    # Clean up query embedding for memory
    del query_embedding
    gc.collect()  
    
    # Retrieve relevant chunks and their source documents
    relevant_chunks = [all_chunks[idx] for idx in chunk_indices[0]]
    source_docs = [chunk_to_doc_map[idx] for idx in chunk_indices[0]]
    
    # Create augmented prompt with context
    context = "\n\n".join(relevant_chunks)
    augmented_prompt = f"""Based on the following context, please answer the question:

Context:
{context}

Question: {query}"""
    
    # Get model response
    response = get_llm_response(augmented_prompt, model, tokenizer)
    
    # Clear any remaining temporary variables
    #del context, augmented_prompt, relevant_chunks, query, source_docs, distances, chunk_indices
    gc.collect()
    torch.cuda.empty_cache()
    
    return {
        "query": query,
        "response": response,
        "relevant_chunks": relevant_chunks,
        "source_documents": source_docs,
        "distances": distances[0].tolist()
    }

def get_llm_response(prompt: str, model, tokenizer) -> str:
    """
    Get response from the language model.
    """
    system_instructions = "You are an inteligent assistant, With access to the provided content try to answer using them as well. Note: Try to anwer in the same languages as users message."
    message = f"{system_instructions}. {prompt}"
    messages = [
        {"role": "user", "content": message},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    #print(text)
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

This cell defines several functions that are crucial for creating and using the RAG system.

*   `chunk_text(text, chunk_size=256, overlap=32)`: Splits a given text into smaller chunks of a specified `chunk_size`, with a defined `overlap` between consecutive chunks. This overlap helps to maintain context across chunks.
*   `create_embeddings(text_list)`: Takes a list of text chunks and generates their embeddings using the `embedding_model` (loaded later). It returns the embeddings as a NumPy array of type float32.
*   `create_faiss_index(embeddings)`: Creates a FAISS index from the given embeddings. `faiss.IndexFlatL2` creates a simple L2 distance-based index. The index is used for efficient nearest neighbor search.
*   `initialize_rag_system(full_text_list, chunk_size=256, overlap=32)`: This function orchestrates the initial setup of the RAG system.
    *   It takes a list of full text documents, chunking parameters, and pre-loads the embedding model and the language model.
    *   It iterates through each document, chunks it using `chunk_text`, and keeps track of which chunk belongs to which original document using `chunk_to_doc_map`.
    *   It then calls `create_embeddings` to generate embeddings for all the chunks and `create_faiss_index` to build the vector database.
    *   It returns the FAISS index, the list of all chunks, and the mapping from chunk index to document index.
*   `save_rag_system(faiss_index, all_chunks, chunk_to_doc_map, save_dir="rag_system")`: Saves the created RAG components (FAISS index, text chunks, and the chunk-to-document mapping) to disk using `faiss.write_index` and `pickle.dump`.
*   `load_rag_system(save_dir="rag_system")`: Loads the saved RAG components from disk.
*   `process_rag_query(query, faiss_index, all_chunks, chunk_to_doc_map, model, tokenizer, embedding_model, top_k=5)`: This function processes a user query by:
    *   Generating an embedding for the query.
    *   Searching the FAISS index for the `top_k` most similar chunk embeddings.
    *   Retrieving the corresponding text chunks and the indices of their source documents.
    *   Constructing an augmented prompt by including the retrieved context along with the original query.
    *   Generating a response from the language model using the augmented prompt.
    *   It includes memory management techniques like deleting the query embedding and running garbage collection.
*   `get_llm_response(prompt, model, tokenizer)`: This function takes a prompt and generates a response from the language model. It formats the prompt as a chat message and uses the tokenizer to prepare the input for the model. It then generates text and decodes the output tokens back into a string.

In [None]:
# Load sentence transformer model and LLM and make the vector database once
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
print("Embedding model loaded...")

In [None]:
faiss_index, all_chunks, chunk_to_doc_map = initialize_rag_system(processed_files)

In [None]:
#Saving the database and chunks so we don't have to do this again.
save_rag_system(faiss_index, all_chunks, chunk_to_doc_map, save_dir="/kaggle/temp/rag_system")

These cells instantiates the sentence transformer model and initializes and saves the RAG database.

*   `embedding_model = SentenceTransformer("all-MiniLM-L6-v2")`: Loads the `all-MiniLM-L6-v2` pre-trained sentence embedding model from the `sentence-transformers` library. This model will be used to create vector representations of the text chunks.
*   `print("Embedding model loaded...")`: Prints a confirmation message.
*   `faiss_index, all_chunks, chunk_to_doc_map = initialize_rag_system(processed_files)`: Calls the `initialize_rag_system` function, passing the list of processed book texts. This creates the text chunks, generates their embeddings, and builds the FAISS index. The returned values are the FAISS index, the list of all text chunks, and the mapping from chunk index to the original document index.
*   `save_rag_system(faiss_index, all_chunks, chunk_to_doc_map, save_dir="/kaggle/temp/rag_system")`: Calls the `save_rag_system` function to save the created FAISS index, the list of text chunks, and the chunk-to-document mapping to the `/kaggle/temp/rag_system` directory. This allows you to load the database later without recomputing everything.


In [None]:
kaggle_username = kagglehub.whoami()["username"]
handle = f'{kaggle_username}/Sample_Gutenverg_vector_database_for_Rag'
local_dataset_dir = '/kaggle/temp/rag_system'

# Create a new dataset
kagglehub.dataset_upload(handle, local_dataset_dir)

print("Database uploaded to kaggle datasets...")

This cell uploads the created RAG database to Kaggle Datasets.

*   `kaggle_username = kagglehub.whoami()["username"]`: Retrieves your Kaggle username using the `kagglehub` library.
*   `handle = f'{kaggle_username}/Sample_Gutenverg_vector_database_for_Rag'`: Defines the handle (unique identifier) for your dataset on Kaggle. The format is typically `username/dataset_name`.
*   `local_dataset_dir = '/kaggle/temp/rag_system'`: Specifies the local directory where the saved RAG database components are located.
*   `kagglehub.dataset_upload(handle, local_dataset_dir)`: Uploads the contents of the local directory to Kaggle Datasets under the specified handle. This makes the RAG database accessible to others or for use in other Kaggle notebooks.
*   `print("Database uploaded to kaggle datasets...")`: Prints a confirmation message.

## Part 3: Putting It Together

This section demonstrates how to combine the fine-tuned language model and the RAG database to answer a query. It loads the fine-tuned model, and then uses the RAG system to retrieve relevant context and generate an informed response.

In [None]:
model_name = "/kaggle/input/gemma2/transformers/gemma2_2b_gutneberg/2" #The fine-tuned model, I fine-tuned it in another notebook and pushed it to kaggle
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

This cell loads the previously fine-tuned Gemma model and its tokenizer.

*   `model_name = "/kaggle/input/gemma2/transformers/gemma2_2b_gutneberg/2"`: Specifies the path to the directory containing the saved fine-tuned Gemma model. The comment indicates that this model was fine-tuned in another notebook and uploaded to Kaggle.
*   `model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")`: Loads the fine-tuned Gemma model.
    *   `torch_dtype="auto"`: Automatically determines the appropriate data type for the model weights (e.g., float16, bfloat16) based on hardware availability.
    *   `device_map="auto"`: Automatically places the model's layers on the available devices (GPU if available, otherwise CPU).
*   `tokenizer = AutoTokenizer.from_pretrained(model_name)`: Loads the tokenizer associated with the fine-tuned Gemma model. It's crucial to use the same tokenizer that was used during fine-tuning.

In [None]:
query = "Descrivi l'importanza della famiglia nella cultura italiana."#Describe the importance of family in Italian culture.

In [None]:
results = process_rag_query(
            query=query,
            faiss_index=faiss_index,
            all_chunks=all_chunks,
            chunk_to_doc_map=chunk_to_doc_map,
            model=model,
            tokenizer=tokenizer,
            embedding_model=embedding_model
        )
response = results["response"]
relevant_chunks = results["relevant_chunks"]
source_documents = results["source_documents"]

In [None]:
print(f"response:{response}")

This cell demonstrates how to use the RAG system to answer a specific query.

*   `query = "Descrivi l'importanza della famiglia nella cultura italiana."`: Defines the query in Italian (with an English translation in the comment).
*   `results = process_rag_query(...)`: Calls the `process_rag_query` function with the following arguments:
    *   `query`: The user's question.
    *   `faiss_index`: The loaded FAISS index containing the embeddings of the text chunks.
    *   `all_chunks`: The list of all text chunks extracted from the Gutenberg books.
    *   `chunk_to_doc_map`: The mapping from chunk index to the original document index.
    *   `model`: The loaded fine-tuned Gemma model.
    *   `tokenizer`: The tokenizer for the Gemma model.
    *   `embedding_model`: The loaded sentence transformer model used to create the embeddings.
*   `response = results["response"]`: Extracts the generated answer from the `results` dictionary.
*   `relevant_chunks = results["relevant_chunks"]`: Extracts the list of text chunks that were deemed most relevant to the query.
*   `source_documents = results["source_documents"]`: Extracts the indices of the original documents from which the relevant chunks were retrieved.

## Future Path

*   **Improving the RAG system:**
    *   Experimenting with different embedding models for better semantic representation.
    *   Trying different FAISS index types for improved search efficiency or accuracy.
    *   Implementing more sophisticated chunking strategies.
    *   Adding mechanisms for re-ranking retrieved chunks.
*   **Enhancing the fine-tuning process:**
    *   Training for more steps or epochs.
    *   Using a larger and more diverse dataset for fine-tuning.
    *   Experimenting with different LoRA configurations or other PEFT methods.
*   **Improving the prompt engineering:**
    *   Trying different prompt formats or instructions to guide the model's response.
*   **Adding evaluation metrics:**
    *   Implementing metrics to quantitatively assess the quality of the generated responses.
*   **Deploying the system:**
    *   Packaging the model and RAG database for deployment in a web application or API.

## Conclusion

This section summarizes the work done in the notebook and highlights the key achievements. It often reiterates the goal of the project and briefly describes the steps taken to achieve it. It might also mention any limitations of the current implementation and the potential impact or applications of the work. For example:

*   The notebook successfully demonstrates the creation of a Retrieval-Augmented Generation (RAG) system using a fine-tuned Gemma 2B model and a vector database built from Project Gutenberg texts.
*   The fine-tuning process adapts the base model to better understand and generate text related to the training data.
*   The RAG database allows the model to access and incorporate relevant information from a large corpus of text, leading to more informed and contextually appropriate responses.
*   The system can be further improved by exploring different techniques in embedding, indexing, and fine-tuning.
*   This type of system can be valuable for various applications, such as question answering, information retrieval, and content generation.