<a href="https://colab.research.google.com/github/JapiKredi/RAG_HF_Transfomers_pretrained_facebook_RAG/blob/main/RAG_HF_Transfomers_pretrained_facebook_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fine-Tuning RAG Models for Custom Content Generation



## Introduction to RAG Fine-Tuning

Fine-tuning a RAG model involves adjusting the model’s parameters on a specific dataset to enhance its performance for a particular task — in our case, generating content on climate change. This process improves the model’s ability to retrieve relevant information and generate coherent, contextually appropriate content.


# Setting Up the Environment

Before we begin, ensure you have Python and the necessary libraries installed.
We’ll use the Hugging Face transformers and datasets libraries, which provide access to pre-trained RAG models and a convenient API for fine-tuning and data handling.

In [1]:
pip install transformers datasets torch

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.23.0-p

In [None]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m


# Step 1: Preparing the Dataset

For fine-tuning, you’ll need a dataset of documents related to climate change. This dataset should be structured with fields for the “document” (the content to retrieve) and the “query” (the prompt for generation), along with the “answer” (the expected output).


In [9]:
from datasets import load_dataset, DatasetDict
# Assuming you have a dataset in CSV format
climate_change_dataset = load_dataset("csv", data_files='/content/GlobalLandTemperaturesByCity.csv')
# Example structure of each data point: {'query': 'What is climate change?', 'document': 'Climate change refers to...', 'answer': 'Climate change is...'}

Generating train split: 0 examples [00:00, ? examples/s]

# Step 2: Loading the RAG Model and Tokenizer

We’ll use the Hugging Face transformers library to load a pre-trained RAG model and its corresponding tokenizer. The RAG model combines a question-answering model with a document retrieval component.

In [10]:
from transformers import RagTokenizer, RagTokenForGeneration
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

# Step 3: Fine-Tuning the RAG Model

Fine-tuning adjusts the model’s weights based on your specific dataset. The goal is to enhance the model’s ability to generate accurate and relevant responses to queries about climate change.

In [12]:
from transformers import Trainer, TrainingArguments

In [17]:
!pip install accelerate>=0.21.0

In [21]:
!pip install transformers[torch]



In [22]:
training_args = TrainingArguments(
    output_dir="/content/rag_finetuned/",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs/",
    logging_steps=10,
)

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [14]:
# Define a function to process the data for training
def preprocess_function(examples):
    inputs = [ex["query"] + tokenizer.sep_token + ex["document"] for ex in examples]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    # Prepare labels
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["answer"], max_length=512, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [15]:
# Process the dataset
tokenized_datasets = climate_change_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8599212 [00:00<?, ? examples/s]

TypeError: string indices must be integers

In [16]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

NameError: name 'training_args' is not defined

In [None]:
# Start fine-tuning
trainer.train()

# Step 4: Generating Content with the Fine-Tuned Model

After fine-tuning, the RAG model is better equipped to handle queries related to climate change, pulling relevant information from the dataset to generate informative and contextually relevant content.

In [None]:
from transformers import RagTokenizer, RagTokenForGeneration
# Load the fine-tuned model and tokenizer
model = RagTokenForGeneration.from_pretrained("./rag_finetuned/")
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
# Generating content
query = "Explain the impact of global warming on polar ice caps."
input_ids = tokenizer(query, return_tensors="pt").input_ids
# Generate the answer
generated_ids = model.generate(input_ids)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

# Conclusion

Fine-tuning a RAG model on a domain-specific dataset, like those related to climate change, significantly enhances its performance, making it a powerful tool for generating factual, relevant and engaging content. This approach offers a practical solution for leveraging advanced AI technologies to disseminate knowledge and raise awareness about critical global issues.