### Installing Required Libraries

This code block installs the necessary Python libraries to work with HuggingFace Transformers, datasets, and tools for fine-tuning large language models. 

### Libraries Overview:
- **Transformers**: Provides APIs and tools for loading pre-trained models and fine-tuning them on custom datasets.
- **Datasets**: Simplifies dataset handling for machine learning tasks.
- **Accelerate**: Optimizes and speeds up distributed model training.
- **PEFT (Parameter-Efficient Fine-Tuning)**: Allows fine-tuning large models efficiently.
- **TRL (Transformer Reinforcement Learning)**: Tools for RL with Transformers.
- **Bitsandbytes**: Enables 8-bit model optimization for faster training and inference.
- **Weights & Biases (wandb)**: A tool for experiment tracking and visualization.


In [2]:
%%capture
%pip install -U transformers 
%pip install -U datasets 
%pip install -U accelerate 
%pip install -U peft 
%pip install -U trl 
%pip install -U bitsandbytes 
%pip install -U wandb

### Setting Up and Preparing for Model Fine-Tuning

#### Importing Required Libraries
First we import all necessary libraries for model loading, fine-tuning, and training. This includes HuggingFace Transformers, PEFT (Parameter-Efficient Fine-Tuning), and other utility libraries like `wandb`, `torch`, and `datasets`.
#### Authenticating with HuggingFace Hub
Then we log into HuggingFace Hub using a token retrieved from environment variables for downloading our base model (`NousResearch/Llama-2-7b-chat-hf`). Ensure the `HUGGINGFACE_TOKEN` environment variable is set with your personal HuggingFace access token.
#### Configuring and Loading the Model
The next block sets up QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning using `BitsAndBytesConfig`. It loads the base model with 4-bit quantization and the tokenizer for text generation tasks.
#### Defining LoRA Fine-Tuning Parameters
Finally, we define the parameters for PEFT using the LoraConfig class. These parameters include `lora_alpha`, `lora_dropout`, `r`, and the `task type` (CAUSAL_LM). These settings optimize the fine-tuning process for causal language modeling tasks by efficiently adapting the pre-trained model for the new dataset.

In [3]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

In [4]:
import os
from huggingface_hub import login

# Get the token from environment variables
hf_token = os.getenv("HUGGINGFACE_TOKEN")

if hf_token:
    login(token=hf_token)
else:
    print("HuggingFace token not found. Please set the HUGGINGFACE_TOKEN environment variable.")

In [None]:
# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

In [None]:
# CODE TO CHECK FOR GPU FOR RUNNIGNG THE CODE

# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    torch_dtype = torch.float16
    attn_implementation = "eager"


print("GPU Available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))

In [None]:
from transformers import BitsAndBytesConfig

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation 
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [12]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

### Preprocessing Data for Fine-Tuning

This section prepares the dataset for fine-tuning a chat-based AI model:

1. **Data Cleaning**:
   - Loaded the dataset (`IFND.csv`) and standardized the `Label` column (`TRUE` → 1, `FALSE` → 0).
   
2. **Balancing**:
   - Sampled equal numbers of `TRUE` and `FALSE` labels for training (1000 each) and validation (200 each).

3. **Formatting**:
   - Converted the data into a Hugging Face Dataset.
   - Defined a chat-based template with system instructions and user queries for classification.

4. **Final Output**:
   - Processed datasets are ready for tokenization and fine-tuning.


In [13]:
import pandas as pd

# Replace 'ISO-8859-1' with the detected encoding
df = pd.read_csv('IFND.csv', encoding='ISO-8859-1')
df = df.iloc[:, :3]

# Display the first few rows
df.head()

  df = pd.read_csv('/kaggle/input/infd-dataset-final/IFND.csv', encoding='ISO-8859-1')


Unnamed: 0,Statement,Category,Label
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,True
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,True
2,LAC tensions: China's strategy behind delibera...,TERROR,True
3,India has signed 250 documents on Space cooper...,COVID-19,True
4,Tamil Nadu chief minister's mother passes away...,ELECTION,True


In [14]:
df['Label'] = df['Label'].astype(str)
# Replace "true" with "TRUE" in the 'label' column
df['Label'] = df['Label'].replace({'True': 'TRUE', 'False': 'FALSE','Fake':'FALSE'})

In [15]:
df['Statement'] = df['Statement'].astype(str)
df['Label'] = df['Label'].map({'TRUE': 1, 'FALSE': 0})

In [16]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

# Sample true and false labels
true_rows = df[df['Label'] == 1].sample(1000)
false_rows = df[df['Label'] == 0].sample(1000)

# Combine the two datasets
subset_df = pd.concat([true_rows, false_rows])

# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(subset_df, split="train")

# Define instruction
instruction = """You are an AI assistant trained to classify news articles based on their content.
Your task is to analyze the content of a given news article and determine if the article is factually TRUE or FALSE.
Provide accurate and well-reasoned classifications."""

# Define the formatting function
def format_chat_template(row):
    row_json = [
        {"role": "system", "content": instruction},
        {"role": "user", "content": f"Please classify the following news article: {row['Statement']}"},
        {"role": "assistant", "content": f"The news article is classified as: {row['Label']}"}
    ]
    # Flatten the row into a format suitable for tokenization
    formatted_input = "\n".join([entry["content"] for entry in row_json])
    # Tokenize the input manually
    encoding = tokenizer(formatted_input, truncation=True, padding='max_length', max_length=512)
    return encoding

# Apply the formatting function to the dataset
formatted_dataset = dataset.map(
    format_chat_template,
    num_proc=4,  # Use multiple processors for speed
)


  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

  self.pid = os.fork()


In [19]:
formatted_dataset

Dataset({
    features: ['Statement', 'Category', 'Label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 2000
})

In [20]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

# Example: Load dataset into DataFrame

# Sample true and false labels
true_rows = df[df['Label'] == 1].sample(200)
false_rows = df[df['Label'] == 0].sample(200)

# Combine the two datasets
subset_df = pd.concat([true_rows, false_rows])

# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(subset_df, split="train")

# Load tokenizer (ensure you're using a model that supports chat functionality)

# Define instruction
instruction = """You are an AI assistant trained to classify news articles based on their content.
Your task is to analyze the content of a given news article and determine if the article is factually TRUE or FALSE.
Provide accurate and well-reasoned classifications."""

# Define the formatting function
def format_chat_template(row):
    row_json = [
        {"role": "system", "content": instruction},
        {"role": "user", "content": f"Please classify the following news article: {row['Statement']}"},
        {"role": "assistant", "content": f"The news article is classified as: {row['Label']}"}
    ]
    # Flatten the row into a format suitable for tokenization
    formatted_input = "\n".join([entry["content"] for entry in row_json])
    # Tokenize the input manually
    encoding = tokenizer(formatted_input, truncation=True, padding='max_length', max_length=512)
    return encoding

# Apply the formatting function to the dataset
formatted_val_dataset = dataset.map(
    format_chat_template,
    num_proc=4,  # Use multiple processors for speed
)

# Save formatted dataset to a JSONL file for fine-tuning
formatted_val_dataset

Map (num_proc=4):   0%|          | 0/400 [00:00<?, ? examples/s]

Dataset({
    features: ['Statement', 'Category', 'Label', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 400
})

### Fine-Tuning and Saving the Model

1. **Gradient Checkpointing**: Enabled gradient checkpointing for efficient memory usage during training.
2. **Training Configuration**: Defined training parameters using `TrainingArguments`, including batch size, learning rate, gradient accumulation, and optimizer.
3. **Training**: Used `SFTTrainer` to fine-tune the model with the processed dataset and PEFT configuration.
4. **Saving Results**: Saved the fine-tuned model and tokenizer to the `./results` directory.
5. **Download Link**: Provided a direct link to download the results as a ZIP file.


In [4]:
import torch.utils.checkpoint as checkpoint

def checkpointed_model(*inputs):
    return checkpoint.checkpoint(model, *inputs)

In [37]:
model.to("cuda")
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=500,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)
trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,
    peft_config=peft_params,
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)
trainer.train()



Step,Training Loss
25,1.9478
50,1.245
75,1.0075
100,1.0771
125,1.2709
150,1.0472
175,1.0736
200,1.0524
225,1.1613
250,0.9551


TrainOutput(global_step=2000, training_loss=1.0461262998580934, metrics={'train_runtime': 5261.7761, 'train_samples_per_second': 0.38, 'train_steps_per_second': 0.38, 'total_flos': 4.0801677606912e+16, 'train_loss': 1.0461262998580934, 'epoch': 1.0})

In [38]:
trainer.save_model("./results")
tokenizer.save_pretrained("./results")

('./results/tokenizer_config.json',
 './results/special_tokens_map.json',
 './results/tokenizer.model',
 './results/added_tokens.json',
 './results/tokenizer.json')

In [40]:
from IPython.display import FileLink
FileLink(r'results.zip')

### Testing the fine-tuned model is a random headline

In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from transformers import BitsAndBytesConfig

# Paths
adapter_dir = "/kaggle/input/llama-fine-tuned1/pytorch/default/1"
base_model_dir = "NousResearch/Llama-2-7b-chat-hf"

# Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
)

# Load Model
tokenizer = AutoTokenizer.from_pretrained(base_model_dir)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_dir,
    quantization_config=bnb_config,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, adapter_dir)

# Move Model to GPU/CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Input Prompt
input_text = (
    "You are a news analyzer. Given the headline, determine if it's true or false, and provide a detailed explanation. "
    "Also, classify the news into a theme like politics, sports, education, etc.\n"
    "Headline: 'was ramnath kovind ever been the president of india'\n"
)

inputs = tokenizer(
    input_text,
    return_tensors="pt",
    truncation=True,
    padding="max_length",
)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate Response
# Generate Response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=250,   # Limits generated tokens only
        num_beams=5,          # Enhance quality with beam search
        temperature=0.7,      # Balance randomness
        top_k=40,             # Limit to top-k tokens
        top_p=0.9,            # Nucleus sampling
        repetition_penalty=1.2  # Reduce repetitive outputs
    )

# Decode and Post-process
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = response.strip()

# Print Response
print("Generated Response:")
print(response)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Generated Response:
You are a news analyzer. Given the headline, determine if it's true or false, and provide a detailed explanation. Also, classify the news into a theme like politics, sports, education, etc.
Headline: 'was ramnath kovind ever been the president of india'
The news is false.Ram Nath Kovind is the 14th President of India, serving since 2017. He was born on October 1, 1945, in Paralakhemundi, Odisha. He is a member of the Bharatiya Janata Party (BJP) and served as the Governor of Bihar from 2015 to 2017 before being elected as the President of India in 2017.
The news is false.Ram Nath Kovind is the 14th President of India, serving since 2017. He was born on October 1, 1945, in Paralakhemundi, Odisha. He is a member of the Bharatiya Janata Party (BJP) and served as the Governor of Bihar from 2015 to 2017 before being elected as the President of India in 2017.
The news is false.Ram Nath Kovind is the 14th President of India, serving since 2017.
