### This Model performs fine-tuning of a distilbert-base-uncased model on a small subset of the IMDb dataset using Low-Rank Adaptation (LoRa) with the Hugging Face Transformers library. Here’s a step-by-step explanation of what each part of the code does:

In [1]:
!pip install torch transformers datasets accelerate peft

from IPython.display import clear_output
clear_output()

#### Requests: Used to fetch data from a URL.
#### Datasets: A library by Hugging Face to handle datasets.
#### Transformers: A library by Hugging Face to work with transformer models.
#### Peft: A library that provides efficient fine-tuning methods like LoRa.
#### This section sends an HTTP GET request to a specific URL to fetch 100 rows of data from the IMDb dataset, stored in JSON format.

In [2]:
import requests
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType

# Fetching data from the URL
url = "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp%2Fimdb&config=plain_text&split=train&offset=0&length=100"
response = requests.get(url)
data = response.json()

2024-08-21 06:14:28.139393: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-21 06:14:28.139531: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-21 06:14:28.276590: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


#### texts: Extracts the movie reviews.
#### labels: Converts the "positive" and "negative" labels into binary labels (1 for positive, 0 for negative).

In [3]:
# Extract text and labels from the fetched data
texts = [row['row']['text'] for row in data['rows']]
labels = [1 if row['row']['label'] == "positive" else 0 for row in data['rows']]

#### Creates a Dataset object from the fetched data, which can be used for further processing.

In [4]:
# Create a Dataset object
dataset = Dataset.from_dict({'text': texts, 'label': labels})

#### AutoTokenizer: Loads the tokenizer associated with distilbert-base-uncased.
#### tokenize_function: Defines a function to tokenize the text, ensuring all inputs are of the same length (max_length) and truncating if necessary.
#### tokenized_dataset: Applies the tokenizer to the entire dataset.

In [5]:
# Tokenization
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example['text'], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

#### Loads the validation dataset (test split) and tokenizes it similarly to the training dataset.

In [6]:
# Load validation dataset if available
validation_dataset = load_dataset("stanfordnlp/imdb", split='test')  # Assuming 'test' is the validation split
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

#### AutoModelForSequenceClassification: Loads the distilbert-base-uncased model pre-configured for sequence classification with 2 output labels (positive/negative).

In [7]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### target_modules: Specifies the part of the model to which LoRa will be applied. Here, it targets the classifier layer.
#### LoraConfig: Configures the LoRa parameters:

####    task_type=TaskType.SEQ_CLS: Specifies that the task is sequence classification.
####    inference_mode=False: Indicates that the model is in training mode.
####    r, lora_alpha, lora_dropout: Hyperparameters for LoRa.

In [8]:
# Configure LoRa
target_modules = ["classifier"]
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=target_modules
)

#### get_peft_model: Applies the LoRa configuration to the pre-trained model, allowing efficient fine-tuning.

#### TrainingArguments: Configures the training process:

####    output_dir: Directory where model checkpoints will be saved.
####    evaluation_strategy="epoch": Evaluate the model at the end of each epoch.
####    learning_rate, batch_size, num_train_epochs: Key hyperparameters for training.
####    weight_decay: Regularization term to prevent overfitting.
####    save_total_limit: Limits the number of saved checkpoints.
####    logging_dir: Directory where logs will be saved.

In [9]:
# Apply LoRa to the model
model = get_peft_model(model, lora_config)

# Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate during training
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir='./logs',
)



#### Trainer: A high-level API for training and evaluation.

####    Takes the model, training arguments, tokenized datasets, and tokenizer as inputs.
#### trainer.train(): Starts the training process using the specified arguments and datasets.
#### Saves the fine-tuned model and tokenizer to the specified directory.

In [10]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_validation_dataset,  # Provide the evaluation dataset
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Save the trained model
model.save_pretrained("./lora-finetuned-model")
tokenizer.save_pretrained("./lora-finetuned-model")

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,No log,0.705666
2,No log,0.725957
3,No log,0.735333


('./lora-finetuned-model/tokenizer_config.json',
 './lora-finetuned-model/special_tokens_map.json',
 './lora-finetuned-model/vocab.txt',
 './lora-finetuned-model/added_tokens.json',
 './lora-finetuned-model/tokenizer.json')