1. Load and Explore the Dataset

Before Fine-Tuning: GPT-2's Pre-Trained State
1. Selection Criteria for the Base LLM

Explanation:

Base Model: GPT-2 is chosen due to its high capacity and general capabilities in natural language processing. It’s pre-trained on a large and diverse corpus of text from the internet, making it versatile for various NLP tasks.
Criteria for Selection:
General Language Understanding: GPT-2 has been trained on extensive textual data, enabling it to understand and generate coherent text across a wide range of topics.
Flexibility: It can be adapted for different tasks with fine-tuning, thanks to its large-scale pre-training.
Performance Benchmarks: GPT-2 has demonstrated strong performance in various NLP benchmarks and tasks, providing a solid foundation for further specialization.
2. Task-Specific Considerations for Fine-Tuning

Explanation:

Task: Sentiment classification of movie reviews.
Considerations:
Nature of Task: Sentiment analysis requires understanding nuanced language and sentiment expressions. GPT-2’s general training allows it to perform well, but it may not be optimal without task-specific fine-tuning.
Domain Relevance: While GPT-2 can generate text and perform tasks in a zero-shot setting, it lacks specialized knowledge in sentiment analysis unless fine-tuned on a relevant dataset.
3. Data Preparation and Preprocessing Steps

Explanation:

Dataset: IMDB dataset used for sentiment classification.
Preprocessing Steps:
Tokenization: Convert raw text into token IDs that GPT-2 can process.
Padding and Truncation: Ensure all sequences are of consistent length for model training.
Handling Labels: Convert sentiment labels (positive/negative) into a format suitable for the model. In the case of GPT-2, this might involve adapting the format for text generation tasks, as GPT-2 primarily generates text rather than classifying it directly.
4. Fine-Tuning Hyperparameters and Optimization Strategies

Explanation:

Hyperparameters:
Learning Rate: Determines how much to adjust the model weights during training.
Batch Size: Number of samples processed before updating the model weights.
Number of Epochs: Number of times the model sees the entire dataset.
Weight Decay: Regularization parameter to prevent overfitting.
Optimization Strategies:
Learning Rate Scheduling: Adjust learning rates over time to improve convergence.
Early Stopping: Stop training when performance ceases to improve on the validation set.
5. Evaluation Metrics and Performance Analysis

Explanation:

Metrics:
Loss: Measures how well the model's predictions match the actual labels. Lower loss indicates better performance.
Accuracy: Percentage of correct classifications. While GPT-2 is primarily used for generation, accuracy in classification tasks can be evaluated after fine-tuning.
Performance Analysis:
Baseline Performance: Performance of GPT-2 in a zero-shot setting or before fine-tuning, showing its initial capabilities and limitations.


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

In [None]:
# Example for text classification task
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

1

After Fine-Tuning: GPT-2 Adapted for Sentiment Classification
1. Selection Criteria for the Base LLM

Explanation:

Base Model: GPT-2 was initially chosen for its general language capabilities and versatility. After fine-tuning, the focus remains on the same model due to its successful adaptation to the sentiment classification task.
Criteria for Selection:
Adaptability: GPT-2's ability to fine-tune effectively for specific tasks is demonstrated by its improved performance on sentiment classification.
Performance Gains: Post-fine-tuning, the model’s ability to specialize and perform better in the sentiment analysis domain is a key factor in retaining the model.
2. Task-Specific Considerations for Fine-Tuning

Explanation:

Task: Sentiment classification of movie reviews.
Considerations:
Fine-Tuning Specifics: Fine-tuning adapts GPT-2 to understand and classify sentiment more effectively by learning from IMDB dataset examples.
Specialized Performance: The model now excels in recognizing sentiment nuances and patterns specific to movie reviews, improving upon its general pre-trained capabilities.
3. Data Preparation and Preprocessing Steps

Explanation:

Dataset Preparation: The IMDB dataset is preprocessed specifically for sentiment classification.
Preprocessing Steps:
Tokenization: Applied to convert movie reviews into tokens, fitting the input requirements for GPT-2.
Padding and Truncation: Ensured sequences fit the model’s input size, optimizing training efficiency.
Data Splitting: Split data into training and testing sets to evaluate performance effectively post-fine-tuning.
4. Fine-Tuning Hyperparameters and Optimization Strategies

Explanation:

Hyperparameters:
Learning Rate: Set to 2e-5 for fine-tuning, balancing between convergence speed and stability.
Batch Size: Configured to 8 for managing memory and computational efficiency.
Number of Epochs: Set to 3, ensuring sufficient training while preventing overfitting.
Weight Decay: Applied at 0.01 to regularize the model and reduce overfitting risks.
Optimization Strategies:
Evaluation Strategy: Evaluated model performance at each epoch to monitor and adjust training as needed.
Logging: Used to track progress and adjust parameters based on training dynamics.
5. Evaluation Metrics and Performance Analysis

Explanation:

Metrics:
Loss: The primary metric to evaluate how well the model has learned to classify sentiments. After fine-tuning, loss should decrease, indicating better model performance.
Accuracy: Assessing classification accuracy post-fine-tuning reveals improvements in the model's ability to correctly classify sentiment.
Performance Analysis:
Improved Metrics: Post-fine-tuning results should show lower loss and higher accuracy compared to pre-fine-tuning.
Visual Comparisons: Graphs or charts comparing metrics before and after fine-tuning provide clear insights into performance improvements.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load and tokenize dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Set the padding token
tokenizer.pad_token = tokenizer.eos_token  # or use tokenizer.add_special_tokens({'pad_token': '[PAD]'})

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4)
tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

trainer = Trainer(
    model=GPT2LMHeadModel.from_pretrained("gpt2"),
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
)

trainer.train()

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss


In [None]:
# Evaluate the model
results = trainer.evaluate()

# Print out the results
print("Classification Metrics:")
print(f"Accuracy: {results['eval_accuracy']:.4f}")
print(f"Precision: {results['eval_precision']:.4f}")
print(f"Recall: {results['eval_recall']:.4f}")
print(f"F1 Score: {results['eval_f1']:.4f}")

Summary
Before Fine-Tuning:

Model: Pre-trained GPT-2
Capabilities: General text generation, zero-shot classification
Performance: Limited accuracy for sentiment classification tasks, generalized results.

After Fine-Tuning:
Model: Fine-tuned GPT-2 on IMDB dataset
Capabilities: Specialized sentiment classification
Performance: Improved accuracy and relevance, lower loss, and better classification results.