In [1]:
!pip install tensorflow-datasets transformers tensorflow



Question 1: Sentiment Analysis with Transformers

In [2]:
# Imports
import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np
import torch
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, AutoModelForCausalLM
import os
import time # To measure training time for comparison

In [3]:
print("--- 1. Data Loading ---")
tfds.disable_progress_bar()

try:
    dataset, info = tfds.load(
        "imdb_reviews",
        with_info=True,
        as_supervised=True,
        download=True
    )
    train_data, test_data = dataset["train"], dataset["test"]
    print(f"IMDB Dataset Loaded. Train examples: {info.splits['train'].num_examples}, Test examples: {info.splits['test'].num_examples}")
except Exception as e:
    print(f"ERROR: Failed to load IMDB dataset. Details: {e}")

--- 1. Data Loading ---




Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.
IMDB Dataset Loaded. Train examples: 25000, Test examples: 25000


In [4]:
# TOKENIZATION FUNCTION ---
MAX_LENGTH = 128
BATCH_SIZE = 32

# Generic tokenization wrapper (will be reused for both models)
def create_processed_datasets(model_name, train_data, test_data):
    """Tokenizes and prepares TF Datasets for a given model."""
    print(f"\nProcessing data for {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(text, label):
        text_str = text.numpy().decode('utf-8')
        encoded = tokenizer(
            text_str,
            max_length=MAX_LENGTH,
            truncation=True,
            padding='max_length',
            return_tensors='tf'
        )
        return encoded['input_ids'], encoded['attention_mask'], np.int64(label.numpy())

    def tf_tokenize(text, label):
        input_ids, attention_mask, label = tf.py_function(
            tokenize_function,
            [text, label],
            (tf.int32, tf.int32, tf.int64)
        )

        # FIX: Squeeze the tensors back to the required (128,) shape
        input_ids = tf.squeeze(input_ids, axis=0)
        attention_mask = tf.squeeze(attention_mask, axis=0)

        # Set the shapes for the TF graph
        input_ids.set_shape([MAX_LENGTH])
        attention_mask.set_shape([MAX_LENGTH])
        label.set_shape([])

        return {'input_ids': input_ids, 'attention_mask': attention_mask}, label

    # Apply the mapping, shuffle, batch, and prefetch the datasets
    ds_train_processed = train_data.map(tf_tokenize, num_parallel_calls=tf.data.AUTOTUNE) \
                                   .shuffle(10000) \
                                   .batch(BATCH_SIZE) \
                                   .prefetch(tf.data.AUTOTUNE)

    ds_test_processed = test_data.map(tf_tokenize, num_parallel_calls=tf.data.AUTOTUNE) \
                                 .batch(BATCH_SIZE) \
                                 .prefetch(tf.data.AUTOTUNE)

    print(f"Data preparation for {model_name} complete.")
    return ds_train_processed, ds_test_processed


In [5]:
#MODEL 1: DISTILBERT FINE-TUNING ---

In [6]:
MODEL_NAME_DISTILBERT = 'distilbert-base-uncased'

ds_train_distilbert, ds_test_distilbert = create_processed_datasets(
    MODEL_NAME_DISTILBERT, train_data, test_data
)

print(f"\n--- Starting Training for {MODEL_NAME_DISTILBERT} ---")
start_time_distilbert = time.time()


Processing data for distilbert-base-uncased...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Data preparation for distilbert-base-uncased complete.

--- Starting Training for distilbert-base-uncased ---


In [7]:
# Load the model
model_distilbert = TFAutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME_DISTILBERT,
    num_labels=2,
    use_safetensors=False
)


tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_projector', 'vocab_transform', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-

In [8]:
# Compile and train
model_distilbert.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

history_distilbert = model_distilbert.fit(
    ds_train_distilbert,
    epochs=3,
    validation_data=ds_test_distilbert
)

end_time_distilbert = time.time()
training_time_distilbert = end_time_distilbert - start_time_distilbert

Epoch 1/3


TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


Epoch 2/3
Epoch 3/3


In [9]:
# Evaluation
distilbert_metrics = model_distilbert.evaluate(ds_test_distilbert)
distilbert_accuracy = distilbert_metrics[1]
print("\n--- DistilBERT Final Evaluation ---")
print(f"Test Accuracy: {distilbert_accuracy:.4f}")
print(f"Total Training Time: {training_time_distilbert:.2f} seconds")


--- DistilBERT Final Evaluation ---
Test Accuracy: 0.8691
Total Training Time: 1572.42 seconds


In [10]:
# MODEL 2: BERT FINE-TUNING (COMPARISON) ---
MODEL_NAME_BERT = 'bert-base-uncased'

# Note: The data processing must be run again, as the tokenizer is different.
ds_train_bert, ds_test_bert = create_processed_datasets(
    MODEL_NAME_BERT, train_data, test_data
)

print(f"\n--- Starting Training for {MODEL_NAME_BERT} ---")
start_time_bert = time.time()


Processing data for bert-base-uncased...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Data preparation for bert-base-uncased complete.

--- Starting Training for bert-base-uncased ---


In [11]:
# Load the model
model_bert = TFAutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME_BERT,
    num_labels=2,
    use_safetensors=False
)

tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Compile and train
model_bert.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

history_bert = model_bert.fit(
    ds_train_bert, # Use the BERT-tokenized dataset
    epochs=3,
    validation_data=ds_test_bert
)

end_time_bert = time.time()
training_time_bert = end_time_bert - start_time_bert

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [13]:
# Evaluation
bert_metrics = model_bert.evaluate(ds_test_bert)
bert_accuracy = bert_metrics[1]
print("\n--- BERT Final Evaluation ---")
print(f"Test Accuracy: {bert_accuracy:.4f}")
print(f"Total Training Time: {training_time_bert:.2f} seconds")


--- BERT Final Evaluation ---
Test Accuracy: 0.8753
Total Training Time: 2937.96 seconds


In [17]:
print("Sentiment Analysis Model Comarision")
print("Model Name                            ", "Final Test Accuracy        ", "Training Time (Approx.)  ")
print("BERT (bert-base-uncased)              ", "0.8753 (87.53%)            ", "2937.96 seconds (~49.0 min)")
print("DistilBERT (distilbert-base-uncased)  ", "0.8691 (86.91%)            ", "1572.42 seconds (~26.2 min)")


Sentiment Analysis Model Comarision
Model Name                             Final Test Accuracy         Training Time (Approx.)  
BERT (bert-base-uncased)               0.8753 (87.53%)             2937.96 seconds (~49.0 min)
DistilBERT (distilbert-base-uncased)   0.8691 (86.91%)             1572.42 seconds (~26.2 min)


Question 2: TEXT GENERATION (GPT - 2)


In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [20]:
# --- 1. MODEL LOADING ---
MODEL_NAME_GPT = 'gpt2'
tokenizer_gpt = AutoTokenizer.from_pretrained(MODEL_NAME_GPT)
model_gpt = AutoModelForCausalLM.from_pretrained(MODEL_NAME_GPT)
tokenizer_gpt.pad_token = tokenizer_gpt.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [21]:
# --- 2. GENERATION SETUP AND EXECUTION ---
prompt = "In a distant future, humanity has discovered"
input_ids = tokenizer_gpt.encode(prompt, return_tensors='pt')

print(f"\nPrompt: \"{prompt}\"")
print("\n--- Generated Story using Nucleus Sampling (Top-P) ---")



Prompt: "In a distant future, humanity has discovered"

--- Generated Story using Nucleus Sampling (Top-P) ---


In [22]:
# Generate the story using advanced sampling (Top-P)
story_output = model_gpt.generate(
    input_ids,
    max_length=150,
    do_sample=True,
    top_k=50,                 # Limits tokens to the 50 most probable
    top_p=0.95,               # Nucleus Sampling threshold
    temperature=0.8,          # Controls randomness
    num_return_sequences=1,
    pad_token_id=tokenizer_gpt.eos_token_id
)



The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [24]:
# Decode and display the generated story
generated_text = tokenizer_gpt.decode(story_output[0], skip_special_tokens=True)
print("\n" + generated_text)


In a distant future, humanity has discovered how to destroy their enemies. The planet Serenity, which is a planet of immense size and importance, is surrounded by a massive array of nuclear warheads, and no one could possibly know how to stop it.

Serenity's existence is a mystery because of a series of mysterious and strange events. One of these events has been referred to as "sunken" and "dark." The next is called "unearthly" and "warped." These are events that have come to be known as "dungeons." However, in the future, the world Serenity has become part of is actually inhabited by a species of extraterrestrial beings that are known as the "
