# Text-to-Code Generation with TensorFlow, HuggingFace & MBPP

## Overview
Text-to-code generation automatically converts natural language descriptions into executable code. This guide demonstrates using CodeT5 (by Salesforce) with TensorFlow and the Mostly Basic Python Programming (MBPP) benchmark.

## Key Components

### CodeT5
- Built on T5's encoder-decoder architecture
- Pre-trained on CodeSearchNet data (multiple programming languages)
- Specialized for code understanding and generation
- Uses code-specific tokenizer with 32,000 vocabulary size

### T5 (Text-to-Text Transfer Transformer)
- Unified framework for NLP tasks
- Uses task-specific prefixes for different operations
- Pre-trained on C4 (Colossal Clean Crawled Corpus)
- Available in multiple sizes (small to 11B parameters)

### MBPP Dataset
- 1,000 Python programming problems
- Entry-level coding tasks
- Includes:
  - Task descriptions
  - Code solutions
  - Automated test cases

## Implementation Steps

1. **Setup**
   - Import TensorFlow and HuggingFace libraries
   - Configure mixed precision training
   - Set up distributed training if needed

2. **Data Processing**
   - Load MBPP dataset
   - Preprocess text and code pairs
   - Prepare input/output sequences

3. **Training**
   - Fine-tune CodeT5 model
   - Use teacher forcing approach
   - Monitor training metrics

4. **Inference**
   - Generate code from descriptions
   - Validate outputs
   - Test with custom inputs

## Key Considerations
- Learning rate: Use 1e-4 to 3e-4 with AdamW optimizer
- Padding: Replace padding tokens with -100 in loss calculation
- Task prefixes: Important for multi-task training
- Validation: Test generated code functionality

In [1]:
import os
import time
import math
import random
import datetime
from pathlib import Path

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"  # reduce the amount of console output from TF
import tensorflow as tf

from transformers import *
%pip install -q datasets # install HF datasets library
from datasets import load_dataset

logging.set_verbosity_warning()
logging.set_verbosity_error()

import logging

GroupViT models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version.Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
TAPAS models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version. Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
  pid, fd = os.forkpty()


Note: you may need to restart the kernel to use updated packages.


In [2]:
print('TF version',tf.__version__)
#print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU'))) # check GPU available

TF version 2.17.1


# Machine Learning Training Optimization Strategies

## Mixed Precision Training

Mixed precision training combines different floating-point formats to optimize model training performance while maintaining accuracy.

### How It Works
Mixed precision uses both float32 and lower precision formats (float16 or bfloat16):
- Most transformer operations use float16/bfloat16
- Critical operations (like softmax) remain in float32
- Reduces memory usage and enables larger batch sizes
- Accelerates matrix operations on supported hardware

### Hardware Considerations
- NVIDIA GPUs (Compute Capability 7.0+): Use float16/float32 mix
- Cloud TPUs: Use bfloat16/float32 mix
- Tensor Cores provide additional acceleration for float16 operations

## XLA (Accelerated Linear Algebra)

XLA is a specialized compiler that optimizes TensorFlow models without requiring code changes.

### Key Benefits
- Fuses multiple operations into single kernels
- Reduces memory transfers
- Optimizes computation graphs automatically

### Example Optimization
```python
# Original operations:
def model_fn(x, y, z):
    return tf.reduce_sum(x + y * z)

# XLA fuses these into a single kernel operation
# maintaining values in GPU registers
```

## Distribution Strategy

TensorFlow's distribution API enables training across multiple devices with minimal code changes.

### Training Approaches
- Synchronous: Workers train on different data slices, aggregate gradients
- Asynchronous: Workers train independently, update variables asynchronously

### Implementation Options
- Single GPU: Use `tf.distribute.OneDeviceStrategy`
- Multiple GPUs: Supports data parallelism
- TPUs/Multiple Machines: Scales across different hardware configurations

### Key Features
- Strategy-aware components (variables, layers, models, optimizers)
- Supports eager and graph execution
- Flexible device placement
- Automatic input distribution
- Built-in checkpoint management

### Best Practices
- Use graph execution (`tf.function`) for optimal performance
- Consider data pipeline optimization for distributed training
- Monitor device utilization and memory usage
- Implement appropriate batch size scaling

In [3]:
def setup_strategy(xla, fp16, no_cuda):
    print(" Tensorflow: setting up strategy")
    
    # setup xla
    if xla:
        print(" XLA Enabled")
        tf.config.optimizer.set_jit(True)
    
    # setup mixed precision training
    if fp16:
        # Set to float16 at first
        print(" Mixed Precision Training Enabled")
        policy = tf.keras.mixed_precision.experimental.Policy("mixed_float16")
        tf.keras.mixed_precision.experimental.set_policy(policy)
    
    # setup distribution strategy
    gpus = tf.config.list_physical_devices("GPU")
    if no_cuda:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
    else:
        if len(gpus) == 0:
            print(" One Device Strategy [CPU] Enabled")
            strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
        elif len(gpus) == 1:
            print(" One Device Strategy [GPU] Enabled")
            strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
        elif len(gpus) > 1:
            print(" Mirrored Strategy Enabled")
            # If only want to use a specific subset of GPUs use CUDA_VISIBLE_DEVICES=0`
            strategy = tf.distribute.MirroredStrategy()
        else:
            strategy = tf.distribute.get_strategy()

    return strategy

def n_replicas(strategy):
    # return number of devices
    return strategy.num_replicas_in_sync

# note: 
# huggingface TF-T5 implementation has issues when mixed precision is enabled
# we will disable FP16 for this but can be used for training any other model
strategy = setup_strategy(xla=True, fp16=False, no_cuda=False)

 Tensorflow: setting up strategy
 XLA Enabled
 One Device Strategy [GPU] Enabled


# Understanding and Processing the MBPP Dataset

## About MBPP (Mostly Basic Python Problems)

The MBPP dataset, introduced in "Program Synthesis with Large Language Models" (2021), serves as a benchmark for code generation tasks. It provides a structured collection of Python programming problems suitable for testing both human programmers and machine learning models.

### Dataset Composition
- 974 total Python functions
- 426 hand-verified questions (edited dataset)
- Each entry contains:
  - Function description in natural language
  - Python implementation
  - Test cases for verification

### Quality Control
The edited subset (426 problems) underwent rigorous verification to ensure:
- Standard Python function signatures
- Unambiguous problem descriptions
- Accurate test cases matching descriptions
- Clear evaluation criteria

## Dataset Processing Pipeline

### Step 1: Data Acquisition
```python
def download_dataset():
    """
    Downloads and loads the MBPP dataset using HuggingFace datasets
    Returns: Dataset object containing training examples
    """
    return load_dataset('mbpp')
```

### Step 2: Feature Creation
```python
def convert_examples_to_features(examples):
    """
    Converts raw examples into model-ready features
    Creates: input_ids, attention_mask, labels
    Uses: tokenizer.batch_encode_plus() for processing
    """
    features = {
        'input_ids': [],
        'attention_mask': [],
        'labels': []
    }
    # Processing logic here
    return features
```

### Step 3: Data Pipeline Construction
Our pipeline uses TensorFlow's tf.data API for efficient data handling:

1. Train/Test Split
   - 90% training data
   - 10% validation data

2. TensorFlow Dataset Creation
   ```python
   def get_train_tfdataset():
       """
       Creates training dataset with shuffling and prefetching
       Returns: tf.data.Dataset object
       """
       return tf.data.Dataset.from_generator(...)
           .shuffle(buffer_size)
           .batch(batch_size)
           .prefetch(tf.data.AUTOTUNE)
   ```

### Pipeline Optimizations

1. **Prefetching**
   - Overlaps data preprocessing with model execution
   - Reduces step time by parallel processing
   - Uses tf.data.AUTOTUNE for dynamic optimization

2. **Batching Strategy**
   - Applies batching before repeat for clear epoch boundaries
   - Maintains consistent batch sizes across epochs

3. **Shuffling**
   - Uses fixed-size buffer for memory efficiency
   - Applied only to training data
   - Maintains randomization across epochs

4. **Memory Management**
   - Efficient buffer usage for large datasets
   - Dynamic prefetch sizing based on available resources

In [4]:
def download_dataset(cache_dir):
    # download data using a keras utility
    _url = "https://raw.githubusercontent.com/google-research/google-research/master/mbpp/mbpp.jsonl" # download mbpp dataset
    dataset_path = tf.keras.utils.get_file("mbpp.jsonl", origin=_url, cache_dir=cache_dir, cache_subdir=cache_dir)
    return dataset_path
# Specify your desired directory
cache_dir = "D:/Projects/Projects/Deep_Learning/Code Generation Using T5"
dataset_path = download_dataset(cache_dir)

print(f"Dataset saved at: {dataset_path}")

def convert_examples_to_features(examples, tokenizer, args):
    # encode text-code pairs
    texts = examples['text']
    codes = examples['code']
    # tests = [" ".join(test) for test in examples['test_list']] # convert list of test cases to single string
    
    # encode texts by prepending the task for input sequence
    inputs = [args.prefix + text for text in texts]
    model_inputs = tokenizer(inputs, max_length=args.max_input_length, padding="max_length", truncation=True)
    
    # encode texts by prepending the task for input sequence and appending the test sequence
    # inputs = [args.prefix + text + " " + test for text, test in zip(texts, tests)]
    # model_inputs = tokenizer(inputs, max_length=args.max_input_length, padding="max_length", truncation=True)
    
    # encode texts by prepending the task for input sequence
    labels = tokenizer(codes, max_length=args.max_target_length, padding="max_length", truncation=True).input_ids
    
    # we need to replace the index of the padding tokens by -100
    # such that they are not taken into account by the CrossEntropyLoss
    labels_with_ignore_index = []
    for labels_example in labels:
        labels_example = [label if label != 0 else -100 for label in labels_example]
        labels_with_ignore_index.append(labels_example)
    model_inputs["labels"] = labels_with_ignore_index
    
    # return features
    return model_inputs


def get_train_tfdataset(train_dataset, num_train_examples, args):
    # select feature columns
    columns = ['input_ids', 'attention_mask', 'labels'] 
    # set to tensorflow format
    train_dataset.set_format(type='tensorflow', columns=columns) 
    
    # specify return types
    return_types = {'input_ids':tf.int32, 'attention_mask':tf.int32, 'labels':tf.int32} 
    # specify return shapes
    return_shapes = {'input_ids': tf.TensorShape([None]),'attention_mask': tf.TensorShape([None]), 'labels': tf.TensorShape([None])} 
    # initialize dataset 
    tf_dataset = tf.data.Dataset.from_generator(lambda : train_dataset, return_types, return_shapes) 
    
    # turn off auto-sharding
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
    tf_dataset = tf_dataset.with_options(options)
    
    # repeat, shuffle, batch, prefetch
    ds = (
        tf_dataset.repeat()
        .shuffle(num_train_examples, seed=args.seed)
        .batch(args.train_batch_size)
        .prefetch(tf.data.AUTOTUNE)
    )
    
    # distribute dataset to devices
    return strategy.experimental_distribute_dataset(ds)

def get_validation_tfdataset(eval_dataset, num_validation_examples, args):
    # select feature columns
    columns = ['input_ids', 'attention_mask', 'labels'] 
    # set to tensorflow format
    eval_dataset.set_format(type='tensorflow', columns=columns) 
    
    # specify return types
    return_types = {'input_ids':tf.int32, 'attention_mask':tf.int32, 'labels':tf.int32} 
    # specify return shapes
    return_shapes = {'input_ids': tf.TensorShape([None]),'attention_mask': tf.TensorShape([None]), 'labels': tf.TensorShape([None])} 
    # initialize dataset 
    tf_dataset = tf.data.Dataset.from_generator(lambda : eval_dataset, return_types, return_shapes) 
    
    # turn off auto-sharding
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
    tf_dataset = tf_dataset.with_options(options)
    
    # repeat, batch, prefetch
    ds = (
        tf_dataset.repeat()
        .batch(args.validation_batch_size)
        .prefetch(tf.data.AUTOTUNE)
    )
    
    # distribute dataset to devices
    return strategy.experimental_distribute_dataset(ds)

Dataset saved at: /tmp/.keras/D:/Projects/Projects/Deep_Learning/Code Generation Using T5/mbpp.jsonl


# Utility Functions / Class

- *fix_all_seeds()* - sets the random seed for deterministic results.
- *init_logger()* - initialize logger for tracking events.
- *ProgressBar()* - custom progress bar to display metrics.

In [5]:
def fix_all_seeds(seed):
    # set random seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    
def init_logger(log_file=None, log_file_level=logging.NOTSET):
    # initialize logger for tracking events and save in file
    if isinstance(log_file, Path):
        log_file = str(log_file)
    log_format = logging.Formatter(
        fmt='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
        datefmt='%m/%d/%Y %H:%M:%S'
    )
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    console_handler = logging.StreamHandler()
    console_handler.setFormatter(log_format)
    logger.handlers = [console_handler]
    if log_file and log_file != '':
        file_handler = logging.FileHandler(log_file)
        file_handler.setLevel(log_file_level)
        # file_handler.setFormatter(log_format)
        logger.addHandler(file_handler)
    return logger

class ProgressBar(object):
    # custom progress bar
    def __init__(self, n_total,width=30,desc = 'Training'):
        self.width = width
        self.n_total = n_total
        self.start_time = time.time()
        self.desc = desc

    def __call__(self, step, info={}):
        now = time.time()
        current = step + 1
        recv_per = current / self.n_total
        bar = f'[{self.desc}] {current}/{self.n_total} ['
        if recv_per >= 1:
            recv_per = 1
        prog_width = int(self.width * recv_per)
        if prog_width > 0:
            bar += '=' * (prog_width - 1)
            if current< self.n_total:
                bar += ">"
            else:
                bar += '='
        bar += '.' * (self.width - prog_width)
        bar += ']'
        show_bar = f"\r{bar}"
        time_per_unit = (now - self.start_time) / current
        if current < self.n_total:
            eta = time_per_unit * (self.n_total - current)
            if eta > 3600:
                eta_format = ('%d:%02d:%02d' %
                              (eta // 3600, (eta % 3600) // 60, eta % 60))
            elif eta > 60:
                eta_format = '%d:%02d' % (eta // 60, eta % 60)
            else:
                eta_format = '%ds' % eta
            time_info = f' - ETA: {eta_format}'
        else:
            if time_per_unit >= 1:
                time_info = f' {time_per_unit:.1f}s/step'
            elif time_per_unit >= 1e-3:
                time_info = f' {time_per_unit * 1e3:.1f}ms/step'
            else:
                time_info = f' {time_per_unit * 1e6:.1f}us/step'

        show_bar += time_info
        if len(info) != 0:
            show_info = f'{show_bar} ' + \
                        "-".join([f' {key}: {value:.4f} ' if key != "learning_rate" else f' {key}: {value:.8f} ' for key, value in info.items()])
            print(show_info, end='')
        else:
            print(show_bar, end='')

# Custom Training Loops in TensorFlow: A Comprehensive Guide

Custom training loops provide fine-grained control over the training process and make debugging easier. Let's understand how they work and how to implement them effectively.

## Core Concepts

### What is a Training Loop?
A training loop is the iterative process where a model learns from data. Think of it like practicing a skill - you try something, learn from your mistakes, and gradually improve. In machine learning, this happens through systematic steps:

1. Process a batch of data
2. Make predictions
3. Calculate how wrong those predictions were (loss)
4. Adjust the model to do better next time (optimization)

### Why Custom Training?
While TensorFlow provides high-level APIs for training, custom loops offer several advantages:
- Complete control over the training process
- Easier debugging and monitoring
- Flexibility to implement complex training strategies
- Better understanding of what's happening "under the hood"

## Implementation Guide

Here's how to build a robust custom training loop:

```python
class Trainer:
    def __init__(self, strategy):
        self.strategy = strategy
        with strategy.scope():
            self.model = create_model()
            self.optimizer = create_optimizer()

    @tf.function
    def train_step(self, batch):
        with tf.GradientTape() as tape:
            # Forward pass
            predictions = self.model(batch['inputs'], training=True)
            # Calculate loss
            loss = compute_loss(predictions, batch['labels'])
            
        # Calculate gradients and update model
        gradients = tape.gradient(loss, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
        return loss

    def train(self, train_dataset, epochs):
        for epoch in range(epochs):
            total_loss = 0
            # Distribute dataset across devices
            for batch in train_dataset:
                # Run training step on all replicas
                per_replica_loss = self.strategy.run(
                    self.train_step, args=(batch,)
                )
                # Aggregate losses from all replicas
                total_loss += self.strategy.reduce(
                    tf.distribute.ReduceOp.SUM, per_replica_loss, axis=None
                )
```

## Key Components Explained

### Distribution Strategy
The training loop integrates with TensorFlow's distribution strategies for multi-device training:
- Creates model and optimizer within strategy scope
- Distributes data across available devices
- Aggregates results from all devices

### Performance Optimization
Several techniques ensure efficient training:

1. `@tf.function` Decoration
   - Converts Python code to TensorFlow graphs
   - Improves execution speed
   - Required for SavedModel export

2. Gradient Tape
   - Records operations for automatic differentiation
   - Efficiently computes gradients
   - Manages memory usage during backpropagation

3. Loss Reduction
   - Uses strategy.reduce() for proper loss aggregation
   - Handles varying batch sizes across replicas
   - Maintains consistent scaling

## Best Practices

1. Model and Optimizer Creation
   ```python
   with strategy.scope():
       model = create_model()
       optimizer = create_optimizer()
   ```
   Always create these objects within strategy scope for proper distribution.

2. Checkpoint Management
   ```python
   checkpoint = tf.train.Checkpoint(
       model=model,
       optimizer=optimizer
   )
   ```
   Regular checkpointing helps resume training and save progress.

3. Progress Monitoring
   - Track metrics across epochs
   - Monitor learning rate changes
   - Log gradients for debugging

Remember: The goal of a custom training loop is not just to train the model, but to give you insights and control over the training process. Take time to add logging and visualization code to help you understand what's happening during training.

In [6]:
class Trainer:
    def __init__(
        self, model, args, train_dataset, validation_dataset, 
        num_train_examples, num_validation_examples
    ):
        self.model = model
        self.args = args
        
        self.train_dataset = train_dataset
        self.num_train_examples = num_train_examples
        
        self.validation_dataset = validation_dataset
        self.num_validation_examples = num_validation_examples
        
        self.global_step = 0
        self.eval_loss = tf.keras.metrics.Sum()
        
    def create_optimizer_and_scheduler(self, num_training_steps):
        # creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay.
        num_warmup_steps = math.ceil(num_training_steps * self.args.warmup_ratio)
        self.optimizer, self.lr_scheduler = create_optimizer(
            init_lr=self.args.learning_rate,
            num_train_steps=num_training_steps,
            num_warmup_steps=num_warmup_steps,
            weight_decay_rate=self.args.weight_decay,
            adam_epsilon=self.args.adam_epsilon
        )
    
    def evaluation_step(self, features, labels, nb_instances_in_global_batch):
        # forward pass
        outputs = self.model(input_ids=features['input_ids'], attention_mask=features['attention_mask'], labels=labels, training=False)[:2]
        loss, logits = outputs[:2]
        # loss scaling
        scaled_loss = loss / tf.cast(nb_instances_in_global_batch, dtype=loss.dtype)
        # add current batch loss
        self.eval_loss.update_state(scaled_loss)
    
    @tf.function
    def distributed_evaluation_steps(self, batch):
        features = {k: v for k, v in batch.items() if 'labels' not in k}
        labels = batch['labels']
        nb_instances = tf.reduce_sum(tf.cast(labels != -100, dtype=tf.int32))
        # strategy.run() expects args to be a list or tuple
        inputs = (features, labels, nb_instances)
        # `run` replicates the provided computation and runs with the distributed input
        strategy.run(self.evaluation_step, inputs)

    def evaluate(self):
        # calculate total validation steps
        steps = math.ceil(self.num_validation_examples / self.args.validation_batch_size)
        # reset eval loss after every epoch
        self.eval_loss.reset_states()
        logs = {}
        pbar = ProgressBar(n_total=steps, desc='Evaluating')
        # iterate over validation dataset
        for step, batch in enumerate(self.validation_dataset): 
            # distributed evaluation step
            self.distributed_evaluation_steps(batch) 
            logs["eval_loss"] = self.eval_loss.result() / (step + 1)
            pbar(step=step, info=logs)
            if step == steps - 1:
                break
        print("\n------------- validation result -----------------")
        
    def apply_gradients(self, features, labels, nb_instances_in_global_batch):
        # forward pass
        outputs = self.model(input_ids=features['input_ids'], attention_mask=features['attention_mask'], labels=labels, training=True)[:2] 
        loss, logits = outputs[:2]
        # loss scaling
        scaled_loss = loss / tf.cast(nb_instances_in_global_batch, dtype=loss.dtype) 
        # calculate gradients
        gradients = tf.gradients(scaled_loss, self.model.trainable_variables) 
        # convert gradients with nan value
        gradients = [g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)] 
        # optimize the model
        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables))) 
        # add current batch loss
        self.train_loss.update_state(scaled_loss) 
    
    @tf.function
    def distributed_training_steps(self, batch):
        with strategy.scope():
            features = {k: v for k, v in batch.items() if 'labels' not in k}
            labels = batch['labels']
            nb_instances = tf.reduce_sum(tf.cast(labels != -100, dtype=tf.int32))
            # strategy.run() expects args to be a list or tuple
            inputs = (features, labels, nb_instances)
            # `run` replicates the provided computation and runs with the distributed input.
            strategy.run(self.apply_gradients, inputs)
    
    def train(self):
        # calculate total training steps
        num_updates_per_epoch = self.num_train_examples // args.train_batch_size 
        self.steps_per_epoch = num_updates_per_epoch
        t_total = self.steps_per_epoch * self.args.epochs
        
        with strategy.scope():
            # optimizer, and checkpoint must be created under `strategy.scope`
            # create optimizer and scheduler
            self.create_optimizer_and_scheduler(num_training_steps=t_total) 
            
            # create checkpoint manager
            folder = os.path.join(self.args.output_dir, self.args.checkpoint_dir)
            ckpt = tf.train.Checkpoint(optimizer=self.optimizer, model=self.model) 
            self.model.ckpt_manager = tf.train.CheckpointManager(ckpt, folder, max_to_keep=1)
            iterations = self.optimizer.iterations
            
            logger.info("***** Running training *****")
            logger.info(f"  Num examples = {self.num_train_examples}")
            logger.info(f"  Num Epochs = {self.args.epochs}")
            logger.info(f"  Total train batch size (w. parallel & distributed) = {self.args.train_batch_size * n_replicas(strategy)}")
            logger.info(f"  Steps per epoch = {self.steps_per_epoch}")
            logger.info(f"  Total optimization steps = {t_total}")
            
            self.train_loss = tf.keras.metrics.Sum(name="training_loss")
            start_time = datetime.datetime.now()
            for epoch_iter in range(self.args.epochs):
                # training loop
                logger.info(f"Epoch {epoch_iter + 1}/{self.args.epochs}")
                
                pbar = ProgressBar(n_total=self.steps_per_epoch, desc='Training')
                # iterate over training dataset
                for step, batch in enumerate(self.train_dataset):    
                    # distributed training step
                    self.distributed_training_steps(batch) 
                    
                    self.global_step = iterations.numpy()
                    training_loss = self.train_loss.result() / (step + 1)
                    
                    logs = {}
                    logs["training_loss"] = training_loss.numpy()
                    logs["learning_rate"] = self.lr_scheduler(self.global_step).numpy()
                    pbar(step=step, info=logs)
                    
                    if self.global_step % self.steps_per_epoch == 0:
                        print("\n------------- train result -----------------")
                        # call to evaluation loop
                        self.evaluate()
                        # save checkpoint
                        ckpt_save_path = self.model.ckpt_manager.save()
                        logger.info(f"Saving checkpoint at {ckpt_save_path}")
                        break
                
                # reset train loss after every epoch
                self.train_loss.reset_states()
            end_time = datetime.datetime.now()
            logger.info(f"Training took: {str(end_time - start_time)}")

# Run

The `run()` function defines our execution process. We download, load and preprocess and convert our data into `tf.data.Dataset` format. We initialize tokenizer and model. The model needs to be created under `strategy.scope()`. We create instance of our `Trainer` and pass everything to `.train()` method for running our custom training loop. In the end we save our model and tokenizer using `.save_pretrained()` method.

In [7]:
def run(args):
    logger.info(" Starting training / evaluation")
    
    logger.info(" Downloading Data Files")
    dataset_path = download_dataset(args.cache_dir) 

    logger.info(" Loading Data Files")
    dataset = load_dataset('json', data_files=dataset_path) 
    # train test split
    dataset = dataset['train'].train_test_split(0.1, shuffle=False) 
        
    logger.info(" Initializing Tokenizer")
    tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name) 
    
    logger.info(" Preparing Features")
    dataset = dataset.map(convert_examples_to_features, batched=True, fn_kwargs={"tokenizer":tokenizer, "args":args})

    logger.info(" Intializing training and validation dataset ")
    train_dataset = dataset['train']
    num_train_examples = len(dataset['train'])
    # create tf train dataset
    tf_train_dataset = get_train_tfdataset(train_dataset, num_train_examples, args) 
    
    validation_dataset = dataset['test']
    num_validation_examples = len(dataset['test'])
    # create tf validation dataset
    tf_validation_dataset = get_validation_tfdataset(train_dataset, num_validation_examples, args) 
    
    logger.info(f' Intializing model | {args.model_type.upper()} ')
    with strategy.scope():
        # model must be created under `strategy.scope`
        model = TFT5ForConditionalGeneration.from_pretrained(args.model_name_or_path, from_pt=True)
    
    # custom training loop
    trainer = Trainer(model, args, tf_train_dataset, tf_validation_dataset, num_train_examples, num_validation_examples) 
    trainer.train()
    
    # save pretrained model and tokenizer
    logger.info(f" Saving model in {args.save_dir}")
    trainer.model.save_pretrained(args.save_dir)
    tokenizer.save_pretrained(args.save_dir)

# Execute

Below we define our training arguments - model, data, optimizer, training and initialize directories. Initialize logger for logging and tracking metrics. We call `fix_all_seeds()` to set the global seed. Then finally we execute our `run()` method by passing our training `args`. 

In [8]:
class Args:
    # define training arguments
    
    # MODEL
    model_type = 't5'
    tokenizer_name = 'Salesforce/codet5-base'
    model_name_or_path = 'Salesforce/codet5-base'
    
    # DATA
    train_batch_size = 12
    validation_batch_size = 6
    max_input_length = 64
    max_target_length = 256
    prefix = "Generate Python: "    

    # OPTIMIZER
    learning_rate = 3e-4
    weight_decay = 1e-4
    warmup_ratio = 0.2
    adam_epsilon = 1e-8

    # TRAINING
    seed = 222
    epochs = 20

    # DIRECTORIES
    output_dir = "runs/"
    logging_dir = f"{output_dir}/logs/"
    checkpoint_dir = f"checkpoint"
    save_dir = f"{output_dir}/saved_model/"
    cache_dir = '../working/'
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    Path(logging_dir).mkdir(parents=True, exist_ok=True)
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    

# initialize training arguments
args = Args()
# initialize logger
logger = init_logger(log_file=os.path.join(args.logging_dir, f"{args.model_type}-{time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime())}.log"))
# fix all seeds
fix_all_seeds(args.seed)

if __name__ == "__main__":
    # run training and evaluation
    dataset = run(args)

01/19/2025 12:55:56 - INFO - root -    Starting training / evaluation
01/19/2025 12:55:56 - INFO - root -    Downloading Data Files
01/19/2025 12:55:56 - INFO - root -    Loading Data Files
01/19/2025 12:55:56 - INFO - root -    Initializing Tokenizer
01/19/2025 12:55:56 - INFO - root -    Preparing Features


Map:   0%|          | 0/876 [00:00<?, ? examples/s]

Map:   0%|          | 0/98 [00:00<?, ? examples/s]

01/19/2025 12:55:59 - INFO - root -    Intializing training and validation dataset 
01/19/2025 12:56:00 - INFO - root -    Intializing model | T5 
01/19/2025 12:56:05 - INFO - root -   ***** Running training *****
01/19/2025 12:56:05 - INFO - root -     Num examples = 876
01/19/2025 12:56:05 - INFO - root -     Num Epochs = 20
01/19/2025 12:56:05 - INFO - root -     Total train batch size (w. parallel & distributed) = 12
01/19/2025 12:56:05 - INFO - root -     Steps per epoch = 73
01/19/2025 12:56:05 - INFO - root -     Total optimization steps = 1460
01/19/2025 12:56:05 - INFO - root -   Epoch 1/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 12:57:48 - INFO - absl -   Sharding callback duration: 1215
01/19/2025 12:57:51 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-1
01/19/2025 12:57:51 - INFO - root -   Epoch 2/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 12:58:31 - INFO - absl -   Sharding callback duration: 484
01/19/2025 12:58:34 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-2
01/19/2025 12:58:34 - INFO - root -   Epoch 3/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 12:59:15 - INFO - absl -   Sharding callback duration: 495
01/19/2025 12:59:18 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-3
01/19/2025 12:59:18 - INFO - root -   Epoch 4/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 12:59:59 - INFO - absl -   Sharding callback duration: 421
01/19/2025 13:00:03 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-4
01/19/2025 13:00:03 - INFO - root -   Epoch 5/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:00:43 - INFO - absl -   Sharding callback duration: 433
01/19/2025 13:00:46 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-5
01/19/2025 13:00:46 - INFO - root -   Epoch 6/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:01:27 - INFO - absl -   Sharding callback duration: 325
01/19/2025 13:01:30 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-6
01/19/2025 13:01:30 - INFO - root -   Epoch 7/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:02:11 - INFO - absl -   Sharding callback duration: 454
01/19/2025 13:02:15 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-7
01/19/2025 13:02:15 - INFO - root -   Epoch 8/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:02:55 - INFO - absl -   Sharding callback duration: 404
01/19/2025 13:02:59 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-8
01/19/2025 13:02:59 - INFO - root -   Epoch 9/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:03:39 - INFO - absl -   Sharding callback duration: 322
01/19/2025 13:03:42 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-9
01/19/2025 13:03:42 - INFO - root -   Epoch 10/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:04:22 - INFO - absl -   Sharding callback duration: 445
01/19/2025 13:04:26 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-10
01/19/2025 13:04:26 - INFO - root -   Epoch 11/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:05:06 - INFO - absl -   Sharding callback duration: 316
01/19/2025 13:05:09 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-11
01/19/2025 13:05:09 - INFO - root -   Epoch 12/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:05:50 - INFO - absl -   Sharding callback duration: 414
01/19/2025 13:05:53 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-12
01/19/2025 13:05:53 - INFO - root -   Epoch 13/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:06:33 - INFO - absl -   Sharding callback duration: 1313
01/19/2025 13:06:37 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-13
01/19/2025 13:06:37 - INFO - root -   Epoch 14/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:07:18 - INFO - absl -   Sharding callback duration: 607
01/19/2025 13:07:21 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-14
01/19/2025 13:07:21 - INFO - root -   Epoch 15/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:08:02 - INFO - absl -   Sharding callback duration: 513
01/19/2025 13:08:05 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-15
01/19/2025 13:08:05 - INFO - root -   Epoch 16/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:08:46 - INFO - absl -   Sharding callback duration: 512
01/19/2025 13:08:50 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-16
01/19/2025 13:08:50 - INFO - root -   Epoch 17/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:09:30 - INFO - absl -   Sharding callback duration: 372
01/19/2025 13:09:33 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-17
01/19/2025 13:09:33 - INFO - root -   Epoch 18/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:10:14 - INFO - absl -   Sharding callback duration: 564
01/19/2025 13:10:17 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-18
01/19/2025 13:10:17 - INFO - root -   Epoch 19/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:10:57 - INFO - absl -   Sharding callback duration: 604
01/19/2025 13:11:01 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-19
01/19/2025 13:11:01 - INFO - root -   Epoch 20/20


------------- train result -----------------
------------- validation result -----------------


01/19/2025 13:11:41 - INFO - absl -   Sharding callback duration: 497
01/19/2025 13:11:45 - INFO - root -   Saving checkpoint at runs/checkpoint/ckpt-20
01/19/2025 13:11:45 - INFO - root -   Training took: 0:15:40.219991
01/19/2025 13:11:45 - INFO - root -    Saving model in runs//saved_model/


## Prediction/Inference

Prediction is performed using `predict_from_dataset()` (for test set texts) or `predict_from_text()` (for custom input). The core prediction logic resides in `run_predict()`, which uses the model's `generate()` method and a decoding technique (currently Top-p/nucleus sampling).

`predict_from_dataset()` samples a random text from the test set for each call.

**Key Decoding Parameters:**

*   **Top-p (Nucleus Sampling):** Selects from the smallest set of words whose cumulative probability exceeds `p` (0 < `top_p` < 1). This dynamically adjusts the selection set size.
*   **Top-K:** Can be combined with Top-p to further restrict the selection to the top K most likely words.
*   **Repetition Penalty:** Penalizes repeated words or words present in the context.
*   **`num_return_sequences`:** Generates multiple independent output samples (if > 1).


In [9]:
def run_predict(args, text):
    # load saved finetuned model
    model = TFT5ForConditionalGeneration.from_pretrained(args.save_dir)
    # load saved tokenizer
    tokenizer = RobertaTokenizer.from_pretrained(args.save_dir) 
    
     # encode texts by prepending the task for input sequence and appending the test sequence
    query = args.prefix + text 
    encoded_text = tokenizer(query, return_tensors='tf', padding='max_length', truncation=True, max_length=args.max_input_length)
    
    # inference
    generated_code = model.generate(
        encoded_text["input_ids"], attention_mask=encoded_text["attention_mask"], 
        max_length=args.max_target_length, top_p=0.95, top_k=50, repetition_penalty=2.0, num_return_sequences=1
    )
    
    # decode generated tokens
    decoded_code = tokenizer.decode(generated_code.numpy()[0], skip_special_tokens=True)
    return decoded_code

def predict_from_dataset(args):
    # load using hf datasets
    dataset = load_dataset('json', data_files='../working/mbpp.jsonl') 
    # train test split
    dataset = dataset['train'].train_test_split(0.1, shuffle=False) 
    test_dataset = dataset['test']
    
    # randomly select an index from the validation dataset
    index = random.randint(0, len(test_dataset))
    text = test_dataset[index]['text']
    code = test_dataset[index]['code']
    
    # run-predict on text
    decoded_code = run_predict(args, text)
    
    print("#" * 25); print("QUERY: ", text); 
    print()
    print('#' * 25); print("ORIGINAL: "); print("\n", code);
    print()
    print('#' * 25); print("GENERATED: "); print("\n", decoded_code);
    
def predict_from_text(args, text):
    # run-predict on text
    decoded_code = run_predict(args, text)
    print("#" * 25); print("QUERY: ", text); 
    print()
    print('#' * 25); print("GENERATED: "); print("\n", decoded_code);

# Predict from Dataset

In [10]:
# example 1
predict_from_dataset(args)
# example 2
predict_from_dataset(args)
# example 3
predict_from_dataset(args)

Generating train split: 0 examples [00:00, ? examples/s]



#########################
QUERY:  Write a function to find the sum of first even and odd number of a given list.

#########################
ORIGINAL: 

 def sum_even_odd(list1):
    first_even = next((el for el in list1 if el%2==0),-1)
    first_odd = next((el for el in list1 if el%2!=0),-1)
    return (first_even+first_odd)

#########################
GENERATED: 

 def sum_evenodd(list1):
    first = next((el for el in list 1 if El%2==0),-ord('a') + 2) 
        return (first[i] % 11 == 0))
#########################
QUERY:  Write a function to count the elements in a list until an element is a tuple.

#########################
ORIGINAL: 

 def count_elim(num):
  count_elim = 0
  for n in num:
    if isinstance(n, tuple):
        break
    count_elim += 1
  return count_elim

#########################
GENERATED: 

 def count_elements(list1, element):
  ctr = 0 
    for x in list:  
      if not isinstance (x , tuple or len([y]), type('tuple')) :    
        continue      
          else:

# Predict from Text

In [11]:
# example 1
predict_from_text(args, "Write a function to add two random numbers"); print()
# example 2
predict_from_text(args, "Write a function to find the frequency of items in a list"); print()
# example 3
predict_from_text(args, "Write a function to concatenate two dictionary"); print()

#########################
QUERY:  Write a function to add two random numbers

#########################
GENERATED: 

 def add_random(a,b):
    if a + b in range (0,'2' and c: 
        return -1  
     else::NOQA
         for iinrange(_min=A), _max = B;
       yield from [i+j] import math
           as jadd


#########################
QUERY:  Write a function to find the frequency of items in a list

#########################
GENERATED: 

 import collections
def freq_count(list1):freq = Counter([item for item in list 1 if not isinstance((x, dict))]) 
    return frequency

#########################
QUERY:  Write a function to concatenate two dictionary

#########################
GENERATED: 

 def concatenate_dict(d1, d2):
    result = {k: v for (key , val) in dict.items() if not isinstance([val], list))} 
     returnResult



This notebook demonstrates fine-tuning a T5 model for text-to-code generation using TensorFlow, Hugging Face Transformers, and the MBPP dataset. Results are constrained by the limited training data and the base T5 model size. The methodology can be applied to other CodeXGLUE benchmarks (code understanding and generation tasks). Potential applications for AI coding assistants include:

*   **Text-to-code:** Generating code from natural language.
*   **Code autocompletion:** Completing function implementations.
*   **Code summarization:** Generating natural language summaries of code.