## Imports & Installation

### Intallation

| Library         | Description                                                                                                           |
|-----------------|-----------------------------------------------------------------------------------------------------------------------|
| transformers    | A library that provides state-of-the-art pretrained models for various NLP tasks.                                     |
| datasets        | A library that simplifies the process of accessing and working with a wide range of machine learning datasets.        |
| mlflow          | A platform for managing the end-to-end machine learning lifecycle, from experimentation to deployment.                |
| torch (PyTorch) | A powerful deep learning framework used for building, training, and deploying neural networks.                        |
| pyngrok         | A tool that allows local servers (like Gradio apps) to be exposed to the internet for easy testing and sharing.       |
| gradio          | A user-friendly library for creating interactive UIs for machine learning models, enabling easy sharing and testing.  |


In [None]:
!pip install transformers -q
!pip install datasets -q
!pip install mlflow -q
!pip install torch -q
!pip install pyngrok -q
!pip install gradio -q
!pip install tf-keras -q

In [4]:
!pip install accelerate>=0.26.0

In [5]:
!ngrok config add-authtoken 0000000000000000000000000000000000000000000000000

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


### Imports

In [6]:
# Import necessary modules for subprocess management
import subprocess
# Import pyngrok for handling public access tunnels and configurations
from pyngrok import ngrok, conf
# For securely handling password inputs
import getpass
# Importing os module to interact with the operating system
import os
# Importing MLflow to track machine learning experiments with PyTorch models
import mlflow
import mlflow.pytorch
# Import transformers' pre-trained GPT-2 model and tokenizer, as well as Trainer utilities
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, EarlyStoppingCallback
# For loading and handling datasets
from datasets import load_dataset, DatasetDict
# Import PyTorch, a machine learning framework
import torch
# Import pre-trained models and tokenizers for causal language modeling tasks
from transformers import AutoTokenizer, AutoModelForCausalLM
# Importing Gradio, a framework to create web interfaces for machine learning models
import gradio as gr

## Initialization

Initializing MLflow Tracking with a SQLite Backend

In [7]:
# Set the URI for MLflow to use a SQLite database as the backend store for tracking experiments.
MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"

# Start the MLflow tracking UI in a new process, using the specified SQLite database as the backend store.
subprocess.Popen(["mlflow", "ui", "--backend-store-uri", MLFLOW_TRACKING_URI])

<Popen: returncode: None args: ['mlflow', 'ui', '--backend-store-uri', 'sqli...>

Establishing MLflow Tracking Configuration for Experiment Management

In [8]:
# Set the MLflow tracking URI to specify where the tracking data will be stored.
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

# Set the name of the experiment to track runs under a specific experiment name in MLflow.
mlflow.set_experiment("duration-prediction-experiment")

2024/10/15 17:45:54 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2024/10/15 17:45:54 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

<Experiment: artifact_location='/kaggle/working/mlruns/1', creation_time=1729014355280, experiment_id='1', last_update_time=1729014355280, lifecycle_stage='active', name='duration-prediction-experiment', tags={}>

Configuring ngrok with Authentication Token for Secure Tunneling

In [9]:
# Define your ngrok authentication token here (replace with your actual token)
NGROK_AUTH_TOKEN = '0000000000000000000000000000000000000000000000000'  # <-- Replace with your ngrok token

# Import necessary libraries
from pyngrok import ngrok, conf

# Set the authentication token for ngrok configuration
conf.get_default().auth_token = NGROK_AUTH_TOKEN

# Set the local port number that the ngrok tunnel will forward to
port = 5000

# Establish an ngrok tunnel to the specified local port and retrieve the public URL
public_url = ngrok.connect(port).public_url

# Print the public URL provided by ngrok, which forwards to the local server
print(f' * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{port}\"')


 * ngrok tunnel "https://b012-34-31-190-15.ngrok-free.app" -> "http://127.0.0.1:5000"


Creating MLflow Directory and Starting a New Experiment Run

In [10]:
# Create a directory named "mlruns" to store MLflow tracking data.
# The exist_ok=True parameter means that no error will be raised if the directory already exists.
os.makedirs("mlruns", exist_ok=True)

# End any active MLflow run to ensure that there are no overlapping runs.
# This is useful to clean up before starting a new run.
mlflow.end_run()

# Start a new MLflow run to track metrics, parameters, and models associated with this particular experiment.
mlflow.start_run()

<ActiveRun: >

## Tokenization and Training

### Importing Dataset and Configuring GPT-2 for Language Modeling

In [11]:
# Specify the path to the cleaned dataset CSV file
data_files = '/kaggle/input/cleaned-creative-writing/cleaned_creative_writing_dataset.csv'

# Load the dataset from the specified CSV file
# The 'csv' argument indicates the file format, and the 'data_files' argument specifies the path to the file
dataset = load_dataset('csv', data_files=data_files)

# Remove the 'text' column from the dataset
# This is done to avoid any potential conflicts or redundant information
dataset = dataset['train'].remove_columns(['text'])

# Rename the 'cleaned_text' column to 'text' for consistency
# This makes it easier to refer to the main text column in subsequent processing
dataset = dataset.rename_column('cleaned_text', 'text')

# Load the pre-trained GPT-2 tokenizer
# The tokenizer is responsible for converting text into token IDs that the model can understand
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load the pre-trained GPT-2 model
# The 'GPT2LMHeadModel' is the model architecture that can generate text
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the tokenizer's padding token to the end-of-sequence token
# This is important for ensuring that input sequences have consistent lengths during training or inference
tokenizer.pad_token = tokenizer.eos_token

Generating train split: 0 examples [00:00, ? examples/s]

[2024-10-15 17:45:56 +0000] [137] [INFO] Starting gunicorn 23.0.0
[2024-10-15 17:45:56 +0000] [137] [INFO] Listening at: http://127.0.0.1:5000 (137)
[2024-10-15 17:45:56 +0000] [137] [INFO] Using worker: sync
[2024-10-15 17:45:56 +0000] [142] [INFO] Booting worker with pid: 142
[2024-10-15 17:45:56 +0000] [143] [INFO] Booting worker with pid: 143
[2024-10-15 17:45:56 +0000] [144] [INFO] Booting worker with pid: 144
[2024-10-15 17:45:56 +0000] [145] [INFO] Booting worker with pid: 145


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [12]:
import wandb

# Replace 'your_api_key' with your actual API key
wandb.login(key='0000000000000000000000000000000000000000')


[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

### Dataset Preprocessing: Tokenization and Train-Test Split for GPT-2

In [13]:
# Define a tokenization function to process the dataset
def tokenize_function(examples):
    # Tokenize the 'text' field from the dataset examples using the pre-loaded tokenizer
    # padding='max_length' ensures that all sequences are padded to the maximum length
    # truncation=True cuts off sequences that exceed the max_length
    # max_length=32 sets a fixed length of 32 tokens for each input
    input_ids = tokenizer(
        examples['text'],
        padding='max_length',  # Pads to 32 tokens per sequence
        truncation=True,       # Truncates sequences longer than 32 tokens
        max_length=32          # Sets the maximum token length to 32
    )

    # Copy the 'input_ids' into a new field 'labels' to use as the target for training
    # This is often done in language models to predict the next word in a sequence
    input_ids['labels'] = input_ids['input_ids'].copy()
    
    # Return the tokenized input dictionary, including both 'input_ids' and 'labels'
    return input_ids

# Apply the tokenization function to the entire dataset
# The map() method applies the function to each example in the dataset, with batched=True
# meaning that multiple examples are passed in a single batch for faster processing
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Split the tokenized dataset into a training set and a validation set
# train_test_split(test_size=0.2) splits 80% of the data for training and 20% for validation
train_test_split = tokenized_datasets.train_test_split(test_size=0.2)

# Organize the train and validation datasets into a DatasetDict for easy reference
tokenized_datasets = DatasetDict({
    'train': train_test_split['train'],        # Training dataset (80% of the data)
    'validation': train_test_split['test']     # Validation dataset (20% of the data)
})

Map:   0%|          | 0/1429 [00:00<?, ? examples/s]

### Training GPT-2 with Custom Hyperparameters and Logging with MLflow

In [14]:
# Define a function to compute evaluation metrics
# This function will be called during the evaluation phase of the model
def compute_metrics(eval_pred):
    logits, labels = eval_pred  # Extract the logits (model outputs) and true labels
    predictions = logits.argmax(axis=-1)  # Get the predicted class by taking the argmax along the last axis
    accuracy = (predictions == labels).mean()  # Compute the accuracy by comparing predictions with labels
    return {'accuracy': accuracy}  # Return accuracy as a dictionary for logging

# Define training hyperparameters
learning_rate = 2e-5  # The learning rate for the optimizer
per_device_train_batch_size = 1  # Batch size per device (1 sample per training step)
num_train_epochs = 10  # Number of training epochs
max_length = 32  # Maximum sequence length for inputs

# Set up training arguments using Hugging Face's TrainingArguments class
training_args = TrainingArguments(
    output_dir='./results',  # Directory where results (like checkpoints and logs) will be saved
    evaluation_strategy='epoch',  # Evaluate the model at the end of each epoch
    save_strategy='epoch',  # Save the model at the end of each epoch
    learning_rate=learning_rate,  # Set the learning rate for training
    per_device_train_batch_size=per_device_train_batch_size,  # Set the batch size per device
    num_train_epochs=num_train_epochs,  # Define the number of training epochs
    weight_decay=0.01,  # Weight decay to avoid overfitting (used in regularization)
    load_best_model_at_end=True,  # Load the best model based on evaluation metrics after training ends
    metric_for_best_model="accuracy",  # The metric used to select the best model (accuracy in this case)
    no_cuda=True,  # Force training on CPU, set to False if using GPU
)

# Instantiate a Trainer to manage the training loop
trainer = Trainer(
    model=model,  # The pre-trained GPT-2 model that you want to fine-tune
    args=training_args,  # Training arguments defined above
    train_dataset=tokenized_datasets['train'],  # The training dataset
    eval_dataset=tokenized_datasets['validation'],  # The validation dataset for evaluation
    compute_metrics=compute_metrics,  # The function to compute evaluation metrics (accuracy here)
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # Early stopping callback to avoid overfitting
    # The model stops training if it doesn't improve for 3 evaluation cycles (epochs in this case)
)

# Log hyperparameters using MLflow
mlflow.log_param("data_files", data_files)  # Log the dataset file used for training
mlflow.log_param("learning_rate", learning_rate)  # Log the learning rate used
mlflow.log_param("per_device_train_batch_size", per_device_train_batch_size)  # Log batch size
mlflow.log_param("num_train_epochs", num_train_epochs)  # Log number of epochs
mlflow.log_param("max_length", max_length)  # Log the max sequence length for tokenization
mlflow.log_param("model_name", "gpt2")  # Log the model name (GPT-2 in this case)

# Start training the model using the Trainer instance
trainer.train()


2024/10/15 17:46:10 ERROR mlflow.utils.async_logging.async_logging_queue: Run Id 2dc90d926dde44aab757b7b9fba74a09: Failed to log run data: Exception: Changing param values is not allowed. Params were already logged='[{'key': 'max_length', 'old_value': '32', 'new_value': '20'}]' for run ID='2dc90d926dde44aab757b7b9fba74a09'.
[34m[1mwandb[0m: Currently logged in as: [33mmonafe301[0m ([33mmonafe301-na[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.18.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20241015_174610-6ru30s3w[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m./results[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/monafe301-na/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/monafe301-na/huggingface/runs/6ru30s3w[0m


Epoch,Training Loss,Validation Loss,Accuracy
1,5.5434,5.308781,0.18608
2,4.8289,5.316811,0.179414
3,4.3499,5.343615,0.180835
4,4.1771,5.446638,0.179196


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=4572, training_loss=4.775720104040645, metrics={'train_runtime': 5175.0118, 'train_samples_per_second': 2.209, 'train_steps_per_second': 2.209, 'total_flos': 74664198144000.0, 'train_loss': 4.775720104040645, 'epoch': 4.0})

## Model Evaluation and Metrics Logging with MLflow for GPT-2

### Completion of Training: Saving Model, Logging Metrics, and Ending MLflow Run

In [15]:
# Save the fine-tuned model and tokenizer locally
model.save_pretrained('/kaggle/output/fine_tuned_gpt2')  # Save the fine-tuned GPT-2 model to the specified directory
tokenizer.save_pretrained('/kaggle/output/fine_tuned_gpt2')  # Save the tokenizer (required for text preprocessing) to the same directory

# Log the model to MLflow using the PyTorch logging interface
# This will store the model artifact in the MLflow tracking system for later use
mlflow.pytorch.log_model(model, "fine_tuned_gpt2")

# Evaluate the model using the trainer and store the evaluation metrics (e.g., loss, accuracy)
eval_metrics = trainer.evaluate()

# Extract the training loss from the trainer's state history if it exists
# The state.log_history holds a record of logs during training
if 'loss' in trainer.state.log_history[-1]:
    train_loss = trainer.state.log_history[-1]['loss']  # Get the last logged training loss
else:
    train_loss = None  # If not found, set training loss to None

# Log metrics to MLflow
mlflow.log_metric("Training Loss", train_loss if train_loss is not None else 0.0)  # Log training loss (set to 0.0 if unavailable)
mlflow.log_metric("Validation Loss", eval_metrics['eval_loss'])  # Log the validation loss from evaluation
mlflow.log_metric("Accuracy", eval_metrics.get('eval_accuracy', 0.0))  # Log the accuracy (default to 0.0 if not found)

# End the MLflow run to ensure all logs and artifacts are finalized
mlflow.end_run()

# Print confirmation message to indicate that the model training and saving process is complete
print("Model training and saving completed.")




Model training and saving completed.


### Generating Text with GPT-2: Story Creation and Experiment Tracking

In [16]:
# Load the fine-tuned model and tokenizer from the specified directory
model_name = "/kaggle/output/fine_tuned_gpt2"  # Path to the saved fine-tuned model directory
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Load the tokenizer associated with the model
model = AutoModelForCausalLM.from_pretrained(model_name)  # Load the model for causal language modeling

# Set the model to evaluation mode
# This is essential for inference, disabling dropout and other training-specific behaviors
model.eval()

# Define a function to generate stories based on a given prompt
# max_length: maximum length of the generated story
# temperature: controls randomness in the generation process (higher values = more random)
# top_k: limits the sampling pool to the top-k most likely next words
def generate_story(prompt, max_length=1000, temperature=1.5, top_k=100):
    # Tokenize the input prompt and convert it into input IDs (tensor format)
    input_ids = tokenizer.encode(prompt, return_tensors='pt')  # 'pt' indicates PyTorch tensors
    
    # Disable gradient computation during generation for efficiency
    with torch.no_grad():
        # Generate text from the input prompt
        output = model.generate(
            input_ids,  # The input prompt as tokenized IDs
            max_length=max_length,  # Maximum length of the generated sequence
            temperature=temperature,  # Controls diversity in the output
            top_k=top_k,  # Limits sampling to the top-k most probable tokens
            do_sample=True,  # Enables sampling for more varied text generation
            num_return_sequences=1,  # Generate only one sequence
            pad_token_id=tokenizer.eos_token_id  # Use EOS token for padding
        )

    # Decode the generated tokens back into human-readable text
    generated_story = tokenizer.decode(output[0], skip_special_tokens=True)  # Skip special tokens in the output

    # Start a new MLflow run to log parameters related to text generation
    mlflow.start_run()
    mlflow.log_param("generation_max_length", max_length)  # Log the maximum length for generation
    mlflow.log_param("temperature", temperature)  # Log the temperature setting
    mlflow.log_param("top_k", top_k)  # Log the top-k value used for sampling
    mlflow.log_param("prompt", prompt)  # Log the prompt that was used for generation
    mlflow.end_run()  # End the MLflow run to save the logged parameters

    return generated_story  # Return the generated story

# Define a prompt to initiate the story generation
prompt = "Write a story about a girl's adventures in a magical forest where she finds strange creatures"
# Generate the story based on the prompt and specified parameters
generated_text = generate_story(prompt, max_length=1000)  # Adjust max_length as needed
# Print the generated story
print(generated_text)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Write a story about a girl's adventures in a magical forest where she finds strange creatures mysterious enough strange stories involving powerful beings including one powerful magical braziers used two different languages called abridged danish natur leont alto book book short story series writer stepe viver written by kem stolich chamm described one rare gem mac read julia morge wrote zamem schmerkel read zamemer series began series series a der pasz mieszka run short story series also book julien daniels sebarnic writer elle voorbe also called emmalo marius author co jelal first published a story abridged engorkt write long siberian christian adventure james schmercher mieszka prael began stolich kesler write dream wierander based scifi show story kehma serial serialized story sam ziegert began huff english published uitjerni series begin published scion algar biologist named sam ziegert zartz described how angel came born alien father semitude would told episode character gernim sa

### Gradio-Powered Story Generation: Generate Tales with Fine-Tuned GPT-2

In [17]:
# Import Gradio for creating web interfaces
import gradio as gr

# Define a wrapper function for generating stories using the previously defined generate_story function
def gradio_generate(prompt):
    # Call the generate_story function with the provided prompt to get the generated text
    generated_text = generate_story(prompt)  # This utilizes the fine-tuned model to generate the story
    return generated_text  # Return the generated story for display in the Gradio interface

# Create a Gradio interface
gradio_interface = gr.Interface(
    fn=gradio_generate,  # Function that will be called to generate text
    inputs="text",  # Input type is a text box for users to enter their prompts
    outputs="text",  # Output type is a text box for displaying the generated story
    title="Story Generator",  # Title of the Gradio app displayed at the top
    description="Enter a prompt to generate a story using the fine-tuned GPT-2 model.",  # Description shown to users
)

# Launch the Gradio interface
# The share=True option allows the interface to be shared via a public link
gradio_interface.launch(share=True)


* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://209b8507be98c3caf9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


