Notebook 3: Transcript Re-writer
---
In the Step 2 - Transcript Writer notebook, we got a great podcast transcript using the raw file we have uploaded earlier.

In this one, we will use Llama-3.1-8B-Instruct model to re-write the output from previous pipeline and make it more dramatic or realistic.

We will again set the SYSTEM_PROMPT and remind the model of its task.

In [1]:
SYSTEMP_PROMPT = """
You are an international oscar winnning screenwriter

You have been working with multiple award winning podcasters.

Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.

Make it as engaging as possible, Speaker 1 and 2 will be simulated by different voice engines

Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc

Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes

Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions

Make sure the tangents speaker 2 provides are quite wild or interesting. 

Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the Speaker 2.

REMEMBER THIS WITH YOUR HEART
The TTS Engine for Speaker 1 cannot do "umms, hmms" well so keep it straight text

For Speaker 2 use "umm, hmm" as much, you can also use [sigh] and [laughs]. BUT ONLY THESE OPTIONS FOR EXPRESSIONS

It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait

Please re-write to make it as characteristic as possible

START YOUR RESPONSE DIRECTLY WITH SPEAKER 1:

STRICTLY RETURN YOUR RESPONSE AS A LIST OF TUPLES OK? 

IT WILL START DIRECTLY WITH THE LIST AND END WITH THE LIST NOTHING ELSE

Example of response:
[
    ("Speaker 1", "Welcome to our podcast, where we explore the latest advancements in AI and technology. I'm your host, and today we're joined by a renowned expert in the field of AI. We're going to dive into the exciting world of Llama 3.2, the latest release from Meta AI."),
    ("Speaker 2", "Hi, I'm excited to be here! So, what is Llama 3.2?"),
    ("Speaker 1", "Ah, great question! Llama 3.2 is an open-source AI model that allows developers to fine-tune, distill, and deploy AI models anywhere. It's a significant update from the previous version, with improved performance, efficiency, and customization options."),
    ("Speaker 2", "That sounds amazing! What are some of the key features of Llama 3.2?")
]
"""

In [2]:
#Defining the main LLM to use for rewriting the transcript using the generated_output.pkl file from Step 2 Transcript Writer
MODEL = "meta-llama/Llama-3.1-8B-Instruct"

Step 1: Import Required Libraries and Configure Warnings for Model Processing
---
### Overview of the Code
This code block imports essential libraries needed for deep learning model processing, device management, progress tracking, and handling model-related warnings. By setting up these imports and configuring warning suppression, it prepares the environment for efficient and focused model operations. These libraries provide the necessary functionality for working with large language models, tracking progress, and optimizing device usage.

### Purpose and End Result
* Purpose: To initialize the Python environment with required libraries for model processing and device management, and to suppress non-critical warnings. This setup ensures that the code can perform tasks efficiently while keeping output focused and uncluttered by unnecessary warnings.

* End Result: The environment is set up with essential tools for deep learning, model optimization, and progress tracking, with warnings suppressed for cleaner output.

In [3]:
# Import essential libraries for deep learning and model optimization
import torch  # PyTorch library, used for tensor operations and handling the processing device
from accelerate import Accelerator  # Hugging Face's Accelerator to optimize model device allocation
import transformers  # Transformers library from Hugging Face for loading and managing large language models
from tqdm.notebook import tqdm  # Tqdm for displaying progress bars, compatible with Jupyter notebooks
import warnings  # Warnings library to control and manage warning messages

# Suppress warnings to keep output clean and focused during model execution
warnings.filterwarnings('ignore')  # Ignore any non-critical warnings for a more readable output


  from .autonotebook import tqdm as notebook_tqdm


Step 2: Select and Load a .pkl File for Input
---
### Overview of the Code
This code block opens a file selection dialog using Tkinter to allow the user to choose a .pkl file interactively. Once a file is selected, the code loads the content of the .pkl file using pickle and stores it in the variable INPUT_PROMPT. This approach enables users to dynamically specify their input file, making the workflow more adaptable.

### Purpose and End Result
* Purpose: To provide an interactive way for the user to select a .pkl file, then load the content of the chosen file into INPUT_PROMPT for further processing. The file selection dialog allows users to select any .pkl file, making the code more flexible in handling different input sources.
* End Result: The content of the selected .pkl file is loaded into INPUT_PROMPT, and the file path and data content are optionally printed for verification.

In [4]:
import tkinter as tk  # Import Tkinter for creating GUI elements
from tkinter import filedialog  # Import filedialog to allow file selection
import pickle  # Import pickle for loading data from .pkl files

# Initialize the Tkinter root window, which is required to open a file dialog
root = tk.Tk()
root.withdraw()  # Hide the main Tkinter window as we only need the file dialog

# Open a file dialog to let the user select a .pkl file from their system
file_path = filedialog.askopenfilename(
    title="Select a .pkl File",               # Sets the title of the dialog box
    filetypes=[("Pickle Files", "*.pkl")],     # Filters the dialog to show only .pkl files
)

# Check if a file was selected before attempting to load
if file_path:
    # Load the selected .pkl file using pickle
    with open(file_path, 'rb') as file:
        INPUT_PROMPT = pickle.load(file)  # Load the content of the .pkl file into INPUT_PROMPT
    
    # Print confirmation messages (optional) to verify the selected path and loaded data
    print(f"Selected file path: {file_path}")  # Show the path of the selected file
    print(f"Loaded data from the selected .pkl file:\n{INPUT_PROMPT}")  # Display loaded data for verification
else:
    # Inform the user if no file was selected
    print("No file was selected.")

Selected file path: /home/mohamedashour/Documents/Projects/Notebook_LLM/Sample_pdfs/generated_output.pkl
Loaded data from the selected .pkl file:
Speaker 1: Welcome to today's episode of "The Writing Life", where we explore the world of writing and share tips and tricks to help you improve your craft. I'm your host, [name], and I'm excited to be joined by a special guest today, who will be sharing their expertise on the art of writing. Our guest has been a writer for many years and has written for various publications, including novels, essays, and even poetry. They've also taught writing workshops and have a deep understanding of the writing process.

Speaker 2: I'm thrilled to be here today and to share my knowledge with your audience. I've been writing for as long as I can remember, and I've always been fascinated by the power of words to convey meaning and emotion. Writing is not just about putting words on paper, but about crafting a message that resonates with readers and leaves 

Step 3: Initialize Text Generation Pipeline and Generate Text
---
### Overview of the Code
This code block sets up a text generation pipeline using Hugging Face’s transformers library. The pipeline initializes a pretrained language model with specified parameters, prepares a structured prompt consisting of system instructions and user input, and generates text based on this prompt. By configuring model properties such as torch_dtype and device_map, the pipeline is optimized for efficient processing.

### Purpose and End Result
* Purpose: To create a text generation pipeline using a specific model and generate text based on an input prompt. The structured prompt includes a system-level instruction and user input, guiding the model on how to respond. The pipeline parameters control the generation settings, including output length and response variety.
* End Result: The generated text, based on the model’s interpretation of the system prompt and user input, is stored in the outputs variable, ready for further processing or display.

In [5]:
# Initialize a text generation pipeline using Hugging Face's transformers library
pipeline = transformers.pipeline(
    "text-generation",                     # Specify the task as text generation
    model=MODEL,                           # Use the specified pretrained language model
    model_kwargs={"torch_dtype": torch.bfloat16},  # Set model to use bfloat16 precision for efficient memory usage
    device_map="auto",                     # Automatically allocate devices (e.g., GPU if available) for processing
)

# Prepare a structured prompt with system and user instructions
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},  # System-level instructions for guiding the model's behavior
    {"role": "user", "content": INPUT_PROMPT},     # User-provided input text that the model will respond to
]

# Generate text output using the initialized pipeline and structured prompt
outputs = pipeline(
    messages,                              # Pass the structured prompt to the pipeline for processing
    max_new_tokens=8126,                   # Limit the generated output to a maximum of 8126 tokens
    temperature=1,                         # Set temperature for randomness in responses (higher values yield more varied responses)
)

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.84it/s]
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Step 4: Extract and Display Generated Text from Model Output
---
### Overview of the Code
This code block extracts the final generated text content from the model’s output, assigns it to a variable (save_string_pkl), and then prints the entire generated output segment for inspection. By isolating the processed text, this step prepares it for further use, such as saving to a file or additional processing.

### Purpose and End Result
* Purpose: To retrieve the generated text content from the model’s output and store it in a variable for easy access. The extracted text can be used in subsequent steps, such as saving or displaying, while the full output is printed for verification.
* End Result: The variable save_string_pkl contains the generated text content, and the entire last segment of the output is printed, providing a snapshot of the model’s generated response for user verification.

In [6]:
# Extract the generated text content from the model's output
# The generated text is located in the last element of the 'generated_text' list in 'outputs'
save_string_pkl = outputs[0]["generated_text"][-1]['content']

# Print the last generated segment of the output for verification
print(outputs[0]["generated_text"][-1])  # Displays the complete last segment of generated text

{'role': 'assistant', 'content': '[\n    ("Speaker 1", "Welcome to \'The Writing Life\', where we explore the world of writing and share tips and tricks to help you improve your craft. I\'m your host, and today we\'re joined by a seasoned writer and educator who\'s worked with authors, poets, and journalists. Let\'s dive right in! What draws you to writing, and how do you approach the creative process?"),\n    ("Speaker 2", "Hmm, I think I\'ve always been fascinated by the power of words to convey emotion and meaning. But, umm, how do you approach writing, exactly?"),\n    ("Speaker 1", "Well, I think it\'s all about understanding your audience and purpose. Whether you\'re writing a novel, essay, or poem, it\'s essential to consider who your readers are and what they want to take away from your work."),\n    ("Speaker 2", "That makes sense. But, umm, what about tone? How do you convey a tone through writing? I\'ve always struggled with this one."),\n    ("Speaker 1", "Tone is a great t

Step 5: Select Save Location and Save Processed Data as a .pkl File
---
### Overview of the Code
This code block uses Tkinter to open a file save dialog, allowing the user to specify the save location and filename for the processed .pkl file. After the user selects or names the file, the code saves the processed data to the specified location using pickle. This setup makes it easy for users to control where and how their data is saved.

### Purpose and End Result
* Purpose: To let the user choose a custom save location and filename for the processed .pkl data. This makes the data saving process more flexible and user-driven.
* End Result: The processed data is saved to the user-defined location, with the file path printed for verification. If no file is selected, a message indicates that the save operation was canceled.

In [7]:
import tkinter as tk  # Import Tkinter for creating GUI elements
from tkinter import filedialog  # Import filedialog to allow file saving dialog
import pickle  # Import pickle for saving data to .pkl format

# Initialize the Tkinter root window (necessary for opening dialogs)
root = tk.Tk()
root.withdraw()  # Hide the main Tkinter window as only the dialog is needed

# Open a file save dialog to let the user specify the save location and filename for the .pkl file
file_path = filedialog.asksaveasfilename(
    title="Save Processed Pickle File",      # Title for the save dialog
    defaultextension=".pkl",                 # Set default file extension to .pkl
    filetypes=[("Pickle Files", "*.pkl")],   # Restrict file types to .pkl files only
)

# Check if the user provided a file path for saving
if file_path:
    # Open the specified file path in write-binary mode and save the processed data
    with open(file_path, 'wb') as file:
        pickle.dump(save_string_pkl, file)  # Use pickle to save the data to the file
    
    # Print confirmation message with the saved file path (optional)
    print(f"Processed data saved at: {file_path}")
else:
    # Inform the user if no file was selected or if the save operation was canceled
    print("Save operation canceled.")

Processed data saved at: /home/mohamedashour/Documents/Projects/Notebook_LLM/Sample_pdfs/podcast_ready_data.pkl
