Notebook 2: Transcript Writer
---
This notebook uses the Llama-3.1-8B-Instruct model to take the cleaned up text from previous notebook and convert it into a podcast transcript

SYSTEM_PROMPT is used for setting the model context or profile for working on a task. Here we prompt it to be a great podcast transcript writer to assist with our task

In [1]:
SYSTEM_PROMPT = """
You are the a world-class podcast writer, you have worked as a ghost writer for Joe Rogan, Lex Fridman, Ben Shapiro, Tim Ferris. 

We are in an alternate universe where actually you have been writing every line they say and they just stream it into their brains.

You have won multiple podcast awards for your writing.
 
Your job is to write word by word, even "umm, hmmm, right" interruptions by the second speaker based on the PDF upload. Keep it extremely engaging, the speakers can get derailed now and then but should discuss the topic. 

Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc

Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes

Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions

Make sure the tangents speaker 2 provides are quite wild or interesting. 

Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the second speaker. 

It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait

ALWAYS START YOUR RESPONSE DIRECTLY WITH SPEAKER 1: 
DO NOT GIVE EPISODE TITLES SEPERATELY, LET SPEAKER 1 TITLE IT IN HER SPEECH
DO NOT GIVE CHAPTER TITLES
IT SHOULD STRICTLY BE THE DIALOGUES
"""

In [2]:
#Defining the main LLM to use for crafting the transcript using the cleaned extracted text

MODEL = "meta-llama/Llama-3.1-8B-Instruct"

Step 1: Import Libraries and Configure Warnings for Model Processing
---
### Overview of the Code
This code block imports essential libraries needed for deep learning model processing, device management, text generation, and file handling. Additionally, it sets up a warning filter to suppress unnecessary warnings, improving the clarity of the output. These imports provide the necessary tools for handling large language models, efficient device usage, progress tracking, and saving output data.

### Purpose and End Result
* Purpose: To set up the environment with required libraries for model processing and file handling, and to configure warning suppression for a cleaner output. This block lays the groundwork for efficient model operations, progress tracking, and data saving.
* End Result: All essential libraries for model processing, device acceleration, and file handling are loaded, with warnings suppressed to keep the output clean and focused.

In [3]:
# Import essential libraries for deep learning and model processing

import torch  # PyTorch, a deep learning library used here for device handling and tensor operations
from accelerate import Accelerator  # Accelerator from Hugging Face to optimize and manage device allocation
import transformers  # Transformers library for model loading, text generation, and tokenization
import pickle  # Pickle library for saving Python objects, used here for saving generated outputs
from tqdm.notebook import tqdm  # Tqdm library for displaying progress bars in Jupyter notebooks
import warnings  # Warnings library to manage and filter warning messages

# Suppress warnings to keep output clean and focused during execution
warnings.filterwarnings('ignore')  # Ignore any non-critical warnings during model processing

  from .autonotebook import tqdm as notebook_tqdm


Step 2: Read File with Flexible Encoding Handling
---
### Overview of the Code
This code block defines a function, read_file_to_string, which reads the content of a text file into a string. The function is designed to handle multiple file encodings, attempting several common encodings if the standard UTF-8 fails. It also includes error handling for cases such as file not found or read errors. This flexibility makes it robust for reading files with various encoding formats.

### Purpose and End Result
* Purpose: To load text data from a file into a string while accommodating various encoding formats, improving the likelihood of successful reading even if the file isn’t in UTF-8. It also manages common file-related errors, ensuring that any issues are handled gracefully with informative messages.
* End Result: The function returns the content of the file as a string if successful, or None if an error occurs, accompanied by error messages that help diagnose potential issues.

In [4]:
def read_file_to_string(filename):
    """
    Attempt to read the content of a file into a string, handling multiple encodings.
    """
    # Try reading the file with UTF-8 encoding first (most common encoding)
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            content = file.read()  # Read the entire file content into 'content'
        return content  # Return the content if UTF-8 reading is successful
    
    except UnicodeDecodeError:
        # If UTF-8 decoding fails, try other common encodings
        encodings = ['latin-1', 'cp1252', 'iso-8859-1']  # List of fallback encodings
        for encoding in encodings:
            try:
                with open(filename, 'r', encoding=encoding) as file:
                    content = file.read()  # Attempt to read with a fallback encoding
                print(f"Successfully read file using {encoding} encoding.")  # Informative success message
                return content  # Return content if reading is successful with a fallback encoding
            except UnicodeDecodeError:
                # Continue to the next encoding if this one fails
                continue
        
        # If all encodings fail, print an error message
        print(f"Error: Could not decode file '{filename}' with any common encoding.")
        return None  # Return None to indicate failure to read the file
    
    except FileNotFoundError:
        # Handle the case where the file does not exist
        print(f"Error: File '{filename}' not found.")
        return None  # Return None if file is not found
    
    except IOError:
        # Handle other I/O errors (e.g., permission issues)
        print(f"Error: Could not read file '{filename}'.")
        return None  # Return None if there is an input/output error

Step 3: Open File Dialog to Select the Clean and Extracted Text File
---
### Overview of the Code
This code block creates a file selection dialog using Tkinter to allow the user to choose a .txt file from their system. Once the user selects a file, the file path is stored in a variable called INPUT_PROMPT. This file path can then be used for further processing. The code is simple and provides a straightforward way for users to specify input files interactively.

### Purpose and End Result
* Purpose: To open a file selection dialog and allow the user to choose a text file (.txt). The path of the selected file is saved in INPUT_PROMPT, providing an accessible reference for future code sections that need to use the selected file as input.
* End Result: The selected file path is saved in the variable INPUT_PROMPT, ready to be used in subsequent processing steps, and is optionally printed to confirm the chosen file path.

In [5]:
import tkinter as tk  # Import Tkinter for creating GUI elements
from tkinter import filedialog  # Import filedialog to allow file selection

# Initialize the Tkinter root window, which is needed for the file dialog
root = tk.Tk()
root.withdraw()  # Hide the main window since we only want the file dialog

# Open a file dialog to let the user select a .txt file from their system
file_path = filedialog.askopenfilename(
    title="Select a Text File",               # Sets the dialog title
    filetypes=[("Text Files", "*.txt")],       # Filters to show only .txt files
)

# Process the uploaded cleaned and extracted text using the function created in the previous step and taking into account the file path
INPUT_PROMPT = read_file_to_string(file_path)

# Print the selected file path (optional, helpful for confirming file choice)
print(f"Selected file path: {file_path}")

Selected file path: /home/mohamedashour/Documents/Projects/Notebook_LLM/Sample_pdfs/clean_extracted_text.txt


Step 4: Initialize Text Generation Pipeline and Generate Text
---
### Overview of the Code
This code block sets up a text generation pipeline using the Hugging Face transformers library. It loads a pretrained language model configured with specific parameters, then feeds it a structured prompt, composed of a system instruction and user input. The pipeline generates text based on the prompt, with options for customizing the output length and randomness. This setup is essential for leveraging large language models to generate contextually relevant text.

### Purpose and End Result
* Purpose: To initialize a text generation pipeline with a pretrained model, prepare a structured prompt, and generate text output based on the prompt. The code specifies generation parameters to control the model’s behavior, such as the output length and temperature for variety in responses.
* End Result: A generated text output based on the given system prompt and user input, stored in the outputs variable. This text can be further processed or saved for later use.

In [6]:
# Initialize the text generation pipeline using Hugging Face's transformers library
pipeline = transformers.pipeline(
    "text-generation",                     # Specify the task as text generation
    model=MODEL,                           # Use the specified pretrained model
    model_kwargs={"torch_dtype": torch.bfloat16},  # Use bfloat16 for memory-efficient processing
    device_map="auto",                     # Automatically allocate devices (e.g., GPU, CPU) for processing
)

# Define the structured input prompt for the model
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},  # System-level instructions to guide model behavior
    {"role": "user", "content": INPUT_PROMPT},     # User input text, typically the text to be processed or transformed
]

# Generate text based on the structured input prompt using the initialized pipeline
outputs = pipeline(
    messages,                              # Feed the structured messages to the pipeline
    max_new_tokens=8126,                   # Set maximum token length for generated output
    temperature=1,                         # Set temperature to control randomness (higher values for more diversity)
)

Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.83it/s]
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Step 13: Define Output Path and Save Generated Text
---
### Overview of the Code
This code block defines the file path for saving a generated .pkl file, based on the directory of the initially selected .txt file. It retrieves the generated text content from the model’s output and assigns it to a variable (save_string_pkl) for saving. The file path setup and content extraction ensure that the processed output is saved in an organized way, alongside the original input file.

### Purpose and End Result
* Purpose: To define a directory path for saving the .pkl file based on the user-selected input file location and extract the generated text content from the model’s output for storage. This setup provides consistency by saving the processed output in the same directory as the input file.
* End Result: The generated text is assigned to save_string_pkl, ready to be saved at output_pkl_path, which is located in the same directory as the selected input file.

In [7]:
import os  # Import os for path and directory management

# Define the path for saving the .pkl file based on the directory of the selected input file
output_dir = os.path.dirname(file_path)  # Get the directory path of the selected .txt file

# Create the full path for the .pkl output file, using the same directory as the input file
output_pkl_path = os.path.join(output_dir, "generated_output.pkl")  # Define the output .pkl file path

# Extract the generated content from the model output to be saved
save_string_pkl = outputs[0]["generated_text"][-1]['content']  # Access the generated text from model output

# Print the extracted content for verification
print(outputs[0]["generated_text"][-1]['content'])  # Optional: display the content for user review

Speaker 1: Welcome to today's episode of "Writing Insights," where we dive into the world of effective writing and explore the essential skills and techniques to help you become a better writer. I'm your host, and I'm excited to share my knowledge with you. Today, we're going to talk about the art of writing a great essay.

Speaker 2: (excitedly) Oh, I'm so glad we're talking about essays! I've been struggling to write a good one for my university course, and I'm hoping to get some tips from you.

Speaker 1: (laughs) Well, you've come to the right place! Writing an essay can be a daunting task, but with the right approach, you can produce a well-written and engaging piece of work. Let's start with the basics. What is an essay, and what are its key components?

Speaker 2: (curiously) I've always thought of an essay as a long piece of writing, but what are the different types of essays, and how do they vary?

Speaker 1: Ah, excellent question! There are several types of essays, including

Step 14: Save Generated Text as a .pkl File
---
### Overview of the Code
This code block saves the generated text content to a .pkl file at a specified path, using Python’s pickle module. After saving the file, the code assigns the file path to a variable (GENERATED_PKL_PATH) for easy reference, ensuring that the saved file can be accessed later. This step completes the data processing workflow by storing the output in a structured and accessible format.

### Purpose and End Result
* Purpose: To save the generated text as a .pkl file at a specified path (output_pkl_path), ensuring that the processed data is preserved for future use. The path of the saved file is stored in GENERATED_PKL_PATH to facilitate easy access.
* End Result: The generated text is saved in a .pkl file at the defined location, and the path to this file is stored in GENERATED_PKL_PATH for further use or reference.

In [8]:
# Save the generated text as a .pkl file at the specified path
with open(output_pkl_path, "wb") as pkl_file:  # Open the file in write-binary mode
    pickle.dump(save_string_pkl, pkl_file)      # Use pickle to save the generated text to the file

# Store the path of the saved .pkl file for easy reference
GENERATED_PKL_PATH = output_pkl_path

# Print confirmation of the saved .pkl file path for verification
print(f"Generated .pkl file saved at: {GENERATED_PKL_PATH}")

Generated .pkl file saved at: /home/mohamedashour/Documents/Projects/Notebook_LLM/Sample_pdfs/generated_output.pkl
