In [1]:
!python --version

Python 3.9.1


In [2]:
import os
import openai
import tiktoken

- Load the `.txt` file that contains your OpenAI API key.

In [3]:
with open('Data/Input/api-key.txt', 'r') as file:
    api_key = file.read()

os.environ["OPENAI_API_KEY"] = api_key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Part II: Parent Classes

- From the first part, we'll proceed with creating the Python classes `OpenAI_Transcriber` and `OpenAI_Summarizer`, which utilizes the speech-to-text (Whisper-1) and chat completion (GPT 3.5 turbo) respectively.
- Using a class-based approach for maximizing OpenAI APIs usage provides organization, modularity, and flexibility in implementing media transcription and summarization tasks, which go hand-in-hand in note-taking.
- For a usual note-taking routine, these classes are to be used consecutively. 
- Later, these classes will serve as the parent classes for `OpenAI_NoteTaker` so that the whole note-taking routine will be further streamlined.
- ChatGPT has been extensively used for assistance in creating docstrings for each functions, making documentation much easier and bearable. 

## A. `OpenAI_Transcriber`

Recall that we have previously stated that `OpenAI_Transcriber` must have the following functions:
- Convert input video to audio, in order to save memory.
- Get filesize and duration of input file.
- Get estimated transcription price.
- Transcription using `openai.Audio.transcribe()`
- Save transcription output as `.txt` file.
 
### Methods

In response, we have created the following functions, with ease-of-use and cost tracking in mind:
 
- `__init__`: Initializes an instance of the `OpenAI_Transcriber` class with strings from input directory, model name, and price per minute, as stated from [OpenAI's pricing website](https://openai.com/pricing). The `self.filetype` attribute is determined using the [`magic`](https://pypi.org/project/python-magic/) library, which helps to identify the type of file without relying solely on the file extension. This is important because sometimes the file extension may not be accurate, or may be missing altogether. Using magic helps to ensure that the file is correctly identified, which is crucial for subsequent processing steps.

- `to_mp3`: Uses a try-except block to handle any exceptions that might occur during the conversion process. Within the try block, it first checks if the input file is already in MP3 format and simply returns the input file if it is. If not, it checks if the input file is in a supported format (e.g., WAV, FLAC) and uses Pydub's `AudioSegment` class to load the input audio file. It then applies Pydub's `export()` method to convert the audio to MP3 format. If the input file is not in a supported format, the function raises a ValueError with a corresponding message. Finally, the function returns the output MP3 file.


To install both `pydub` and `magic` in this notebook, simply convert the raw-nbconvert cell below into a code block.

- Install `pydub` and `python-magic`.

- `get_filesize()`: Returns the size of the input file in bytes, providing users with an estimate of the transcription cost.

- `get_duration()`: Returns the duration of the input file in seconds, which is used in conjunction with the transcription price per minute to calculate the estimated cost.

- `get_price()`: Calculates the estimated transcription cost based on the duration of the input file and the price per minute charged by OpenAI.

- `transcribe_audio()`: Uses the `openai.Audio.transcribe()` method to transcribe the input audio file using the specified OpenAI API key and model, returning the resulting text as a string.

- `save_txt()`: Saves the transcribed text to a `.txt` file, allowing for easy access and future use for editing.

In [4]:
import magic
from pydub import AudioSegment
import wave

In [4]:
import magic
from pydub import AudioSegment
import wave

class OpenAI_Transcriber:
    """
    This class provides methods for transcribing an audio file using OpenAI's
    transcription service. The class supports automatic conversion of input files
    to MP3 format, retrieval of file size and duration, estimation of the cost
    of transcription, transcription of the audio file, and saving the transcript
    to a text file.
    
    -----------
    Parameters:
    -----------
    
    - input_dir (str): The file path to the audio file to be transcribed.
    - transcriber_model (str): The name of the transcriber model to use. Defaults to "whisper-1".
    - USD_per_min (float): The cost per minute in USD for using the transcriber service. Defaults to 0.006.
    
    --------
    Methods:
    --------
    
    - to_mp3(export_dir=None): Converts the input audio file to MP3 format if necessary.
    - get_filesize(): Returns the file size of the input audio file in megabytes.
    - get_duration(): Returns the duration of the input audio file in seconds.
    - get_price(): Calculates the total price of the transcription service based on the duration of the input audio file.
    - transcribe_audio(show_output=False): Transcribes the input audio file using the specified transcriber model.
    - save_txt(export_dir=None): Saves the transcription output to a text file.
    
    -----------
    Attributes:
    -----------
    
    - input_dir (str): The file path to the input audio file.
    - transcriber_model (str): The name of the transcriber model used.
    - USD_per_min (float): The cost per minute in USD for using the transcriber service.
    - filepath_mp3 (str): The file path to the MP3 version of the input audio file.
    - input_filesize (float): The file size of the input audio file in megabytes.
    - duration (float): The duration of the input audio file in seconds.
    - transcript (dict): The output of the transcription service, including the transcription text and confidence score.
    - transcript_text (str): The transcription text output only.
    - filepath_txt (str): The file path to the saved text file containing the transcription output.
    
    ---------
    Examples:
    ---------
    
    # Create an instance of the OpenAI_Transcriber class
    transcriber = OpenAI_Transcriber(input_dir="audio_file.wav")
    
    # Convert the input file to MP3 format
    transcriber.to_mp3()
    
    # Get the duration of the input file
    transcriber.get_duration()
    
    # Get the price of the transcription service
    transcriber.get_price()
    
    # Transcribe the input file
    transcriber.transcribe_audio()
    
    # Save the transcription output to a text file
    transcriber.save_txt()

    """
    def __init__(self, 
                 input_dir:str, 
                 transcriber_model:str = "whisper-1", 
                 USD_per_min:float = 0.006):
        """
        Initializes the OpenAI_Transcriber class.

        Parameters:
        ----------
        input_dir: str
            The directory path of the input audio file.
        transcriber_model: str, optional (default="whisper-1")
            The OpenAI transcriber model to use. Default is "whisper-1".
        USD_per_min: float, optional (default=0.006)
            The price per minute for transcribing audio. Default is 0.006 USD per minute.

        Returns:
        -------
        None
        """
        self.input_dir = input_dir
        self.transcriber_model = transcriber_model
        self.USD_per_min = USD_per_min
        
        self.audio_file = open(self.input_dir, "rb")
        self.filetype = magic.Magic(mime=True).from_file(self.input_dir)
        
    def to_mp3(self, export_dir=None):
        """
        Convert the input audio file to MP3 format using ffmpeg.

        Args:
            export_dir (str, optional): If specified, the path to save the MP3 file.
                                        If None, the MP3 file is saved in the same directory
                                        as the input file with the same name but with .mp3 extension.
                                        Defaults to None.

        Returns:
            None

        Raises:
            Exception: Raised if an error occurs during the conversion process.

        """
        if self.filetype == "audio/mpeg":
            print("File is already in mp3 format.")
            
        else:
            try:
                self.audiosegment = AudioSegment.from_file(self.input_dir, self.filetype.split('/')[1])
        
                if export_dir==None:
                    self.filepath_mp3 = self.input_dir.replace(self.input_dir.split('.')[-1],'mp3')
                    self.audiosegment.export(self.filepath_mp3, format="mp3")

                else:
                    self.filepath_mp3 = export_dir
                    self.audiosegment.export(self.filepath_mp3, format="mp3")

                self.audio_file = open(self.filepath_mp3, "rb")
                self.filetype = magic.Magic(mime=True).from_file(self.filepath_mp3)
                
                print(f"Output .mp3 file saved to {self.filepath_mp3}")
                print("Conversion to mp3 successful.")
                
                self.input_dir = self.filepath_mp3
                
            except Exception as e:
                print("Error converting file to mp3.")
                print(e)
        
                
    def get_filesize(self):
        """
        Get the file size of the input audio file.
        
        Returns
        -------
        None
        
        Prints
        ------
        input_filesize : float
            The size of the input audio file in MB.
        """
        self.input_filesize = os.stat(self.input_dir).st_size / (1024 * 1024)
        print(f"Input file size: {self.input_filesize:.2} MB.")
        
            
    def get_duration(self):
        """
        Gets the duration of the input audio file.
        
        Raises:
            OSError: If the file cannot be opened or read.
            TypeError: If the file type is not supported.
        
        Prints:
            The duration of the audio file in seconds.
        """
        try:
            if self.filetype == "audio/wav" or self.filetype == "audio/x-wav":
                with wave.open(self.input_dir, 'r') as f:
                    frames = f.getnframes()
                    rate = f.getframerate()
                    self.duration = frames / float(rate)
                    print(f'Duration: {self.duration:.2} s')
            else:
                audio = AudioSegment.from_file(self.input_dir)
                self.duration = audio.duration_seconds
                print(f'Duration: {self.duration:.2} s')
                
        except:
            print("Error getting file length.")
    
    def get_price(self):
        """
        Calculates the total cost of transcribing the audio based on its duration and the given USD per minute rate.

        Raises:
            None

        Returns:
            None
        """
        self.total_price = self.duration * (self.USD_per_min/60.0)
        
    def transcribe_audio(self, 
                         show_output=False):
        """
        Transcribes the audio file using the specified transcriber model.

        Args:
            show_output (bool, optional): If True, prints the transcribed text. Defaults to False.

        Returns:
            None
        """
        self.transcript = openai.Audio.transcribe(self.transcriber_model, 
                                                  self.audio_file)
        
        self.transcript_text = self.transcript['text']
        
        if show_output==True:
            print(self.transcript_text)
            
    def save_txt(self, export_dir=None): 
        """
        Saves the transcript text to a text file.
        
        Parameters:
        -----------
        export_dir: str, optional
            The export directory of the output .txt file. If None, 
            the directory will be the same as the input audio file.
        
        Returns:
        --------
        None
        """
        if export_dir is None:
            self.filepath_txt = self.input_dir.replace(self.input_dir.split('.')[-1],' txt')
        else:
            self.filepath_txt = f"{export_dir}.txt"
        
        with open(self.filepath_txt, 'w', encoding="utf-8") as f:
            f.write(self.transcript_text)
            f.close()
        
        print(f"Transcript saved at: {self.filepath_txt}")

### Demonstration

- Define a variable `test_mp3`, an `OpenAI_Transcriber` object to initialize transcription.

In [5]:
test_mp3 = OpenAI_Transcriber('Data/Input/DE question converted.mp3')
test_mp3.filetype

'audio/mpeg'

- Check filesize. Ideally, this should be <25 MB.

In [6]:
test_mp3.get_filesize()

Input file size: 6.0 MB.


- Get duration and price, but print the value of the latter to the nearest cent.

In [7]:
test_mp3.get_duration()
test_mp3.get_price()

print(f"Total Transcription Price: {test_mp3.total_price:.2} USD")

Duration: 3.9e+02 s
Total Transcription Price: 0.039 USD


- Run the transcription then preview the output.

In [8]:
%%time
test_mp3.transcribe_audio(show_output=True)

transcript_text = test_mp3.transcript_text

Question would be what advice can you give to a college student or a graduate student who wants to pursue a career as a data engineer? How should he she prepare in order to have the adequate skillsets that would enable an easy transition from the academic to the industry? And can data bricks be part of that preparation? Thank you. So, I just want to repeat the question that we got. I think the question was from the University of the Philippines, right? Is what I heard. And I think the question is around what advice would we give to students, right? Entering into kind of the space and then number two, what support can Databricks provide, right? Is that correct? Ah yes, how could Databricks be part of the preparation if someone... I can take the second one. You want to cover all of them first? What would you advise me to someone studying at university today but entering into the state NAL, what advice would you be giving them if that's the journey they want to go on? So, thanks for the q

- Save output as `.txt` file.

In [9]:
test_mp3.save_txt('Data/Output/DE question (raw transcription)')

Transcript saved at: Data/Output/DE question (raw transcription).txt


## B. `OpenAI_Summarizer`

Recall also that we require `OpenAI_Summarizer` to have the following functions:
- Compute input tokens from input transcript.
- Summarize transcript using `openai.ChatCompletion.create()`
- Save output summary as `.txt` file.
- Produce a dictionary like `usage_dict` to indicate summarization cost.
- Compute output tokens from output summary.

### Methods

- `num_tokens_from_input_string()`: Calculates the number of input tokens from the input transcript using a specified encoding and returns the count as an integer.
- `summarize_text()`: Generates a summarized text using the OpenAI GPT-3 model given a system prompt and the input transcript, and returns the summarized text into `n_items` as a string. It also allows the option to display the summarized text as output and set the number of bullet points in the summary.
    - If `n_items=None`, a random number of points is given.
- `save_txt()`: Saves the generated summarized text to a text file with the specified export directory and returns a confirmation message indicating the saved file's location.
- `get_price()`: Calculates the cost of generating the summarized text based on the number of tokens used and returns the cost as a dictionary.
- `num_tokens_from_output_string()`: Calculates the number of output tokens from the output summary using a specified encoding and returns the count as an integer.

In [10]:
import os
import openai

class OpenAI_Summarizer:
    """
    This class uses OpenAI's GPT-3 model to summarize a transcript into bullet points. It can calculate the number of tokens in the input and output text, estimate the price of generating the summary based on token usage, save the summary to a text file, and display notes if required.
    
    -----------
    Parameters:
    -----------
    
    transcript_text : str
        The input transcript that needs to be summarized.

    summarizer_model : str, optional (default="gpt-3.5-turbo")
        The OpenAI model to use for summarizing the transcript.

    USD_per_1k : float, optional (default=0.002)
        The price charged per 1,000 tokens used.

    encoding_name : str, optional (default="cl100k_base")
        The encoding type to use for encoding the input and output text.

    --------
    Methods:
    --------
    num_tokens_from_input_string() -> int:
        Calculates the number of tokens in the input text using the specified encoding and returns it.

    summarize_text(system_prompt:str, n_items:int=None, model:str="gpt-3.5-turbo", show_notes:bool=False) -> str:
        Summarizes the input text into a bulleted list of n-items using the OpenAI GPT-3 model as default, given a system prompt, and returns the summarized text.

    save_txt(export_dir):
        Saves the summarized text to a text file with the specified export directory.

    get_price() -> dict:
        Calculates the price for generating the summarized text based on the number of tokens used, and returns it as a dictionary.

    num_tokens_from_output_string(encoding_name:str="cl100k_base") -> int:
        Calculates the number of tokens from the output string.
    
    -----------
    Attributes:
    -----------
    
    transcript_text : str
        The input transcript that needs to be summarized.

    summarizer_model : str
        The OpenAI model to use for summarizing the transcript.

    USD_per_1k : float
        The price charged per 1,000 tokens used.

    encoding_name : str
        The encoding type to use for encoding the input and output text.

    input_encoding : tiktok.Encoding
        The encoding used to encode the input text.

    input_num_tokens : int
        The number of tokens in the input text.

    response : openai.api_models.ModelAPIResponse
        The response object returned by the OpenAI API after generating the summarized text.

    summarized_text : str
        The summarized text in bullet-point form.

    output_usage_dict : dict
        A dictionary containing the token usage of the OpenAI API after generating the summarized text.

    output_tokens_count : int
        The total number of tokens used by the OpenAI API after generating the summarized text.

    output_price_dict : dict
        A dictionary containing the cost of generating the summarized text based on token usage.

    ---------
    Examples:
    ---------
    
    # Create an instance of the OpenAI_Summarizer class
    summarizer = OpenAI_Summarizer(transcript_text="This is an example transcript.")

    # Calculate the number of input tokens
    num_input_tokens = summarizer.num_tokens_from_input_string()
    
    # Define role_txt for system_prompt
    role_txt = "'You are a graduating SHS student, excellent at summarizing notes in layman's terms."

    # Summarize the transcript into 3 bullet points
    summarized_text = summarizer.summarize_text(system_prompt=role_txt, n_items=3)

    # Save the summarized text to a text file
    summarizer.save_txt(export_dir="example_summarized_text")

    # Calculate the cost of generating the summary
    output_price = summarizer.get_price()

    # Calculate the number of output tokens
    num_output_tokens = summarizer.num_tokens_from_output_string()
    """
    def __init__(self, 
                 transcript_text:str, 
                 summarizer_model:str = "gpt-3.5-turbo", 
                 USD_per_1k:float = 0.002, 
                 encoding_name:str = "cl100k_base"):
        """
        Initializes the instance of the OpenAI_Summarizer class with the input text, model, USD_per_1k, and encoding_name parameters.
    
        Parameters:
        -----------
        transcript_text : str
            The input text to be summarized.
        summarizer_model : str, optional
            The name of the OpenAI language model to be used for summarization. Default is "gpt-3.5-turbo".
        USD_per_1k : float, optional
            The cost of 1000 tokens in USD. Default is 0.002.
        encoding_name : str, optional
            The name of the encoding to be used for tokenization. Default is "cl100k_base".

        Returns:
        --------
        None

        """
        self.transcript_text = transcript_text
        self.summarizer_model = summarizer_model
        self.USD_per_1k = USD_per_1k
        self.encoding_name = encoding_name
        
    def num_tokens_from_input_string(self) -> int:
    
        """
        Calculates the number of tokens in the input text using the specified encoding and returns it.

        Returns:
            An integer representing the number of tokens in the input text.
        """
    
        self.input_encoding = tiktoken.get_encoding(self.encoding_name)
        self.input_num_tokens = len(self.input_encoding.encode(self.transcript_text))

    def summarize_text(self, 
                       system_prompt:str, 
                       n_items:int = None, 
                       model:str = "gpt-3.5-turbo", 
                       show_notes:bool=False) -> str:
        """
        Summarizes the input text into a bulleted list of n-items using the OpenAI GPT-3 model as default, 
        given a system prompt, and returns the summarized text.

        Parameters:
        -----------
        system_prompt: str
            A prompt to be fed into the OpenAI API model.
        
        n_items: int
            The number of bullet points the summarized text should contain.
        
        model: str
            The OpenAI API model to use for the text summarization.
        
        show_notes: bool
            If True, it prints the summarized text.

        Returns:
        --------
        str
            The summarized text.

        Prints:
        -------
        If show_notes=True, it prints the summarized text.

        Exceptions:
        -----------
        Raises an OpenAI API Exception if there is an issue with the OpenAI API authentication.
        """
        self.response = openai.ChatCompletion.create(
            model=self.summarizer_model,
            messages=[
                {"role":"system", 
                 "content": system_prompt},
                
                {"role":"user", 
                 "content": f"Summarize the following transcript into {n_items} key bullet points: '\n{self.transcript_text}'"}
            ])

        self.summarized_text = self.response['choices'][0]['message']['content']
        self.output_usage_dict = dict(self.response['usage'])
        self.output_tokens_count = self.output_usage_dict['total_tokens']

        if show_notes==True:
            print(self.summarized_text)
    
    def save_txt(self, export_dir): 
        """
        Saves the summarized text to a text file with the specified export directory.
        
        Parameters:
        -----------
        export_dir : str
            The directory path where the summarized text will be saved as a .txt file.
        
        Returns:
        --------
        None
        
        Prints:
        -------
        A message indicating where the summarized note was saved.
        """
        with open(f"{export_dir}.txt", 'w', encoding="utf-8") as f:
            f.write(self.summarized_text)
            f.close()
            
        print(f"Summarized note saved at: {export_dir}.txt")
        

    def get_price(self) -> dict: 
        """
        Calculates the price for generating the summarized text based on the number of tokens used, and returns it as a dictionary.
        
        Parameters:
        None
        
        Returns:
        output_price_dict (dict): a dictionary containing the cost of generating the summarized text based on the number of tokens used.
        
        Prints:
        None
        
        Raises:
        None
        """
        self.output_price_dict = {k: v*(self.USD_per_1k/1000.0) for (k, v) in self.output_usage_dict.items()}

    
    def num_tokens_from_output_string(self, 
                                      encoding_name:str = "cl100k_base") -> int:
        """
        Calculates the number of tokens in the output string using the specified encoding and returns it.
        
        Parameters:
        -----------
        encoding_name: str
            The name of the encoding to use in the tokenization process.
            Default is 'cl100k_base'.
            
        Returns:
        --------
        output_num_tokens: int
            The number of tokens in the summarized output text.
            
        Raises:
        -------
        None
        """
        self.output_encoding = tiktoken.get_encoding(self.encoding_name)
        self.output_num_tokens = len(self.output_encoding.encode(self.summarized_text))

### Demonstration

- Initialize `OpenAI_Summarizer` using `transcript_text`, by defining `summarizer`.

In [11]:
summarizer = OpenAI_Summarizer(transcript_text)

- Preview price per 1000 tokens.

In [12]:
summarizer.USD_per_1k

0.002

- Get token count.
- Make it sure that it's <4096 tokens.

In [13]:
summarizer.num_tokens_from_input_string()
summarizer.input_num_tokens

1329

- Preview input token price.

In [14]:
(summarizer.USD_per_1k/1000) * summarizer.input_num_tokens

0.0026579999999999998

- Define `role_txt`,

In [15]:
role_txt = "You are a detail-oriented data science student from the Philippines, who can easily transcribe text to pure English."

- Start summarization into 7 points.
- Show output.

In [16]:
%%time

summarizer.summarize_text(role_txt, n_items=7, show_notes=True)

- The question is about advice for college or graduate students interested in pursuing a data engineering career and how to prepare for the transition to industry.
- There are many online resources available, but it can be tricky to learn data engineering without working with actual data and a business case that provides value.
- One piece of advice is to look for internships that expose students to that kind of environment.
- Attitude, work ethic, and a drive to learn are important factors to consider when hiring data engineers.
- Having a foundation in software development or other related fields can also be helpful in pursuing a career as a data engineer.
- Building a portfolio and demonstrating that one can solve tough data and AI problems in a team-like manner is important.
- Databricks has a lot of open, free learning and runs training programs, and they have a university alliance program where they offer resources to universities to train people on Databricks.
CPU times: total: 

- Preview output tokens.

In [17]:
summarizer.num_tokens_from_output_string()
summarizer.output_num_tokens

185

## C. Determine total job price

- Check tokens used by gpt-3.5-turbo.

In [18]:
summarizer.output_usage_dict

{'prompt_tokens': 1378, 'completion_tokens': 185, 'total_tokens': 1563}

- Calculate estimated price based on current token price of USD 0.002 per 1k tokens.

In [19]:
summarizer.get_price()

print("Summarization Output Price (in USD): \n", summarizer.output_price_dict)

Summarization Output Price (in USD): 
 {'prompt_tokens': 0.0027559999999999998, 'completion_tokens': 0.00037, 'total_tokens': 0.003126}


- Save output text as `QnA_summarized.txt`.

In [20]:
summarizer.save_txt('Data/Output/QnA_summarized')

Summarized note saved at: Data/Output/QnA_summarized.txt


## C. Preview Pricing

- Preview and pretty print transcription price from using Whisper.

In [21]:
{'transcription_price' : f"{test_mp3.total_price:.2} USD"}

{'transcription_price': '0.039 USD'}

- Preview the dictionary of chat completion token prices.

In [22]:
summarizer.output_price_dict

{'prompt_tokens': 0.0027559999999999998,
 'completion_tokens': 0.00037,
 'total_tokens': 0.003126}

- Preview and pretty print chat completion total price from using GPT-3.5 turbo.

In [23]:
{'summarization_price' : f"{summarizer.output_price_dict['total_tokens']:.2} USD"}

{'summarization_price': '0.0031 USD'}

- Combine these price dictionaries.

In [24]:
{'transcription_price':f"{test_mp3.total_price:.2} USD"} |\
{'summarization_price':f"{summarizer.output_price_dict['total_tokens']:.2} USD"} |\
{'total_price': f"{(test_mp3.total_price + summarizer.output_price_dict['total_tokens']):.2} USD"}

{'transcription_price': '0.039 USD',
 'summarization_price': '0.0031 USD',
 'total_price': '0.042 USD'}

# Next Steps

While the classes `OpenAI_Transcriber` and `OpenAI_Summarizer` are already easy to use on their own, the next step is to combine them by creating a child class called `OpenAI_NoteTaker`:
- A note-taking routine will always combine the transcription and summarization jobs, so it's logical to combine them.
- `OpenAI_NoteTaker` must have the following functions that do:
    - take notes
    - save notes
    - preview total job price via dictionaries.
    
**These will be done on the next part.**