In [1]:
!python --version

Python 3.9.1


- Import relevant libraries:
    - `os`: to interact with the operating system, allowing local file access and manipulation.
    - `openai`: to interact with the OpenAI API, allowing the use of language models (LLMs) to generate text, translate languages, summarize text, etc.

In [2]:
import os
import openai

- Load the `.txt` file that contains your OpenAI API key.

In [3]:
with open('Data/Input/api-key.txt', 'r') as file:
    api_key = file.read()

os.environ["OPENAI_API_KEY"] = api_key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Part I: Explore how the APIs work

- In order to create our NoteTaker using OpenAI's APIs, we must first learn how the [speech-to-text](https://platform.openai.com/docs/guides/speech-to-text) and [chat completion](https://platform.openai.com/docs/guides/chat) APIs work.
- The main objective is to use OpenAI APIs in order to summarize the recorded audio file of my question to the Databricks: Destination Lakehouse Pilipinas, about how students should prepare for a Data Engineering role while still in school.

## A. Audio transcription via [OpenAI Whisper](https://openai.com/research/whisper) API

- Whisper is an Automatic Speech Recognition (ASR) system that has been trained on a large and diverse dataset consisting of 680,000 hours of multilingual and multitask supervised data from the web.
- The Whisper architecture is implemented as an encoder-decoder Transformer that can perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.
- Whisper's zero-shot performance across many diverse datasets is much more robust and makes 50% fewer errors than other existing approaches that use smaller, more closely paired audio-text training datasets or unsupervised audio pretraining.
- Whisper's approach of alternating between transcribing in the original language and translating to English is particularly effective at learning speech-to-text translation and outperforms the supervised state-of-the-art on CoVoST2 to English translation zero-shot.


**Price:** around 0.006 USD per minute.

**Filesize Limit:** <25 MB

For more details, please check the [paper](https://cdn.openai.com/papers/whisper.pdf), [model card](https://github.com/openai/whisper/blob/main/model-card.md), and [code](https://github.com/openai/whisper).

### Sample syntax using Whisper

- Define an `fpath` string variable, for the file destination of your audio or video file to be transcribed.
- Create a new variable `audio_file` using the native Python [`open()`](https://docs.python.org/3/library/functions.html#open) function with `"rb"` as an argument for
    - `'r'`: open for reading (default)
    - `'b'`: binary mode
- Lastly, define a variable `transcript` using the `transcribe()` function of `openai.Audio`, with `"whisper-1"` for the string argument declaring that we will use the Whisper-1 model, then `audio_file`.
    - `openai.Audio` is the OpenAI API module that provides methods for working with audio files, including transcription of speech-to-text using the Whisper API.
- **Note:** While not covered in our discussion, prompting in OpenAI's GPT models allows users to provide initial text as input to guide the model's generation of new text, but this is not a required feature for using the OpenAI API. **This should be considered and prioritized when working upon this project for improvement.**

In [4]:
%%time
fpath = 'Data/Input/DE question converted.mp3'

audio_file = open(fpath, "rb")

transcript = openai.Audio.transcribe("whisper-1", audio_file)

CPU times: total: 46.9 ms
Wall time: 25.2 s


- Preview `transcript`, the `OpenAIObject` that contains a JSON file. 
- The JSON contains the text of the transcript as a string along with other metadata. 

In [5]:
transcript

<OpenAIObject at 0x2236bfa65e0> JSON: {
  "text": "My question would be, what advice can you give to a college student or a graduate student who wants to pursue a career as a data engineer? How should he-she prepare in order to have the adequate skill sets that would enable an easy transition from the academic to the industry? And can Databricks be part of that preparation? Thank you. So, I just want to repeat the question to answer. I think the question was from the University of the Philippines, right, is what I heard. And I think the question is around what advice would we give to students, right, entering into kind of the space? And then number two, what support can Databricks provide, right? Is that correct? Ah, yes. How could Databricks be part of the preparation if someone... Okay, so I can take the second one. You want to cover all of them. First of all, would you advise me to, you know, someone studying at university today but entering into the state NAL, what advice would you

- Preview the `text` item from `transcript`.
- This is the most relevant component of `transcript`, so it's better to save it to a variable called `transcript_text`.

In [6]:
transcript_text = transcript['text']
transcript_text

"My question would be, what advice can you give to a college student or a graduate student who wants to pursue a career as a data engineer? How should he-she prepare in order to have the adequate skill sets that would enable an easy transition from the academic to the industry? And can Databricks be part of that preparation? Thank you. So, I just want to repeat the question to answer. I think the question was from the University of the Philippines, right, is what I heard. And I think the question is around what advice would we give to students, right, entering into kind of the space? And then number two, what support can Databricks provide, right? Is that correct? Ah, yes. How could Databricks be part of the preparation if someone... Okay, so I can take the second one. You want to cover all of them. First of all, would you advise me to, you know, someone studying at university today but entering into the state NAL, what advice would you be giving them if that's the journey they want to

## Audio transcription via Whisper library

- One alternative reason to use the whisper library instead of using the OpenAI API in order to access the whisper so that [you don't have to pay for using the OpenAI API](https://github.com/openai/whisper/discussions/1088).
- The downside however, is it will take more time to transcribe and translate since this approach uses your own local machine. 
- Overall, I recommend the use of the API if the audio and video to be transcribed is for an open-source project and does not contain sensitive information.

### Some points to consider:

- It is a self-contained library that can be run locally without requiring an internet connection, whereas the OpenAI API requires an internet connection to use.
- The library option provides more control over the training process and allows for more fine-tuned models than the OpenAI API, but requires more technical expertise to use effectively.
- Allows for modification and customization to meet specific needs, whereas the OpenAI API is a closed system with limited options. 

### Installation via Jupyter notebook

- The raw-nbconvert cells below must be converted into code cells in order for the installation to be run.
- install [ffmpeg-python](https://github.com/kkroening/ffmpeg-python), one of the possible prerequisites.
- install whisper from the GitHub repo.

### Sample syntax using the Whisper library

- import `whisper`

- Define and load the base `model`.

- Define `fpath` and transcribe using the base `model` to produce `result`.

- Preview transcription.

### Some notes:

- Whisper uses an ensemble of models to generate transcriptions, which means that different models are used for different runs, leading to variations in the output.

- The models used by Whisper are probabilistic, meaning that they rely on chance to produce transcriptions. Therefore, the output may vary even if the input is the same.

- Whisper uses a random seed to initialize the models, which may cause variations in the output between different runs.

## B. Summarizing text with gpt-3.5-turbo

- Now that we have `transcript_text`, we want to summarize it into seven points using OpenAI's Chat Completion with `gpt-3.5-turbo` as the model.

- According to OpenAI, GPT-3.5 models can understand and generate natural language or code. The most capable and cost effective model in the GPT-3.5 family is gpt-3.5-turbo which has been optimized for chat but works well for traditional completions tasks as well.

- One of the key benefits of using the GPT3.5 Turbo API for chat completion is its ability to learn from the inputs that it receives. As the API is used to generate responses to different prompts, it will gradually develop a better understanding of language and become better at generating accurate and appropriate responses.

- Some points to consider:
    - `openai.ChatCompletion.create()` contains two variables: `model` and `messages`.
    - `model` specifies the OpenAI LLM used. For now, we prefer `"gpt-3.5-turbo"` for completion tasks.
    - `messages` contains the list of dictionaries:
        - `"system"`: sets the behavior of the `"assistant"`, defining the **role** that the model takes, as if it is the persona.
        - `"user"`: refers to the text of the message entered by the user. 
        - `"assistant"`: refers to the text of the message generated by the AI-based assistant.
    - For both `"system"` and `"user"`, it is important to carefully engineer the prompts to ensure that they are clear, concise, and relevant to the task at hand.
    
**Price:** USD 0.002 per 1k tokens.

**Token Limit:** <4096 tokens. (1 token $\approx$ 0.75 of a word)

For more details, please check the [model index](https://platform.openai.com/docs/models/gpt-3-5).

### Prompt Engineering Preparation

- Define `role_txt` for `"system"`:
    - This is "persona" that we want `"gpt-3.5-turbo"` model to take. 
    - When defining this persona, it is important to consider a concise prompt that humanizes this persona.
    - A suggestion would be to provide a general role, preceded by a background then followed by some relevant context regarding what the model is tasked to do.
 - `"user"` prompt:
     - Since our objective is to summarize the given `transcript_text` output into several points, a simple command would suffice. 
     - For now, we could say we want `transcript_text` to be summarized into seven points, but next time, this should be a variable integer parameter.
     - Next, the whole prompt should be in the form of an f-string containing `transcript_text`.

In [7]:
role_txt = "You are a detail-oriented data science student from the Philippines, great with transcribing text to pure English."

### Chat Completion Syntax

- Let's generate `response` using `role_txt`, and `transcript_text` previously generated using Whisper.

In [8]:
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role":"system", 
         "content": role_txt},
        
        {"role":"user", 
         "content": f"Summarize the following transcript into 7 key bullet points: '\n{transcript_text}'"}
        ]
)

- Let's inspect the `response` output.
- We can see that the important components of `response` are the following:
    - The `"content"` attribute under the message dictionary is the most important since it contains the actual result or the response generated by the GPT-3.5 Turbo API to the prompt provided. In this case, the response is a summary of the given transcript in seven key bullet points. This is the main output that we must consider, and should be saved as a text file later.

    - The `"usage"` dictionary contains information on the number of tokens used in the completion process, including the prompt and response. This information is important because it can be used to track the cost of using the GPT-3.5 Turbo API, as it charges based on the number of tokens used. Knowing this information can help manage the usage and cost of the API.

In [9]:
response

<OpenAIObject chat.completion id=chatcmpl-768SNCwVWruD2KM28Ab3VhPUP4LO3 at 0x2236bf184f0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "1. The question is about advice for preparing to pursue a career as a data engineer. \n2. There are many online resources available, but it's hard to learn data engineering without working with actual data. \n3. One advice is to look for internships that would expose you to that kind of environment. \n4. Attitude, work ethic, and a growth mindset are important when pursuing a career as a data engineer. \n5. Building a portfolio of work and demonstrating the ability to solve tough problems in a team-like manner is essential. \n6. Databricks has a significant amount of open and free learning resources for anyone interested. \n7. Databricks has a university alliance program that provides resources, labs, and learning material to universities to train people on Databricks.",
        "ro

- To pretty print the prompt output:

In [12]:
print(response['choices'][0]['message']['content'])

1. The question is about advice for preparing to pursue a career as a data engineer. 
2. There are many online resources available, but it's hard to learn data engineering without working with actual data. 
3. One advice is to look for internships that would expose you to that kind of environment. 
4. Attitude, work ethic, and a growth mindset are important when pursuing a career as a data engineer. 
5. Building a portfolio of work and demonstrating the ability to solve tough problems in a team-like manner is essential. 
6. Databricks has a significant amount of open and free learning resources for anyone interested. 
7. Databricks has a university alliance program that provides resources, labs, and learning material to universities to train people on Databricks.


- Define `usage_dict` as the dictionary of tokens from `response`, which contains the following:
    - `prompt_tokens`: the number of tokens in the prompt, which is the input given to the model.
    - `completion_tokens`: the number of tokens in the generated completion, which is the output generated by the model based on the prompt.
    - `total_tokens`: the total number of tokens used by the model to generate the completion, which is the sum of the prompt and completion tokens.

In [13]:
usage_dict = dict(response['usage'])
usage_dict

{'prompt_tokens': 1438, 'completion_tokens': 161, 'total_tokens': 1599}

- Currently, the [price of GPT 3.5 tokens](https://openai.com/pricing) is 0.002 USD per 1000 tokens.
- It is useful therefore to define `price_dict`, which multiplies this factor to the token values of `usage_dict` in order to obtain the job price of the chat completion process.

In [14]:
price_dict = {k: v*(0.002/1000.0) for (k, v) in usage_dict.items()}
price_dict

{'prompt_tokens': 0.002876,
 'completion_tokens': 0.00032199999999999997,
 'total_tokens': 0.003198}

# Next Steps

Based on our exploration, in order to create our NoteTaker using OpenAI APIs, a recommended approach would be:
- Create an `OpenAI_Transcriber` class that uses the Whisper API, and can do the following:
    - Convert input video to audio, in order to save memory.
    - Get filesize and duration of input file.
    - Get estimated transcription price.
    - Transcription using `openai.Audio.transcribe()`
    - Save transcription output as `.txt` file.
- Create an `OpenAI_Summarizer` class that uses the Chat Completion API, and can do the following:
    - Compute input tokens from input transcript.
    - Summarize transcript using `openai.ChatCompletion.create()`
    - Save output summary as `.txt` file.
    - Produce a dictionary like `usage_dict` to indicate summarization cost.
    - Compute output tokens from output summary.
- Generally, for any input audio/video file received by `OpenAI_Transcriber`, its output transcription will be received by `OpenAI_Summarizer` in order to produce the summary text file.
    - Taking into consideration the limits, the input file should be less than 25 MB and the transcription should have less than 4096 tokens.

**These will be done on the next part.**