# Whisper Notes

This notebook is intended to be used to explore the whisper api by openai

## Requirements

In [1]:
import openai
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from scipy import stats
from tabulate import tabulate
from dotenv import load_dotenv
import sys
import os
from pydub import AudioSegment
import whisper

  def backtrace(trace: np.ndarray):


## Versions

In [2]:
versions = {
    'numpy': np.__version__,
    'panda': pd.__version__,
    'matplotlib': matplotlib._get_version(),
    'seaborn': sns.__version__,
    'scipy': scipy.__version__,
    'openai': openai.__version__,
    'whisper': whisper.__version__,
    'python': sys.version
}

table = tabulate(
    versions.items(),
    headers=["Module", "Version"],
    tablefmt="fancy_grid",
)

print(table)

╒════════════╤══════════════════════════════════════════════════════════════════════════════════╕
│ Module     │ Version                                                                          │
╞════════════╪══════════════════════════════════════════════════════════════════════════════════╡
│ numpy      │ 1.24.3                                                                           │
├────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ panda      │ 2.0.2                                                                            │
├────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ matplotlib │ 3.7.1                                                                            │
├────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ seaborn    │ 0.12.2                                                                           │
├────────────┼──────

## Globals

### Import ENV

In [3]:
load_dotenv() # Loads .env file into env vars

True

### Assignments

In [10]:
openai_api_key = os.getenv('OPENAI_API_KEY')
audio_file_path = "G:\\My Drive\\03 - Work\\03.03 - Meetings\\03.03.05 - Inbox\\MSS - Keyhole - Sync Meeting with Steve over Ticketing V2 - 06.14.2023.flac"
openai.api_key = openai_api_key

## Investigate Audio File

In [11]:
audio = AudioSegment.from_file(audio_file_path, "flac")

In [12]:
print("audio file is " + str(audio.duration_seconds // 60) + " Minutes long")

audio file is 71.0 Minutes long


## OpenAI Web API Whisper

### Get the transcript

In [14]:
audio_file = open(audio_file_path, "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)

KeyboardInterrupt: 

In [None]:
print(transcript)

## local Whisper Python API

### Load Model

Available models:
- tiny
- base
- small
- medium
- large

In [14]:
model = whisper.load_model("base")

### Load Audio & Transcribe

In [15]:
result = model.transcribe(audio_file_path)
print(result["text"])



 What you're saying about that because I looked into and file is a new Project for me too that it You know it kind of does want to wire at like you know and at an application love right like you're saying as a standalone How does it work well with integration? Right, you know just just as a library, you know, right? Yeah, we're just using it standalone I haven't used it in that capacity before I do think it's the best fit for Because it simplifies a lot if you're building like a standalone application like my connect like basically my connector is And so I don't what I what I what I hate doing is putting you guys in a box Right because then then we can end up with a fucked up solution and that's not that's not what I want Use the best tool for the for the job and there's other libraries And so don't be afraid to look at those just because we're we're I'm using files for what I'm doing It just happens to be the right tool for what I'm using using it sure sure and they're and they're ple

## Chunkify

In [63]:
import re

def split_text(text, maxsize=3000):
    # Split the text into individual sentences using regex
    sentences = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', text)

    # Initialize variables for tracking chunk size and current chunk contents
    max_chunk_size = maxsize
    current_chunk_size = 0
    current_chunk_contents = ""

    # Iterate through each sentence and add it to the appropriate chunk
    for sentence in sentences:
        if len(current_chunk_contents) + len(sentence) <= max_chunk_size:
            current_chunk_contents += sentence.strip() + " "
            current_chunk_size += len(sentence)
        else:
            yield current_chunk_contents.strip()
            current_chunk_contents = sentence.strip() + " "
            current_chunk_size = len(sentence)

    # Yield any remaining content as its own chunk (if there is any)
    if len(current_chunk_contents) > 0:
        yield current_chunk_contents.strip()

In [64]:
chunks_of_text = list(split_text(result["text"]))

In [65]:
len(chunks_of_text)

10

In [72]:
len(" ".join(chunks_of_text))

26444

## Convert to Notes

### Build Chunk Summaries

In [69]:
sum_results = []
count = 0

for chunk in chunks_of_text:
    chatgptPrompt = '''Using the below meeting transcript chunk, create a summary of the meeting chunk.:
{}
'''.format(chunk)

    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": chatgptPrompt}
        ]
    )

    sum_results.append(completion.choices[0].message.content)
    count += 1

    print(count)


1
2
3
4
5
6
7
8
9
10


In [71]:
len(" ".join(sum_results))

5302

In [74]:
sum_chunks_of_text = list(split_text(" ".join(sum_results)))

In [75]:
sum_results_second_pass = []
count = 0

for chunk in sum_chunks_of_text:
    chatgptPrompt = '''Using the below meeting transcript chunk, create a summary of the meeting chunk.:
{}
'''.format(chunk)

    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": chatgptPrompt}
        ]
    )

    sum_results_second_pass.append(completion.choices[0].message.content)
    count += 1

    print(count)


1
2


In [76]:
len(" ".join(sum_results_second_pass))

1644

In [77]:
chatgptPrompt = '''Using the below meeting transcript, create a succinct summary of the meeting:
{}
'''.format(" ".join(sum_results_second_pass))

completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": chatgptPrompt}
    ]
)
print(completion.choices[0].message.content)

The speaker proposed a turnkey security solution for Fortress Energy with professional services and compliance with TSA SDZC directives, divided into four phases with set timelines. The team discussed the access and architecture of the cloud piece and set up a proof of concept for secure remote access. Michael offered to send more information for a meeting on Wednesday and deployment plans were discussed, aiming for SRA by July 1st and active directory integration & identity management by July 19th. Technical requirements, firewall rules, and port lists were discussed, and the group plans to coordinate project management resources and submit a proposal and a partnership agreement.


In [78]:
from gensim

print(summarize(result["text"]))

ModuleNotFoundError: No module named 'gensim.summarization'