# Making a script

One of the things that dissapoints me most about OpenAI's Whisper is that the resulting text is just one long wall of text.  Fortunately, we are using a discord bot called Craig to record our DnD sessions.  With Craig, each person's voice is recorded to their own audio file, which means that the audio is already diarized.

Therefore it stands to reason that in order to generate a text file in a script format, all I would need to do is process each file individually then order the segments in chronological order.  Each time the speaker changes simply insert a line with that person's name.

In [None]:
import os, re
import whisper
import torch
import torchaudio as ta
import librosa
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
from IPython.display import Audio, display

print(whisper.__version__)
print(ta.__version__)
print(torch.__version__)
%matplotlib inline

## Parsing Audacity projects
The method we've been using with Craig is to download our recordings as an audacity project.  Audacity projects have a `project.aup` file, which is just a text file that tells the program how to arrange the audio files used in the project.  Let's take a look.

In [None]:
data_dir = '../data/db_02-03-2023/'
aup_file = 'V8DcTChKFF.aup'

with open(os.path.join(data_dir, aup_file)) as f:
    aup = BeautifulSoup(f, features="lxml-xml")

# Project metadata
project = aup.find('project')
pprint(project.attrs)


`projname` tells us the directory in which the audio files are stored.

In [None]:
proj_data_dir = project.get('projname')

proj_data_dir

In [None]:
p = os.path.join(data_dir, proj_data_dir)
os.listdir(p)

4 audio files and two data files

In [None]:
proj_imports = project.find_all('import')
proj_imports

Here there's two attributes that seem immediately useful.  `filename` tells us where the audio is.  `offset` tells us how each file aligns with the others in the project.

In [None]:
discord_name_pattern = r'.+-(.*)\..+'
proj_files = []

for item in proj_imports:
    filename, offset= item.get('filename'), float(item.get('offset'))
    username = re.search(discord_name_pattern, filename).group(1)
    out = {
        'filename': filename,
        'username': username,
        'offset': offset
    }
    proj_files.append(out)

pprint(proj_files)

In [None]:
whisper_dir = '../models/whisper'
transcriber =   whisper.load_model('small.en', download_root= whisper_dir)

In [None]:
for part in proj_files[:1]:
    filename = part.get('filename')
    username = part.get('username')
    offset = part.get('offset')

    path = os.path.join(data_dir, proj_data_dir, filename)
    wav = whisper.load_audio(path)

wav

In [None]:
# condition_on_previous_text=False made this run in 12 minutes.  Without the argument it took 21 minutes.
results = transcriber.transcribe(wav, condition_on_previous_text=False)

In [None]:
for thing in results['segments']:
    print(f"[{thing['start']} - {thing['end']}] {thing['text']}")

Hmm.. so Whisper is generating a lot of repetitive text.  This is likely due to the fact that the audio is very long with long sections of silence.  However, I'm noticing that the timestamps presented are much more accurate than normal.  Likely another benifit of setting `condition_on_previous_text=False`

I'm wondering if using Siler VAD to strip out the non-speech will improve transcription accuracy.

In [None]:
silero_dir = '../models/silero-vad'
vad, utils =    torch.hub.load(repo_or_dir=silero_dir,
                               source='local',
                               model='silero_vad',
                               force_reload=True,
                               onnx=False)

(get_speech_timestamps,
 save_audio,
 read_audio,
 VADIterator,
 collect_chunks) = utils

In [None]:
speech_timestamps = get_speech_timestamps(wav, vad)

In [None]:
len(speech_timestamps)

In [None]:
SAMPLING_RATE = 16000
speech_only = collect_chunks(speech_timestamps, torch.tensor(wav))

In [None]:
Audio(data=speech_only, rate=SAMPLING_RATE)

In [None]:
# 6m 46s!
results = transcriber.transcribe(speech_only, condition_on_previous_text=False)


In [None]:
print(results['text'])

In [None]:
for thing in results['segments']:
    print(f"[{thing['start']} - {thing['end']}] {thing['text']}")

In [None]:
len(results['segments'])

In [None]:
for segment in results['segments']:
    segment['start_frame'] = segment['start'] * SAMPLING_RATE
    segment['end_frame'] = segment['end'] * SAMPLING_RATE
    segment['length'] = segment['end_frame'] - segment['start_frame']

In [None]:
results['segments'][0:2]

In [None]:
speech_ts_df = pd.DataFrame(speech_timestamps)
speech_ts_df['length'] = speech_ts_df['end'] - speech_ts_df['start']
speech_ts_df.sort_values(by='start')

In [None]:
speech_timestamps[-1]

In [None]:
len(speech_timestamps)

In [None]:
chunk_timestamps = []
current_frame = 0
for i, entry in enumerate(speech_timestamps):
    speech_length = entry['end'] - entry['start']
    end_frame = current_frame + speech_length
    chunk_timestamps.append(
        {'start': current_frame,
         'end': end_frame
        }
        )
    current_frame = end_frame+1

len(chunk_timestamps)

In [None]:
a = 1
b = speech_timestamps[a]
c = chunk_timestamps[a]
d,e = b['start'], b['end']
f,g = c['start'], c['end']

display(Audio(wav[d:e], rate=SAMPLING_RATE))
display(Audio(speech_only[f:g], rate=SAMPLING_RATE))

In [None]:
a=450
i = results['segments'][a]
start = round(i['start'] * 16000)
end = round(i['end'] * 16000)
print(start, i['text'], end)
display(Audio(speech_only[start:end], rate=SAMPLING_RATE))

In [None]:
len(results['segments'])

## Time to use pandas

The timestamps provided by Whisper become slightly out of sync with the timestamps provided by Silero.  This means that I can't match Whisper segments directly with Silero segments.  
However, the drift doesn't seem too bad and is likely caused by the way that Whisper is converting the frame timestamps to seconds, so it should be sufficient to just find the closest ones.

By putting both the long waveform timestamps and the truncated chunk timestamps into the same dataframe I can easily match the two together with a `segment_id` identifying them as the same segment of audio.  Additionally, having the timestamps sorted will make it easier to match Silero segments with Whisper segments.

In [None]:
vad_df = pd.DataFrame(speech_timestamps)
vad_df['length'] = vad_df['end'] - vad_df['start']
vad_df.head()

In [None]:
tdf = pd.DataFrame(chunk_timestamps)
tdf.columns = ['chunk_start','chunk_end']
tdf['chunk_length'] = tdf['chunk_end'] - tdf['chunk_start']
tdf.head()

In [None]:
vad_df = pd.concat([vad_df, tdf], axis=1)

In [None]:
vad_df.head()

In [None]:
w_df = pd.DataFrame(results['segments'])
w_df.head()