<a href="https://colab.research.google.com/github/BATspock/ML_Projects/blob/main/whisper/Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Whisper for performance on audio files and coonverted audio files from videos

- Whisper installation with pytorch working in this notbook
-  Compare performance of different whisper models and create report
- Detailed whipser installation at https://github.com/openai/whisper

In [None]:
import whisper
import timeit

### Working with video files from YouTube using YouTube API

In [None]:
from __future__ import unicode_literals
import youtube_dl

### Function to create audio (mp3) files from YouTube videos 

In [None]:
def save_to_mp3(url):
    """
    Save a YouTube video URL to mp3.

    Args:
        url (str): A YouTube video URL.

    Returns:
        str: The filename of the mp3 file.
    """

    options = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
    }

    with youtube_dl.YoutubeDL(options) as downloader:
        downloader.download(["" + url + ""])
                
    return downloader.prepare_filename(downloader.extract_info(url, download=False)).replace(".m4a", ".mp3")

#### Convert YouTube videos to mp3 and create audio files

In [None]:
# This url is British PM Liz Truss giving a speech
youtube_url = "https://www.youtube.com/watch?v=UFNRUuBARM4"
# additional links for testing purposes
# this link contains a video with a duration of 1:56 of French people speaking in English
# youtube_url = "https://www.youtube.com/watch?v=iwto58Wc2bg&ab_channel=Frenchly"
filename = save_to_mp3(youtube_url)

## Convert audio file saved to transcribe text using whisper

### Trying with different size whisper models
- tiny	39 M	tiny.en	tiny	~1 GB	~32x
- base	74 M	base.en	base	~1 GB	~16x
- small	244 M	small.en	small	~2 GB	~6x
- medium	769 M	medium.en	medium	~5 GB	~2x

Performance of $\textbf{tiny.en}$ model

In [None]:
model = whisper.load_model("tiny.en")

### Transcribe text using transcribe function and check result

In [None]:
time = timeit.default_timer()
result = model.transcribe(filename, fp16=False)
#result['text']
print("Time taken to transcribe: ", timeit.default_timer() - time)

Performance of $\textbf{base.en}$ model

In [None]:
timer = timeit.default_timer()
model = whisper.load_model("base.en")
result = model.transcribe(filename, fp16=False)
#result['text']
print("Time taken to transcribe using base model: ", timeit.default_timer() - timer)


Performance of $\textbf{small.en}$ model

In [None]:
timer = timeit.default_timer()
model = whisper.load_model("small.en")
result = model.transcribe(filename, fp16=False)
#result['text']
print("Time taken to transcribe using small model: ", timeit.default_timer() - timer)

Performance of $\textbf{medium.en}$ model

In [None]:
timer = timeit.default_timer()
model = whisper.load_model("medium.en")
result = model.transcribe(filename, fp16=False)
#result['text']
print("Time taken to transcribe using medium model: ", timeit.default_timer() - timer)

## Observation with english translations

1. Performance of tiny model is much faster compared to other larger models
2. Whisper is able to discern clearly between speaker and background noise
3. Base model explicitly mentions "APPLUSES" during transcription
4. Whisper is also able to understand the semantics as "Norfolk turkey" has been identified as a food dish by every model except for the base model which identifies it more as a conjugation of countries, that is : "Norfolk, Turkey"


# Check performance on video where French speakers are speaking English

In [None]:
youtube_url_french = "https://www.youtube.com/watch?v=iwto58Wc2bg&ab_channel=Frenchly"
filename_french = save_to_mp3(youtube_url_french)

Performance of $\textbf{tiny.en}$ model

In [None]:
timer = timeit.default_timer()
model = whisper.load_model("tiny.en")
result = model.transcribe(filename_french, fp16=False)
#result['text']
print("Time taken to transcribe French speech using tiny model: ", timeit.default_timer() - timer)

Performance of $\textbf{base.en}$ model

In [None]:
timer = timeit.default_timer()
model = whisper.load_model("base.en")
result = model.transcribe(filename_french, fp16=False)
result['text']
print("Time taken to transcribe French speech using base model: ", timeit.default_timer() - timer)

Performance of $\textbf{small.en}$ model

In [None]:
timer = timeit.default_timer()
model = whisper.load_model("small.en")
result = model.transcribe(filename_french, fp16=False)
#result['text']
print("Time taken to transcribe French speech using small model: ", timeit.default_timer() - timer)

Performance of $\textbf{medium.en}$ model

In [None]:
model = whisper.load_model("medium.en")
result = model.transcribe(filename_french, fp16=False)
result['text']

# Observations
- The tiny model is the fastest and the least accurate
- The accuracies in the models is very clear as the models get bigger
- The base model is able to capture French from english parts of the audio, especially for cases such as "to Tour-F-L. Yeah, it's F-L Tour", for (Effiel Tower) and French words from the speaker such as "Le Divan du Monde"
- $\textbf{Need to Investigate Further}$: The medium model seem to run out of memory for French audio which it should not as it worked with English audio 


# Quantitative Aanlysis of Results

### Import for the next process



In [2]:
from Levenshtein import distance 
import moviepy.editor as mp
import whisper
import timeit
import xml.etree.ElementTree as ET
import os
import sys
import json
import matplotlib.pyplot as plt

### Create function to convert all video files to audio files

In [None]:
def convert_video_to_audio(video_file_path):
    #convert video mp4 to mp3
    start_time = timeit.default_timer()

    clip = mp.VideoFileClip(video_file_path)
    clip.audio.write_audiofile(video_file_path.replace(".mp4",".mp3"))
    #move converted audio file to audio folder
    audio_file = video_file_path.replace(".mp4",".mp3")
    audio_file = audio_file.split('/')[-1]
    os.replace('Data/'+audio_file,'audio/'+audio_file)
    #transcribe the audio file
    print("Time Taken to transcribe the audio file:",timeit.default_timer()-start_time)

### Function to convert video files to transcribed text

In [1]:
def compare(audio_file_path, xml_path, model_type = 'tiny.en'):

    #print("Converting video to audio")
    
    print("audio file path:",'audio/'+ audio_file_path)
    print("xml file path:",xml_path)
    print('#'*50)

    #transcribe the audio file
    start_time = timeit.default_timer()
    
    model = whisper.load_model(model_type)
    result = model.transcribe('audio/'+ audio_file_path, fp16=False)

    transcription_time = timeit.default_timer() - start_time
    print("Time Taken to transcribe the audio file:",transcription_time)
    
    whisper_output = result['text'].split()
    #extract the text from the xml file

    start_time = timeit.default_timer()

    tree = ET.parse(xml_path)
    words = []
    for node in tree.findall('.//p/span'):
        words.append(node.text)

    #print("Time Taken to extract the text from the xml file:",timeit.default_timer() - start_time)

    #filter the words from xml file and whisper output

    start_timer = timeit.default_timer()

    filtered_words = []
    for word in words:
        if word == None:
            continue
        elif '[' not in word:
            if ':' not in word:
                filtered_words.append(word)

    #get sets of filtered words and whisper output
    set_whisper_output = set(whisper_output)
    set_filtered_words = set(filtered_words)
    #convert the lists to string
    whisper_output_str = ' '.join(whisper_output)
    filtered_words_str = ' '.join(filtered_words)
    print('$'*20)
    print()
    print("Number of words in the video:",len(set_whisper_output))
    print("Number of words in the xml file:",len(set_filtered_words))
    print('$'*20)
    print()
    #calculate unique words in the video and xml file
    print("Number of unique words in the video:",len(set_whisper_output - set_filtered_words))
    print("Number of unique words in the xml file:",len(set_filtered_words - set_whisper_output))
    print('$'*20)
    print()
    #print("Time Taken to process the string:",timeit.default_timer() - start_timer)

    #calculate levenshitn distance between the two strings
    start_timer = timeit.default_timer()
    print(filtered_words_str[:20])
    print(whisper_output_str[:20])
    with open(audio_file_path.split('/')[-1]+str(model_type)+'.txt', "w") as text_file:
        text_file.write(whisper_output_str.lower())
    levenshitn_distance = distance(filtered_words_str.lower(),whisper_output_str.lower())/len(filtered_words_str)
    print("Levenshtein distance between xml words and whisper output:", levenshitn_distance)
            #distance(filtered_words_str.lower(),whisper_output_str.lower())/len(filtered_words_str))
    print('$'*20)
    print()
    print("Time Taken to calculate levenshtein distance:",timeit.default_timer() - start_timer)

    print('#'*50)

    return transcription_time, levenshitn_distance, len(set_filtered_words - set_whisper_output)

### Call function on all the files, report comparison stats and save graphs

In [None]:
#    video_files = []
#    for file in os.listdir('Data/'):
#        if file.endswith('.mp4'):
#            video_files.append(file)
#    video_files.sort()

xml_files = []
for file in os.listdir(os.getcwd()):
    if file.endswith('.xml'):
        xml_files.append(file)
xml_files.sort()
#print(video_files, xml_files)

#convert video mp4 to mp3
#for video_file in video_files:
#  convert_video_to_audio('Data/'+ video_file)

audio_files = []
for file in os.listdir('audio/'):
    if file.endswith('.mp3'):
        audio_files.append(file)
audio_files.sort()

print(xml_files, audio_files)


TIME = []
DISTANCE = []
WORDS = []
#    print(audio_files)
tiny_time = []
tiny_distance = []
tiny_words = []
for i in range(len(audio_files)):
    print("#### NEW VIDEO TINY####")
    time , dist, words = compare(audio_files[i],xml_files[i], "tiny.en")
    tiny_time.append(time)
    tiny_distance.append(dist)
    tiny_words.append(words)
TIME.append(tiny_time)
DISTANCE.append(tiny_distance)
WORDS.append(tiny_words)

base_time = []
base_distance = []
base_words = []
for i in range(len(audio_files)):
    print("#### NEW VIDEO BASE####")
    time , dist, words = compare(audio_files[i],xml_files[i], "base.en")
    base_time.append(time)
    base_distance.append(dist)
    base_words.append(words)
TIME.append(base_time)
DISTANCE.append(base_distance)
WORDS.append(base_words)

small_time = []
small_distance = []
small_words = []
for i in range(len(audio_files)):
    print("#### NEW VIDEO SMALL####")
    time , dist, words = compare(audio_files[i],xml_files[i], "small.en")
    small_time.append(time)
    small_distance.append(dist)
    small_words.append(words)
TIME.append(small_time)
DISTANCE.append(small_distance)
WORDS.append(small_words)

#    medium_time = []
#    medium_distance = []
#    medium_words = []
#    print("Medium Model")
#    for i in range(len(audio_files)):
#        print("#### NEW VIDEO MEDIUM ####")
#        time , dist, words = compare(audio_files[i],xml_files[i], "medium.en")
#        medium_time.append(time)
#        medium_distance.append(dist)
#        medium_words.append(words)
#    TIME.append(medium_time)
#    DISTANCE.append(medium_distance)
#    WORDS.append(medium_words)

#save TIME and DISTANCE in json format
with open('time.json', 'w') as f:
    json.dump(TIME, f)
with open('distance.json', 'w') as f:
    json.dump(DISTANCE, f)
with open('words.json', 'w') as f:
    json.dump(WORDS, f)

['21259.1.xml', '21259.2.xml', '43666.1.xml', '43666.2.xml', '5891.1.xml', '5891.5.xml', '9908.1.xml', '9908.2.xml'] ['21259-01-V01-5000000002053816.mp3', '21259-02-V01-5000000002053837.mp3', '43666-01-V01-5000000002836675.mp3', '43666-02-V01-5000000002836696.mp3', '5891-01-V01-1279244.mp3', '5891-05-V01-1277108.mp3', '9908-01-V01-1329158.mp3', '9908-02-V01-1328710.mp3']
#### NEW VIDEO TINY####
audio file path: audio/21259-01-V01-5000000002053816.mp3
xml file path: 21259.1.xml
##################################################
Time Taken to transcribe the audio file: 215.85463279800024
$$$$$$$$$$$$$$$$$$$$

Number of words in the video: 939
Number of words in the xml file: 1134
$$$$$$$$$$$$$$$$$$$$

Number of unique words in the video: 452
Number of unique words in the xml file: 647
$$$$$$$$$$$$$$$$$$$$

My name is Manuel Be
....................
Levenshtein distance between xml words and whisper output: 0.3683451700752924
$$$$$$$$$$$$$$$$$$$$

Time Taken to calculate levenshtein distan

Time Taken to transcribe the audio file: 237.0943705919999
$$$$$$$$$$$$$$$$$$$$

Number of words in the video: 857
Number of words in the xml file: 1006
$$$$$$$$$$$$$$$$$$$$

Number of unique words in the video: 271
Number of unique words in the xml file: 420
$$$$$$$$$$$$$$$$$$$$

Tape two, Ella Davis
Take two Ella Davis.
Levenshtein distance between xml words and whisper output: 0.20889480076340292
$$$$$$$$$$$$$$$$$$$$

Time Taken to calculate levenshtein distance: 0.501264064998395
##################################################
#### NEW VIDEO BASE####
audio file path: audio/5891-01-V01-1279244.mp3
xml file path: 5891.1.xml
##################################################
Time Taken to transcribe the audio file: 257.93086259400116
$$$$$$$$$$$$$$$$$$$$

Number of words in the video: 1232
Number of words in the xml file: 1390
$$$$$$$$$$$$$$$$$$$$

Number of unique words in the video: 341
Number of unique words in the xml file: 499
$$$$$$$$$$$$$$$$$$$$

1995. We're with a s
opa tel

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(TIME[0], DISTANCE[0], label = 'tiny')
plt.scatter(TIME[1], DISTANCE[1], label = 'base')
plt.scatter(TIME[2], DISTANCE[2], label = 'small')
#     plt.scatter(TIME[3], DISTANCE[3], label = 'medium')

plt.xlabel('Time')
plt.ylabel('Levenshtein Distance')
plt.title('Levenshtein Distance vs Time')
plt.legend()
plt.savefig('levenshtein_distance_vs_time.png')

plt.figure(figsize=(5,5))
plt.scatter(WORDS[0], DISTANCE[0], label = 'tiny')  
plt.scatter(WORDS[1], DISTANCE[1], label = 'base')
plt.scatter(WORDS[2], DISTANCE[2], label = 'small') 
#    plt.scatter(WORDS[3], DISTANCE[3], label = 'medium')

plt.xlabel('Number of words')
plt.ylabel('Levenshtein Distance')
plt.title('Levenshtein Distance vs Number of words')
plt.legend()
plt.savefig('levenshtein_distance_vs_words.png')
