# Transcribe Audio
In this notebook, we aim to:
1. Transcribe lead vocal audio files of acapella songs into lyrics
2. Get phrase-level timestamps of the lyrics

We will be using the OpenAI Whisper model to acheive this. The accuracy of the model will be evaluated using the Jaccard Similarity Score.

# Install required packages 

In [31]:
# !brew install ffmpeg
# !pip install setuptools-rust
# !pip install git+https://github.com/openai/whisper.git 


Running `brew update --auto-update`...
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/portable-ruby/portable-ruby/blobs/sha256:61029cec31c68a1fae1fa90fa876adf43d0becff777da793f9b5c5577f00567a[0m
######################################################################### 100.0%####################                         69.7%
[34m==>[0m [1mPouring portable-ruby-2.6.10_1.el_capitan.bottle.tar.gz[0m
[34m==>[0m [1mHomebrew collects anonymous analytics.[0m
[1mRead the analytics documentation (and how to opt-out) here:
  [4mhttps://docs.brew.sh/Analytics[24m[0m
No analytics have been recorded yet (nor will be during this `brew` run).

[34m==>[0m [1mhomebrew/core is old and unneeded, untapping to save space...[0m
Untapping homebrew/core...
Untapped 2 commands and 6594 formulae (6,970 files, 654.7MB).
[34m==>[0m [1mDownloading https://formulae.brew.sh/api/formula_tap_migrations.jws.json[0m
######################################################################### 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [35]:
import whisper

def transcribe_audio(audio_path, model="base"):
    # returns a dictionary with the text and the segments, contains the phrase level timestamps
    model = whisper.load_model(model)
    result = model.transcribe(audio_path)
    extracted_data = [{"start": item["start"], "end": item["end"], "text": item["text"]} for item in result['segments']]
    return {'text': result['text'], 'segments': extracted_data}

# Test on Jacapella file Harugakita

In [43]:
harugakita_path = "Dataset/jaCappella_v1.1.0/neutral/harugakita/lead_vocal.wav"
actual_harugakita_lyrics = "Êò•„ÅåÊù•„Åü Êò•„ÅåÊù•„Åü „Å©„Åì„Å´Êù•„Åü Â±±„Å´Êù•„Åü Èáå„Å´Êù•„Åü Èáé„Å´„ÇÇÊù•„Åü È≥•„Åå„Å™„Åè È≥•„Åå„Å™„Åè „Å©„Åì„Åß„Å™„Åè Â±±„Åß„Å™„Åè Èáå„Åß„Å™„Åè Èáé„Åß„ÇÇ„Å™„Åè Ëä±„Åå„Åï„Åè Ëä±„Åå„Åï„Åè „Å©„Åì„Å´„Åï„Åè Â±±„Å´„Åï„Åè Èáå„Å´„Åï„Åè Èáé„Å´„ÇÇ„Åï„Åè"

## Base model (~4.5s for a 50s Japanese wav audio file)

In [44]:
# Harugakita lyrics using base model
base_harugakita = transcribe_audio(harugakita_path, "base")

In [46]:
print(base_harugakita['segments'])

[{'start': 0.0, 'end': 13.14, 'text': '„Éè„É´„ÅåÊù•„Åü„Éè„É´„ÅåÊù•„Åü„Å©„Åì„Å´Êù•„Åü'}, {'start': 13.14, 'end': 21.34, 'text': 'Â±±„Å´Êù•„Åü„Åï„Å®„Å´Êù•„Åü„ÅÆ„Å´„ÇÇÊù•„Åü'}, {'start': 21.34, 'end': 29.32, 'text': 'Ëä±„ÅåÂí≤„Åè„ÄÅËä±„ÅåÂí≤„Åè„Å©„Åì„Å´Âí≤„Åè'}, {'start': 29.32, 'end': 38.0, 'text': 'Â±±„Å´Âí≤„Åè„ÄÅ„Åï„Å®„Å´Âí≤„Åè„ÅÆ„Å´„ÇÇÂí≤„Åè'}]


## Large model (~1min 30s for a 50s Japanese wav audio file)

In [47]:
# Harugakita lyrics using large model
large_harugakita = transcribe_audio(harugakita_path, "large")
print(large_harugakita['segments'])

[{'start': 0.0, 'end': 7.0, 'text': 'Êò•„ÅåÊù•„Åü Êò•„ÅåÊù•„Åü „Å©„Åì„Å´Êù•„Åü'}, {'start': 7.0, 'end': 15.0, 'text': 'Â±±„Å´Êù•„Åü Èáå„Å´Êù•„Åü Èáé„Å´„ÇÇÊù•„Åü'}, {'start': 15.0, 'end': 22.0, 'text': 'Ëä±„ÅåÂí≤„Åè Ëä±„ÅåÂí≤„Åè „Å©„Åì„Å´Âí≤„Åè'}, {'start': 22.0, 'end': 29.0, 'text': 'Â±±„Å´Âí≤„Åè Èáå„Å´Âí≤„Åè Èáé„Å´„ÇÇÂí≤„Åè'}]


# To test the accuracy of the models using Jaccard Similarity:

In [38]:
from nltk import ngrams

# Function to calculate Jaccard similarity
def jaccard_similarity(str1, str2, n=1):
    # Tokenize the strings into n-grams
    set1 = set(ngrams(str1, n))
    set2 = set(ngrams(str2, n))
    
    # Calculate Jaccard similarity
    intersection = len(set1.intersection(set2))
    union = len(set1) + len(set2) - intersection
    similarity = intersection / union
    
    return similarity

In [50]:
print(f"Base model accuracy for Jacapella Harugakita audiofile: {round(jaccard_similarity(actual_harugakita_lyrics, base_harugakita['text'], 1),2)}%")
print(f"Large model accuracy for Jacapella Harugakita audiofile: {round(jaccard_similarity(actual_harugakita_lyrics, large_harugakita['text'], 1),2)}%")

Base model accuracy for Jacapella Harugakita audiofile: 0.46%
Large model accuracy for Jacapella Harugakita audiofile: 0.74%
