# Transcribe Audio
In this notebook, we aim to:
1. Transcribe lead vocal audio files of acapella songs into lyrics
2. Get phrase-level timestamps of the lyrics

We will be using the OpenAI Whisper model to acheive this. The accuracy of the model will be evaluated using the Jaccard Similarity Score.

# Install required packages 

In [5]:
# !brew install ffmpeg
# !pip install setuptools-rust
# !pip install git+https://github.com/openai/whisper.git 


In [6]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
import whisper

# library source code here https://github.com/openai/whisper
def transcribe_audio(audio_path, model="large"):
    # returns a dictionary with the text and the segments, contains the phrase level timestamps
    model = whisper.load_model(model)
    result = model.transcribe(audio_path)
    extracted_data = [{"start": item["start"], "end": item["end"], "text": item["text"]} for item in result['segments']]
    return {'text': result['text'], 'segments': extracted_data}

# Test on Jacapella file Harugakita

In [8]:
harugakita_path = "Dataset/jaCappella_v1.1.0/neutral/harugakita/lead_vocal.wav"
actual_harugakita_lyrics = "春が来た 春が来た どこに来た 山に来た 里に来た 野にも来た 鳥がなく 鳥がなく どこでなく 山でなく 里でなく 野でもなく 花がさく 花がさく どこにさく 山にさく 里にさく 野にもさく"

## Base model (~4.5s for a 50s Japanese wav audio file)

In [9]:
# Harugakita lyrics using base model
base_harugakita = transcribe_audio(harugakita_path, "base")

In [46]:
print(base_harugakita['segments'])

[{'start': 0.0, 'end': 13.14, 'text': 'ハルが来たハルが来たどこに来た'}, {'start': 13.14, 'end': 21.34, 'text': '山に来たさとに来たのにも来た'}, {'start': 21.34, 'end': 29.32, 'text': '花が咲く、花が咲くどこに咲く'}, {'start': 29.32, 'end': 38.0, 'text': '山に咲く、さとに咲くのにも咲く'}]


## Large model (~1min 30s for a 50s Japanese wav audio file)

In [40]:
# Harugakita lyrics using large model
large_harugakita = transcribe_audio(harugakita_path, "large")
print(large_harugakita['segments'])

[{'start': 0.0, 'end': 7.0, 'text': '春が来た 春が来た どこに来た'}, {'start': 7.0, 'end': 15.0, 'text': '山に来た 里に来た 野にも来た'}, {'start': 15.0, 'end': 22.0, 'text': '花が咲く 花が咲く どこに咲く'}, {'start': 22.0, 'end': 29.0, 'text': '山に咲く 里に咲く 野にも咲く'}]


# To test the accuracy of the models using Jaccard Similarity:

In [37]:
from nltk import ngrams

# Function to calculate Jaccard similarity
def jaccard_similarity(str1, str2, n=1):
    # Tokenize the strings into n-grams
    set1 = set(ngrams(str1, n))
    set2 = set(ngrams(str2, n))
    
    # Calculate Jaccard similarity
    intersection = len(set1.intersection(set2))
    union = len(set1) + len(set2) - intersection
    similarity = intersection / union
    
    
    return similarity * 100

In [41]:
print(f"Base model accuracy for Jacapella Harugakita audiofile: {round(jaccard_similarity(actual_harugakita_lyrics, base_harugakita['text'], 1),2)}%")
print(f"Large model accuracy for Jacapella Harugakita audiofile: {round(jaccard_similarity(actual_harugakita_lyrics, large_harugakita['text'], 1),2)}%")

Base model accuracy for Jacapella Harugakita audiofile: 45.83%
Large model accuracy for Jacapella Harugakita audiofile: 73.68%


In [1]:
def wer(actual_lyrics, model_lyrics):
  # Split the lyrics into words.
  actual_words = actual_lyrics.split()
  model_words = model_lyrics.split()

  # Calculate the number of insertions, deletions, and substitutions.
  insertions = 0
  deletions = 0
  substitutions = 0
  for i in range(len(actual_words)):
    if i < len(model_words):
      if actual_words[i] != model_words[i]:
        substitutions += 1
    else:
      deletions += 1

  # Calculate the word error rate.
  wer = (insertions + deletions + substitutions) / len(actual_words) * 100

  return wer

In [42]:
# path for /popular/kutsuganaru
kutsuganaru_path = "Dataset/jaCappella_v1.1.0/popular/kutsuganaru/lead_vocal.wav"   
actual_kutsuganaru_lyrics = "お手（てて）つないで野道を行（ゆ）けばみんな可愛（かわ）い小鳥になつて 歌をうたへば靴が鳴る晴れたみ空に靴が鳴る 花をつんではお頭（つむ）にさせばみんな可愛（かわ）いうさぎになつてはねて踊れば靴が鳴る晴れたみ空に靴が鳴る"

In [43]:
base_kutsuganaru = transcribe_audio(kutsuganaru_path, "base")

In [44]:

print(base_kutsuganaru['text'])

漂いみんな可愛いことりになって歌を歌えば靴が鳴る晴れたみ空に靴が鳴る


In [45]:
large_kutsuganaru = transcribe_audio(kutsuganaru_path, "large")
print(large_kutsuganaru['text'])


おててつないでのみちをゆけばみんなかわいいことりになるおててつないでのみちをゆけばみんなかわいいことりになってうたをうたえばくつがなるはれたみそらにくつがなるたらたらたたらたたたたはー


In [46]:
# jaccard
print(f"Base model accuracy for Jacapella Kutsuganaru audiofile: {round(jaccard_similarity(actual_kutsuganaru_lyrics, base_kutsuganaru['text'], 1),2)}%")
print(f"Large model accuracy for Jacapella Kutsuganaru audiofile: {round(jaccard_similarity(actual_kutsuganaru_lyrics, large_kutsuganaru['text'], 1),2)}%")

Base model accuracy for Jacapella Kutsuganaru audiofile: 36.54%
Large model accuracy for Jacapella Kutsuganaru audiofile: 36.84%
