# **Karaoke Scoring System**

### **Overview:**
The Karaoke Scoring System is meticulously designed to evaluate a user's singing performance against an original track. Utilizing advanced audio processing techniques and alignment strategies, it delivers precise and insightful scoring, ensuring users gain deep insights into their performance.

### **KaraokeData:**
At the core of our system is the `KaraokeData` class, serving as the single access point for essential data for a particular song: the original singer's audio, the instrumental track, and synchronized lyrics. Beyond just storage, this class adeptly parses lyrics into a structured format, ensuring time-specific lyric extraction, which is paramount for aligning user feedback with distinct moments in the song.

#### **Utilization Within KaraokeData:**
- The **original singer's audio** sets the standard for user performance comparisons.
- The **instrumental track** is instrumental in audio preprocessing, aiding in identifying and attenuating background noises.
- **Synchronized lyrics** enhance the user experience, providing context to the feedback and ensuring precision in alignment.

### **AudioPreprocessor:**
The `AudioPreprocessor` class refines the user's audio through:
1. **Normalization**: Adjusting the audio to have zero mean and unit variance.
2. **Silence Trimming**: Removing any leading and trailing silences from the user's audio.
3. **Spectral Gate**: Filtering out frequencies below a threshold, significantly reducing low-level noise.
4. **Adaptive Noise Reduction**: Harnessing the instrumental track to pinpoint and eliminate background noise from the user's audio.
5. **Voice Activity Detection (VAD)**: Spotting segments where the user is actively singing, ensuring the vocal's prominence over potential background disturbances.

### **Scoring Mechanisms:**
Our system leverages diverse metrics to deliver a well-rounded evaluation:
1. **Linguistic Accuracy Score**: Employs Google's Speech Transcription service to transcribe the user's audio to text. This transcribed text is then matched with the original lyrics, determining pronunciation and word accuracy.
2. **Amplitude Matching Score**: Utilizes Dynamic Time Warping (DTW) to compare amplitude profiles between the user's audio and the original.
3. **Pitch Matching Score**: Investigates the fundamental frequency contours of both the user's and original audio, ensuring tonal alignment.
4. **Rhythm Score**: Contrasts onset patterns between the user's performance and the original, assessing synchronization and timing.

In [1]:
import os
import librosa
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [2]:
from audio_vis import AudioVis
from pipeline import Pipeline
from karaoke_data import KaraokeData
from audio_scorer import AudioScorer
from audio_preprocessor import AudioPreprocessor
from google_speech import GoogleSpeechTranscription
from whisper import WhisperSpeechTranscription

av = AudioVis()
warnings.filterwarnings("ignore")


## Load data

In [3]:
base_dir = "../data/KaraokeData/"
lyrics_dir = os.path.join(base_dir, "SongsLyrics", "Lyrics")
track_dir = os.path.join(base_dir, "SongsLyrics", "Track")
voice_dir = os.path.join(base_dir, "SongsLyrics", "Voice")

data_dict = {}

def add_to_data_dict(directory, key):
    for file in os.listdir(directory):
        if file.endswith(".wav") or file.endswith(".mp3") or file.endswith("_ko_lyrics.csv"):
            if directory == voice_dir:
                if "voice_1" in file:
                    song_id = file.split('_')[0]
                    key = "Original"
                elif "voice_2" in file:
                    song_id = file.split('_')[0]
                    key = "Original Second"
                else:
                    song_id = os.path.splitext(file)[0]
            elif directory == lyrics_dir:
                song_id = file.split('_ko_lyrics.csv')[0]
            else:
                song_id = os.path.splitext(file)[0]

            data_dict.setdefault(song_id, {})
            data_dict[song_id][key] = os.path.join(directory, file)

# Populate the dictionary
add_to_data_dict(base_dir, "Attempted")
add_to_data_dict(lyrics_dir, "Lyrics")
add_to_data_dict(track_dir, "Track")
add_to_data_dict(voice_dir, "Original")

# Print a sample
print(data_dict.get("42029", {}))


{'Attempted': '../data/KaraokeData/42029.wav', 'Lyrics': '../data/KaraokeData/SongsLyrics/Lyrics/42029_ko_lyrics.csv', 'Track': '../data/KaraokeData/SongsLyrics/Track/42029.mp3', 'Original': '../data/KaraokeData/SongsLyrics/Voice/42029.mp3'}


In [4]:
# Extract usable song IDs
required_keys = {"Attempted", "Lyrics", "Track", "Original"}
usable_ids = [song_id for song_id, data in data_dict.items() if required_keys.issubset(data.keys())]

# Print the results
print(f"Number of id's with all files: {len(usable_ids)}")
print(usable_ids)


Number of id's with all files: 13
['42029', '44924', '49032', '58659', '36520', '44957', '57730', '42113', '34302', '51837', '63172', '27256', '45000']


In [5]:
def get_song_data(song_id):
    """Load audio data for a given song ID."""

    song_data = data_dict.get(song_id, {})
    if "Attempted" in song_data and "Lyrics" in song_data and "Track" in song_data and ("Original" in song_data or "Original Second" in song_data):
        original_audio, sr = librosa.load(song_data['Original'], sr=None, mono=True)
        attempted_audio, sr = librosa.load(song_data['Attempted'], sr=None, mono=True)
        track_audio, sr = librosa.load(song_data['Track'], sr=None, mono=True)
        return original_audio, attempted_audio, track_audio, song_data['Lyrics'], sr
    else:
        return None

# Get the song data
song_data = get_song_data("44957")
print(song_data)


(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), array([0.        , 0.        , 0.        , ..., 0.9999695 , 0.952301  ,
       0.90756226], dtype=float32), array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), '../data/KaraokeData/SongsLyrics/Lyrics/44957_ko_lyrics.csv', 44100)


In [6]:
original_audio, attempted_audio, track_audio, raw_lyrics_data, sr = get_song_data("27256")


In [7]:
# %% skip_cell

av.play_audio(original_audio, sr)
av.wav_plot(original_audio, sr, title="Original Audio")
av.plot_spectrogram(original_audio, sr, title="Original Audio")
av.plot_log_spectrogram(original_audio, sr, title="Original Audio")
av.plot_mfcc(original_audio, sr, title="Original Audio")
av.plot_psd(original_audio, sr, title="Original Audio")

av.play_audio(attempted_audio, sr)
av.wav_plot(attempted_audio, sr, title="Attempted Audio")
av.plot_spectrogram(attempted_audio, sr, title="Attempted Audio")
av.plot_log_spectrogram(attempted_audio, sr, title="Attempted Audio")
av.plot_mfcc(attempted_audio, sr, title="Attempted Audio")
av.plot_psd(attempted_audio, sr, title="Attempted Audio")

av.play_audio(track_audio, sr)
av.wav_plot(track_audio, sr, title="Track Audio")
av.plot_spectrogram(track_audio, sr, title="Track Audio")
av.plot_log_spectrogram(track_audio, sr, title="Track Audio")
av.plot_mfcc(track_audio, sr, title="Track Audio")
av.plot_psd(track_audio, sr, title="Track Audio")


In [7]:
# To simulate receiving audio in chunks, I have created split_into_chunks
def split_into_chunks(audio, num_chunks=5):
    """Splits the audio data into a specified number of chunks."""
    chunk_size = len(audio) // num_chunks
    chunks = [audio[i:i + chunk_size] for i in range(0, len(audio), chunk_size)]
    return chunks[:num_chunks]


## KaraokeData

In [8]:
# Initialize KaraokeData
karaoke_data = KaraokeData(original_audio=original_audio, track_audio=track_audio, raw_lyrics_data=raw_lyrics_data, sampling_rate=sr)


In [9]:
chunks = split_into_chunks(attempted_audio, 10)
chunk = chunks[0]


### Parsing Lyrics:

In [10]:
parsed_lyrics = karaoke_data.lyrics_data
for i in range(10):
  print(parsed_lyrics[i])


{'time': 13.76, 'lyrics': 'Ha'}
{'time': 14.16, 'lyrics': 'ppi'}
{'time': 14.36, 'lyrics': 'ness'}
{'time': 15.56, 'lyrics': 'hit'}
{'time': 15.96, 'lyrics': 'her'}
{'time': 17.16, 'lyrics': 'like'}
{'time': 17.55, 'lyrics': 'a'}
{'time': 17.96, 'lyrics': 'train'}
{'time': 18.76, 'lyrics': 'on'}
{'time': 18.96, 'lyrics': 'a'}


### Audio Alignment

In [11]:
karaoke_data.reset_alignment()  # Resetting any prior alignments
start_singing_position = karaoke_data.align_audio(chunk, method="start")
print(f"Position after start alignment: {karaoke_data.current_position}")


Position after start alignment: 0


Align Using Lyrics Data: This method uses the first entry in the parsed lyrics data to align the audio.

In [13]:
# %% skip_cell

karaoke_data.reset_alignment()  # Resetting any prior alignments
start_singing_position = karaoke_data.align_audio(chunk, method="lyrics_data")
print(f"Position after lyrics data alignment: {karaoke_data.current_position}")


Position after lyrics data alignment: 606816


Align Using Onset Detection:
This method aligns the audio by detecting onsets in both the original audio and the provided audio chunk. It then attempts to align the first onset of the chunk with the corresponding onset in the original.

In [14]:
# %% skip_cell

karaoke_data.reset_alignment()  # Resetting any prior alignments
start_singing_position = karaoke_data.align_audio(chunk, method="onset_detection")
print(f"Position after onset detection alignment: {karaoke_data.current_position}")


Position after onset detection alignment: 583680


### Audio Segment Retrieval:

In [12]:
segment_length = len(chunk)  # Using the length of the first audio chunk
retrieved_original_segment, retrieved_track_segment = karaoke_data.get_next_segment(segment_length)


In [16]:
# %% skip_cell

av.play_audio(chunk, sr)
av.wav_plot(chunk, sr, title="Chunk Audio")

av.play_audio(retrieved_original_segment, sr)
av.wav_plot(retrieved_original_segment, sr, title="Original Audio")

av.play_audio(retrieved_track_segment, sr)
av.wav_plot(retrieved_track_segment, sr, title="Track Audio")


In [13]:
segment_lyrics = karaoke_data.get_lyrics()
print(segment_lyrics)


Ha ppi ness hit her like a train on a track



## Preprocessing Audio Chunks

In [12]:
ap = AudioPreprocessor()


In [13]:
def demonstrate_effect(before, after, sr, effect_name, visualization_functions):
    """
    Demonstrates the effect of a preprocessing function by playing and visualizing:
    - The original audio
    - The processed audio
    - (Optional) The removed audio (difference between the original and processed audio)
    - Visualizations specified in visualization_functions for each of the audios
    """
    # Play original audio
    print(f"Original Audio ({effect_name}):")
    av.play_audio(before, sr)

    # Play processed audio
    print(f"\nTransformed Audio ({effect_name}):")
    av.play_audio(after, sr)

    same_length = len(before) == len(after)

    # If the lengths are the same, play the difference audio
    if same_length:
        difference = before - after
        print(f"\nRemoved Audio ({effect_name}):")
        av.play_audio(difference, sr)

    # Display visualizations
    for viz_func in visualization_functions:
        print(f"\nOriginal Audio - {effect_name}:")
        viz_func(before, sr)

        print(f"\nTransformed Audio - {effect_name}:")
        viz_func(after, sr)

        # If the lengths are the same, visualize the difference audio
        if same_length:
            print(f"\nDifference - {effect_name}:")
            viz_func(difference, sr)


### Trim Audio

Description: Trimming silences involves removing any leading or trailing silent parts from an audio signal. This can be useful to eliminate unnecessary silent portions which don't contribute to the actual content.

Implementation: The trim_audio function uses the librosa.effects.trim function to achieve this. The top_db parameter defines a threshold in decibels below which the audio is considered silent.

In [20]:
# vf = [av.wav_plot, av.plot_spectrogram, av.plot_mfcc]
vf = [av.wav_plot]
trimmed_chunk = ap.trim_audio(chunk)
demonstrate_effect(chunk, trimmed_chunk, sr, "Trimming", vf)


Original Audio (Trimming):



Transformed Audio (Trimming):



Original Audio - Trimming:



Transformed Audio - Trimming:


### Normalize Audio

Description: Normalization adjusts the audio amplitude so that its average amplitude is zero, and its standard deviation is one. This ensures that the audio's loudness is relatively consistent, which can be beneficial for further processing or analysis.

Implementation: The _normalize_segment function subtracts the mean from the audio segment and then divides by the standard deviation. The normalize_audio function can normalize the entire audio or perform segment-wise normalization if a segment_length is provided.

In [17]:
vf = [av.wav_plot]
normalized_chunk = ap.normalize_audio(chunk)
demonstrate_effect(chunk, normalized_chunk, sr, "Normalization", vf)


Original Audio (Normalization):



Transformed Audio (Normalization):



Removed Audio (Normalization):



Original Audio - Normalization:



Transformed Audio - Normalization:



Difference - Normalization:


### Spectral Gate

Description: This involves suppressing frequency components of the signal below a certain threshold. It helps in reducing noise or undesired frequencies.

Implementation: In the spectral_gate function, an STFT (Short-Time Fourier Transform) is performed, and any frequencies below the threshold are set to zero. The processed signal is then reconstructed using the inverse STFT.

In [22]:
spectral_gated_chunk = ap.spectral_gate(chunk, threshold=0.1)
demonstrate_effect(chunk, spectral_gated_chunk, sr, "Spectral Gating", vf)


Original Audio (Spectral Gating):



Transformed Audio (Spectral Gating):



Original Audio - Spectral Gating:



Transformed Audio - Spectral Gating:


### Adaptive Noise Reduction

Description: Adaptive noise reduction aims to reduce background noise from the user's audio using a reference (typically the instrumental track). By comparing the reference track with the user's audio, it identifies and subtracts common background elements, reducing interference or bleed from the instrumental.

Implementation: In the given code, the method named spectral_masking is used for this purpose. It calculates a mask based on the ratio of magnitudes of the user audio to the combined magnitudes of the user and reference audios. This mask, when applied to the user's audio STFT, emphasizes the parts where the user's audio is dominant (like vocals) and suppresses the parts that are common with the reference (like instrumental bleed).

In [23]:
adaptively_reduced_chunk = ap.adaptive_noise_reduction(chunk, retrieved_track_segment, sr)
demonstrate_effect(chunk, adaptively_reduced_chunk, sr, "Adaptive Noise Reduction", vf)


Original Audio (Adaptive Noise Reduction):



Transformed Audio (Adaptive Noise Reduction):



Original Audio - Adaptive Noise Reduction:



Transformed Audio - Adaptive Noise Reduction:


### Voice Activity Detection

Description: VAD is employed to detect when a person is speaking/singing in an audio clip. This is valuable when you want to separate or focus on vocal content and exclude long silences or background noise.

Implementation: The voice_activity_detection function uses the librosa.effects.split function, which identifies segments of the signal that are above a certain loudness threshold.

In [24]:
vad_chunk = ap.voice_activity_detection(chunk, sr, top_db=5)  # Adjust the top_db value as needed
demonstrate_effect(chunk, vad_chunk, sr, "Voice Activity Detection", vf)


Original Audio (Voice Activity Detection):



Transformed Audio (Voice Activity Detection):



Original Audio - Voice Activity Detection:



Transformed Audio - Voice Activity Detection:


### Source Separation

Description:
Source separation aims to distinguish different sources within an audio signal. In this method, Harmonic/Percussive Source Separation (HPSS) is employed. HPSS works by analyzing the audio and determining which parts of the signal are steady (harmonic) and which parts are transient (percussive). This method is particularly useful for separating melodic content (like vocals or instrumental solos) from rhythmic content (like drums or percussions).

Implementation:
In the source_separation method, Librosa's hpss function is used to separate the harmonic and percussive components of the input audio chunk. The harmonic component, which corresponds to the melodic content, is returned, thus effectively filtering out the rhythmic or percussive elements of the audio.

In [18]:
source_separated_chunk = ap.source_separation(chunk, sr)
demonstrate_effect(chunk, source_separated_chunk, sr, "Source Separation", vf)


Original Audio (Source Separation):



Transformed Audio (Source Separation):



Removed Audio (Source Separation):



Original Audio - Source Separation:



Transformed Audio - Source Separation:



Difference - Source Separation:


### Spectral Masking

Description: Spectral masking emphasizes certain frequency components based on a reference signal. This can help in reducing interference or background sounds.

Implementation: The spectral_masking function calculates a mask based on the ratio of magnitudes of the user audio to the sum of magnitudes of the user and reference audios. This mask is then applied to the user's audio STFT, and the processed audio is reconstructed.

In [29]:
masked_chunk = ap.spectral_masking(chunk, retrieved_track_segment)
demonstrate_effect(chunk, masked_chunk, sr, "Spectral Masking", vf)


Original Audio (Spectral Masking):



Transformed Audio (Spectral Masking):



Original Audio - Spectral Masking:



Transformed Audio - Spectral Masking:


### Pipeline

In [19]:
def demonstrate_pipeline(audio_chunk, pipeline, sr, **kwargs):
    """Demonstrates the effect of a preprocessing pipeline."""
    processed_audio = AudioPreprocessor.preprocess_audio(audio_chunk, pipeline, **kwargs)
    pipeline_name = " -> ".join(pipeline)
    vf = [av.wav_plot]
    demonstrate_effect(audio_chunk, processed_audio, sr, pipeline_name, vf)

# Define the pipelines
pipeline_1 = ["normalize"]
pipeline_2 = ["adaptive_noise_reduction", "normalize"]
pipeline_3 = ["adaptive_noise_reduction", "source_separation", "normalize"]

# Additional arguments for the pipelines
pipeline_args = {
    "reference_audio": retrieved_track_segment
    # You can add more arguments for other steps here as required
}

# Apply and demonstrate each pipeline
demonstrate_pipeline(chunk, pipeline_1, sr)
demonstrate_pipeline(chunk, pipeline_2, sr, **pipeline_args)
demonstrate_pipeline(chunk, pipeline_3, sr, **pipeline_args)


Original Audio (normalize):



Transformed Audio (normalize):



Removed Audio (normalize):



Original Audio - normalize:



Transformed Audio - normalize:



Difference - normalize:


Original Audio (adaptive_noise_reduction -> normalize):



Transformed Audio (adaptive_noise_reduction -> normalize):



Original Audio - adaptive_noise_reduction -> normalize:



Transformed Audio - adaptive_noise_reduction -> normalize:


Original Audio (adaptive_noise_reduction -> source_separation -> normalize):



Transformed Audio (adaptive_noise_reduction -> source_separation -> normalize):



Original Audio - adaptive_noise_reduction -> source_separation -> normalize:



Transformed Audio - adaptive_noise_reduction -> source_separation -> normalize:


## AudioScorer

**Linguistic Accuracy**: The transcription is used to determine how closely the song content matches the actual lyrics. This is a `qualitative measure`.

**Amplitude, Pitch, and Rhythm Matching**: These are `quantitative measures`. They compare the user's sung audio features with the reference (original) audio. 

In [14]:
# transcriber = GoogleSpeechTranscription()
transcriber = WhisperSpeechTranscription()

#fastdtw is suppose to be much faster but has bug
audio_scorer = AudioScorer(transcriber, 'dtaidistance_fast')


###  Linguistic Accuracy Score

In [15]:
print(karaoke_data.get_lyrics())


Ha ppi ness hit her like a train on a track



In [16]:
transcriber.transcribe(chunk, sr)


'🎵Happiness, hit her, like a train on a track🎵'

The problem here is because the audio is long, for short audio this will work fine

In [18]:
kwargs = {'sr': sr, 'actual_lyrics': segment_lyrics, 'from_file': True}
linguistic_score = audio_scorer.linguistic_accuracy_score(chunk, kwargs=kwargs)
print(f"Linguistic Accuracy Score: {linguistic_score:.2f}")


ERROR:root:Linguistic accuracy computation failed: unsupported operand type(s) for *: 'NoneType' and 'int'


Linguistic Accuracy Score: 0.00


### Rhythm Score:

**Explanation**: Rhythm score quantifies how closely the rhythm of a user's audio matches a reference audio. It can be computed using onset strength, which is a measure of the abruptness of sound changes.

**Implementation**: It compute onset strength for both user audio and reference audio using the `librosa.onset.onset_strength` function. It then computes the Dynamic Time Warping (DTW) similarity between these onset strength sequences to generate a rhythm score.

In [37]:
rhythm_score = audio_scorer.rhythm_score(np.array(chunk), retrieved_original_segment)
print("Rhythm Score:", rhythm_score)




Rhythm Score: 0.9905208911994058


###  Pitch Matching Score:

**Explanation**: Pitch matching score assesses how closely the pitch contour of a user's audio aligns with that of a reference audio. Pitch contour is the variation of pitch over time.

**Implementation**: Uses the `librosa.pyin` function to extract pitch sequences from the user audio and reference audio. It then computes the DTW similarity between these pitch sequences to yield the pitch matching score.

In [38]:
pitch_score = audio_scorer.pitch_matching_score(chunk, retrieved_original_segment)
print("Pitch Matching Score:", pitch_score)


Pitch Matching Score: 0.6381186807369086


### Amplitude Matching Score

**Explanation**: Amplitude matching score evaluates how well the amplitude envelope of a user's audio matches that of a reference audio.

**Implementation**: Flattens the multi-dimensional audio arrays to 1D using `numpy.flatten`, then computes the DTW similarity between these 1D amplitude sequences to derive the amplitude matching score.

In [39]:
amplitude_score = audio_scorer.amplitude_matching_score(chunk, retrieved_original_segment, sr)
print("Amplitude Matching Score:", amplitude_score)




Amplitude Matching Score: 0.9999227028007309


## Full Pipeline

In [26]:
pipelines = {
    "linguistic_accuracy_score": {
        "chunk": [],
        "original": []
    },
    "linguistic_similarity_score": {
        "chunk": [],
        "original": []
    },
    "amplitude_score": {
        "chunk": [],
        "original": []
    },
    "pitch_score": {
        "chunk": ["adaptive_noise_reduction", "spectral_gate", "normalize"],
        "original": ["spectral_gate", "normalize"]
    },
    "rhythm_score": {
        "chunk": ["adaptive_noise_reduction", "spectral_gate", "normalize"],
        "original": ["spectral_gate", "normalize"]
    },
}


In [18]:
ids = ['27256', '58659']

for song_id in ids:
    original_audio, attempted_audio, track_audio, raw_lyrics_data, sr = get_song_data(song_id)

    # Initialize the pipeline
    pipeline = Pipeline(original_audio=original_audio, track_audio=track_audio, raw_lyrics_data=raw_lyrics_data, sr=sr, pipelines=pipelines)

    # Assuming you want to process chunks of the attempted_audio
    chunk_size_seconds = 20
    chunk_size_samples = chunk_size_seconds * sr

    # Split attempted_audio into chunks and process
    for i in range(0, int(len(attempted_audio)/chunk_size_samples), 1):
        if i % 3 != 0 and i != 0:
            pipeline.karaoke_data.get_next_segment(chunk_size_samples)
            continue

        # ---------------------------------- Debugging ----------------------------------
        print("\n" + "🎵" + "🎶" * 18 + "🎵")
        print(f"🎤 Processing Audio Chunk {i+1} 🎤")
        print("🎵" + "🎶" * 18 + "🎵" + "\n")


        chunk = attempted_audio[i*chunk_size_samples:(i+1)*chunk_size_samples]
        scores = pipeline.process_and_score(chunk)
        scores, feedback = pipeline.process_and_score(chunk)
        print("\n" + "🌟" * 30)
        print(f"🎤 Scores for Chunk at {i/sr:.2f} seconds 🎤")
        print("🌟" + "─" * 28 + "🌟")
        for score_type, score_value in scores.items():
            print(f"🎵 {score_type.replace('_', ' ').title()}: {score_value:.2f}")
        print("\nFeedback: " + feedback)
        print("🌟" * 30 + "\n")
        print("\n" + "🔹" * 40 + "\n")
        # ---------------------------------- Debugging ----------------------------------

    # Get average scores
    average_scores = pipeline.get_average_scores()
    print("\n" + "🎉" * 30)
    print(f"🏆 Average Scores for Song {song_id} 🏆")
    print("🎉" + "─" * 28 + "🎉")
    for score_type, score_value in average_scores.items():
        print(f"🎶 {score_type.replace('_', ' ').title()}: {score_value:.2f}")
    print("🎉" * 30 + "\n")



🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵
🎤 Processing Audio Chunk 1 🎤
🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵





Scores: {'linguistic_accuracy_score': 0.8863636363636364, 'linguistic_similarity_score': 0.0, 'amplitude_score': 0.998670979538087, 'pitch_score': 0.15672890500563721, 'rhythm_score': 0.9939604878392587}




Scores: {'linguistic_accuracy_score': 0.2753623188405797, 'linguistic_similarity_score': 0.0, 'amplitude_score': 0.9981458507023967, 'pitch_score': 0.32203726238644603, 'rhythm_score': 0.9853766069539406}

🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟
🎤 Scores for Chunk at 0.00 seconds 🎤
🌟────────────────────────────🌟
🎵 Linguistic Accuracy Score: 0.28
🎵 Linguistic Similarity Score: 0.00
🎵 Amplitude Score: 1.00
🎵 Pitch Score: 0.32
🎵 Rhythm Score: 0.99

Feedback: 🎤 Hmm, your phrasing seems a bit different from the original. Listen closely to the original singer's style and try to emulate it! 🎶
🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟


🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹


🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵
🎤 Processing Audio Chunk 4 🎤
🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵





Scores: {'linguistic_accuracy_score': 0.2928870292887029, 'linguistic_similarity_score': 0.2931937172774869, 'amplitude_score': 0.9986744156349543, 'pitch_score': 0.5956255615904297, 'rhythm_score': 0.9867140461146159}




Scores: {'linguistic_accuracy_score': 0.24, 'linguistic_similarity_score': 0.0, 'amplitude_score': 0.9984067198028099, 'pitch_score': 0.36154666859041257, 'rhythm_score': 0.9765363606692578}

🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟
🎤 Scores for Chunk at 0.00 seconds 🎤
🌟────────────────────────────🌟
🎵 Linguistic Accuracy Score: 0.24
🎵 Linguistic Similarity Score: 0.00
🎵 Amplitude Score: 1.00
🎵 Pitch Score: 0.36
🎵 Rhythm Score: 0.98

Feedback: 🎤 Hmm, your phrasing seems a bit different from the original. Listen closely to the original singer's style and try to emulate it! 🎶
🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟


🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹


🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵
🎤 Processing Audio Chunk 7 🎤
🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵





Scores: {'linguistic_accuracy_score': 0.2417582417582418, 'linguistic_similarity_score': 0.23529411764705888, 'amplitude_score': 0.9985030640113264, 'pitch_score': 0.45221527411268797, 'rhythm_score': 0.9882244812059702}




Scores: {'linguistic_accuracy_score': 0.1145374449339207, 'linguistic_similarity_score': 0.0, 'amplitude_score': 0.9985269962314716, 'pitch_score': 0.36015975084982005, 'rhythm_score': 0.9884078853177477}

🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟
🎤 Scores for Chunk at 0.00 seconds 🎤
🌟────────────────────────────🌟
🎵 Linguistic Accuracy Score: 0.11
🎵 Linguistic Similarity Score: 0.00
🎵 Amplitude Score: 1.00
🎵 Pitch Score: 0.36
🎵 Rhythm Score: 0.99

Feedback: 🎤 Hmm, your phrasing seems a bit different from the original. Listen closely to the original singer's style and try to emulate it! 🎶
🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟


🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹


🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵
🎤 Processing Audio Chunk 10 🎤
🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵



ERROR:root:Linguistic accuracy computation failed: division by zero


ZeroDivisionError: division by zero

In [27]:
ids = ['27256', '58659']

for song_id in ids:
    original_audio, attempted_audio, track_audio, raw_lyrics_data, sr = get_song_data(song_id)

    # Initialize the pipeline
    pipeline = Pipeline(original_audio=original_audio, track_audio=track_audio, raw_lyrics_data=raw_lyrics_data, sr=sr, pipelines=pipelines)

    # Assuming you want to process chunks of the attempted_audio
    chunk_size_seconds = 20
    chunk_size_samples = chunk_size_seconds * sr

    # Split attempted_audio into chunks and process
    for i in range(0, int(len(attempted_audio)/chunk_size_samples), 1):
        if i % 3 != 0 and i != 0:
            pipeline.karaoke_data.get_next_segment(chunk_size_samples)
            continue

        # ---------------------------------- Debugging ----------------------------------
        print("\n" + "🎵" + "🎶" * 18 + "🎵")
        print(f"🎤 Processing Audio Chunk {i+1} 🎤")
        print("🎵" + "🎶" * 18 + "🎵" + "\n")


        chunk = attempted_audio[i*chunk_size_samples:(i+1)*chunk_size_samples]
        scores = pipeline.process_and_score(chunk)
        scores, feedback = pipeline.process_and_score(chunk)
        print("\n" + "🌟" * 30)
        print(f"🎤 Scores for Chunk at {i/sr:.2f} seconds 🎤")
        print("🌟" + "─" * 28 + "🌟")
        for score_type, score_value in scores.items():
            print(f"🎵 {score_type.replace('_', ' ').title()}: {score_value:.2f}")
        print("\nFeedback: " + feedback)
        print("🌟" * 30 + "\n")
        print("\n" + "🔹" * 40 + "\n")
        # ---------------------------------- Debugging ----------------------------------

    # Get average scores
    average_scores = pipeline.get_average_scores()
    print("\n" + "🎉" * 30)
    print(f"🏆 Average Scores for Song {song_id} 🏆")
    print("🎉" + "─" * 28 + "🎉")
    for score_type, score_value in average_scores.items():
        print(f"🎶 {score_type.replace('_', ' ').title()}: {score_value:.2f}")
    print("🎉" * 30 + "\n")



🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵
🎤 Processing Audio Chunk 1 🎤
🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵





Scores: {'linguistic_accuracy_score': 0.8863636363636364, 'linguistic_similarity_score': 0.0, 'amplitude_score': 0.9999448244110144, 'pitch_score': 0.15672890500563721, 'rhythm_score': 0.9939604878392587}




Scores: {'linguistic_accuracy_score': 0.2753623188405797, 'linguistic_similarity_score': 0.0, 'amplitude_score': 0.9999054876534426, 'pitch_score': 0.32203726238644603, 'rhythm_score': 0.9853766069539406}

🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟
🎤 Scores for Chunk at 0.00 seconds 🎤
🌟────────────────────────────🌟
🎵 Linguistic Accuracy Score: 0.28
🎵 Linguistic Similarity Score: 0.00
🎵 Amplitude Score: 1.00
🎵 Pitch Score: 0.32
🎵 Rhythm Score: 0.99

Feedback: 🎤 Hmm, your phrasing seems a bit different from the original. Listen closely to the original singer's style and try to emulate it! 🎶
🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟


🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹🔹


🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵
🎤 Processing Audio Chunk 4 🎤
🎵🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎵





SystemError: CPUDispatcher(<function _viterbi at 0x16de41ea0>) returned a result with an exception set