<a href="https://colab.research.google.com/github/JAZ-CO/3D/blob/main/Research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Mispronunciation Detection and Diagnosis (MDD) for Arabic Learners using Generative AI**

**Outlines:**


*  Explore AI tools for Arabic Speech Recognition:
    1. Azure
    2. Google
    3. OpenAI Whisper
*  Datasets:
    1. ASMDD: https://drive.google.com/drive/folders/1dhlp-L0n6_RAzoosVK4bRa7hxBnzebqs
    2. MGB-3: https://huggingface.co/datasets/MightyStudent/Egyptian-ASR-MGB-3/viewer?views%5B%5D=train
    3. QASR or any dataset from news
*  Benchmark each tool with metrics:
      1.   Word Error Rate (WER): Measures transcription errors
      2.   Character Error Rate (CER): Useful for languages with complex scripts
      3.   Speaker Diarization Error Rate (DER): Accuracy in differentiating speakers
      4.   Processing Speed (Latency): Time taken for transcription
      5.   Noise Handling: Performance in different noise environments
      6.   Accent and Dialect Support: Accuracy for diverse accents
      7.   CPU/GPU Usage: Efficiency in processing
      8.   Memory Consumption: RAM requirements
      9.   Cloud vs. Local Processing
      10.  Cost Comparison

*  Conclusion




In [None]:
# Install necessary libraries
!pip install azure-cognitiveservices-speech

Collecting azure-cognitiveservices-speech
  Downloading azure_cognitiveservices_speech-1.42.0-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Downloading azure_cognitiveservices_speech-1.42.0-py3-none-manylinux1_x86_64.whl (39.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.7/39.7 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: azure-cognitiveservices-speech
Successfully installed azure-cognitiveservices-speech-1.42.0


In [None]:
!pip install jiwer

Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-3.1.0-py3-none-any.whl (22 kB)
Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.1.0 rapidfuzz-3.12.2


In [None]:
import psutil
import time
import subprocess

def get_system_usage():
    counter = time.perf_counter()
    return counter

# # Measure before execution
# cpu_before, gpu_before, mem_before = get_system_usage()

# # Measure after execution
# cpu_after, gpu_after, mem_after = get_system_usage()

# # Calculate differences
# cpu_diff = cpu_after - cpu_before
# gpu_diff = gpu_after - gpu_before
# mem_diff = mem_after - mem_before  # Memory difference in bytes

# # Display results
# print(f"CPU Usage Change: {cpu_diff:.2f}%")
# print(f"GPU Usage Change: {gpu_diff:.2f}%")
# print(f"Memory Usage Change: {mem_diff / (1024 ** 2):.2f} MB")  # Convert bytes to MB


In [None]:
# prompt: Given a bunch of arabic text label and a corresponding transcriped arabic text from AI tool, I want to implement these benchmacrks:
#       1.   Word Error Rate (WER): Measures transcription errors
#       2.   Character Error Rate (CER): Useful for languages with complex scripts
#       4.   Processing Speed (Latency): Time taken for transcription
#       7.   CPU/GPU Usage: Efficiency in processing
#       8.   Memory Consumption: RAM requirements

import time
import jiwer
import psutil
import os

# ... (Your existing code for Azure, Google, and Whisper APIs) ...


def calculate_metrics(ground_truth, hypothesis):
    """Calculates WER, CER, and other metrics."""

    # Word Error Rate (WER)
    wer = jiwer.wer([ground_truth], hypothesis)

    # Character Error Rate (CER)
    cer = jiwer.cer(ground_truth, hypothesis)

    return wer, cer


def benchmark(ground_truth_texts, hypothesis_texts, delta_time):
  number_of_texts = 0
  total_wer = 0
  total_cer = 0
  not_transcribed = 0
  for gt, hyp in zip(ground_truth_texts, hypothesis_texts):
      if(isinstance(hyp, str)):

        wer, cer = calculate_metrics(gt, hyp)

        number_of_texts += 1
        total_wer += wer
        total_cer += cer
      else:
        not_transcribed += 1
  if(number_of_texts >0):
    average_wer = total_wer / number_of_texts
    average_cer = total_cer/ number_of_texts

    print(f"  Average WER: {average_wer:.4f}")
    print(f"  Average CER: {average_cer:.4f}")
    print(f"  Latency: {delta_time:.4f} seconds")
    print(f"  Not transcripted: {not_transcribed} words")

    return average_wer, average_cer, delta_time, not_transcribed
  else:
    print("   Nothing got transcripted")
    return 100,100,delta_time,not_transcribed

# # Example usage (replace with your actual data)
# ground_truth_texts = ["السلام عليكم ورحمة الله وبركاته", "كيف حالك؟"]
# hypothesis_texts = ["السلام عليكم ورحمة الله", "كيف حالك"]
# benchmark(ground_truth_texts, hypothesis_texts)

TypeError: expected string or bytes-like object, got 'list'

In [None]:
# prompt: Implement Azure AI Arabic Speech to Text using audio file

# Import necessary libraries
import azure.cognitiveservices.speech as speechsdk

# Replace with your Azure Speech service subscription key and region


# Replace with the path to your Arabic audio file
# audio_file_path = "/content/audio (1).wav"  # Example: "/content/audio.wav"


def transcribe_arabic_azure(audio_file_path):
    """Performs speech-to-text recognition using Azure Speech service."""

    speech_key, service_region = "CDGcTgsSt5N28RmKxnBWe4JAPob4j61RJXY2EfBL2apIZKXbkOsMJQQJ99BBACYeBjFXJ3w3AAAYACOGccdX", "eastus"
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

    # Set the language to Arabic (adjust if needed)
    speech_config.speech_recognition_language = "ar-EG"  # Example: Egyptian Arabic
    audio_config = speechsdk.audio.AudioConfig(filename=audio_file_path)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    # print("Recognizing speech from audio file...")

    result = speech_recognizer.recognize_once_async().get()

    if result.reason == speechsdk.ResultReason.RecognizedSpeech:
        # print("Recognized: {}".format(result.text))
        return result.text
    elif result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(result.no_match_details))
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))

# Example usage
# transcribe_arabic_azure(audio_file_path)


In [None]:
!pip install SpeechRecognition

Collecting SpeechRecognition
  Downloading SpeechRecognition-3.14.1-py3-none-any.whl.metadata (31 kB)
Downloading SpeechRecognition-3.14.1-py3-none-any.whl (32.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m68.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.14.1


**Google Speech to Text**

In [None]:
# prompt: Use Google Speech to text for Arabic using audio file

import speech_recognition as sr

# Replace with the path to your Arabic audio file
audio_file_path = "/content/audio (1).wav"  # Example: "/content/audio.wav"

def transcribe_arabic_google(audio_file_path):
    """Performs speech-to-text recognition using Google Speech Recognition."""
    r = sr.Recognizer()
    with sr.AudioFile(audio_file_path) as source:
        audio = r.record(source)  # read the entire audio file

    try:
        # Use the Arabic language model
        text = r.recognize_google(audio, language="ar-EG")  # or another Arabic variant like "ar-SA"
        # print("Recognized Text:", text)
        return text
    except sr.UnknownValueError:
        print("   Google Speech Recognition could not understand audio")
    except sr.RequestError as e:
        print(f"    Could not request results from Google Speech Recognition service; {e}")

# Example usage
# transcribe_arabic_google(audio_file_path)


In [None]:
!pip install -U openai-whisper

Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/800.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m645.1/800.5 kB[0m [31m19.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127

**Whisper AI**

In [None]:
# prompt: Use Whisper AI Speech to text for Arabic using audio file

import whisper


model =  whisper.load_model("small") # or choose another model size like "small", "medium", "large"

def transcribe_arabic_whisper(audio_file_path):
    """Transcribes Arabic speech using OpenAI Whisper."""
    try:
        result = model.transcribe(audio_file_path, language='ar') # Specify Arabic language
        # print("Recognized Text:", result["text"])
        return result["text"]
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage
# audio_file_path = "/content/audio (1).wav"  # Replace with your audio file path
# transcribe_arabic_whisper(audio_file_path,model)


100%|███████████████████████████████████████| 461M/461M [00:08<00:00, 53.8MiB/s]
  checkpoint = torch.load(fp, map_location=device)


**MP3 Dataset**

In [None]:
# prompt: just print the response from this api
# curl -X GET \
#      "https://datasets-server.huggingface.co/rows?dataset=MightyStudent%2FEgyptian-ASR-MGB-3&config=default&split=train&offset=0&length=100"

import requests
import tempfile

def download_audio(url):
    """Downloads an audio file from a URL and saves it temporarily."""
    response = requests.get(url)
    if response.status_code == 200:
        temp_audio = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        temp_audio.write(response.content)
        temp_audio.close()
        print(temp_audio.name)
        return temp_audio.name
    else:
        raise Exception(f"Failed to download audio file, status code: {response.status_code}")



def api_response(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

url = "https://datasets-server.huggingface.co/rows?dataset=MightyStudent%2FEgyptian-ASR-MGB-3&config=default&split=train&offset=0&length=100"
response = api_response(url)

# print(response)
# print(response["features"][0])
# print(response["rows"])
# print(response["rows"][0]["row"]["audio"][0]["src"])
# print(response["rows"][0]["row"]["sentence"])
# download_audio(response["rows"][0]["row"]["audio"][0]["src"])

ground_truth_texts = []
transcribed_text_google = []

transcribed_text_whisper = []

transcribed_text_azure = []

azure_time = 0
whisper_time = 0
google_time = 0

for i, row in enumerate(response["rows"]):

    if(i ==5):
      break
    audio_url = row["row"]["audio"][0]["src"]
    sentence = row["row"]["sentence"]

    ground_truth_texts.append(sentence)

    audio_file_path = download_audio(audio_url)


    whisper_time = get_system_usage()
    transcribed_text_whisper.append(transcribe_arabic_whisper(audio_file_path))
    new_whisper_time = get_system_usage()
    whisper_time = new_whisper_time-whisper_time


    google_time = get_system_usage()
    transcribed_text_google.append(transcribe_arabic_google(audio_file_path))
    new_google_time = get_system_usage()
    google_time = new_google_time-google_time

    azure_time = get_system_usage()
    transcribed_text_azure.append(transcribe_arabic_azure(audio_file_path))
    new_azure_time = get_system_usage()
    azure_time = new_azure_time-azure_time



In [None]:
print("Google Bechmark")
benchmark(ground_truth_texts,transcribed_text_google, google_time)
print("Whisper Bechmark")
benchmark(ground_truth_texts,transcribed_text_whisper, whisper_time)
print("Azure Bechmark")
benchmark(ground_truth_texts,transcribed_text_azure,azure_time)

Google Bechmark
Average WER: 0.7374
Average CER: 0.6444
Latency: 1.9629 seconds
Whisper Bechmark
Average WER: 0.8166
Average CER: 0.5585
Latency: 10.8000 seconds
Azure Bechmark
Average WER: 0.7845
Average CER: 0.6059
Latency: 2.8386 seconds


**Results:**

The labels provided by MGB3 were already not accurate because it includes English words for some reason

But comparing between three models:

1- Google had the best overall results

2- Azure is second with better CER than Google

3- Whisper was worst using small model with high latecny and WER but with best CER

In [None]:
import os
import zipfile
import re

# Define paths
zip_path = "/content/ASMDD.zip"  # Update with actual ZIP path
extract_path = "/content"  # Update with the desired extraction path

# Step 1: Extract the ZIP file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

# Step 2: Get a sorted list of files
def extract_number(filename):
    """ Extracts leading numeric part from a filename for sorting. """
    match = re.search(r'\d+', filename)
    return int(match.group()) if match else float('inf')  # If no number, put it at the end

# List all files in the extracted path
files = sorted(os.listdir(extract_path), key=lambda x: (extract_number(x), x))

# Print sorted files
print("\n".join(files))


**MP3 Dataset**

In [None]:
import requests
import tempfile
import os
import wave
import contextlib
import re


# Step 2: Iterate through extracted folders and process `.wav` files

arabic_words = [
    "نعم", "رجل", "يخبر","شخص" ,"الوقت", "اليوم", "صحيح", "أستطيع", "شكرا", "الناس",
    "أعلم", "رائع", "مرحبا", "آسف", "تعال", "بالطبع", "العالم", "الحقيقة", "الليلة", "أمي",
    "الطريق", "عمل", "الجميع", "جيدة", "المال", "الذهاب", "أرجوك", "المنزل", "الحياة", "انتظر",
    "الرجال", "الله","الباب", "جميل", "الشرطة", "السيارة","عظيم", "الخير", "حالك", "للغاية", "فتاة",
    "كبيرة", "آسفة", "الأرض", "البيت", "صباح", "ألم", "لحظة", "بالضبط",
]

# "رقم", "طريق",
    # "المدينة", "الرئيس", "صديقي", "ساعة", "غرفة", "عام", "الأطفال", "سنة", "المدرسة", "الصباح",
    # "الماء", "التحدث", "الساعة", "الليل", "نهاية", "حياة", "الواقع", "الطفل", "دكتور", "الهاتف",
    # "الطعام","فريق", "الفتى","اللقاء", "نظرة","النساء", "العشاء","الأسبوع", "ولد", "رسالة", "عائلة", "القائد", "المرأة",
    # "المرأة","الطبيب", "اسم", "النقود", "الكلام", "مدينة", "مساء", "الشمس","ارجوك", "السماء","الزواج","أصدقاء",
    # "مكتب","البحر","الكتاب","الشارع",

ground_truth_texts = arabic_words
transcribed_text_google = []

transcribed_text_whisper = []

transcribed_text_azure = []
azure_time = 0
whisper_time = 0
google_time = 0

google_WERs =[]
google_CERs =[]
google_latecys =[]
google_not_transcripted = 0
whisper_WERs =[]
whisper_CERs =[]
whisper_latecys =[]
whisper_not_transcripted = 0
azure_WERs =[]
azure_CERs =[]
azure_latecys =[]
azure_not_transcripted = 0

#number of folders
i=0

for root, dirs, files in os.walk(extract_path):

    for j,file in enumerate(files):
      if(j<50):
        if file.endswith(".wav"):
            match = re.search(r'\d+', file)  # Find the first occurrence of digits


            file_path = os.path.join(root, file)


            google_time = get_system_usage()
            transcribed_text = transcribe_arabic_google(file_path)
            if(transcribed_text != None):
              transcribed_text_google.append(transcribed_text)
              new_google_time = get_system_usage()
              google_time = new_google_time-google_time
              wer, cer = calculate_metrics(ground_truth_texts[int(match.group())],transcribed_text)
              google_WERs.append(wer)
              google_CERs.append(cer)
              google_latecys.append(google_time)
            else:
              google_not_transcripted +=1
              google_WERs.append(100)
              google_CERs.append(100)
              google_latecys.append(google_time)

            whisper_time = get_system_usage()
            transcribed_text = transcribe_arabic_whisper(file_path)
            transcribed_text_whisper.append(transcribed_text)
            if(transcribed_text != None):
              new_whisper_time = get_system_usage()
              whisper_time = new_whisper_time-whisper_time
              wer, cer = calculate_metrics(ground_truth_texts[int(match.group())],transcribed_text)
              whisper_WERs.append(wer)
              whisper_CERs.append(cer)
              whisper_latecys.append(whisper_time)
            else:
              whisper_not_transcripted +=1
              whisper_WERs.append(100)
              whisper_CERs.append(100)
              whisper_latecys.append(whisper_time)


            azure_time = get_system_usage()
            transcribed_text = transcribe_arabic_azure(file_path)
            transcribed_text_azure.append(transcribed_text)
            if(transcribed_text != None):
              new_azure_time = get_system_usage()
              azure_time = new_azure_time-azure_time
              wer, cer = calculate_metrics(ground_truth_texts[int(match.group())],transcribed_text)
              azure_WERs.append(wer)
              azure_CERs.append(cer)
              azure_latecys.append(azure_time)
            else:
              azure_not_transcripted +=1
              azure_WERs.append(100)
              azure_CERs.append(100)
              azure_latecys.append(whisper_time)


    print("Google Bechmark")
    print(f"  Average WER: {sum(google_WERs)/len(google_WERs) if len(google_WERs)!=0 else 0}")
    print(f"  Average CER: {sum(google_CERs)/len(google_CERs) if len(google_CERs)!=0 else 0}")
    print(f"  Latency: {sum(google_latecys)} seconds")
    print(f"  Not transcripted: {google_not_transcripted} words")
    print("Whisper Bechmark")

    print(f"  Average WER: {sum(whisper_WERs)/len(whisper_WERs) if len(whisper_WERs)!=0 else 0}")
    print(f"  Average CER: {sum(whisper_CERs)/len(whisper_CERs) if len(whisper_CERs)!=0 else 0}")
    print(f"  Latency: {sum(whisper_latecys)} seconds")
    print(f"  Not transcripted: {whisper_not_transcripted} words")

    print("Azure Bechmark")
    print(f"  Average WER: {sum(azure_WERs)/len(azure_WERs) if len(azure_WERs)!=0 else 0}")
    print(f"  Average CER: {sum(azure_CERs)/len(azure_CERs) if len(azure_CERs)!=0 else 0}")
    print(f"  Latency: {sum(azure_latecys)} seconds")
    print(f"  Not transcripted: {azure_not_transcripted} words")






Google Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Whisper Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Azure Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Google Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Whisper Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Azure Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Google Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Whisper Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Azure Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Google Bechmark
  Average WER: 0
  Average CER: 0
  Latency: 0 seconds
  Not transcripted: 0 words
Whisper Be



IndexError: list index out of range

In [None]:



# print("Google:",transcribed_text_google)
# print("Whisper:",transcribed_text_whisper)
# print("Azure:",transcribed_text_azure)

# print("Google match:",set(ground_truth_texts) & set(transcribed_text_google))
# print("Whisper match:",set(ground_truth_texts) & set(transcribed_text_whisper))
# print("Azure match:",set(ground_truth_texts) & set(transcribed_text_azure))