# INFO 5390 HW 2
**Ruth Rajcoomar (rr672)**

## Part A: Introduction

_Data Collection Process:_ I am using data from the Common Voice Datasets, which are free datasets made available by the Mozilla Foundation (Mozilla). Volunteers write sentences, which are then spoken by other volunteers. Separate volunteers then review the recordings to upvote or downvote them. Mozilla asks volunteers to provide demographic data about themselves as well. Project decisions are made "by a diverse community of activists, linguists, data scientists, academics and software engineers from all over the world." From the Mozilla Common Voices datasets, I am specifically using part of the Common Voice Delta Segment 16.1 dataset which was published on 1/4/2024. This data only consists of new data added between the releases of Common Voice Corpus 15.0 and Common Voice Corpus 16.1. I am using data from the validated clips.

_Data Attributes:_ Attributes of this dataset include: client_id (an identifier for the person contributing their voice), path (the name of the audio file), sentence (the sequence of words spoken by the person; the 'ground truth'), up_votes (the number of people who reviewed a clip and could hear and understand the speaker), down_votes (the number of people who reviewed a clip and could not hear or understand the speaker), age (age group in buckets of tens), gender (male, female, or other), accents (any way of pronouncing words that may depend on factors like location or social class). 

_API Being Tested:_ I am testing the SpeechServices API from Microsoft Azure. Microsofft claims that this API can transcribe speech to text with high accuracy. However, they do note that if the audio contains ambient noise or domain-specific jargon, the model may not be sufficient as is.

_Motivation:_ I chose this specific dataset primarily because the corpus datasets are too large for me to download. (However, the dataset I did download was also large. I chose to take a sample from this dataset that had an equal number of audio files from each age group, as I want to look at how age affects the API performance. See the appendix for more details.) This is due to constraints such as limited free access to APIs and my internet speed. The Delta Segment 16.1 is also a good choice because it is the most recent Delta Segment, which means it contains the most updated data attributes and will be reflective of the most up to data instructions Mozilla gave its volunteers. Therefore, this dataset is likely to be of high quality. Also, the validated clips are confirmed audible and understandable by human volunteers. I used Microsoft services for the API as it was recommended in the instructions as not requiring credit card information. In addition, Microsoft is a large company with many resources, so I expect that they would put significant effort into making a high quality API. The most relevant attributes for my analyses were: path, sentence, and age. 

_Hypothesis:_ The API will perform well on younger people (specifically those in their teens, twenties, thirties, fourties) and underperform on older people (specifically those in their fifites, sixties, seventies).

_Other Background Information:_ Attributes such as age, gender, and accents are self-reported. Age was easier to work with in terms of being measurable, as people chose from a list of prepared options without the ability to create their own option. (This is also the case for gender, which I have chosen not to look at for the purposes of this homework.) I originally wanted to work with accents, but found that since people could type in accents in addition to selecting from prepared options, it was harder to work with in terms of measurability.
My submitted materials are as follows: this Jupyter Notebook, a folder titled edited_data (which has the actual data I used for this assignment), and a folder titled cv-corpus-16.1-delta-2023-12-06 which is the original data from Mozilla.

## Part B: Generation

In [1]:
import azure.cognitiveservices.speech as speechsdk
import os
import pandas as pd
import numpy as np
import jiwer

In [3]:
# Initialize Microsoft Azure Speech Service API
subscription_key = "a9b0c545d9cb43e686da72a76bc01c87"
region = "eastus"

azure_api = speechsdk.SpeechConfig(subscription=subscription_key, region=region)

In [4]:
# Load data
voices_df = pd.read_csv('/Users/ruthrajcoomar/Documents/info5390/HW2_rr672/edited_data/common_voices_clean.csv')

In [5]:
# Store file names and generated transcriptions from API into dictionary
audio_folder = '/Users/ruthrajcoomar/Documents/info5390/HW2_rr672/edited_data/wav_files'
transcription_dict = {}
for filename in os.listdir(audio_folder):
    # Call API to generate desired output
    audio_path = os.path.join(audio_folder, filename)
    audio_config = speechsdk.audio.AudioConfig(filename=audio_path)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=azure_api, audio_config=audio_config)
    result = speech_recognizer.recognize_once()
    # Store result of API in dictionary
    transcription_dict[filename] = result.text

# Print the resulting dictionary
print(transcription_dict)

{'common_voice_en_38667654.wav': 'That place is facing erosion.', 'common_voice_en_38571925.wav': 'Several popular movies and television programs have been filmed in Ketanyang.', 'common_voice_en_38607162.wav': 'This in turn will produce radicals to destroy harmful contaminants.', 'common_voice_en_39175731.wav': 'Triang Walker coonhounds get along well with other dogs and with children.', 'common_voice_en_38607213.wav': 'How many people work here?', 'common_voice_en_38522174.wav': "The team's average attendance ranked 6th in the seven team league.", 'common_voice_en_38539226.wav': 'Several of these canals are still in use today.', 'common_voice_en_38916560.wav': 'The building was designed by Elliott Woods.', 'common_voice_en_38667569.wav': 'The river upstream of the Bay supports populations of salmon, steelhead and cutthroat trout.', 'common_voice_en_38778689.wav': 'Tsunami starred as Escamillo and Carmen Disrupted with Sharon Stone.', 'common_voice_en_39018435.wav': 'He then passes th

## Part C: Analysis

In [6]:
# Convert API output dictionary into one dataframe
transcription_df = pd.DataFrame.from_dict(transcription_dict, orient='index', columns=['transcription'])

# Merge to dataframe containing 'ground truth'
merged_df = pd.merge(voices_df, transcription_df, left_on="path", right_index=True, how="left")
# 'ground truths' are in the 'sentence' column of voices_df

# Compare API outputs to 'ground truth' for each row
# Using Word Error Rate (WER)
merged_df['wer'] = merged_df.apply(lambda row: jiwer.wer(row['sentence'], row['transcription']), axis=1)
# Using Character Error Rate (CER)
merged_df['cer'] = merged_df.apply(lambda row: jiwer.cer(row['sentence'], row['transcription']), axis=1)

# Aggregation statistics
age_order_mapping = {
    'teens': 1,
    'twenties': 2,
    'thirties': 3,
    'fourties': 4,
    'fifties': 5,
    'sixties': 6,
    'seventies': 7
}
# Overall WER and CER
mean_wer = merged_df['wer'].mean()
print(f"Overall Mean WER: {mean_wer:.2%}")
mean_cer = merged_df['cer'].mean()
print(f"Overall Mean CER: {mean_cer:.2%}")
print()
# WER by age
mean_wer_by_age = merged_df.groupby('age')['wer'].mean()
sorted_mean_wer_by_age = sorted(mean_wer_by_age.items(), key=lambda x: age_order_mapping.get(x[0]))
for age, mean_wer in sorted_mean_wer_by_age:
    print(f"Mean WER of people in their {age}: {mean_wer:.2%}")
print()
# CER by age
mean_cer_by_age = merged_df.groupby('age')['cer'].mean()
sorted_mean_cer_by_age = sorted(mean_cer_by_age.items(), key=lambda x: age_order_mapping.get(x[0]))
for age, mean_cer in sorted_mean_cer_by_age:
    print(f"Mean CER of people in their {age}: {mean_cer:.2%}")

Overall Mean WER: 32.17%
Overall Mean CER: 23.14%

Mean WER of people in their teens: 45.05%
Mean WER of people in their twenties: 42.40%
Mean WER of people in their thirties: 45.33%
Mean WER of people in their fourties: 20.00%
Mean WER of people in their fifties: 2.86%
Mean WER of people in their sixties: 34.16%
Mean WER of people in their seventies: 35.38%

Mean CER of people in their teens: 25.20%
Mean CER of people in their twenties: 26.26%
Mean CER of people in their thirties: 41.15%
Mean CER of people in their fourties: 20.00%
Mean CER of people in their fifties: 0.43%
Mean CER of people in their sixties: 23.88%
Mean CER of people in their seventies: 25.07%


## Part D: Observations

_Explanation of Analysis:_ I first merged the transcriptions (the results from the Microsoft Azure SpeechServices API) to my voices_df dataframe which contained the ground truths (in the `sentence` column). For each audio file, I compared the API outputs to the ground truth using two metrics: Word Error Rate (WER) and Character Error Rate (CER). To present these metrics, I obtained the mean WER and CER across all audio files. I also presented the mean WER and CER for each age group accounted for in the dataset (teens, twenties, thirties, fourties, fifities, sixties, seventies).

_Ground Truths:_ I did not choose the ground truths myself in this case. The relevant ground truths (i.e. the sequence of words spoken by the person) was provided by Mozilla as part of the dataset in the `sentence` column. I spot checked by listening to a few of the audio files, and they matched the words in the `sentence` column. In addition to that, Mozilla has robust quality control guidelines so I was not worried about the accuracy of their ground truths.

_Metrics:_ I chose Word Error Rate (WER) and Character Error Rate (CER) as they are popular and widely utilized in general transcription cases according to the literature. These sentences did not require any special treatment, as Mozilla intends for them to be natural and conversational. Utilizing the mean for WER and CER allows for an overall, general assessment of the API's performance. It also provides a solid lauching point for determining if further, more detailed analyses would be worth the time and effort.

_Takeaways:_ My hypothesis (The API will perform well on younger people (specifically those in their teens, twenties, thirties, fourties) and underperform on older people (specifically those in their fifites, sixties, seventies).) did not hold up after this audit. 
- The mean WER for people in their teens, twenties, and thirties was higher than the overall mean WER.
- The mean WER for people in their fourties and fifties was lower than the overall mean WER.
- The mean WER for people in their sixties and seventies was higher than the overall mean WER.
- The mean CER for people in their teens, twenties, and thirties was higher than the overall mean CER.
- The mean CER for people in their fourties and fifties was lower than the overall mean CER.
- The mean CER for people in their sixties and seventies was higher than the overall mean CER.
Overall, the API performed better than the overall mean on people in their fourties and fifties and worse than the overall mean for all other age groups, regardless of whether the chosen metric is WER or CER. 
It's possible that the sample of people in their fourties and fifites had accents that the model trained on more. It is also possible that the training data was biased towards these age groups. Cognititve factors such as difficulty in articulation could apply to those in their sixities or seventies, giving the API difficulty. Background noise could have been a distorting factor for any age group, those these audio clips were reviewed as being audible by humans. The big takeaway here is that it is important to have a dataset that is representative across many demographic features such as age in order to have fairer algorithms.

## Part E: Sources

### Introduction
- https://commonvoice.mozilla.org/en/datasets
- https://commonvoice.mozilla.org/en/about?tab=how-validate#playbook
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview
- https://discourse.mozilla.org/t/delta-releases/106567

### Setting Up and Initalizing API

- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text?tabs=macos%2Cterminal&pivots=programming-language-python
- https://learn.microsoft.com/en-us/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig?view=azure-node-latest

### Generating Text Transcriptions
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription
- https://stackoverflow.com/questions/56884243/tutorial-for-azure-speech-to-text
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-to-text?tabs=macos%2Cterminal&pivots=programming-language-python

### Analyzing Output
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
- https://pypi.org/project/jiwer/

### Observations
- https://commonvoice.mozilla.org/en/guidelines?tab=sentence
- https://huggingface.co/learn/audio-course/en/chapter5/evaluation
- https://rechtsprechung-im-ostseeraum.archiv.uni-greifswald.de/word-error-rate-character-error-rate-how-to-evaluate-a-model/

### Appendix
- https://podcastle.ai/converter/mp3-to-wav

## Appendix

The code block below shows the process I utilized in order to obtain a sample from the original dataset, with the goal of having an equal number of audio files from each age group.

In [2]:
# Load the original TSV file
tsv_file_path = '/Users/ruthrajcoomar/Documents/info5390/HW2_rr672/cv-corpus-16.1-delta-2023-12-06/en/validated.tsv'
df = pd.read_csv(tsv_file_path, delimiter='\t')
# Make groups of ages
sample_df = df.groupby('age').sample(n=5, random_state=4)
# Change format given in string from mp3 to wav
def change_extension(filename):
    return filename.replace('.mp3', '.wav')
sample_df['path'] = sample_df['path'].apply(change_extension)
# Save sample as CSV file
sample_df.to_csv('common_voices_clean.csv')

I manually copied the mp3 files with names matching the ones in the common_voices_clean.csv file into the `needed_mp3_files` folder within the `edited_data` folder. I then converted each mp3 file to wav files using podcastle.ai, and manually compiled the wav files into the `wav_files` folder within the `edited_data` folder. I researched processes for performing these processes programatically, but ran into many errors. It got to the point where my time was better served doing these actions manually so I could focus on being efficient with the code most directly relevant to the homework.