# Automatic Speech Recognition Mini-Project
This project explores existing models available on Hugging Face for ASR and accent classification including:
1. Use Whisper (seq2seq architecture) English checkpoint to transcribe speech
2. Use wav2vec2 (CTC architecture) to extract phonemic transcription from English speech and compare results from 4 different models
3. Use audio classification model for accent recognition to extract accent group from speech

## Set up common code for all models

In [None]:
# Some models want me to log into hugging face
from huggingface_hub import notebook_login
notebook_login()

In [None]:
import librosa
import IPython.display as ipd
import torch

In [None]:
# Load a sample audio to test models
convo, sr = librosa.load("convo.wav", sr=16000)

In [None]:
# ipd.Audio("convo.wav") # listen to file in notebook

In [None]:
from transformers import pipeline

## Whisper English Checkpoint to Transcribe speech

In [3]:
# Set up pipeline to do ASR using whisper-small
model = "openai/whisper-base.en"
# model = "openai/whisper-small.en" # try this one if I need better accuracy

# load the model in half-precision (float16) if running on a GPU to speed up inference
if torch.cuda.is_available():
    device = "cuda:0"
    torch_dtype = torch.float16
else:
    device = "cpu"
    torch_dtype = torch.float32

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    torch_dtype=torch_dtype,
    device=device,
)

In [None]:
# Define function that takes a filepath for audio input and pipeline loads audio, resamples it,
# runs inference with the model and returns transcribed text
def transcribe_speech(filepath):
    output = pipe(
        filepath,
        max_new_tokens=256,
        chunk_length_s=30,
        batch_size=8,
    )
    return output["text"]
  # generate_kwargs={"task": "transcribe","language": "en",},

In [None]:
# !pip install gradio

In [None]:
# Test with individual file
result = transcribe_speech('convo.wav')



In [None]:
result

" Hey, what do you want to do today? Could we go to the beach? Is the weather nice? Yeah, that's a great idea. Let's bring tortilla. It's sunny. Sounds good. She could use some exercise."

## wav2vec2 to extract phonemic transcription from English speech
Compare 4 wav2vec2 models that output phonemic transcriptions but were trained on different datasets

In [None]:
phoneme_models = [
    "vitouphy/wav2vec2-xls-r-300m-timit-phoneme", # Trained on DARPA TIMIT American English
    "mrrubino/wav2vec2-large-xlsr-53-l2-arctic-phoneme", # Trained on speakers of English as a second langauge
    "vitouphy/wav2vec2-xls-r-300m-phoneme", # Trained on unknown dataset
    "ct-vikramanantha/phoneme-scorer-v2-wav2vec2", # Trained on LJSpech which sounds like Americans reading text in English
]
pipes = []

In [1]:
for i, model in enumerate(phoneme_models):
  pipe = pipeline("automatic-speech-recognition", model=model)
  pipes.append(pipe)

In [None]:
outputs = []
for i, pipe in enumerate(pipes):
  output = pipe(convo.copy())
  outputs.append(output)

In [None]:
for i, output in enumerate(outputs):
  print(str(i+1) + ": " + str(output['text']))

1: heɪwəɾiwənɪ duɾɪdeɪkʊwiʊ tɪðə bi ʧɪzðəwɛðɝnaɪsjæðɛ szɪ gɹeɪɾaɪ diə lɛ s b ɹɪŋ tɝ tiə ɪ səni saʊn z gʊ ʃi kʊɾjusəmɛ k sɝsaɪz
2: hei wʌt ju wɑndʌ duɪ tʌ deɪ kʊd wi ɡoʊt tʌ ðʌ bit͡ʃ ɪz ðʌ wɛðɚnaɪsjæ ðæts ʌ ɡɹeɪd aɪdiʌ lɛts pɹɪŋ tʌ  diʌ ɪts sʌni sʌʊmz ɡɛt ʃi kɹud sʌm ɛksɚsaɪs
3: h#hheywahdywahn dahduwtihdeyh#kwiygow tahdhahbiychh#ihzdhahwehdhernaysyehdhaet sahg reytaydiyahh#leht s b rihngtertiyahh#iht sahniyh#sawn z gershiykeryuwz sahmehk sersayzh#
4: hay u n y oo w n i d t aw k uoh d w ee goht aw bth aw v eechi z bthohw e bth or nIs bth a t s bth g raytI ee aw l e t s b r i ng t or d ee  i t th u n ee  sown g e sh ee k uoh d y oo z s u m e k s or sIz


In [None]:
# These are pretty different outputs
# Models 3-4 look like they're outputting non-IPA output, so that's probably not what I want, I'd like IPA phonemic transcriptions
# Model 1-2 appear to be using IPA but look rather different still, with different spacing decisions, and different symbols
# Let's more deeply compare model 1 and 2 output below

In [None]:
# Changing spacing to line up sound with the transcript and expected phonemes
# 0 = transcript
# 1 = output of model 1 from audio input
# 2 = output of model 2 from audio input
# 3 = output from ChatGPT of expected phonemes given the text, assuming spoken in standard colloquial American English

# 0: Hey what do you wanna  do  today?
# 1: heɪ wəɾ  i      wənɪ   du  ɾɪdeɪ
# 2: hei wʌt  j  u   wɑndʌ  duɪ tʌdeɪ
# 3: heɪ wʌɾə jə     wɑnə   du  təˈdeɪ

# 0: Could we  go   to    the beach?
# 1: kʊ    wi  ʊ    tɪ    ðə  biʧ
# 2: kʊd   wi  ɡoʊ  t tʌ  ðʌ  bit͡ʃ
# 3: kʊd   wi  ɡoʊ  tə    ðə  biʧ

# 0: Is   the weather nice?
# 1: ɪz   ðə  wɛðɝ    naɪs
# 2: ɪz   ðʌ  wɛðɚ    naɪs
# 3: ɪz   ðə ˈwɛðɚ    naɪs

# 0: Yeah that's a  great  idea,
# 1: jæ   ðɛ szɪ    gɹeɪɾ  aɪdiə
# 2: jæ   ðæts   ʌ  ɡɹeɪd  aɪdiʌ
# 3: jɛ   ðæts   ə  ɡreɪɾ  aɪˈdiə

# 0: lets bring Tortilla. It's sunny.
# 1: lɛ s bɹɪŋ  tɝ tiə    ɪ    səni
# 2: lɛts pɹɪŋ  tʌ  diʌ   ɪts  sʌni
# 3: lɛts brɪŋ  tɔɹˈtiʝə  ɪts ˈsʌni

# 0: Sounds good, she could use some exercise.
# 1: saʊnz  gʊ    ʃi  kʊɾj  u   səm  ɛksɝsaɪz
# 2: sʌʊmz  ɡɛt   ʃi  kɹud      sʌm  ɛksɚsaɪs
# 3: saʊnz  ɡʊd   ʃi  kəd   juz səm ˈɛksɚˌsaɪz

# Overall, Models 1-2 differ in a lot of sounds, and sometimes it appears Model 1 missed a few
# sounds that Model 2 picked up on, but to be fair, the speakers are both New Englanders who tend
# to soften certain consonants like 't' and 'd'

# Since Model 1 was trained on American accents, while model 2 was trained on speakers of English as a
# second language, we'd expect model 1 to be more accurate on this native Massachusetts speakers conversation

# I think the best idea would be to do an accent detection algorithm, and if the speaker sounds American,
# send the audio to model 1, but if the speaker sounds like English is their second language, send the audio
# to model 2.  Both will output similar IPA phonemic transcripts, but model 2 might capture non-standard American
# pronunciation better

## Explore accent classifier for native-English speakers

Testing out a preexisting accent classification model trained on SpeechBrain CommonAccent dataset which includes 16 accents from recordings in English including:

 - African
 - Australian
 - Bermudan
 - Canadian
 - English
 - Hong Kong
 - Indian
 - Ireland
 - Malaysian
 - New Zealand
 - Philippines
 - Scotland
 - Singapore
 - South Atlantic
 - US
 - Whales

** Notably missing are English speakers whose first language is Spanish, which is a major limitation of this dataset and related models for use in the US

In [None]:
#!pip install speechbrain

In [None]:
import torchaudio
from speechbrain.pretrained import EncoderClassifier

  from speechbrain.pretrained import EncoderClassifier


In [2]:
classifier = EncoderClassifier.from_hparams(source="Jzuluaga/accent-id-commonaccent_ecapa", savedir="pretrained_models/accent-id-commonaccent_ecapa")

In [None]:
# test on a bunch of accents, some native English from outside US, some not
# american: clip from Reservation Dogs, a show with Indigenous/Native American actors
# irish: clip from Derry Girls, an Irish TV show
# indian: Abdul Bari, and Indian professor, on YouTube teaching algorithms
# mexican: Jaime Camil, Mexican actor from Jane the Virgin
# south_african: Trevor Noah US-based comedian born in South Africa
# chinese: Ronny Chieng, Chinese-American comedian
# nigerian: Daniel Etim Effiong and Tana Adelana in Dinner for Four, a Nigerian Film
accents = ['american', 'irish', 'indian', 'mexican', 'south_african', 'chinese', 'nigerian']
predicted_accents = []
likelihoods = []
for i, accent in enumerate(accents):
  out_prob, score, index, text_lab = classifier.classify_file(accent+'.wav')
  predicted_accents.append(text_lab[0])
  likelihoods.append(float(score[0]*100))

In [None]:
for i, accent in enumerate(accents):
  print(f"The {accent} recording was classified as {predicted_accents[i]} with probability {likelihoods[i]:.2f}%")

The american recording was classified as us with probability 81.86%
The irish recording was classified as england with probability 68.69%
The indian recording was classified as indian with probability 67.83%
The mexican recording was classified as australia with probability 58.15%
The south_african recording was classified as us with probability 56.03%
The chinese recording was classified as us with probability 62.65%
The nigerian recording was classified as england with probability 67.06%


Above, we can see the classifier is not very accurate.

Correct: American and Indian

Incorrect (but expected given these speakers are not represented in training data): Mexican, Chinese

Incorrect unexpectedly: Irish, Nigerian

*Note: South African speaker in this recording is Trevor Noah, who is from South Africa but has been in the US for a while, so we can give the model a pass on that