#This is a Quick Hands on for Speech >> Text >> NER 
# using 🤗Hugging Face Transformers, Wav2Vec-2.0 and spaCy.

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 58.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63.5 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 60.8 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installati

In [2]:
import librosa
import soundfile as sf
import torch
import warnings

from transformers import Wav2Vec2ForMaskedLM, Wav2Vec2Tokenizer

warnings.filterwarnings("ignore")

In [39]:

#load wav2vec2 tokenizer and model
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

model = Wav2Vec2ForMaskedLM.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
#You should probably TRAIN this model on a down-stream task to be able to use it for better predictions and inference.


# define speech-to-text function
def asr_transcript(audio_file):
    transcript = ""

    # Stream over 10 seconds chunks
    stream = librosa.stream(
        audio_file, block_length=10, frame_length=16000, hop_length=16000
    )

    for speech in stream:
        if len(speech.shape) > 1:
            speech = speech[:, 0] + speech[:, 1]

        input_values = tokenizer(speech, return_tensors="pt").input_values
        logits = model(input_values).logits

        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.batch_decode(predicted_ids)[0]
        transcript += transcription.lower() + " "

    return transcript

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForMaskedLM were not initialized from the model checkpoint at facebook/wav2vec2-large-960h-lv60-self and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
text_output = asr_transcript("/content/4390.wav") # just some audio(this file is ~4 min long with two speakers in a dialog) file, replace this your own file

In [41]:
text_output

"i loter are aread a lot atousem ber on genera i a  to o tavl cop on i   o ro  t tro for some trans  now run up a glass and crama bobb of water alon atra thats a rot me thin  catit i wane a i wate  deo on or while im a little a  oa a reberibran reeeoes  i seni tee o this bant foser an  fe oa rapto alarne her o tetebe iooon o av ro ibro o re don a a raena te a tousan thers a ar o   i i ll ha to th jar round o rond il whoe  t trave row bopo teres this buses otene dosnt a  readyner to snatch the craft and coten fings are relfon  ras an ta cattle e water  wate teton a telli a differt track sir so hate  round of thor for the priloters  as aba a mile a ltl rabat wi ther  meronan eigt eighteen hundred farout with ater car aatri tei trustin  h aha wild ei  or liv oreverier ewill ave to be let aof of thepama for several for butler divee ouseen o he rina bu at the bo  lo t a t  e o  a bo the confeent for in i rar ra e prt alon on boaran  diret o e fuo  aable a to read for the wo oma brme fon eea

In [26]:
import spacy

In [27]:
nlp=spacy.load('en_core_web_sm')
nlp.pipe_names

['tagger', 'parser', 'ner']

In [43]:
from spacy import displacy

doc = nlp(text_output)
displacy.render(nlp(doc.text),style='ent', jupyter=True)

In [None]:
# Short and Quick Code for (Speech ==> Text ==> NER) by Susant Achary