This is an alternative and in my view better approach to transcribing long audio clips. The code is taken from Lysandre Jik's (of Hugging Face) response to an issue about transcribing long audio clips on Github.

The code here doesn't need you to manually split up your audio files. Just use Librosa to stream a specific chunk of the audio file to transcribe one at a time.


# CHAIN-LINKING NLP TASKS WITH WAV2VEC2 & TRANSFORMERS

Take an audio clip in English, transcibe it and finally apply a layer of sentiment analysis on the speech or try to summarise it. For a human, getting all these tasks done could take hours.

The output, be it the transcribed or translated text, still require a certain amount of clean up by a human user.

This notebook demos a simple workflow to:
 - transcribe a longish English speech (~24 minutes)
 - plot the 'sentiment structure' of the Engish speech.

I decided to use Biden's [first prime time speech](https://www.youtube.com/watch?v=JYBatFW-BP4) on Mar 11/12 2021. 

## REQUIREMENTS

- [transformers](https://pypi.org/project/transformers/) >= 4.3
- [librosa](https://pypi.org/project/librosa/)



In [18]:
import librosa
import matplotlib as mpl
import pandas as pd
import plotly
import plotly_express as px
import plotly.graph_objects as go
import numpy as np
import re
import torch

from nltk.tokenize import sent_tokenize
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Tokenizer,
    MarianMTModel,
    MarianTokenizer,
    pipeline,
)

mpl.rcParams["figure.dpi"] = 300
%matplotlib inline
%config InlineBackend.figure_format ='retina'

# 1. TRANSCRIBE

More detailed notes on using Wav2Vec2 and my work-around can be found in notebooks 1 and 2. I'll skip over the details here.

Biden's speech is about 24 minutes long, and is transcribed in 25s-chunks at a time. The raw transcript will look a little disjointed. 

Another issue is Wav2Vec2's inability to generate punctuation, so longer clips would be tricky for downstream tasks like summarization. But more on that later.

## 1.1 DEFINE FUNCTIONS TO TRANSCRIBE AUDIO CLIP IN 20-SECOND CHUNKS

You can change the "block_length" parameter to any value, technically speaking. But anything above a 60s block length results in considerable out-of-memory issues. 20/30-second blocks seem to make the most sense to me.

I decided to set the block_length to 25s in this notebook after a few trials.

There's currently no good way to deal with punctuation in Wav2Vec2. As a temporary work-around, I added ". " to the end of every transcribed clip. Not ideal, but this step is needed for the sentence tokenizer in the MarianMT translation in Section 2.

In [19]:
# function adapted via: https://github.com/huggingface/transformers/issues/10366

def asr_transcript(tokenizer, model, audio_file, clip_length):
    transcript = ""

    stream = librosa.stream(
        audio_file, block_length=clip_length, frame_length=16000, hop_length=16000
    )

    for speech in stream:
        if len(speech.shape) > 1:
            speech = speech[:, 0] + speech[:, 1]

        input_values = tokenizer(speech, return_tensors="pt").input_values
        logits = model(input_values).logits

        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.decode(predicted_ids[0])
        transcript += transcription.lower() + ". " # this places an artifical full-stop at the end of each clip
        
    return transcript

## 1.2 LOAD CHOICE OF MODEL-TOKENIZER, AUDIO FILE AND CHECK RESULTS

This took about 12mins to run on my late-2015 iMac. The transcript quality is patchy in parts, but still very impressive out of the box. 

In [20]:
#load tokenizer and pre-trained model
tokenizer_transcribe = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

model_transcribe = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

audio_file = "../audio/biden.flac"

clip_length = 25 # Stream over 25-second chunks

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.

The class `Wav2Vec2Tokenizer` is deprecated and will be removed in version 5 of Transformers. Please use `Wav2Vec2Processor` or `Wav2Vec2CTCTokenizer` instead.

Some weights of the model checkpoint at facebook/wav2vec2-large-960h-lv60-self were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint

In [21]:
%%time
biden = asr_transcript(tokenizer_transcribe, model_transcribe, audio_file, clip_length)

CPU times: total: 1h 22min 29s
Wall time: 9min 14s


In [22]:
print(biden)

a year ago weere hit with a virus that was met with silence and spread unchecked denials for days weeks then months that led to more deaths more infections more stress and more loneliness photos and viteos from twenty nineteen feel like they were taken in another era. the last vacation the last birthday with friends the last holiday with the extended family while it was different for everyone we all lost something a collective suffering a collective sacrifice a year filled with the loss of life and the loss of living for all of us but in the loss. we saw how much there was to gain in appreciation respect and gratitude finding light in the darkness is a very american thing to do in fact it may be the most american thing we do and that's what we've done we've seen front line and te central workers risking their lives sometimes losing them. to save and help others researchers and sciencists racing for a vaccine and so many of you as hemingway wrote being strong in all the broken places i 

In [23]:
# Output the transcript to a text file if you wish.

#with open("../transcripts/biden_alt.txt", "w") as file:
#    file.write(biden)

# 3. SENTIMENT ANALYSIS

Hugging Face's pipeline has made it very easy to generate results for sentiment analysis, but that's just half the mission. Figuring out a good way to visualize the results for immediate and long term analysis can be just as challenging.

I've been experimenting with these sentiment charts, using Plotly, to better understand the overall sentiment of a speech.

## 3.1 CREATE NEW DF FOR TRANSCRIBED SPEECH

We'd want to see the sentiment label and scores for each transcribed clip, so best to do this via a DF.

In [25]:
df = (
    pd.DataFrame(biden.split("."))
    .stack()
    .reset_index()
    .rename(columns={0: "Speech_Paras"})
    .drop("level_0", axis=1)
    .drop("level_1", axis=1)
)

In [26]:
# filter out empty rows

crit = df["Speech_Paras"] == " "

df = df[~crit]

In [27]:
df.tail()

Unnamed: 0,Speech_Paras
52,be on the mend our kids will be back in schoo...
53,we're coming through it and it's a shared exp...
54,prayer for our country that after all we've b...
55,history darkest we've ever known i promise yo...
56,do when we do it together so god bless you al...


## 3.2 USE HUGGING FACE'S PIPELINE FOR SENTIMENT ANALYSIS

In [28]:
%%time
corpus = list(df['Speech_Paras'].values)

nlp_sentiment = pipeline(
    "sentiment-analysis"
)


df["Sentiment"] = nlp_sentiment(corpus)

# The pipeline's sentiment analysis output consists of a label and a score
# I prefer to extract them into separate columns

df['Sentiment_Label'] = [x.get('label') for x in df['Sentiment']]

df['Sentiment_Score'] = [x.get('score') for x in df['Sentiment']]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


CPU times: total: 22.9 s
Wall time: 4.41 s


In [29]:
df['Sentiment_Label'].value_counts()

Sentiment_Label
POSITIVE    29
NEGATIVE    28
Name: count, dtype: int64

## 3.3 PLOT SENTIMENT CHART WITH PLOTLY

In [30]:
# We won't need all the cols for the chart, so let's narrow down the selection

cols = ["Speech_Paras", "Sentiment_Label", "Sentiment_Score"]

df_sentiment = df[cols].copy()

In [31]:
# Tweaking the sentiment score column for visualisation
# Absolute value of the score is unchanged, merely the direction so that
# the resulting chart is clearer on a divergent axis

df_sentiment["Sentiment_Score"] = np.where(
    df_sentiment["Sentiment_Label"] == "NEGATIVE", -(df_sentiment["Sentiment_Score"]), df_sentiment["Sentiment_Score"]
)

In [32]:
df_sentiment.head()

Unnamed: 0,Speech_Paras,Sentiment_Label,Sentiment_Score
0,a year ago weere hit with a virus that was met...,NEGATIVE,-0.997695
1,the last vacation the last birthday with frie...,NEGATIVE,-0.992579
2,we saw how much there was to gain in apprecia...,POSITIVE,0.997272
3,to save and help others researchers and scien...,NEGATIVE,-0.621613
4,with the number of americans who have died fr...,NEGATIVE,-0.999004


In [33]:
# Optional; output for a closer look at the sentiment label/scores if you wish

# df_sentiment.to_csv("../transcripts/biden_sentiment_alt.csv", index=False)

In [34]:
# I've experimented with various plots and settled on Plotly's Heatmap

fig = go.Figure(
    data=go.Heatmap(
        z=df_sentiment["Sentiment_Score"],
        x=df_sentiment.index,
        y=df_sentiment["Sentiment_Label"],
        colorscale=px.colors.sequential.RdBu,
    )
)

fig.update_layout(
    title=go.layout.Title(
        text="Sentiment Analysis of Biden's First Prime-Time Speech (2021)"
    ),
    autosize=False,
    width=1200,
    height=600,
)

#fig.update_layout(yaxis_autorange = "reversed")

fig.show()