# Build a Speech Recognition and Summarization System

In this project, we are going to use a system that recognizes speech such as python's vosk library and connect it to a summarization system. This will allow us to transcribe audio files and produce somewhat accuracy summaries of audio files.

In order to complete this project you will need the following libraries:
- vosk
- pydub
- torch
- transformers

## Downloading the Vosk Model

The model can be downloaded using the following link: https://alphacephei.com/vosk/models

In [1]:
from vosk import Model

In [2]:
FRAME_RATE = 16000 # Sampling rate of audio file in Hertz (indication of quality)
CHANNELS = 1 # Number of audio channels (speech recognition works best with 1 channel)

In [3]:
model = Model("vosk-model-small-en-us-0.15") # 40 MB small model of the English language for ~ 8 GB RAM machines

## Initialize a Recognizer

Next we are going to initialize a recognizer. We will need the KaldiRecognizer from vosk, which we will load next.

In [4]:
from vosk import KaldiRecognizer

In [5]:
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)

## Loading an Audio File

Next we will explore how to load an audio file into our model.

The audio files can be downloaded from this link: https://github.com/dataquestio/project-walkthroughs/tree/master/speech_recognition

In [6]:
from pydub import AudioSegment

In [7]:
mp3 = AudioSegment.from_mp3("marketplace.mp3")

We are going to set the channels for this audio file to a single channel only. We will use the single channel variable we specified up top.

In [8]:
mp3 = mp3.set_channels(CHANNELS)

We are also going to adjust the frame rate to ensure best performance for speech recognition.

In [9]:
mp3 = mp3.set_frame_rate(FRAME_RATE)

## Audio File Transcription into Text

We are now going to extract the audio transcription as a text file using our recognizer and vosk model.

In [10]:
rec.AcceptWaveform(mp3.raw_data)

1

In [11]:
result = rec.Result()

In [12]:
import json

In [13]:
text = json.loads(result)["text"]
text

"the funny thing about the big economic news of the day the fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news and the interest rate increase wasn't it you know it was common i know it was common wall street news common businesses knew it was common so on this fed day on this program something a little bit different j powell in his own words five of i'm his most used economic words from today's press conference where number one of course it's the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two"

We used the AcceptWaveform method with the raw data of our mp3 file to obtain the transcription. Getting access to the transcription can be done using the Result method. This result is outputted as a json file, so we need the json library from python to help convert this into a dictionary. The text key has the transcribed text as a value.

The first observed issue is that there is no punctuation in this transcript. Also the full result will output the confidence in each word. We could look at the output and decide to replace certain words for which the confidence is low.

In [14]:
json.loads(result)

{'result': [{'conf': 1.0, 'end': 0.15, 'start': 0.0, 'word': 'the'},
  {'conf': 1.0, 'end': 0.54, 'start': 0.15, 'word': 'funny'},
  {'conf': 1.0, 'end': 0.96, 'start': 0.54, 'word': 'thing'},
  {'conf': 1.0, 'end': 1.2, 'start': 0.96, 'word': 'about'},
  {'conf': 1.0, 'end': 1.29, 'start': 1.2, 'word': 'the'},
  {'conf': 1.0, 'end': 1.68, 'start': 1.29, 'word': 'big'},
  {'conf': 1.0, 'end': 2.22, 'start': 1.71, 'word': 'economic'},
  {'conf': 1.0, 'end': 2.46, 'start': 2.22, 'word': 'news'},
  {'conf': 1.0, 'end': 2.55, 'start': 2.46, 'word': 'of'},
  {'conf': 1.0, 'end': 2.64, 'start': 2.55, 'word': 'the'},
  {'conf': 1.0, 'end': 3.03, 'start': 2.64, 'word': 'day'},
  {'conf': 1.0, 'end': 3.72, 'start': 3.6, 'word': 'the'},
  {'conf': 1.0, 'end': 3.96, 'start': 3.72, 'word': 'fed'},
  {'conf': 1.0, 'end': 4.26, 'start': 3.96, 'word': 'raising'},
  {'conf': 1.0, 'end': 4.59, 'start': 4.26, 'word': 'interest'},
  {'conf': 1.0, 'end': 4.98, 'start': 4.59, 'word': 'rates'},
  {'conf': 0

## Adding Punctuation to the Transcript

As mentioned, punctuation is missing from the official output. We can add punctuation using the recasepunc library. Luckily for us, vosk has trained its models to already make use of recasepunc. Unfortunately, these models are at least 1 GB large. This project therefore needs to be done on a machine with at least 16 GB of RAM. We will skip this step since we currently only have access to 8 GB of RAM.

## Transcribing Longer Audio Files

We are now going to build a custom function that will split longer audio files in 45 second bits. Vosk does not work well with longer audio files since it will consume too much memory and inference becomes slow. 

In [15]:
def transcriber(file: str) -> str: 
    # Initializing a recognizer using the Vosk model and frame_rate
    rec = KaldiRecognizer(model, FRAME_RATE)
    rec.SetWords(True)

    # Loading an audio file
    mp3 = AudioSegment.from_mp3(f"{file}.mp3")
    mp3 = mp3.set_channels(CHANNELS)
    mp3 = mp3.set_frame_rate(FRAME_RATE)

    step = 45000
    transcript = ""
    for i in range(0, len(mp3), step):

        print(f"Progress: {round(i / len(mp3) * 100)}%")
        
        segment = mp3[i:(i+step)]
        rec.AcceptWaveform(segment.raw_data)
        result = rec.Result()
        text_segment = json.loads(result)["text"]

        transcript += text_segment

    return transcript
    

In [16]:
transcriber("marketplace")

Progress: 0%
Progress: 98%


"the funny thing about the big economic news of the day the fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news and the interest rate increase wasn't it you know it was common i know it was common wall street news common businesses knew it was common so on this fed day on this program something a little bit different j powell in his own words five of i'm his most used economic words from today's press conference where number one of course it's the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as muchtoday or number two"

Running the function on the previous 45 second slice returns a similar transcript output as before and so we know that our function works well.

## Summarizing the Transcripts

Next we will look at the summarization of the transcripts using the HuggingFace pretrained transformer library. We initially imported transformers, which we will make use of now.

In [17]:
from transformers import pipeline

In [20]:
summarizer = pipeline("summarization", model="t5-small")

Device set to use cpu


In [24]:
split_text = text.split() # split the transcribed text on spaces

In [25]:
docs = []
for i in range(0, len(split_text), 850):
    selection = " ".join(split_text[i:(i+850)])
    docs.append(selection)

In [26]:
docs

["the funny thing about the big economic news of the day the fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news and the interest rate increase wasn't it you know it was common i know it was common wall street news common businesses knew it was common so on this fed day on this program something a little bit different j powell in his own words five of i'm his most used economic words from today's press conference where number one of course it's the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two"]

We just split the small transcribed text and joined it together in a list. This method works better for larger texts which we will play around with after the small transcribed text.

In [27]:
summaries = summarizer(docs)

Your max_length is set to 200, but your input_length is only 163. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=81)


In [28]:
summaries

[{'summary_text': 'the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two .'}]

In [29]:
summary = "\n\n".join(d["summary_text"] for d in summaries)
summary

'the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two .'

We just used the huggingface summarizer tool to produce a summary of the small podcast. Next we will build a custom function to do this work automatically for us and then connect both custom functions together to build a pipeline of podcast to summary conversion.

## Building Custom Summarizer Function

In [36]:
def summarizer(transcript: str) -> str:
    split_transcript = transcript.split()

    docs = []
    step = 850
    for i in range(0, len(split_transcript), step):
        selection = " ".join(split_transcript[i:(i+step)])
        docs.append(selection)

    summarizer = pipeline("summarization", model="t5-small")
    summaries = summarizer(docs)

    summary = " ".join(d["summary_text"] for d in summaries)

    return summary

The above custom function should return a summary using the huggingface pretrained transformer. The function takes a transcript as input. We will now chain both functions together (transcriber and summarizer) to provide a single function to automatically return a summary from a transcribed podcast.

In [37]:
def transcriber_summarizer(file):

    return summarizer(transcriber(file))

Next we are going to test this function on the full marketplace podcast.

In [38]:
transcriber_summarizer("marketplace_full")

Progress: 0%
Progress: 3%
Progress: 5%
Progress: 8%
Progress: 11%
Progress: 13%
Progress: 16%
Progress: 19%
Progress: 21%
Progress: 24%
Progress: 27%
Progress: 29%
Progress: 32%
Progress: 35%
Progress: 37%
Progress: 40%
Progress: 43%
Progress: 45%
Progress: 48%
Progress: 51%
Progress: 53%
Progress: 56%
Progress: 59%
Progress: 61%
Progress: 64%
Progress: 67%
Progress: 69%
Progress: 72%
Progress: 75%
Progress: 77%
Progress: 80%
Progress: 83%
Progress: 85%
Progress: 88%
Progress: 91%
Progress: 93%
Progress: 96%
Progress: 99%


Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (1046 > 512). Running this sequence through the model will result in indexing errors


"elon musk has sealed the deal as of today lauren hirsch has been covering the story for the new york times thanks for having me set aside all marijuana jokes that many people paid with this price clearly he was serious . in i i think both people never that a deal would have happened and even if people thought they weren't a bea deal i don't think anyone thought it could happen deskquickly he gets all the money together they have these board meetings over i think we all need to rethink their ramifications on communication in a country lord hirsch covers business and most recently the twitter story new almost for the new york times or thanks for having me that introduction by the way tootwo hundred and forty five characters nailed it it's twitter shares on this monday up almost six percent still though a couple about shy of must offer of fifty four dollars twenty cents bees elsewhere major indices were up . the national association for business economics is out the new survey of economi

We have now obtained a summary attempt of a lengthy podcast. We can clearly tell than punctuation is missing, which would make the summary more legible. Using the larger vosk models and summarizer models would also improve the quality of the output. Other modifications that could be made to potentially improve the quality is the steps and the splitting of the tokens, although that will make the processing slower.