<a href="https://colab.research.google.com/github/Arajesh03/Speech-Recognition-and-summarization-system/blob/main/ASpeechRecognitionandSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####Necessary Installs

In [None]:
!pip install vosk
!pip install pydub
!pip install transformers
!pip install torch -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


####Speech Recognition: Downloading the Model and Audio Files

- Vosk is pretty easy to use and offers big and small per-language models trained on thousands hours of speedh data.

- Downloaded Vosk Model and audio files followed by loading the selected mode, initializing a recognizer and enabling the model to return not only the complete transcript of the audio but also its individual words.

In [None]:
from vosk import Model, KaldiRecognizer

In [None]:
FRAME_RATE = 16000
CHANNELS = 1

model = Model(model_name="vosk-model-en-us-0.22")
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)

####Speech Recognition: Loading an Audio File

- Further pass in the recognizer to actually recognize text in the speech by using pydub library.
- pydub: it allows to load and edit audio files really easily.
- Need to set the number of channels (2 default) and frame rate (44000 default)

In [None]:
!pip install pydub



In [None]:
from pydub import AudioSegment

In [None]:
mp3 = AudioSegment.from_mp3("marketplace_full (1).mp3")
mp3 = mp3.set_channels(CHANNELS)
mp3 = mp3.set_frame_rate(FRAME_RATE)

####Speech Recognition: Transcribing the Audio File into Text

- Pass the audio file into the speech recognition model to get the text transcript of the speech.


In [None]:
# AcceptWaveform() pass in the raw data of the audio file to the recognizer
rec.AcceptWaveform(mp3.raw_data)

1

In [None]:
#Extract the results of speech recognition from the recognizer
result = rec.Result()

In [None]:
#Convert the above file into a Python dictionary and extract only th etext of the speech
import json
text = json.loads(result)["text"]

In [None]:
text

"turns out fifty four dollars and twenty cents was not a joke from american public media this is marketplace the in los angeles ca resident monday today i do believe the twenty fifth of april good as always to have you along everybody all right just for fun i am going to see if i can do this in two hundred and eighty characters which is of course twitter's limit starting right now after making a not very veiled marijuana reference in offering fifty four dollars twenty cents a share to buy twitter elon musk has sealed the deal as of today lauren hirsch has been covering the story for the new york times thanks for coming on thanks for having me setting aside all marijuana jokes that many people made with this price that musk offered and clearly he was serious and now this has happened in an unbelievably fast timeline right and we will be fast i tell you i was at a shower yesterday communicating like us to where it's kind of casually checking in at my source code i think there could be de

####Adding Punctuation to the transcript

- Have to use another library - recasepunc. Vosk has trained its own models using recasepunc to add punctuation to vosk output.

- On the Vosk website, click on models under the Punctuation models, download the pre-trained model and unzip the file. we need checkpoint (the pre-trained model) and recasepunc (the file that you'll to do the inference).

In [None]:
!pip install transformers
!pip install torch -f


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

-f option requires 1 argument


In [None]:
import subprocess
cased = subprocess.check_output('python recasepunc/recasepunc.py predict recasepunc/checkpoint', shell=True, text=True, input=text)


CalledProcessError: Command 'python recasepunc/recasepunc.py predict recasepunc/checkpoint' returned non-zero exit status 2.

In [None]:
!pip install recasepunc

[31mERROR: Could not find a version that satisfies the requirement recasepunc (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for recasepunc[0m[31m
[0m

In [None]:
# Original code from the user's notebook, with a note about the checkpoint file
import subprocess
# Note: This code requires the 'recasepunc' library to be installed and the 'recasepunc/checkpoint' file to be in the correct path.
# You may need to download the checkpoint file separately and ensure its location.
try:
    cased = subprocess.check_output('python -m recasepunc.recasepunc predict recasepunc/checkpoint', shell=True, text=True, input=text)
    print(cased)
except subprocess.CalledProcessError as e:
    print(f"Error running recasepunc: {e}")
    print(f"Stderr: {e.stderr}")
except FileNotFoundError:
    print("Error: recasepunc script or checkpoint not found. Please ensure recasepunc is installed and the checkpoint file is in the correct location.")

Error running recasepunc: Command 'python -m recasepunc.recasepunc predict recasepunc/checkpoint' returned non-zero exit status 1.
Stderr: None


####Defining a function to transcribe longer audio files




- We need to split up a long file into little pieces of about 45 seconds each, transcribe each of them into text, and then concatenate those pieces of text together for further adding punctuation to the result.


In [None]:
def voice_recognition(filename):
  model = Model(model_name="vosk-model-en-us-0.22")
  rec = KaldiRecognizer(model, FRAME_RATE)
  rec.SetWords(True)

  mp3 = AudioSegment.from_mp3(filename)
  mp3 = mp3.set_channels(CHANNELS)
  mp3 = mp3.set_frame_rate(FRAME_RATE)

  step = 45000
  transcript = ""
  for i in range(0, len(mp3), step):
    print(f"Progress: {i/len(mp3)}")
    segment = mp3[i:i+step]
    rec.AcceptWaveform(segment.raw_data)
    result = rec.Result()
    text = json.loads(result)["text"]
    transcript += text
    return transcript


In [None]:
transcript = voice_recognition("marketplace_full (1).mp3")
