# Speech-to-Text conversion

Text-to-Speech (TTS) and Speech-to-Text (STT) are two of the most interesting Natural Language Processing tasks.

Text-To-Speech is a process in which input text is first analyzed, then processed and understood, and then the text is converted to digital audio and then spoken. Paragraphs, sentences, words, syllables, and letters all use Text-To-Speech ([ref](https://h2o.ai/wiki/speech-to-text/)). 

Speech-to-Text can recognize and translate spoken language into text through computational linguistics. This is also known as computer speech recognition. Linguistic algorithms are used to sort auditory signals and convert them into words using Unicode characters. ([ref](https://h2o.ai/wiki/speech-to-text/)).

This Notebook is inspired by a [Tutorial](https://www.analyticsvidhya.com/blog/2022/01/speech-to-text-conversion-in-python-a-step-by-step-tutorial/) on Speech-to-Text conversion of *Analytics Vidhya*. An attempt was made to record my speech and to convert it to text but it turned out that `sounddevice ` does not work in Google Colab (most likely due to the lack of microphones on its machines).

Instead, I use two publicly available `wav` files downloaded from [*signalogic*](https://www.signalogic.com/index.pl?page=speech_codec_wav_samples) - one of a male voice, and the other - of a female voice.

The first step is to install [`SpeechRecognition`](https://pypi.org/project/SpeechRecognition/) library. `libportaudio2` is also needed to make `SpeechRecognition` work in Colab.

### Install packages

In [1]:
!pip install SpeechRecognition

Collecting SpeechRecognition
  Downloading SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.8.1
[0m

In [2]:
!sudo apt-get install libportaudio2

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  libportaudio2
0 upgraded, 1 newly installed, 0 to remove and 36 not upgraded.
Need to get 65.4 kB of archives.
After this operation, 223 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 libportaudio2 amd64 19.6.0-1build1 [65.4 kB]
Fetched 65.4 kB in 1s (84.9 kB/s)        
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 106350 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1build1_amd64.deb ...
Unpacking libportaudio2:amd64 (19.6.0-1build1) ...
Setting up libportaudio2:amd64 (19.6.0-1build1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...


### Imports

In [3]:
import speech_recognition as sr
import IPython

### Load audio files

Paths to both audio files are stored in variables. They are passed to the `IPython`'s `Audio` module, which displays a gadget to play the audio sound.

In [4]:
male_path = "../input/audio-files/male.wav"
IPython.display.Audio(male_path)

In [5]:
female_path = '../input/audio-files/female.wav'
IPython.display.Audio(female_path)

Thereafter, the Speech `Recognizer()` is initialized.

In [6]:
r = sr.Recognizer()

### Convert Speech to Text

The function below opens the audio file to listen to it with the Speech `Recognizer()`. It uses the Google speech recognition to translate it to text. If translation fails, the function returns an error message.

In [7]:
def speech_to_text(audio_file):
  """
  Opens and listens to an audio file and translates it to text
  Args: audio file
  Returns: text of transcribed audio file
  """
  with sr.AudioFile(audio_file) as source:
    audio_text = r.listen(source)
    try:
        text = r.recognize_google(audio_text)
        print('Converting audio transcripts into text ...')
        # print(text)
    except:
         print('Sorry.. run again...')

    return text

Converted speeches to texts are printed below. Clearly spoken words are well understood and transcribed by `SpeechRecognition`. However, there are unclear words which could not be caught by it.

In [8]:
speech_to_text(male_path)

Converting audio transcripts into text ...


"what if somebody decides to break it be careful that you keep adequate coverage but look for places to save money baby it's taking longer to get things squared away than the bankers expected during the wife for once company may win her tax hated retirement income as helpful as our cost on the two naked bone when the title of this type of song is in question there's no dying or waxing or gassing needed maybe personalized leather hard place work on a flat surface and smooth out the simplest kind of separate system uses a single self-contained unit the old shop at it still holds a good mechanic is usually a bad boss both figures would go higher in later years doll houses at set"

In [9]:
speech_to_text(female_path)

Converting audio transcripts into text ...


"perhaps this is what gives the operating in his are there of dignity turbulent are dressed as much as 50 ft Jenna choreographer Mets arbitrate did not have ever settle back into acquiescence with things as they were around me enough in this instance such personal virtues were a luxury two other cases offer under advisement and he's a horse thief runs and edit Freightline with symbolize its uniqueness the circle its universality kill small hole and ball with Clay multiple implications and possible headaches be a marketing program manufacturer of the costs involved overlapping twisted or why do you always navigate like this"