# Speech to text (S2T) & text to speech (T2S)

## Contents
1. S2T - with google (online)
2. S2T - simple script (offline)
3. S2T - with OpenAI's whisper
3. T2S - with gTTS library (online)
4. T2S - with pyttsx2 (offline) 

## 1. SpeechRecognition (online)

<b>links</b>
http://people.csail.mit.edu/hubert/pyaudio/

For this notebook we will use 2 packages:
1. PyAudio. Install PyAudio with Conda: conda install -c conda-forge pyaudio
2. SpeechRecognition with pip

For more info on Speech recognition, see: https://realpython.com/python-speech-recognition/

In [1]:
!pip install SpeechRecognition

Collecting SpeechRecognition
  Using cached SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8 MB)
Installing collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.8.1


In [3]:
import speech_recognition as sr

def main():
    r = sr.Recognizer()
    
    with sr.Microphone() as source:
        r.adjust_for_ambient_noise(source)
        
        print("Please say something")
        
        audio = r.listen(source)
        
        try: 
            my_text = r.recognize_google(audio)
            print("You have said : \n" + my_text)
        
        except Exception as e:
            print('Error' + str(e))
        
        if my_text == 'hello everybody':
            print('start something')
        
if __name__ == '__main__':
    main()

Please say something
You have said : 
Lisa ice cake


In [None]:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
#source: https://gist.github.com/tjoen/bd37bdb8795363e9940f959b2394c5e2

from gtts import gTTS
import os
import speech_recognition as sr
import tempfile
import time
import pyaudio
import wave

txt = "test"

def speech( txt ):
    tts = gTTS(text=txt, lang="nl")
    testfile = "temp.mp3"
    tts.save(testfile)
    os.system("mpg123 temp.mp3")
    os.system("rm %s" %(testfile))

def record():
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 22050
    CHUNK = 1024
    RECORD_SECONDS = 5
    WAVE_OUTPUT_FILENAME = "file.wav"

    audio = pyaudio.PyAudio()

    # start Recording
    stream = audio.open(format=FORMAT, channels=CHANNELS,
                    rate=RATE, input=True,
                    frames_per_buffer=CHUNK)
    print ("recording...")
    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
    print ("finished recording")
   # stop Recording
    stream.stop_stream()
    stream.close()
    audio.terminate()

    waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    waveFile.setnchannels(CHANNELS)
    waveFile.setsampwidth(audio.get_sample_size(FORMAT))
    waveFile.setframerate(RATE)
    waveFile.writeframes(b''.join(frames))
    waveFile.close()

    r = sr.Recognizer()
    with sr.WavFile(WAVE_OUTPUT_FILENAME) as source:              # use "test.wav" as the audio source
        audio = r.record(source)                        # extract audio data from the file

    # recognize speech using Google Speech Recognition
    try:
        # for testing purposes, we're just using the default API key
        # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
        # instead of `r.recognize_google(audio)`
        global txt
        txt = r.recognize_google(audio, None, "nl_NL")
        print("Google Speech Recognition thinks you said " +txt )
        return txt
    except sr.UnknownValueError:
        print("Google Speech Recognition could not understand audio")
        return("Ik begrijp niet wat je zegt")
    except sr.RequestError as e:
        print("Could not request results from Google Speech Recognition service; {0}".format(e))
        return "Fout in de spraakherkenning service"

spreq = record()
spc = speech( txt)

spreq = record()
spc = speech( txt)

## 2. Speech to text offline (.wav file) (English)
#based on: https://github.com/akashadhikari/wave2vec-speech-to-text/blob/main/Untitled.ipynb

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp39-cp39-win_amd64.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.8.1 tokenizers-0.12.1 transformers-4.21.1


In [1]:
import torch
import librosa
import numpy as np
import soundfile as sf
from scipy.io import wavfile
from IPython.display import Audio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer, Wav2Vec2Processor #use huggingface's transformers

In [2]:
tokenizer = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")#Wav2Vec2Tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
from glob import glob
my_wavs = glob('*.wav')
print(my_wavs)

['11k16bitpcm.wav', 'claxon_1m.wav', 'claxon_close.wav', 'claxon_freesound.wav', 'claxon_iphone.wav', 'claxon_michiel.wav', 'converted.wav', 'example.wav', 'file.wav', 'loudness.wav', 'miaow_16k.wav', 'my-audio.wav', 'my_test.wav', 'my_wav.wav', 'new_file.wav', 'noise_add.wav', 'out.wav', 'output.wav', 'piano_c.wav', 'robot0.wav', 'robot1.wav', 'robot2.wav', 'robot3.wav', 'Sample_audio.wav', 'silence.wav', 'speech_whistling2.wav', 'test.wav', 'test2.wav', 'test3.wav', 'tone_220.wav', 'tone_440.wav', 'welcome.wav']


In [3]:
from IPython.display import Audio
file_name = 'my-audio.wav'
Audio(file_name)

In [4]:
#Adjust sample rate and output
data = wavfile.read(file_name)
framerate = data[0]
sounddata = data[1]
print(framerate, sounddata)
time = np.arange(0,len(sounddata))/framerate
input_audio, _ = librosa.load(file_name, sr=16000)
input_values = tokenizer(input_audio, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]
print(transcription)

44100 [ 0  0  0 ... 41 47 41]


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


DEEP LEARNING IS AMAZING


## 3. Speech2Text with Whisper

Requires FFMPEG

In [1]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /private/var/folders/cj/qtbz9fvd3svc0x28yv2756mh0000gn/T/pip-req-build-26y2ah0u
  Running command git clone -q https://github.com/openai/whisper.git /private/var/folders/cj/qtbz9fvd3svc0x28yv2756mh0000gn/T/pip-req-build-26y2ah0u
  Resolved https://github.com/openai/whisper.git to commit 02b74308fff49aa0d5dd603faefa76d2edd8d56b
Collecting torch
  Downloading torch-1.12.1-cp39-none-macosx_10_9_x86_64.whl (133.8 MB)
[K     |████████████████████████████████| 133.8 MB 12.1 MB/s eta 0:00:01   |██████████                      | 41.9 MB 8.3 MB/s eta 0:00:12
Collecting more-itertools
  Downloading more_itertools-8.14.0-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 2.1 MB/s  eta 0:00:01
[?25hCollecting transformers>=4.19.0
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 21.0 MB/s eta 0:00:01
[?25hColle

In [1]:
import whisper

model = whisper.load_model("base")
#result = model.transcribe("audio.mp3")
#print(result["text"])

In [15]:
from glob import glob
my_wavs = glob('*.wav')
my_mp4s = glob('*.mp4')
print(my_wavs)
print(my_mp4s)

['my-audio.wav', 'new_file.wav', 'out.wav', 'file.wav', 'loudness.wav', 'test2.wav', 'claxon_iphone.wav', 'test3.wav', 'my_wav.wav', 'mywav_reduced_noise2.wav', 'piano_c.wav', 'claxon_michiel.wav', 'claxon_close.wav', 'recordedFile.wav', 'miaow_16k.wav', 'tone_220.wav', 'welcome.wav', 'silence.wav', 'example.wav', '11k16bitpcm.wav', 'newsegment.wav', 'claxon_freesound.wav', 'robot0.wav', 'my_test.wav', 'speech_whistling2.wav', 'robot1.wav', 'mywav_reduced_noise.wav', 'robot3.wav', 'test.wav', 'noise_add.wav', 'robot2.wav', 'Sample_audio.wav', 'tone_440.wav', 'claxon_1m.wav', 'output.wav']
['Do. Or do not. There is no try.-BQ4yd2W50No.mp4', 'output.mp4', 'Jay in Kyiv - Russians having a rough time holding Kherson.-1564190523849691139.mp4', "youtube-dl test video ''_ä↭𝕐-BaW_jenozKc.mp4", 'Visegrád 24 - Ukrainian man finds a drunk Russian soldier.  Via @WarNewsPL1-1541178599918673927.mp4']


In [16]:
import IPython
IPython.display.Audio('Do. Or do not. There is no try.-BQ4yd2W50No.mp4')

In [19]:
result = model.transcribe('Do. Or do not. There is no try.-BQ4yd2W50No.mp4')
print(result['text'])



 I always with you cannot be done. Hear you nothing that I say. You must unlearn what you have learned. All right, I'll give it a try. No. Try not. Do. Or do not. There is no try.


In [21]:
result2 = model.transcribe('Jay in Kyiv - Russians having a rough time holding Kherson.-1564190523849691139.mp4')
print(result2['text'])



 Тропы хуярых церстов, что только можно прорвали первую линию обороны. Хуярят станков авиации артиллерии. Двадцать девять авиации.


## 4. TTS with google's gTTS (online)

- Readthedocs: https://gtts.readthedocs.io/en/latest/index.html

In [5]:
!pip install gTTS

Collecting gTTS
  Downloading gTTS-2.2.4-py3-none-any.whl (26 kB)
Installing collected packages: gTTS
Successfully installed gTTS-2.2.4


In [6]:
from gtts import lang
languages =lang.tts_langs()
for key, value in languages.items():
    if key == 'nl':
        print('Dutch is available')

Dutch is available


In [3]:
print(languages)

{'af': 'Afrikaans', 'ar': 'Arabic', 'bg': 'Bulgarian', 'bn': 'Bengali', 'bs': 'Bosnian', 'ca': 'Catalan', 'cs': 'Czech', 'cy': 'Welsh', 'da': 'Danish', 'de': 'German', 'el': 'Greek', 'en': 'English', 'eo': 'Esperanto', 'es': 'Spanish', 'et': 'Estonian', 'fi': 'Finnish', 'fr': 'French', 'gu': 'Gujarati', 'hi': 'Hindi', 'hr': 'Croatian', 'hu': 'Hungarian', 'hy': 'Armenian', 'id': 'Indonesian', 'is': 'Icelandic', 'it': 'Italian', 'iw': 'Hebrew', 'ja': 'Japanese', 'jw': 'Javanese', 'km': 'Khmer', 'kn': 'Kannada', 'ko': 'Korean', 'la': 'Latin', 'lv': 'Latvian', 'mk': 'Macedonian', 'ms': 'Malay', 'ml': 'Malayalam', 'mr': 'Marathi', 'my': 'Myanmar (Burmese)', 'ne': 'Nepali', 'nl': 'Dutch', 'no': 'Norwegian', 'pl': 'Polish', 'pt': 'Portuguese', 'ro': 'Romanian', 'ru': 'Russian', 'si': 'Sinhala', 'sk': 'Slovak', 'sq': 'Albanian', 'sr': 'Serbian', 'su': 'Sundanese', 'sv': 'Swedish', 'sw': 'Swahili', 'ta': 'Tamil', 'te': 'Telugu', 'th': 'Thai', 'tl': 'Filipino', 'tr': 'Turkish', 'uk': 'Ukrainian'

In [13]:
# Import the required module for text 
# to speech conversion
from gtts import gTTS
  
# This module is imported so that we can 
# play the converted audio
import os
  
# The text that you want to convert to audio
mytext = 'hallo allemaal lisa is gek op papa'
  
# Language in which you want to convert
language = 'nl'
  
# Passing the text and language to the engine, 
# here we have marked slow=False. Which tells 
# the module that the converted audio should 
# have a high speed
myobj = gTTS(text=mytext, lang=language, slow=False)
  
# Saving the converted audio in a mp3 file named
# welcome 
myobj.save("welcome.wav")

In [14]:
from IPython import display
display.Audio('welcome.wav')

In [15]:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
#source: https://gist.github.com/tjoen/bd37bdb8795363e9940f959b2394c5e2

from gtts import gTTS
import os
import speech_recognition as sr
import tempfile
import time
import pyaudio
import wave

txt = "test"

def speech( txt ):
    tts = gTTS(text=txt, lang="nl")
    testfile = "temp.mp3"
    tts.save(testfile)
    os.system("mpg123 temp.mp3")
    os.system("rm %s" %(testfile))

def record():
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 22050
    CHUNK = 1024
    RECORD_SECONDS = 5
    WAVE_OUTPUT_FILENAME = "file.wav"

    audio = pyaudio.PyAudio()

    # start Recording
    stream = audio.open(format=FORMAT, channels=CHANNELS,
                    rate=RATE, input=True,
                    frames_per_buffer=CHUNK)
    print ("recording...")
    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
    print ("finished recording")
   # stop Recording
    stream.stop_stream()
    stream.close()
    audio.terminate()

    waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    waveFile.setnchannels(CHANNELS)
    waveFile.setsampwidth(audio.get_sample_size(FORMAT))
    waveFile.setframerate(RATE)
    waveFile.writeframes(b''.join(frames))
    waveFile.close()

    r = sr.Recognizer()
    with sr.WavFile(WAVE_OUTPUT_FILENAME) as source:              # use "test.wav" as the audio source
        audio = r.record(source)                        # extract audio data from the file

    # recognize speech using Google Speech Recognition
    try:
        # for testing purposes, we're just using the default API key
        # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
        # instead of `r.recognize_google(audio)`
        global txt
        txt = r.recognize_google(audio, None, "nl_NL")
        print("Google Speech Recognition thinks you said " +txt )
        return txt
    except sr.UnknownValueError:
        print("Google Speech Recognition could not understand audio")
        return("Ik begrijp niet wat je zegt")
    except sr.RequestError as e:
        print("Could not request results from Google Speech Recognition service; {0}".format(e))
        return "Fout in de spraakherkenning service"

spreq = record()
spc = speech( txt)


spreq = record()
spc = speech( txt)

recording...
finished recording
Google Speech Recognition thinks you said harder
recording...
finished recording


KeyboardInterrupt: 

## 5. T2S with pyttsx3 (offline)

pyttsx3 is a text-to-speech conversion library in Python. Unlike alternative libraries, it works offline, and is compatible with both Python 2 and 3.

- Pypi: https://pypi.org/project/pyttsx3/
- Readthedocs: https://pyttsx3.readthedocs.io/en/latest/
- Github: https://github.com/nateshmbhat/pyttsx3

In [17]:
!pip install pyttsx3



In [8]:
#basic script
import pyttsx3
engine = pyttsx3.init()
engine.say("I like python")
engine.runAndWait()

In [9]:
#changing the voice to female or male
voices = engine.getProperty('voices')       #getting details of current voice
engine.setProperty('voice', voices[1].id)   #changing index, changes voices. 1 for female, 0 for male
engine.say("I like python")
engine.runAndWait()

## 6. To do: Googles xtreme-S
https://huggingface.co/datasets/google/xtreme_s