<a href="https://colab.research.google.com/github/sujhaan/Voice_assistance/blob/main/Voice_assistance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Model Selection and Libraries**

For this solution, we implemented a voice-to-text and text-to-speech pipeline using the following models and libraries:

**Whisper (OpenAI):**

Used for voice-to-text transcription.


**Parameters**
**Sampling Rate**: 16000 Hz (for audio processing).
**Language Model**: facebook/opt-125m for generating relevant text.
**Text Generation Limit**: Restricted to generating a maximum of 2 sentences to ensure concise output.

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

Wishper.svgasr-details-desktop.svg

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

**Step 1: Install Required Libraries**

In [None]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-obf1qcnl
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-obf1qcnl
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-n

In [None]:
pip install gtts

Collecting gtts
  Downloading gTTS-2.5.3-py3-none-any.whl.metadata (4.1 kB)
Downloading gTTS-2.5.3-py3-none-any.whl (29 kB)
Installing collected packages: gtts
Successfully installed gtts-2.5.3


**Step 1.1: Voice-to-Text Conversion**

**Model**: **Whisper "base.en" model**, a pre-trained speech-to-text model.

**Library:** Whisper, a Python library for speech-to-text tasks.

**Parameters: **audio_file (input audio file), sampling_rate (16000 Hz).

In [None]:
import whisper

# Load the pre-trained English model
model = whisper.load_model("base.en")

# Set the audio file path and sampling rate
audio_file = "C:\Users\Admin\Downloads\Design an End-to-End AI Voice Assistance Pipeline\assign_Audio.aac"
sampling_rate = 16000

# Perform voice-to-text conversion
try:
    result = model.transcribe(audio_file, sampling_rate=sampling_rate)
    transcribed_text = result["text"]
    print(transcribed_text)
except Exception as e:
    print(f"Error: {str(e)}")

100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 161MiB/s]


Error: Failed to load audio: ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-lib

**Text generation** in deep learning refers to the process of automatically generating natural language text using artificial neural networks. This is typically achieved through the use of language models, which are trained on large amounts of text data to predict the next word in a sequence based on the previous words. The generated text can be used for various applications, such as content creation, language translation, and chatbots.

Transformers are widely used for text generation tasks due to several key advantages:

**Attention mechanism**

**Parallel computation**

Transfer learning

High accuracy

Contextualization

**Step 2: Language Model Generation**

**Model:** Facebook OPT-125M, a pre-trained language model.

**Library:** Transformers, a Python library for natural language processing tasks.

**Parameters:** input_text (transcribed text), max_length (50).

In [None]:
from transformers import pipeline

# Load the pre-trained language model
try:
    llm = pipeline('text-generation', model='facebook/opt-125m')
except Exception as e:
    print(f"Error: {str(e)}")
    exit(1)

# Set the input text
transcribed_text = "This is an example input text."  # Define the input text here

# Generate a response using the language model
try:
    response = llm(transcribed_text, max_length=50)
    output_text = response[0]['generated_text']
    print(output_text)
except Exception as e:
    print(f"Error: {str(e)}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


This is an example input text.

The output text is a text file.

The output text is a text file.

The output text is a text file.

The output text is a text file.

The


gTTS: 1.svgGoogle Text-to-Speech  (TTS)  is a feature that converts written text into natural-sounding audio. This technology uses artificial intelligence (AI) and machine learning algorithms to mimic human voice, producing synthesized speech that's easy to understand. With Google TTS, you can hear text read aloud in over 100 languages and variants. This feature has numerous applications, including assistive technology for individuals with visual impairments, language learning, smart home devices, and more. You can access Google TTS through various Google services, including Google Play, Google Assistant, and Google Cloud Text-to-Speech API.

**Step 3: Text-to-Speech Conversion**

**Model:** Google Text-to-Speech (gTTS) model.

**Library**: gTTS, a Python library for text-to-speech tasks.

**Parameters:** text (output text), lang (English).

In [None]:
from gtts import gTTS
import os

# Set the text to be converted to speech
output_text = "Hello, this is a test."

# Set the audio file path
audio_file = "output_audio.mp3"

# Perform text-to-speech conversion
tts = gTTS(text=output_text, lang='en')
tts.save(audio_file)

# Play the audio file
os.system("mpg321 " + audio_file)

32512

**NLTK: The Natural Language Toolkit (NLTK)** library is used to download the 'punkt' package, which is required for sentence tokenization.

Sentence Tokenization:The nltk.sent_tokenize function is used to split the output text into sentences, and the first two sentences are selected as the final output.

NLTK, or the Natural Language Toolkit, is a powerful Python library used for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

In [None]:
import nltk

# Download the 'punkt' package
nltk.download('punkt')

# Restrict the output to 2 sentences
sentences = nltk.sent_tokenize(output_text)
output_text = " ".join(sentences[:2])
print(output_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


This is an example input text. The output text is a text file.


         Input Audio
                  |
                  v
        Whisper Model->(Speech-to-Text)
                  |
                  v
          
        Language Model->(Text Generation)
                  |
                  v
        gTTS Model->(Text-to-Speech)
                  |
                  v
        Play Audio File->(using mpg321)
        --------------