##**Advanced Topics in Machine Learning**
#### Assignment 2
- Joona Kareinen, Premek Janda, Filippo Torrisi, Simone Mugnai


As project for our second assignment we decided use the SQuAD2.0: The Stanford Question Answering Dataset, as requested we divide our task into four parts, from investigation of the dataset to train and testing of the models:

## Investigate Dataset
In this phase, we check what type of data we are working with and analyze it. Drawing inspiration from earlier tutorial notebooks, we engage with the dataset through practical exercises. This includes training a Word2Vec embedding to check intrinsic properties and optimizing document indexing for efficient keyword searches.

## Train the Models
In the training part we are using pretrained models by giving them a context that should have a answer to the asked question. Additionally we use, we explore the potential of leveraging pre-trained models available on the Hugging Face website. 

## Add Voice Interactivity
We introduce voice interactivity to the most adept chatbot/question-answering system identified earlier. This entails the integration of text-to-speech and speech-to-text models. We also analyze multiple different speech models and compare their performance.

## Potential Extensions
We implemented an user friendly interface for the chatbot.

#### Fetch the required files from our github

In [None]:
!git clone -b working-app https://$GITHUB_AUTH@github.com/PremekJanda/ATML-project2.git

In [None]:
# Move to the folder
%cd ATML-project2/

#### Data loading

In this section we pre-process the train dataset provided by Stanford University. \\
In order to prepare the data for our subsequent task we split it according to the context, question and answer.

In [1]:
import json
import numpy as np

# Open the data file
file_path = './data/train-v2.0.json'
with open(file_path, 'rb') as f:
    # Load the data
    data_dict = json.load(f)


unique_contexts = []
contexts = []
pairs = []
for category in data_dict["data"]:
    for passage in category["paragraphs"]:
        context = passage["context"]
        unique_contexts.append(context)
        for qa in passage["qas"]:
            question = qa["question"]
            for answer in qa["answers"]:
                pairs.append([question, answer])
                contexts.append(context)


# Print some data
num_titles = len(unique_contexts)
print(f"In the dataset there are {num_titles} different categories with total of {len(pairs)} question/answer pairs.")
# Test that the data was loaded correctly

print(np.array(pairs[10:15]))

In the dataset there are 19035 different categories with total of 86821 question/answer pairs.
[['What was the first album Beyoncé released as a solo artist?'
  {'text': 'Dangerously in Love', 'answer_start': 505}]
 ['When did Beyoncé release Dangerously in Love?'
  {'text': '2003', 'answer_start': 526}]
 ['How many Grammy awards did Beyoncé win for her first solo album?'
  {'text': 'five', 'answer_start': 590}]
 ["What was Beyoncé's role in Destiny's Child?"
  {'text': 'lead singer', 'answer_start': 290}]
 ["What was the name of Beyoncé's first solo album?"
  {'text': 'Dangerously in Love', 'answer_start': 505}]]


## Text Preprocessing Pipeline:
 We perform data normalization and processing to generate pairs and sentence arrays for training purposes in the context of Word2Vec.

In [None]:
import re
from pandas.core.common import flatten
import unicodedata

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s, is_answer):
    # Lowercase
    s = s.lower()
    s = unicodeToAscii(s)
    # Do some pruning to the data
    s = re.sub('[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '', s)
    s = re.sub('\W', ' ', s).lower().split()

    return s

tokenized_sentences = []
for idx, pair in enumerate(pairs):
    s1 = normalizeString(pair[0], 0)
    s2 = normalizeString(pair[1]["text"], 1)
    if len(s1) > 1 and len(s2) > 1:
        tokenized_sentences.append(s1)
        tokenized_sentences.append(s2)

for sentence in tokenized_sentences[:10]:
    print(sentence)


## Exploratory Analysis:
We perform a simple exploratory analysis about our data, in particular we count for unique words and plot the distribution of lenght of the sentences together with the average count.

In [None]:
import matplotlib.pyplot as plt

# Analyzing tokenized sentences to track sentence lengths and word occurrences.
word2count = {}
sen_len = []
for sentence in tokenized_sentences:
    sen_len.append(len(sentence))
    for word in sentence:
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

sorted_word2vec = {k: v for k, v in sorted(word2count.items(), key=lambda item: item[1], reverse=True)}

#Plot the histogram of sentence lenghts and average
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(sen_len)
plt.title("Histogram of the sentence lengths")
plt.xlabel("Sentence length")
plt.ylabel("Total amount")

plt.subplot(1, 2, 2)
plt.hist(list(sorted_word2vec.values()), bins=50, log=True)
plt.title("Average word count")
plt.xlabel("word count")
plt.ylabel("Total amount")

plt.show()

## Word Embedding:

We train the embedding using word2vec, a popular embedding developed by Google research, used to represent word as numerical vectors able to captures semantic relationships  based on their context in a given corpus.
The embeddings with a vector size of 30 a minimum sequence length of 5 and a contex windows of 10.


In [None]:
from gensim.models.word2vec import Word2Vec

# We create the embeddings with
model = Word2Vec(tokenized_sentences, vector_size=30, min_count=5, window=10)

### Inspect the word2vec embeddings

In [None]:
#length of the vocabolary
print(f"Length of the vocabulary: {len(model.wv)}")

term = 'beyonce'
model.wv[term]

Test some common words from our dataset and check what words are considered to be close to them

In [None]:
term = 'beyonce'
print(f'Most similar embeddings to "{term}":')
print(np.array(model.wv.most_similar(term)))

In [None]:
term = 'bitumen'
print(f'Most similar embeddings to "{term}":')
print(np.array(model.wv.most_similar(term)))

In [None]:
term = 'coffee'
print(f'Most similar embeddings to "{term}":')
print(np.array(model.wv.most_similar(term)))

### Plot some embeddings
We can visualize the embeddings in the 3D space using a t-SNE.

Visualize subset of 100 terms for simplicity

In [None]:
!pip install plotly

In [None]:
from sklearn.manifold import TSNE
import random

# Take 500 random samples from our list of embeddings
sample = random.sample(list(model.wv.key_to_index), 500)
word_vectors = model.wv[sample]

# Create a tsne     
tsne = TSNE(n_components=3, n_iter=2000)
tsne_embedding = tsne.fit_transform(word_vectors)

x, y, z = np.transpose(tsne_embedding)

In [None]:
import plotly.express as px

fig = px.scatter_3d(x=x[:100],y=y[:100],z=z[:100],text=sample[:100])
fig.update_traces(marker=dict(size=3,line=dict(width=2)),textfont_size=10)
fig.show()

# 2. Train and evaluate models:

In [None]:
!pip install sentence_transformers

The variable `semb_model` loads a pre-trained model transformer while instead
 `xenc_model` loads the pre-trained Crossencoder, designed to assess pairs of sentences and predict their semantic relatedness,a crucial task for effective question and answering.


In [None]:
#  allows to work with pre-trained models.
from sentence_transformers import SentenceTransformer, CrossEncoder

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

The function creates an optimize the process of generating and storing embeddings by checking if precomputed embeddings are available in a cache file.
 If they are, it loads the embeddings, otherwise, it computes them and saves them to the cache for subsequent use.

In [None]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = './qa_embeddings_cache.pkl'

# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        corpus_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    corpus_embeddings = semb_model.encode(unique_contexts, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(corpus_embeddings, f)

In [None]:
# fast approximate nearest neighbors search library
!pip install hnswlib

We then employ an efficient index based on hnswlib and cosine similarities for nearest neighbor search, with a mechanism to save and load the index.

In [None]:
import os
import hnswlib
import time
start = time.time()
# Create empthy index
index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = './qa_hnswlib_100.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Start creating HNSWLIB index')
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=100, M=64) # see https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md for parameter description
    # Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

end = time.time()
print(f"Exectution time: {int((end - start) / 60)}:{int((end - start) % 60)} min:sec")

In [None]:
!pip install accelerate

Setting up the environment for working with a pre-trained T5 model by selecting the appropriate device (GPU or CPU), loading the T5 tokenizer, and loading the T5 model with specific configuration options such as device mapping and precision.


In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)

We define a pipeline for question and answering: \\
Takes an user question, ensuring that it ends with "?" and by searching between the relevant documents based on the previous index, after we re-ranks them using the cross-encoder model, and generates the answer.

In [None]:
def qa_pipeline(
    question, print_flag,
    similarity_model=semb_model,
    embeddings_index=index,
    re_ranking_model=xenc_model,
    generative_model=model,
    device=device
):
    if not question.endswith('?'):
        question = question + '?'
    # Embed question
    question_embedding = similarity_model.encode(question, convert_to_tensor=True)
    # Search documents similar to question in index
    corpus_ids, distances = embeddings_index.knn_query(question_embedding.cpu(), k=64)
    # Re-rank results
    xenc_model_inputs = [(question, unique_contexts[idx]) for idx in corpus_ids[0]]
    cross_scores = re_ranking_model.predict(xenc_model_inputs)
    # Get best matching passage
    passage_idx = np.argsort(-cross_scores)[0]
    passage = unique_contexts[corpus_ids[0][passage_idx]]
    # Encode input
    input_text = f"Given the following passage, answer the related question.\n\nPassage:\n\n{passage}\n\nQ: {question}"
    if print_flag:
        print('INPUT TEXT:', input_text, "\n")
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    # Generate output
    output_ids = generative_model.generate(input_ids, max_new_tokens=512)
    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Return result
    return output_text

In [None]:
question = input("Ask a question >>> ")
print()

print(qa_pipeline(question, True))

Ask a question >>> hello

INPUT TEXT: Given the following passage, answer the related question.

Passage:

Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur. He is among the most acclaimed musicians of the 21st century, attracting both praise and controversy for his work and his outspoken public persona.

Q: hello? 

Kanye Omari West


# 3. Add voice interactivity:

# Text-to-Speech using NVIDIA Tacotron2 and WaveGlow
The code loads the Tacotron2 and WaveGlow models from NVIDIA's DeepLearningExamples repository. The models are transferred to the GPU if available and set to evaluation mode.
Utility functions for text-to-speech are loaded, including the text2speech function, which takes text input, converts it to a numeric sequence, generates a mel spectrogram using Tacotron2, and synthesizes audio using WaveGlow.

Text-to-Speech Function

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install numpy scipy  librosa unidecode inflect  openai-whisper

#### Load the tacotron and waveglow models from Torchhub

- Tacotron2 model produces mel spectrograms from text
- Waveglow model takes mel spectrogram and generates speech

Write some text and transform it to numeric sequence

### Define text2speech function using all the previous knowledge

The text2speech function is applied to the input text . It returns the audio signal and sampling rate. The audio is saved to a WAV file and played back using IPython's Audio display.



In [None]:
import torch
from IPython.display import Audio
from scipy.io.wavfile import write
import whisper

# Load tacotron
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

# Load waveglow
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

# Load utils
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')


<!-- Text-to-Speech Function -->

Converts text input into audio signals using Tacotron 2 and WaveGlow models:

1. **Input Preparation:**
   - Transforms the text into a numeric sequence.

2. **Tacotron2 Inference:**
   - Generates a mel spectrogram from the numeric sequence.

3. **WaveGlow Synthesis:**
   - Produces an audio signal from the mel spectrogram.

4. **Output:**
   - Returns the audio signal (NumPy array) with a sampling rate of 22050.


In [None]:
def text2speech(input):
    rate = 22050
    # Transform the text to numeric sequence
    sequences, lengths = utils.prepare_input_sequence([input])

    # Use tacotron2 model to create a mel spectrogram from the numeric sequence

    with torch.no_grad():
        mel, _, _ = tacotron2.infer(sequences, lengths)

    # Use the waveglow model to produce audio signal from the mel spectrogram
    with torch.no_grad():
        audio = waveglow.infer(mel)

    # return the audio signal
    audio_numpy = audio[0].data.cpu().numpy()

    return audio_numpy, rate

In [None]:
text = "Gioona fuck you?"

audio_numpy, rate = text2speech(text)

write("audio.wav", rate, audio_numpy)
Audio(audio_numpy, rate=rate)

### Define speech2text function
Finally, the Whisper ASR model is loaded, and the generated audio file is transcribed. The transcribed text is printed.



In [None]:
model = whisper.load_model("base")

result = model.transcribe("./audio.wav")
print(result["text"])

## Make something app like

In [None]:
import random

pair = random.choice(pairs)
question = pair[0]
answer = pair[1]["text"]
print(f"Q: {question}")
print(f"A: {answer}")

output = qa_pipeline(question, 0)
print(f"Predicted: {output}")

audio_numpy, rate = text2speech(question)
print("question:")
Audio(audio_numpy, rate=rate, autoplay=True)



In [None]:
audio_numpy, rate = text2speech(output)
print("Predicted:")
Audio(audio_numpy, rate=rate)


# Hugging face model


<!-- Question Answering with Hugging Face Transformers -->

Utilizes Hugging Face Transformers for question-answering:
- a) Uses a convenient pipeline for quick predictions.
- b) Loads the model and tokenizer separately for customization.


In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

<!-- Randomized Question-Answering Evaluation -->

Select a random question, answer pair from our data, and send it to the model and see how well the answer matches the ground truth


In [None]:
import random

# Choose random index
idx = random.randint(0,len(pairs)-1)
pair = pairs[idx]

# Get the answer and the answer
question = pair[0]
answer = pair[1]["text"]
print(f"Q: {question}")
print(f"A: {answer}")

# Create a context for the model
QA_input = {
    'question': question,
    'context': contexts[idx]
}

# Text to speech
audio_numpy, rate = text2speech(question)
print("question:")
Audio(audio_numpy, rate=rate, autoplay=True)

res = nlp(QA_input)
print(res["answer"])

# Existing models summary

## Input defines
text section was taken from book and then recorded as slow and fast voice recording to assess how it copes with audio speed.

In [None]:
path = "./data/"
# section from the `Hitchhiker's guide to galaxy` text input
hitchhiker_text = "First, it is slightly cheaper; and secondly it has the words Don't Panic inscribed in large friendly letters on its cover."
# slow and fast recordings are provided in data directory
hitchhiker_slow = f"{path}hitchhiker_slow.mp3"
hitchhiker_fast = f"{path}hitchhiker_fast.mp3"
# speakers
original_speaker = f"{path}original-speaker.wav"
sample_speaker   = f"{path}sample-speaker.wav"
# display audio
from IPython.display import Audio

## coqui/xtts-v2
https://huggingface.co/coqui/XTTS-v2

**text2speech**

\+ manages own voice recordings / imitates voices

\+ good results

\+ fast performance

In [None]:
!pip install TTS

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

In [None]:
def coqui_text2speech(text, speaker_type):
  tts.tts_to_file(text=text,
                  file_path=f"{path}hitchiker-xtts-v2-{speaker_type}.wav",
                  speaker_wav=f"{path}{speaker_type}-speaker.wav",
                  language="en")

  return Audio(filename=f"{path}hitchiker-xtts-v2-{speaker_type}.wav")


In [None]:
# random sample speaker's voice
coqui_text2speech(hitchhiker_text, "sample")

 > Text splitted to sentences.
["First, it is slightly cheaper; and secondly it has the words Don't Panic inscribed in large friendly letters on its cover."]
 > Processing time: 4.544000148773193
 > Real-time factor: 0.4784322870370584


In [None]:
# original voice of one of the authors
coqui_text2speech(hitchhiker_text, "original")

 > Text splitted to sentences.
["First, it is slightly cheaper; and secondly it has the words Don't Panic inscribed in large friendly letters on its cover."]
 > Processing time: 4.2889018058776855
 > Real-time factor: 0.4879586230682065


## suno/bark
https://huggingface.co/suno/bark

**text2speech**

\+ fast and consistent over different text

\+ based on the provided voice (generated or selected from list)

\+ supports special tokens to change voice or add expressions etc.

\- slow performance

\- needs postprocessing to clear robotic voice

In [None]:
!pip install git+https://github.com/suno-ai/bark.git

from bark import SAMPLE_RATE, generate_audio, preload_models

# download and load all models
preload_models()

In [None]:
def bark_text2speech(text):
  # generate audio from text
  speech_array = generate_audio(text)

  # play text in notebook
  return Audio(speech_array, rate=SAMPLE_RATE)

In [None]:
bark_text2speech(hitchhiker_text)

100%|██████████| 562/562 [00:13<00:00, 40.82it/s]
100%|██████████| 29/29 [00:34<00:00,  1.18s/it]


## facebook/seamless-m4t-v2-large
https://huggingface.co/facebook/seamless-m4t-v2-large

**speech2text**, **speech2speech**, **text2speech**

\+ several all options: stt, sts, tts

\- slow performance

In [None]:
!pip install git+https://github.com/huggingface/transformers.git sentencepiece

from transformers import AutoProcessor, SeamlessM4Tv2Model
import torchaudio
from IPython.display import Audio

processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

In [None]:
def facebook_speech2text(file):
  audio, orig_freq =  torchaudio.load(file)
  audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000)
  audio_inputs = processor(audios=audio, return_tensors="pt")
  output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
  audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()
  return processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)

def facebook_speech2speech(file):
  audio, orig_freq =  torchaudio.load(file)
  audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000)
  audio_inputs = processor(audios=audio, return_tensors="pt")
  audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()
  return Audio(audio_array_from_audio, rate=model.config.sampling_rate)

def facebook_text2speech(text):
  text_inputs = processor(text=text, src_lang="eng", return_tensors="pt")
  audio_array_from_text = model.generate(**text_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()
  return Audio(audio_array_from_text, rate=model.config.sampling_rate)

In [None]:
facebook_speech2text(hitchhiker_slow)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


"First it is slightly cheaper and secondly it has the words don't panic inscribed in large friendly letters on its cover."

In [None]:
facebook_speech2text(hitchhiker_fast)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


'First it is slightly cheaper and secondly it has the word "panic" inscribed in large friendly letters on its cover.'

In [None]:
facebook_speech2speech(hitchhiker_slow)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [None]:
facebook_speech2speech(hitchhiker_fast)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [None]:
facebook_text2speech(hitchhiker_text)

## facebook/mms-tts-eng
https://huggingface.co/facebook/mms-tts-eng

**text2speech**
\+ very fast performance

\- makes weird pauses

\- semi-robotic voice



In [None]:
!pip install --upgrade transformers accelerate

from transformers import VitsModel, AutoTokenizer
import torch

model = VitsModel.from_pretrained("facebook/mms-tts-eng")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-eng")

In [None]:
def facebook_mss_text2speech(text):
  inputs = tokenizer(text, return_tensors="pt")

  with torch.no_grad():
      output = model(**inputs).waveform

  return Audio(output.numpy(), rate=model.config.sampling_rate)

In [None]:
facebook_mss_text2speech(hitchhiker_text)

## nvidia/parakeet-rnnt-1.1b
https://huggingface.co/nvidia/parakeet-rnnt-1.1b

**speech2text**

\+ very fast speech to text

\- large model

\- incorrect fast recording

In [None]:
!pip install nemo_toolkit['all']

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/parakeet-rnnt-1.1b")

def nvidia_parakeet_speech2text(file):
  return model.transcribe([file])

In [None]:
nvidia_parakeet_speech2text(hitchhiker_slow)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

(["first it is slightly cheaper and secondly it has the words don't panic inscribed in large friendly letters on its cover"],
 ["first it is slightly cheaper and secondly it has the words don't panic inscribed in large friendly letters on its cover"])

In [None]:
nvidia_parakeet_speech2text(hitchhiker_fast)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

(['first it is slightly cheaper and secondly it has the present panic inscribed in large friendly letters on its cover'],
 ['first it is slightly cheaper and secondly it has the present panic inscribed in large friendly letters on its cover'])

## distil-whisper/distil-large-v2
https://huggingface.co/distil-whisper/distil-large-v2

**speech2text**

\+ very fast

\+ correct for slow version

\- faster is problematic

In [None]:
!pip install datasets

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)


In [None]:
pipe(hitchhiker_slow)["text"]

" First, it is slightly cheaper and secondly, it has the words don't panic inscribed in large flintre letters on its cover."

In [None]:
pipe(hitchhiker_fast)["text"]

' First, it is slightly cheaper and secondly it has the first to panic inscribe in large friendly letters on this cover.'

## speechbrain
https://huggingface.co/speechbrain/tts-tacotron2-ljspeech

**text2speech**

\+ fast

\- robotic voice

In [None]:
!pip install speechbrain

import torchaudio
from speechbrain.pretrained import Tacotron2
from speechbrain.pretrained import HIFIGAN

# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

In [None]:
def speechbrain_text2speech(text):
  # Running the TTS
  mel_output, mel_length, alignment = tacotron2.encode_text(text)

  # Running Vocoder (spectrogram-to-waveform)
  waveforms = hifi_gan.decode_batch(mel_output)

  # Save the waverform
  torchaudio.save('hitchhiker-speachbrain.wav', waveforms.squeeze(1), 22050)

  return Audio(waveforms.squeeze(1), rate=22050)

In [None]:
speechbrain_text2speech(text)