# Speech2Text

**Information**

The Speech2Text model was proposed in the article [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171)

FAIRSEQ S2T is a comprehensive toolkit for speech-to-text tasks, offering end-to-end workflows that streamline data preprocessing, model training, and inference. It supports a variety of advanced models, including RNN, Transformer, and Conformer architectures, and integrates with machine translation and language models for multi-task learning. Designed for scalability and efficiency, FAIRSEQ S2T includes tools for tokenization, mixed precision training, multi-GPU support, and error analysis, making it ideal for large-scale experiments in speech recognition and translation.

Remark: FAIRSEQ S2T was developed by a team at Meta's Fundamental AI Research (FAIR)

***
**Coding sources**

* Hugging Face model page: https://huggingface.co/facebook/s2t-small-librispeech-asr
* Hugging Face documentation: https://huggingface.co/docs/transformers/main/en/model_doc/speech_to_text


***
**Aim of the code template**

Mimic the Advanced Speech-to-Text API of Google by (a) transciping audio file and (b) improve the transcription by using two LLMs, see Google API: https://cloud.google.com/speech-to-text/?hl=en

# Load your own audio file

## Environment Setup

Load necessary libraries:

In [19]:
import torch
import librosa
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
# pip install torchaudio sentencepiece

import textwrap

Code to transcripe your audio file:

In [20]:
# Load pre-trained Speech2Text model and processor from Hugging Face (model weights are downloaded automatically)
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

# Path to your local audio file (e.g., .wav file)
audio_file = "record.wav" # !!! change your audio file path / name here

# Load the audio file using librosa and resample to 16,000 Hz
audio_array, sampling_rate = librosa.load(audio_file, sr=16000)

# Process the audio data and prepare input features for the model
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
# print(inputs)

# Generate transcription using the model
generated_ids = model.generate(inputs["input_features"], attention_mask=inputs["attention_mask"])

# Decode the generated ids into a transcription text
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)

# Output the transcription in a more readable format
print('\n'.join(textwrap.wrap(transcription[0], 100)))

Some weights of Speech2TextForConditionalGeneration were not initialized from the model checkpoint at facebook/s2t-small-librispeech-asr and are newly initialized: ['model.decoder.embed_positions.weights', 'model.encoder.embed_positions.weights']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


large language models are incredibly powerful tools that can understand and generate human like text
making them invaluable for tasks like writing translation and coating you will now


# Improve the transcription of your audio file

## Environment Setup

Load necessary libraries:

In [21]:
from huggingface_hub import InferenceClient
import textwrap

Get API key(s):

In [22]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Code to improve your transcriped audio file:

@Julius: improve promting

In [23]:
# Initialize client
client = InferenceClient(token=key.hugging_api_key)

# Create prompts
system_content = "You are a helpful assistant, whos task is to improve the transcription of an audio file. Focus here on possible spelling errors and missing words."
user_content = f"""
    Check the transcription of the following audio file. Only provide the improved transcription:
    
    {transcription}
"""

# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
   model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    max_tokens=500,
    stream=False,
    temperature=0.0,
)


# Accessing the text in the output object
text = output.choices[0].message.content

# Printing the output in a more readable format
print('\n'.join(textwrap.wrap(text, 100)))

'large language models are incredibly powerful tools that can understand and generate human-like
text, making them invaluable for tasks like writing, translation, and coding. You will now'
