## Hugging Face pipelines

The SONAR project provides a set of pipelines, which utilize Hugging Face models for converting text into embeddings and vice versa. These pipelines are meant to simplify the usage of SONAR.

Via the use of these pipelines one may configure, encode, decode, and get a result for the _translation_ of sentences by simply operating with a couple of classes.

In [3]:
#%pip install --quiet sonar-space seaborn pandas datasets

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
blis 1.0.1 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.
thinc 8.3.2 requires numpy<2.1.0,>=2.0.0; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
import sys
import os

sibling2_dir = os.path.join('/home/david/Documents/MLH_fellowship/SONAR')

sys.path.append(sibling2_dir)

In [10]:
from huggingface_pipelines.text import (
    DatasetConfig,
    EmbeddingToTextPipelineConfig,
    HFEmbeddingToTextPipeline,
    HFTextToEmbeddingPipeline,
    TextToEmbeddingPipelineConfig,
)

from huggingface_pipelines.audio import (
    AudioDatasetConfig,
    AudioToEmbeddingPipelineFactory,
)


import torch
import numpy as np
from datasets import load_dataset

In [7]:
dataset = load_dataset("facebook/flores", "eng_Latn")

Downloading data: 100%|██████████| 25.6M/25.6M [00:02<00:00, 11.8MB/s]
Generating dev split: 997 examples [00:00, 24106.86 examples/s]
Generating devtest split: 1012 examples [00:00, 36711.31 examples/s]


In [17]:
text2embedding_config = TextToEmbeddingPipelineConfig(
        encoder_model="text_sonar_basic_encoder",
        columns=["text"],
        output_column_suffix="embedding",
        batch_size=2,
        device="cpu",
        source_lang="eng_Latn",
        output_path="test",
    )

embedding2text_config = EmbeddingToTextPipelineConfig(
        decoder_model="text_sonar_basic_decoder",
        columns=["embedding"],
        output_column_suffix="text",
        batch_size=2,
        device="cpu",
        target_lang="eng_Latn",
        output_path="test",
    )

encoder = HFTextToEmbeddingPipeline(text2embedding_config)
# Get data from Flores here <--
batch = {"text": [["Hello", "World"], ["Test", "Sentence"]]}
encoded_result = encoder.process_batch(batch)

# "Observe" the embedding space

# Convert back to text
decode_batch = {"embedding": encoded_result["text_embedding"]}
decoder = HFEmbeddingToTextPipeline()
decoded_result = decoder.process_batch(decode_batch)

# Convert back to standard format (all lowercase...)

# Evaluate the "translation"

INFO:huggingface_pipelines.text:Initializing text to embedding model...


## Audio pipeline

In [14]:
dataset_config = AudioDatasetConfig(
    dataset_name="librispeech_asr",  # Example dataset
    dataset_split="train.clean.100",  # Dataset split
    output_dir="/home/david/Documents/MLH_fellowship/SONAR/examples/data",      # Output directory for processed files
    config="clean",                    # Additional config for the dataset
    trust_remote_code=True,            # Trust remote code (if applicable)
    sampling_rate=16000,               # Set sampling rate for audio
    audio_column="audio"               # Column name containing audio data
)

audio_dataset = dataset_config.load_dataset()  # Increase timeout to 600 seconds

pipeline_config = {
    "encoder_model": "sonar_speech_encoder_large",  # Model to use for embedding
    "fbank_dtype": torch.float16,                    # Data type for audio features
    "n_parallel": 4,                                 # Number of parallel processes
    "pad_idx": 0,                                    # Padding index
    "device": "cuda",                                # Device to run on (GPU)
    "batch_size": 32,                                # Batch size for processing
    "columns": ["audio"],                            # Columns to process
    "output_path": "/home/david/Documents/MLH_fellowship/SONAR/examples/data",                # Where to save output embeddings
    "output_column_suffix": "embedding"              # Suffix for output column
}

factory = AudioToEmbeddingPipelineFactory()
embedding_pipeline = factory.create_pipeline(pipeline_config)

for batch in audio_dataset:
    processed_batch = embedding_pipeline.process_batch(batch)

    embeddings = processed_batch["audio_embedding"]
    
    np.save("/path/to/output/embeddings.npy", embeddings)


Downloading data:   9%|▊         | 549M/6.39G [11:03<1:57:37, 827kB/s]


KeyboardInterrupt: 

## Text pipeline