<a target="_blank" href="https://colab.research.google.com/github/Deep-unlearning/notebooks/blob/main/Moonshine_demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Speech-to-Text with Moonshine üåô (Transformers)

## Introduction

Moonshine improves upon Whisper‚Äôs architecture:

- It uses SwiGLU activation instead of GELU in the decoder layers
- Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows.

## Environment Setup

In [17]:
!pip install transformers==4.48.1 # Moonshine was released with transformers 4.48.0
!pip install datasets
!pip install torchaudio  # if needed
!pip install librosa     # helpful for audio manipulation
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.0/84.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


## Imports




In [22]:
from transformers import AutoProcessor, MoonshineForConditionalGeneration
from datasets import load_dataset
import evaluate
import librosa

## Loading Moonshine Model and Processor

In [19]:
processor = AutoProcessor.from_pretrained("UsefulSensors/moonshine-tiny")
model = MoonshineForConditionalGeneration.from_pretrained("UsefulSensors/moonshine-tiny").to("cuda")

In [20]:
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = processor([ds[i]["audio"]["array"] for i in range(10)], return_tensors="pt", padding=True, sampling_rate=16_000)
input_values = inputs.input_values.to("cuda")

# Moonshit has an
token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate  # Maximum of 6.5 tokens per second
seq_lens = inputs.attention_mask.sum(dim=-1)
max_length = int((seq_lens * token_limit_factor).max().item())

generated_ids = model.generate(input_values, max_length=max_length)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
for t in transcription:
    print(t)

Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
Nor is Mr. Quilter's manner less interesting than his matter.
He tells us that at this festive season of the year, with Christmas and Rose beef looming before us, similes drawn from eating and its results occur most readily to the mind.
He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of Rocky Ithaca.
Linels' pictures are a sort of up-guards and Adam paintings, and Mason's exquisite idols are as national as a jingo poem. Mr. Burke fosters landscapes, smiling at one much in the same way that Mr. Carcor used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap in the back, before he says, like a shampoo or an attorcaged bath, next man,
It is obviously unnecessary for us to point out how luminous these criticisms are, how delicate in expression.
A general principles of art and Mr. Quilter writes with equal lucidit

## Test with your own audio !

In [24]:
from google.colab import files

uploaded = files.upload()  # User picks a local audio file
for filename in uploaded.keys():
    audio, sr = librosa.load(filename, sr=16000)
    input_values = processor(audio, sampling_rate=sr, return_tensors="pt").input_values.to("cuda")

    token_limit_factor = 6.5 / processor.feature_extractor.sampling_rate  # Maximum of 6.5 tokens per second
    seq_lens = inputs.attention_mask.sum(dim=-1)
    max_length = int((seq_lens * token_limit_factor).max().item())

    generated_ids = model.generate(input_values, max_length=max_length)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
    print("Transcription:", transcription)


Saving 20090202-0900-PLENARY-9-en_20090202-17_20_18_2.wav to 20090202-0900-PLENARY-9-en_20090202-17_20_18_2 (2).wav
Transcription: ['It is in this same spirit that Article two of the European Convention of Human Rights declares the taking of life a flagrant violation.']


In [25]:
from IPython.display import Audio
Audio(audio, rate=sr)