In [1]:
# install bark (make sure you have torch>=2 for much faster flash-attention)
!pip install -qq git+https://github.com/suno-ai/bark.git

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.2 requires botocore<1.34.52,>=1.34.41, but you have botocore 1.29.165 which is incompatible.[0m[31m
[0m


Audio guides are a good way to see a city or a gallery at their your own individual pace. The goal of this project is to create a guided tour for the [National Gallery of Ireland](<https://www.nationalgallery.ie/visit-us/self-guided-tours/through-the-lens-tour-highlights>). 
 
This tour has been uploaded to [Google Drive](https://drive.google.com/file/d/1wemMPGUryrx__Wb4j1wB4dXY48QvB3pz/view?usp=sharing). Please feel free to visit the gallery and enjoy the tour.

**Text-to-speech (TTS)**

The technology has several advantages and disadvantages. On the positive side, TTS enables convenient and engaging communication since speech is the primary mode of human communication. It is widely used in mobile devices, smart speakers, and voice assistants. Additionally, TTS can benefit language learners by providing them with exposure to verbal information in a foreign language.

However, there are some downsides to using TTS systems. For instance, they might find it challenging to represent accents and intonations accurately. 

Despite these limitations, TTS still offers a superior user experience compared to plain text because it delivers short and succinct audio content that individuals from diverse linguistic backgrounds can easily comprehend.

**Model**
Bark is fully generative text-to-audio model devolved for research and demo purposes. It follows a GPT style architecture similar to AudioLM and Vall-E and a quantized Audio representation from EnCodec. 

It is not a [conventional TTS model](https://github.com/suno-ai/bark/tree/main?tab=readme-ov-file#-usage-in-python), but instead a fully generative text-to-audio model capable of deviating in unexpected ways from any given script. 

Different to previous approaches, the input text prompt is converted directly to audio without the intermediate use of phonemes. It can therefore generalize to arbitrary instructions beyond speech such as music lyrics, sound effects or other non-speech sounds.

Bark is now licensed under the MIT License, meaning it's now available for commercial use!

Below is a list of some known non-speech sounds. 
* [laughter]
* [laughs]
* [sighs]
* [music]
* [gasps]
* [clears throat]
* — or ... for hesitations
* ♪ for song lyrics
* CAPITALIZATION for emphasis of a word
* [MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively


The full version of Bark requires around 12GB of VRAM to hold everything on GPU at the same time. 

To use a smaller version of the models, which should fit into 8GB VRAM, set the environment flag SUNO_USE_SMALL_MODELS=True. [Example Notebook](https://github.com/suno-ai/bark/blob/main/notebooks/memory_profiling_bark.ipynb)


**Bark has two ways to reduce GPU memory:**

1. Small models: a smaller version of the model. This can be set by using the environment variable SUNO_USE_SMALL_MODELS
2. Offloading models to CPU: Holding only one model at a time on the GPU, and shuttling the models to the CPU in between generations.

In [2]:
import os

import time

from IPython.display import Audio
import numpy as np

# https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb
import nltk  # we'll use this to split into sentences
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
import bark.generation

from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE
import torch

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
script = """

... Berthe Morisot exhibited at the first Impressionist exhibition in 1874 
and at most of the group’s subsequent shows.

She painted and exhibited professionally throughout her life 
and often sold more paintings than the other male Impressionists, 
also sometimes for more money.

MORISOT’s paintings typically portray domestic scenes and the activities of middle-class women.

The type of subject matter that was considered appropriate for a woman artist of her class.

This also meant her paintings were more socially acceptable to purchase and display in a home compared to some of the other Impressionists, who painted people at bars and clubs.

Would you hang this painting in your home?

... Le Corsage Noir (the Black Bodice) is technically one of these domestic scenes. 
What do you see in the painting?

It shows a young woman dressed in evening attire. 
Morisot’s model was actually a professional model. 
The dress she wears belonged to ... Morisot.

As Impressionists were known for their depiction of light, 
how has Morisot used light in this painting?

How has Morisot applied her brushstrokes in the painting?

Are they loose or small and detailed? ... What colours has she used?

Even though the title of the work implies the dress is black, is it actually?

... If you look closely, you’ll see it's a dark blue pigment.

""".replace("\n", " ").strip()

In [25]:
# We split longer text into sentences using nltk and generate the sentences one by one.
sentences = nltk.sent_tokenize(script)

In [26]:
len(sentences)

19

**Hallucination**
Somtimes Bark will hallucinate a little extra audio at the end of the prompt. We can solve this issue by lowering the threshold for bark to stop generating text. We use the min_eos_p kwarg in generate_text_semantic.

In [27]:
global models

for offload_models in (True, False):
    # this setattr is needed to do on the fly
    # the easier way to do this is with `os.environ["SUNO_OFFLOAD_CPU"] = "1"`
    setattr(bark.generation, "OFFLOAD_CPU", offload_models)
    for use_small_models in (True, False):
        models = {}
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        preload_models(
            text_use_small=use_small_models,
            coarse_use_small=use_small_models,
            fine_use_small=use_small_models,
            force_reload=True,
        )
        t0 = time.time()
        GEN_TEMP = 0.6
        SPEAKER = "v2/en_speaker_6"
        silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

        pieces = []
        for sentence in sentences:
            semantic_tokens = generate_text_semantic(
                sentence,
                history_prompt=SPEAKER,
                temp=GEN_TEMP,
                min_eos_p=0.05,  # this controls how likely the generation is to end
            )

            audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
            pieces += [audio_array, silence.copy()]

        #audio_array = generate_audio("madam I'm adam", history_prompt="v2/en_speaker_5", silent=True)
        dur = time.time() - t0
        max_utilization = torch.cuda.max_memory_allocated()
        print(f"Small models {use_small_models}, offloading to CPU: {offload_models}")
        print(f"\tmax memory usage = {max_utilization / 1024 / 1024:.0f}MB, time {dur:.0f}s\n")

100%|██████████| 548/548 [00:10<00:00, 52.87it/s]
100%|██████████| 28/28 [00:10<00:00,  2.74it/s]
100%|██████████| 711/711 [00:15<00:00, 45.99it/s]
100%|██████████| 36/36 [00:13<00:00,  2.71it/s]
100%|██████████| 547/547 [00:10<00:00, 52.83it/s]
100%|██████████| 28/28 [00:10<00:00,  2.73it/s]
100%|██████████| 698/698 [00:15<00:00, 46.44it/s]
100%|██████████| 35/35 [00:12<00:00,  2.69it/s]
100%|██████████| 715/715 [00:15<00:00, 45.85it/s]
100%|██████████| 36/36 [00:13<00:00,  2.69it/s]
100%|██████████| 687/687 [00:14<00:00, 46.79it/s]
100%|██████████| 35/35 [00:12<00:00,  2.72it/s]
100%|██████████| 76/76 [00:00<00:00, 82.63it/s] 
100%|██████████| 4/4 [00:01<00:00,  2.82it/s]
100%|██████████| 512/512 [00:09<00:00, 54.87it/s]
100%|██████████| 26/26 [00:09<00:00,  2.71it/s]
100%|██████████| 185/185 [00:02<00:00, 75.42it/s]
100%|██████████| 10/10 [00:03<00:00,  2.89it/s]
100%|██████████| 175/175 [00:02<00:00, 76.06it/s]
100%|██████████| 9/9 [00:03<00:00,  2.72it/s]
100%|██████████| 298/298 

Small models True, offloading to CPU: True
	max memory usage = 4256MB, time 329s



100%|██████████| 662/662 [00:42<00:00, 15.41it/s]
100%|██████████| 34/34 [00:26<00:00,  1.26it/s]
100%|██████████| 674/674 [00:44<00:00, 15.25it/s]
100%|██████████| 34/34 [00:27<00:00,  1.24it/s]
100%|██████████| 430/430 [00:23<00:00, 18.66it/s]
100%|██████████| 22/22 [00:17<00:00,  1.26it/s]
100%|██████████| 408/408 [00:21<00:00, 19.10it/s]
100%|██████████| 21/21 [00:16<00:00,  1.26it/s]
100%|██████████| 704/704 [00:47<00:00, 14.90it/s]
100%|██████████| 36/36 [00:28<00:00,  1.25it/s]
100%|██████████| 185/185 [00:07<00:00, 23.91it/s]
100%|██████████| 10/10 [00:07<00:00,  1.31it/s]
100%|██████████| 75/75 [00:02<00:00, 26.92it/s] 
100%|██████████| 4/4 [00:03<00:00,  1.31it/s]
100%|██████████| 489/489 [00:27<00:00, 17.69it/s]
100%|██████████| 25/25 [00:19<00:00,  1.26it/s]
100%|██████████| 133/133 [00:05<00:00, 25.59it/s]
100%|██████████| 7/7 [00:05<00:00,  1.30it/s]
100%|██████████| 351/351 [00:17<00:00, 20.21it/s]
100%|██████████| 18/18 [00:14<00:00,  1.26it/s]
100%|██████████| 335/335 

Small models False, offloading to CPU: True
	max memory usage = 1779MB, time 746s



100%|██████████| 695/695 [00:14<00:00, 46.60it/s]
100%|██████████| 35/35 [00:13<00:00,  2.69it/s]
100%|██████████| 731/731 [00:16<00:00, 45.32it/s]
100%|██████████| 37/37 [00:13<00:00,  2.71it/s]
100%|██████████| 461/461 [00:08<00:00, 57.13it/s]
100%|██████████| 24/24 [00:08<00:00,  2.78it/s]
100%|██████████| 374/374 [00:06<00:00, 62.21it/s]
100%|██████████| 19/19 [00:06<00:00,  2.72it/s]
100%|██████████| 723/723 [00:15<00:00, 45.62it/s]
100%|██████████| 37/37 [00:13<00:00,  2.73it/s]
100%|██████████| 162/162 [00:02<00:00, 76.93it/s]
100%|██████████| 9/9 [00:03<00:00,  2.95it/s]
100%|██████████| 50/50 [00:00<00:00, 83.87it/s] 
100%|██████████| 3/3 [00:00<00:00,  3.19it/s]
100%|██████████| 416/416 [00:06<00:00, 59.55it/s]
100%|██████████| 21/21 [00:07<00:00,  2.69it/s]
100%|██████████| 194/194 [00:02<00:00, 74.93it/s]
100%|██████████| 10/10 [00:03<00:00,  2.76it/s]
100%|██████████| 318/318 [00:04<00:00, 66.01it/s]
100%|██████████| 16/16 [00:05<00:00,  2.71it/s]
100%|██████████| 169/169 

Small models True, offloading to CPU: False
	max memory usage = 2972MB, time 269s



100%|██████████| 587/587 [00:35<00:00, 16.33it/s]
100%|██████████| 30/30 [00:23<00:00,  1.26it/s]
100%|██████████| 629/629 [00:39<00:00, 15.84it/s]
100%|██████████| 32/32 [00:25<00:00,  1.25it/s]
100%|██████████| 388/388 [00:19<00:00, 19.57it/s]
100%|██████████| 20/20 [00:15<00:00,  1.26it/s]
100%|██████████| 439/439 [00:23<00:00, 18.51it/s]
100%|██████████| 22/22 [00:17<00:00,  1.24it/s]
100%|██████████| 649/649 [00:41<00:00, 15.61it/s]
100%|██████████| 33/33 [00:26<00:00,  1.25it/s]
100%|██████████| 139/139 [00:05<00:00, 25.26it/s]
100%|██████████| 7/7 [00:05<00:00,  1.25it/s]
100%|██████████| 50/50 [00:01<00:00, 27.22it/s] 
100%|██████████| 3/3 [00:02<00:00,  1.46it/s]
100%|██████████| 343/343 [00:16<00:00, 20.35it/s]
100%|██████████| 18/18 [00:13<00:00,  1.29it/s]
100%|██████████| 247/247 [00:11<00:00, 22.21it/s]
100%|██████████| 13/13 [00:10<00:00,  1.29it/s]
100%|██████████| 249/249 [00:11<00:00, 22.16it/s]
100%|██████████| 13/13 [00:10<00:00,  1.28it/s]
100%|██████████| 143/143 

Small models False, offloading to CPU: False
	max memory usage = 7825MB, time 621s



In [28]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

**Post Production**
To clear the audio, we used [Audacity](https://www.audacityteam.org/). This open-source audio editing software is versatile and can cater to a variety of audio editing needs.

