<a href="https://colab.research.google.com/github/Taladala/my-webpage/blob/main/notebooks/long_form_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from IPython.display import Audio
import nltk  # we'll use this to split into sentences
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE

In [None]:
from huggingface_hub import hf_hub_download
from bark.generation import CACHE_DIR # Assuming CACHE_DIR is accessible

# Download text tokenizer files
for filename in ["tokenizer.json", "vocab.json", "merges.txt"]:
    hf_hub_download(repo_id="suno/bark", filename=filename, cache_dir=CACHE_DIR)

preload_models()

In [6]:
!pip install torch==2.5.2 torchaudio --force-reinstall



# Simple Long-Form Generation
We split longer text into sentences using `nltk` and generate the sentences one by one.

In [7]:
script = """
Hey, have you heard about this new text-to-audio model called "Bark"?
Apparently, it's the most realistic and natural-sounding text-to-audio model
out there right now. People are saying it sounds just like a real person speaking.
I think it uses advanced machine learning algorithms to analyze and understand the
nuances of human speech, and then replicates those nuances in its own speech output.
It's pretty impressive, and I bet it could be used for things like audiobooks or podcasts.
In fact, I heard that some publishers are already starting to use Bark to create audiobooks.
It would be like having your own personal voiceover artist. I really think Bark is going to
be a game-changer in the world of text-to-audio technology.
""".replace("\n", " ").strip()

In [8]:
import nltk
nltk.download('punkt')
sentences = nltk.sent_tokenize(script)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence, history_prompt=SPEAKER)
    pieces += [audio_array, silence.copy()]


In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

# $ \\ $

# Advanced Long-Form Generation
Somtimes Bark will hallucinate a little extra audio at the end of the prompt.
We can solve this issue by lowering the threshold for bark to stop generating text.
We use the `min_eos_p` kwarg in `generate_text_semantic`

In [9]:
GEN_TEMP = 0.6
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    semantic_tokens = generate_text_semantic(
        sentence,
        history_prompt=SPEAKER,
        temp=GEN_TEMP,
        min_eos_p=0.05,  # this controls how likely the generation is to end
    )

    audio_array = semantic_to_waveform(semantic_tokens, history_prompt=SPEAKER,)
    pieces += [audio_array, silence.copy()]

NameError: name 'SAMPLE_RATE' is not defined

hello i am good


In [10]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

NameError: name 'pieces' is not defined

# $ \\ $

# Make a Long-Form Dialog with Bark

### Step 1: Format a script and speaker lookup

In [11]:
speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_2"}

# Script generated by chat GPT
script = """
Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?

John: No, I haven't. What's so special about it?

Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.

John: Wow, that sounds amazing. How does it work?

Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.

John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?

Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.

John: I can imagine. It would be like having your own personal voiceover artist.

Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audio technology."""
script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

['Samantha: Hey, have you heard about this new text-to-audio model called "Bark"?',
 "John: No, I haven't. What's so special about it?",
 "Samantha: Well, apparently it's the most realistic and natural-sounding text-to-audio model out there right now. People are saying it sounds just like a real person speaking.",
 'John: Wow, that sounds amazing. How does it work?',
 'Samantha: I think it uses advanced machine learning algorithms to analyze and understand the nuances of human speech, and then replicates those nuances in its own speech output.',
 "John: That's pretty impressive. Do you think it could be used for things like audiobooks or podcasts?",
 'Samantha: Definitely! In fact, I heard that some publishers are already starting to use Bark to create audiobooks. And I bet it would be great for podcasts too.',
 'John: I can imagine. It would be like having your own personal voiceover artist.',
 'Samantha: Exactly! I think Bark is going to be a game-changer in the world of text-to-audi

### Step 2: Generate the audio for every speaker turn

In [None]:
pieces = []
silence = np.zeros(int(0.5*SAMPLE_RATE))
for line in script:
    speaker, text = line.split(": ")
    audio_array = generate_audio(text, history_prompt=speaker_lookup[speaker], )
    pieces += [audio_array, silence.copy()]

### Step 3: Concatenate all of the audio and play it

In [None]:
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

In [12]:
!pip install bark

Collecting bark
  Downloading bark-0.1.5-py3-none-any.whl.metadata (15 kB)
Collecting boto3 (from bark)
  Downloading boto3-1.38.46-py3-none-any.whl.metadata (6.6 kB)
Collecting encodec (from bark)
  Downloading encodec-0.1.1.tar.gz (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting funcy (from bark)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting botocore<1.39.0,>=1.38.46 (from boto3->bark)
  Downloading botocore-1.38.46-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->bark)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.14.0,>=0.13.0 (from boto3->bark)
  Downloading s3transfer-0.13.0-py3-none-any.whl.metadata (1.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->bark)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylin