# **Speech Dataset Synthesis**

In this notebook, we take 9,625 lu encoded sentences as inputs to chosen text-to-speech models in order to create synthetic speeches of each sentence.

## **GCP credential \& data setting**

As the models we use come from different origin and platforms, there are dependency errors that would take time to address to. To mitigate this, we have found a workaround to this. By using the library that would later cause error before getting into the other part with which it has conflict, no error will be caused after this.

In [None]:
from google.colab import auth, drive
import os
import pandas as pd
import sys
from tqdm.notebook import tqdm

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
savedir = r"/content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech"

In [None]:
auth.authenticate_user()

In [None]:
# Use the environment variable if the user doesn't provide Project ID.

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

TTS_LOCATION = "global"

In [None]:
! gcloud config set project {PROJECT_ID}
! gcloud auth application-default set-quota-project {PROJECT_ID}
! gcloud auth application-default login -q

load the data as `pd.DataFrame` first to avoid dependency error after installing `coqui-tts`

In [None]:
# actual dataset
!wget https://raw.githubusercontent.com/TheLuBERTa/lu-encoded-speech/refs/heads/main/dataset/converted_lu_results.csv -O lu_dataset.csv

try:

  dataset = pd.read_csv("lu_dataset.csv")
  # or avoid using pandas after this step entirely by converting the dataset to `.jsonl`
  # dataset.to_json("lu_dataset.jsonl", orient="records", lines=True) # iterate on the file in synthesis step
  dataset.drop(columns=["index"], inplace=True)

  def generate_unique_id(df:pd.DataFrame, column_name:str, prefix:str="lu_") -> pd.DataFrame:
    df[column_name] = prefix + df.index.astype(str)
    return df

  dataset = generate_unique_id(dataset, "sent_id", prefix="lu_")

except Exception as e:
  print(f"{e}")

--2025-05-06 09:04:29--  https://raw.githubusercontent.com/TheLuBERTa/lu-encoded-speech/refs/heads/main/dataset/converted_lu_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7163868 (6.8M) [text/plain]
Saving to: ‘lu_dataset.csv’


2025-05-06 09:04:29 (56.4 MB/s) - ‘lu_dataset.csv’ saved [7163868/7163868]



In [None]:
display(dataset)

Unnamed: 0,original,lu,sent_id
0,ก็แรงปกตินี่ล่ะครับไม่มีอะไรมากอยากไปก็ไปเงินม...,"['ล่อกู้', 'แซงรูง', 'หลกปุก', 'หละกุ', 'หลิตุ...",lu_0
1,คลิกเพื่อดูข้อความที่ซ่อนไว้,"['ละคุ', 'ซิกลุก', 'เลื่อพู่', 'ลูดี', 'ล่อคู่...",lu_1
2,บัตรเครดิตหรือดีคะ,"['หลัดบุด', 'เลครู', 'หลิดดุด', 'สือหรู', 'ลีด...",lu_2
3,คิดว่าเป็นนิสัยส่วนตัวก็ส่วนหนึ่งนะคะเราเคยเห็...,"['ลิดคุด', 'ล่าวู่', 'เล็นปุน', 'ลินุ', 'หลัยส...",lu_3
4,อยากบอกว่าก๋วยเตี๋ยวต้มยำเจ้านี้เด็ดมากครับเพร...,"['หลากหยูก', 'หลอกบูก', 'ล่าวู่', 'หลวยกู๋ย', ...",lu_4
...,...,...,...
9620,สอบถามเรื่องบัตรค่ะ,"['หลอบสูบ', 'หลามถูม', 'เซื่องรู่ง', 'หลัดบุด'...",lu_9620
9621,ดีใจด้วยนะครับผมก็หวังว่าจะเจอผญดีๆแบบจขกทอ่าน...,"['ลีดู', 'ลัยจุย', 'ล่วยดู้ย', 'ละนุ', 'ลับครุ...",lu_9621
9622,คุณถามคําถามแบบนี้ไม่ได้คับบริบทระยะทางการเดิน...,"['ลุนคิน', 'หลามถูม', 'ลัมคุม', 'หลามถูม', 'แห...",lu_9622
9623,ส่งของขวัญให้แฟนเก่าดีมั้ย,"['หล่งสุ่ง', 'หลองขูง', 'หลันขวุน', 'ลั่ยฮุ่ย'...",lu_9623


## **Synthesis**

### **Setting up TTS models**

In [None]:
!pip uninstall -q numpy thinc spacy -y
!pip install -q numpy==1.26.4
!pip install -q pythaitts coqui-tts
!pip install --upgrade --quiet google-cloud-texttospeech

# There is a problem with implementing KhanomTan model via PyThaiTTS interface, we have developed a script to use the model with Coqui TTS directly.
!wget https://raw.githubusercontent.com/TheLuBERTa/lu-encoded-speech/refs/heads/main/khanomtan.py -O khanomtan.py

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.0/16.0 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.1/188.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h--2025-05-06 09:05:05--  https://raw.githubusercontent.com/TheLuBERTa/lu-encoded-speech/refs/heads/main/khanomtan.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4580 (4.5K) [text/plain]
Saving to: ‘khanomtan.py’


2025-05-06 09:05:05 (17.3 MB/s) - ‘khanomtan.py’ saved [4580/4580]



In [None]:
DEVICE = "cpu" #@param ["cpu", "cuda"]

In [None]:
from google.api_core.client_options import ClientOptions
from google.colab import files
from google.cloud import texttospeech_v1beta1 as texttospeech
from IPython.display import Audio, display
import json
from khanomtan import KhanomTanTTS
from pythaitts import TTS
from random import randint
import shutil
from tqdm.notebook import tqdm

# Here we initialise text-to-speech models to use in this task
khanomtan = KhanomTanTTS(version="1.0", device=DEVICE) # equivalent to `khanomtan = TTS(pretrained="khanomtan", mode="best_model", version="1.0", device="cpu")` but doing so using PyThaiTTS code somehow ended up with numpy error.

lunarlist = TTS(pretrained="lunarlist_onnx", device="cpu") # lunarlist_onnx is only usable on cpu

API_ENDPOINT = (
    f"{TTS_LOCATION}-texttospeech.googleapis.com"
    if TTS_LOCATION != "global"
    else "texttospeech.googleapis.com"
)

client = texttospeech.TextToSpeechClient(
    client_options=ClientOptions(api_endpoint=API_ENDPOINT)
)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)

def chirp_synthesize(text:str, filename:str, speaker_idx:str, verbose:bool=True):
  try:
    voice = texttospeech.VoiceSelectionParams(
        language_code="th-TH",
        name=speaker_idx,
    )

    input_text = texttospeech.SynthesisInput(text=text)

    # Perform the text-to-speech request
    response = client.synthesize_speech(
        request={"input": input_text, "voice": voice, "audio_config": audio_config}
    )

    # Write the response audio content to a file
    with open(filename, "wb") as out:
      if verbose:
        print(f"writing audio....")
      out.write(response.audio_content)

    if verbose:
      print(f"Audio content written to file {filename}")

  except Exception as e:
    print(f"{e}")

tacotron2encoder-th.onnx:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

tacotron2decoder-th.onnx:   0%|          | 0.00/72.8M [00:00<?, ?B/s]

tacotron2postnet-th.onnx:   0%|          | 0.00/17.4M [00:00<?, ?B/s]

vocoder.onnx:   0%|          | 0.00/55.8M [00:00<?, ?B/s]

### **Getting into synthesising the speech data**

In [None]:
possible_speaker = ['th-TH-Chirp3-HD-Achernar',
                    'th-TH-Chirp3-HD-Achird',
                    'th-TH-Chirp3-HD-Algenib',
                    'th-TH-Chirp3-HD-Algieba',
                    'th-TH-Chirp3-HD-Alnilam',
                    'th-TH-Chirp3-HD-Aoede',
                    'th-TH-Chirp3-HD-Autonoe',
                    'th-TH-Chirp3-HD-Callirrhoe',
                    'th-TH-Chirp3-HD-Charon',
                    'th-TH-Chirp3-HD-Despina',
                    'th-TH-Chirp3-HD-Enceladus',
                    'th-TH-Chirp3-HD-Erinome',
                    'th-TH-Chirp3-HD-Fenrir',
                    'th-TH-Chirp3-HD-Gacrux',
                    'th-TH-Chirp3-HD-Iapetus',
                    'th-TH-Chirp3-HD-Kore',
                    'th-TH-Chirp3-HD-Laomedeia',
                    'th-TH-Chirp3-HD-Leda',
                    'th-TH-Chirp3-HD-Orus',
                    'th-TH-Chirp3-HD-Puck',
                    'th-TH-Chirp3-HD-Pulcherrima',
                    'th-TH-Chirp3-HD-Rasalgethi',
                    'th-TH-Chirp3-HD-Sadachbia',
                    'th-TH-Chirp3-HD-Sadaltager',
                    'th-TH-Chirp3-HD-Schedar',
                    'th-TH-Chirp3-HD-Sulafat',
                    'th-TH-Chirp3-HD-Umbriel',
                    'th-TH-Chirp3-HD-Vindemiatrix',
                    'th-TH-Chirp3-HD-Zephyr',
                    'th-TH-Chirp3-HD-Zubenelgenubi',
                    'Tsyncone',
                    'Tsynctwo',
                    'lunarlist'] # chosen speaker

speech_count = {speaker : 0 for speaker in possible_speaker}

print(len(possible_speaker))

In [None]:
limit_per_speaker = round(len(dataset) / len(possible_speaker))
print(limit_per_speaker)

In [None]:
def limit_aware_speaker_random_pick(possible_speaker:list, speech_count:dict, limit_per_speaker:int) -> str:
  speaker = possible_speaker[randint(0, len(possible_speaker) - 1)]
  if speech_count[speaker] < limit_per_speaker:
    # print(f"picked {speaker} as the speaker")
    # speech_count[speaker] += 1
    return speaker
  else:
    # print(f"randomly picked {speaker} already met the limit, choosing again....")
    return limit_aware_speaker_random_pick(possible_speaker, speech_count, limit_per_speaker)

In [None]:
for _, row in tqdm(dataset.iterrows(), desc="synthesising speech", total=dataset.shape[0]):

  random_speaker = limit_aware_speaker_random_pick(possible_speaker, speech_count, limit_per_speaker)

  id = row["sent_id"]

  savename = f"{id}"
  full_path = os.path.join(savedir, savename)

  # print(f"synthesising speech to {full_path}")

  if os.path.exists(f"{full_path}.mp3") or os.path.exists(f"{full_path}.wav"):
    print(f"skipping {savename} as it already exists")
    continue

  if random_speaker.startswith("th-TH-Chirp3-HD-"):

    try:
      chirp_synthesize(text=" ".join(eval(row["lu"])), filename=f"{full_path}.mp3", speaker_idx=random_speaker, verbose=False)
      speech_count[random_speaker] += 1
      print(f"saved to {full_path}.wav") # accidentally printed as `.wav` too, but files created in this case are actually `.mp3`
    except Exception as e:
      print(f"{e}")

  elif random_speaker == "lunarlist":

    try:
      save_file = lunarlist.tts(text=" ".join(eval(row["lu"])), filename=f"{full_path}.wav")
      speech_count[random_speaker] += 1
      print(f"saved to {full_path}.wav")
    except Exception as e:
      print(f"{e}")

  else:

    try:
      khanomtan(text=" ".join(eval(row["lu"])), speaker_idx=random_speaker, file_path=f"{full_path}.wav", verbose=False)
      speech_count[random_speaker] += 1
      print(f"saved to {full_path}.wav")
    except Exception as e:
      print(f"{e}")

synthesising speech:   0%|          | 0/9625 [00:00<?, ?it/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
400 This request contains sentences that are too long. Consider splitting up long sentences with sentence ending punctuation e.g. periods.
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5465.wav
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5466.wav
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5467.wav
400 This request contains sentences that are too long. Consider splitting up long sentences with sentence ending punctuation e.g. periods.
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5468.wav
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5469.wav
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5470.wav
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_5471.wav
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025

## **Addressing sentence-too-long error**

After data synthesis step, another error has emerged; `400 This request contains sentences that are too long. Consider splitting up long sentences with sentence ending punctuation e.g. periods.`. We mitigate this error by segmenting sentences whose file that has not been created from the earlier step into four parts, synthesise from each part, then put the audio together to output one single `.mp3` file.

In [None]:
!pip install -q pydub

To save time from having to go over the dataset of almost 10,000 sentences while synthesising speech file, we first create separate dataset of sentences that the models failed to synthesise.

In [None]:
toolongsent = []
created = [idx[:-4] for idx in os.listdir(savedir)]
for id in tqdm(dataset["sent_id"].tolist(), desc="finding uncreated sentences", total=dataset.shape[0]):
  if id not in created:
    toolongsent.append(id)

finding uncreated sentences:   0%|          | 0/9625 [00:00<?, ?it/s]

In [None]:
def four_segment(lu_list:list) -> list:
  forth = round(len(lu_list)/4)
  return [" ".join(lu_list[:forth]), " ".join(lu_list[forth:forth*2]), " ".join(lu_list[forth*2:forth*3]), " ".join(lu_list[forth*3:])]

In [None]:
remaining_dataset = dataset[dataset["sent_id"].isin(toolongsent)]

In [None]:
import io
from pydub import AudioSegment

# Set your Google Cloud Project ID
# os.environ["GOOGLE_CLOUD_PROJECT"] = "your-project-id" # Uncomment and set if needed

def synthesize_list_to_single_mp3(text_list, speaker_idx:str, output_filename="combined_audio.mp3"):
    """
    Synthesizes a list of text strings into speech (MP3) and combines them
    into a single MP3 audio file using pydub.

    Args:
        text_list: A list of strings, where each string is a piece of text to synthesize.
                   (Ensure individual strings adhere to API limits like sentence length).
        output_filename: The name for the final combined audio file (should end with .mp3).
    """
    client = texttospeech.TextToSpeechClient()

    # --- Configuration for the synthesis ---
    # Select the voice (replace with your desired Chirp 3 HD voice)
    # Example Thai Chirp 3 HD voice code (replace if needed): "th-TH-Chirp3-HD-A"
    voice = texttospeech.VoiceSelectionParams(
        language_code="th-TH", # Or the appropriate language code for your text
        name=speaker_idx, # <-- **Replace with the actual Chirp 3 HD voice code**
        # ssml_gender=texttospeech.SsmlVoiceGender.FEMALE # Optional
    )

    # Select the audio format as MP3
    # Note: For MP3, sample_rate_hertz might be less critical for concatenation
    # compared to LINEAR16, but it's still good practice to set.
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3,
        # sample_rate_hertz=24000, # Optional for MP3, API might handle this
    )
    # --------------------------------------

    combined_audio_segment = AudioSegment.empty() # Start with an empty audio segment

    # print(f"Starting synthesis for {len(text_list)} text segments...")

    for i, text in enumerate(text_list):
        synthesis_input = texttospeech.SynthesisInput(text=text)

        try:
            # print(f"Synthesizing segment {i+1}/{len(text_list)}...")
            response = client.synthesize_speech(
                input=synthesis_input, voice=voice, audio_config=audio_config
            )
            # Get the MP3 audio bytes
            audio_bytes = response.audio_content

            # Load the MP3 segment using pydub from bytes
            audio_segment = AudioSegment.from_file(io.BytesIO(audio_bytes), format="mp3")

            # Append this segment to the combined audio segment
            combined_audio_segment += audio_segment

            # print(f"Successfully synthesized and added segment {i+1}.")

        except Exception as e:
            print(f"Error synthesizing segment {i+1}: {text[:50]}... Error: {e}")
            raise Exception(f"Error synthesizing segment {i+1}: {text[:50]}... Error: {e}")
            # Decide how to handle errors. Skipping might cause a silence or gap.
            # Adding a short period of silence might be better:
            # combined_audio_segment += AudioSegment.silent(duration=1000) # Add 1 second of silence
            # pass # Or just pass if you want to skip the segment on error


    # Export the combined audio segment to a single MP3 file
    try:
        # print(f"Exporting combined audio to {output_filename}...")
        combined_audio_segment.export(output_filename, format="mp3")
        # print(f"Combined audio saved successfully to {output_filename}")
    except Exception as e:
         print(f"Error exporting combined audio file: {e}")

In [None]:
for _, row in tqdm(remaining_dataset.iterrows(), desc="synthesising speech for remaining sentences", total=remaining_dataset.shape[0]):

  random_speaker = limit_aware_speaker_random_pick(possible_speaker, speech_count, limit_per_speaker)

  id = row["sent_id"]

  savename = f"{id}"
  full_path = os.path.join(savedir, savename)

  # print(f"synthesising speech to {full_path}")

  if os.path.exists(f"{full_path}.mp3") or os.path.exists(f"{full_path}.wav"):
    print(f"skipping {savename} as it already exists")
    continue

  if random_speaker.startswith("th-TH-Chirp3-HD-"):

    try:
      # chirp_synthesize(text=". ".join(eval(row["lu"])), filename=f"{full_path}.mp3", speaker_idx=random_speaker, verbose=False)
      synthesize_list_to_single_mp3(text_list=four_segment(eval(row["lu"])), speaker_idx=random_speaker, output_filename=f"{full_path}.mp3")
      speech_count[random_speaker] += 1
      print(f"saved to {full_path}.mp3")
    except Exception as e:
      print(f"{e}")

  elif random_speaker == "lunarlist":

    try:
      save_file = lunarlist.tts(text=" ".join(eval(row["lu"])), filename=f"{full_path}.wav")
      speech_count[random_speaker] += 1
      print(f"saved to {full_path}.wav")

    except Exception as e:
      print(f"{e}")

synthesising speech for remaining sentences:   0%|          | 0/1333 [00:00<?, ?it/s]

saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_3.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_4.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_8.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_21.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_48.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_51.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_71.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_73.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_96.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_113.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_115.mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_

In [None]:
len(os.listdir(savedir))

9625

## **Finishing touch**

To wrap things up in this process, we converted `.wav` files to `.mp3` to keep a consistent dataset.

In [None]:
for f in tqdm(os.listdir(savedir), desc="converting to mp3"):
  if f.endswith(".wav"):
    print(f"converting {f} to mp3")
    wav_path = os.path.join(savedir, f)
    mp3_path = os.path.join(savedir, f[:-4] + ".mp3")
    try:
      sound = AudioSegment.from_wav(wav_path)
      sound.export(mp3_path, format="mp3")
      os.remove(wav_path)
      print(f"saved to {mp3_path}")
    except Exception as e:
      print(f"{e}")

converting to mp3:   0%|          | 0/9625 [00:00<?, ?it/s]

converting lu_357.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_357.mp3
converting lu_461.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_461.mp3
converting lu_486.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_486.mp3
converting lu_648.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_648.mp3
converting lu_782.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_782.mp3
converting lu_875.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_875.mp3
converting lu_1223.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_1223.mp3
converting lu_1383.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech/lu_1383.mp3
converting lu_1814.wav to mp3
saved to /content/drive/MyDrive/Colab Notebooks/NLP202

In [None]:
# get size of the directory
total_size = 0
for dirpath, dirnames, filenames in tqdm(os.walk(savedir)):
    for f in filenames:
        fp = os.path.join(dirpath, f)
        file_size = os.path.getsize(fp)
        file_kb = file_size / 1024
        print(f"{f} : {file_kb} kB")
        total_size += file_size

size_gb = total_size / (1024 * 1024 * 1024)
print(f"Size of the directory: {size_gb:.2f} GB")

0it [00:00, ?it/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
lu_5086.mp3 : 87.84375 kB
lu_5088.mp3 : 65.0625 kB
lu_5089.mp3 : 16.03125 kB
lu_5091.mp3 : 28.3125 kB
lu_5092.mp3 : 111.0 kB
lu_5093.mp3 : 43.5 kB
lu_5096.mp3 : 87.75 kB
lu_5097.mp3 : 52.21875 kB
lu_5100.mp3 : 51.9375 kB
lu_5103.mp3 : 25.21875 kB
lu_5104.mp3 : 69.0 kB
lu_5105.mp3 : 25.6875 kB
lu_5106.mp3 : 114.28125 kB
lu_5108.mp3 : 91.3125 kB
lu_5110.mp3 : 36.9375 kB
lu_5111.mp3 : 50.25 kB
lu_5113.mp3 : 59.90625 kB
lu_5114.mp3 : 51.0 kB
lu_5116.mp3 : 55.96875 kB
lu_5117.mp3 : 45.84375 kB
lu_5120.mp3 : 101.625 kB
lu_5121.mp3 : 67.875 kB
lu_5122.mp3 : 62.0625 kB
lu_5123.mp3 : 55.03125 kB
lu_5124.mp3 : 57.75 kB
lu_5127.mp3 : 43.78125 kB
lu_5128.mp3 : 88.5 kB
lu_5130.mp3 : 51.46875 kB
lu_5132.mp3 : 39.9375 kB
lu_5134.mp3 : 44.90625 kB
lu_5135.mp3 : 35.8125 kB
lu_5136.mp3 : 15.375 kB
lu_5139.mp3 : 34.125 kB
lu_5140.mp3 : 28.21875 kB
lu_5142.mp3 : 73.03125 kB
lu_5143.mp3 : 26.34375 kB
lu_5145.mp3 : 16.125 kB
lu_5146.mp3 : 49.3

In [None]:
# Define the directory to zip and the desired name for the zip file (without .zip extension)
directory_to_zip = savedir # Replace with the path to your directory
output_zip_name = 'Lu_Speech' # Desired name for the output zip file (without .zip)

# Create the zip archive
# The first argument is the base name of the archive
# The second argument is the archive format ('zip')
# The third argument is the root directory to start archiving from
shutil.make_archive(output_zip_name, 'zip', directory_to_zip)

print(f'Directory "{directory_to_zip}" zipped successfully to "{output_zip_name}.zip"')

# Optional: Download the zip file
files.download(f'{output_zip_name}.zip')

Directory "/content/drive/MyDrive/Colab Notebooks/NLP2025/LuBERTa/Lu_Speech" zipped successfully to "Lu_Speech.zip"


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>