## 📚 1. Installing Required Libraries

This notebook focuses on an audio task, so we need to install the necessary libraries to handle audio data in addition to the standard `transformers` package.

- **`transformers`**: The core Hugging Face library.
- **`sentencepiece` & `sacremoses`**: Common tokenization dependencies.
- **`datasets[audio]`**: This is a key command. It installs the `datasets` library from Hugging Face and includes the extra dependencies required for loading and processing audio datasets (like `libsndfile`).

In [None]:
# !pip install -U transformers
# !pip install -U sentencepiece
# !pip install -U sacremoses
# !pip install --upgrade pip
# !pip install --upgrade transformers sentencepiece datasets[audio]

## 📂 2. Setting a Custom Cache Directory (Optional)

As with other models, Text-to-Speech models can be large. This commented-out block shows how you can set a custom cache directory for Hugging Face to store downloaded models, which is useful for managing disk space.

In [None]:
# import os
# new_cache_dir = """X:\AI-learin\courss\Fine-Tuning-LLM-with-HuggingFace-main\models"""
# os.environ['HF_HOME'] = new_cache_dir

## 📦 3. Importing Necessary Libraries

Here, we import all the tools we'll need for this task.

- **`pipeline` from `transformers`**: Our main high-level API for inference.
- **`load_dataset` from `datasets`**: A powerful function to download and use datasets directly from the Hugging Face Hub. We'll use this to get voice data.
- **`soundfile` as `sf`**: A library for reading and writing audio files. We'll use it to save our generated speech as a `.wav` file.
- **`torch`**: The PyTorch library. The underlying models run on PyTorch, and we'll use it to handle the numerical data for the voice embeddings.

In [None]:
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf
import torch

## ✍️ 4. Defining the Input Text

This is the text that we want our model to convert into spoken words. We define it here as a simple string.

In [None]:
text="""Sam Altman on Wednesday returned to OpenAI as the chief executive officer (CEO) and sacked the Board that had fired him last week. However, the only remaining member in the Board team is Adam D'Angelo, CEO of Quora.\nEx-Salesforce co-CEO Bret Taylor and former US Treasury Secretary and president of Harvard University, Larry Summers will join D'Angelo."""

## 🗣️ 5. Performing Text-to-Speech Synthesis

This is the core of the notebook where we generate speech from our text. This process has a few fascinating steps.

1.  **Create the Pipeline**: We create a `pipeline` for **`"text-to-speech"`** using the **`"microsoft/speecht5_tts"`** model. We must set **`trust_remote_code=True`** because this model requires custom code from its repository to function correctly. This is a security measure, and you should only use it with models from trusted sources.

2.  **Load a Speaker's Voice (Embedding)**: To control the voice of the generated speech, we don't just use the model alone. We load a dataset of **speaker embeddings** (also known as x-vectors). Think of these as numerical "fingerprints" that represent the unique characteristics of a person's voice.

3.  **Select a Specific Voice**: From the thousands of voices in the dataset, we pick one (at index `7306`). We convert this voice fingerprint into a `torch.tensor` and use `.unsqueeze(0)` to format it correctly for the model.

4.  **Synthesize the Speech**: We call our `synthesiser` with the input `text`. Critically, we pass our chosen `speaker_embedding` to the model using `forward_params`. This tells the model to generate the speech using the characteristics of that specific voice.

In [None]:
model = "microsoft/speecht5_tts"
synthesiser = pipeline("text-to-speech", model= model, trust_remote_code=True)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
speech = synthesiser( text, forward_params={"speaker_embeddings": speaker_embedding})

## 💾 6. Saving the Generated Audio

The output from the pipeline is a dictionary containing the raw audio data (as a NumPy array) and the correct sampling rate.

We use the `soundfile.write()` function to save this data as a standard `.wav` file. We provide a filename (`"speech.wav"`), the audio data (`speech["audio"]`), and the sampling rate (`speech["sampling_rate"]`). This sampling rate is crucial for ensuring the audio plays back at the correct speed and pitch.

In [None]:

sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])