# Configuring the Environment

Steps

* ensure working environment
* convert remote .wav, .mp3 into dataset with metadata, [ref](https://huggingface.co/docs/datasets/audio_load)
  - create `metadata.csv`
```
samples/
├── README.md
├── loader.py
├── metadata.csv
├── raw/      #.wav, .mp3
└── data/     #.tar.gz
```
* create loading script
  - `Audio datasets are commonly stored in tar.gz archives which requires a particular approach to support streaming mode`
* individually stream pipeline, [ref](https://huggingface.co/docs/datasets/stream)
```
>>> from datasets import load_dataset
>>> #NO SLOW: dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
>>> dataset = load_dataset("food101")
>>> iterable_dataset = dataset.to_iterable_dataset()
>>> print(next(iter(iterable_dataset)))
```
* run from dataset
  - resample the sampling rate, [ref](https://huggingface.co/docs/datasets/audio_process)
  - preprocess, `map()`
  - diarize process
  - extract data as timeline
  - apply classification models
* format as individual .pdf
* format as vdi workspace
* ???

## Working with Model Files

### Manual steps with internet

We use internet to get the model and save the pretrained model and tokenizer to local directory. Then you can zip and download it for later use.


In [1]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

  torch.utils._pytree._register_pytree_node(


In [2]:
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

In [3]:
model.config.forced_decoder_ids = None

In [None]:
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

In [8]:
ds[0]

{'file': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/69d899cdf280bc629c9d8609fa18cea800a77bd64686d98ea020dddf62fd77a3/dev_clean/1272/128104/1272-128104-0000.flac',
 'audio': {'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/69d899cdf280bc629c9d8609fa18cea800a77bd64686d98ea020dddf62fd77a3/dev_clean/1272/128104/1272-128104-0000.flac',
  'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
         0.0010376 ]),
  'sampling_rate': 16000},
 'text': 'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL',
 'speaker_id': 1272,
 'chapter_id': 128104,
 'id': '1272-128104-0000'}

In [4]:
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

Downloading builder script:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset librispeech_asr_dummy/clean to /home/vscode/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset librispeech_asr_dummy downloaded and prepared to /home/vscode/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b. Subsequent calls will reuse this data.


In [6]:
# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
transcription

['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']

In [7]:
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
transcription

[' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']

In [None]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
import warnings, logging
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

Download & load the model from HuggingFace modelhub

In [None]:
model_nm = 'microsoft/deberta-v3-small'
model = AutoModelForSequenceClassification.from_pretrained(model_nm, return_dict=True)
tokenizer = AutoTokenizer.from_pretrained(model_nm)

Downloading:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/273M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

Save to directory 

In [None]:
save_path = 'deberta_v3_small_pretrained_model_pytorch'

In [None]:
!mkdir {save_path}

In [None]:
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('deberta_v3_small_pretrained_model_pytorch/tokenizer_config.json',
 'deberta_v3_small_pretrained_model_pytorch/special_tokens_map.json',
 'deberta_v3_small_pretrained_model_pytorch/spm.model',
 'deberta_v3_small_pretrained_model_pytorch/added_tokens.json',
 'deberta_v3_small_pretrained_model_pytorch/tokenizer.json')

In [None]:
!ls {save_path}

added_tokens.json  special_tokens_map.json  tokenizer_config.json
config.json	   spm.model
pytorch_model.bin  tokenizer.json


Loading from saved path

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(save_path, return_dict=True)


In [None]:
tokenizer = AutoTokenizer.from_pretrained(save_path)

Add to archive to download locally

In [None]:
!zip -r debertav3_small.zip {save_path}

  adding: deberta_v3_small_pretrained_model_pytorch/ (stored 0%)
  adding: deberta_v3_small_pretrained_model_pytorch/tokenizer.json (deflated 77%)
  adding: deberta_v3_small_pretrained_model_pytorch/spm.model (deflated 50%)
  adding: deberta_v3_small_pretrained_model_pytorch/config.json (deflated 53%)
  adding: deberta_v3_small_pretrained_model_pytorch/special_tokens_map.json (deflated 54%)
  adding: deberta_v3_small_pretrained_model_pytorch/pytorch_model.bin (deflated 42%)
  adding: deberta_v3_small_pretrained_model_pytorch/added_tokens.json (stored 0%)
  adding: deberta_v3_small_pretrained_model_pytorch/tokenizer_config.json (deflated 45%)


In [None]:
# removing redundant files
!rm -rf {save_path}

### Manual steps without internet

* Search the model you like on https://huggingface.co/spaces/huggingface-projects/diffusers-gallery
* The files in a huggingface repo / `Files and versions` required to run the model - weights, tokenizers, configurations, etc.
* In `Files and versions`, search for file that end with `.ckpt` or `.safetensors`, press down arrow to download it. Then just place it in `model/Stable-diffusion` folder just like when you download from civitai
  - The [`Safetensors`](https://github.com/huggingface/safetensors) format is a relatively new data serialization format that is being developed by HuggingFace. It has many advantages over the ckpt format, including: 
    + Faster loading times in various ML applications (on both CPU and GPU)
    + Cross-platform compatibility (It’s not specifically designed for Python like Pickle)
    + Safety (Does not make use of pickle serialization method which can allow for remote code execution)
  - there is no difference between `.ckpt` and `.pth` 
    + A CKPT file is a checkpoint file created by PyTorch Lightning, a PyTorch research framework. It contains a dump of a PyTorch Lightning machine learning model. Developers create CKPT files to preserve the previous states of a machine learning model, while training it to its final state.
    + [pytorch lightning](https://github.com/Lightning-AI/pytorch-lightning)
* If it doesn't have `safetensors`/`ckpt`, that means it is only available as diffuser (you can convert it to `ckpt`)
  - This is a lengthy video about converting to a `ckpt` https://www.youtube.com/watch?v=-6CA18MS0pY
  - IIRC, ShivamShrirao's dreambooth colab also have a section to convert diffuser weight to `ckpt`
* Save files to cache (`~/.cache/huggingface/hub`), you can read more about it [here](https://huggingface.co/docs/transformers/main/en/installation#cache-setup)
* The model is available for loading

## References

* [Use .safetensors Model Files In Stable Diffusion WebUI](https://techtactician.com/ckpt-vs-safetensors-in-stable-diffusion/)
* [Convert Stable Diffusion Diffusers (.bin Weights) & Dreambooth Models to CKPT File](https://www.youtube.com/watch?v=-6CA18MS0pY)
* [Discussion board](https://www.reddit.com/r/StableDiffusion/comments/12djqlh/please_help_an_idiot_understand_how_to_download/)
* [](https://www.kaggle.com/code/shravankumar147/save-huggingface-model-to-local-for-no-internet)