# Creating your Dataset and loading to Hugging Face 🤗

Supported formats
- CSV
- JSON
- JSON lines
- text
- Parquet
- Compressed files (GZ, BZ2, LZ4, LZMA or ZSTD)

## [HfApi 🌐](https://huggingface.co/docs/huggingface_hub/guides/overview)

```bash 
pip install --upgrade huggingface_hub

```
#### Git vs HTTP paradigm
[**Git**](https://huggingface.co/docs/datasets/v2.13.0/share)

Maintain a local copy of the entire repository on your machine
 - Training a model on your machine and pushing regular updates
 - if you need to manually edit large files

**HfApi**

The same functionality as git-based approaches, but without the need for a local folder
 - Managing repos
 - Downloading files using caching
 - Searching the Hub for repos and metadata
 - Accessing community features 
 - Configuring Spaces hardware and secrets

> **Note:** Each repo with large files (>5GB) need to install a custom transfer agent for Git LFS
>```bash
>huggingface-cli lfs-enable-largefiles

Login
```bash 
pip install ipywidgets
```

In [None]:
from huggingface_hub import login
login()

In [None]:
from huggingface_hub import create_repo
create_repo("Einstellung/demo-salaries", repo_type="dataset")

In [None]:
from huggingface_hub import HfApi

api = HfApi()

In [None]:

api.upload_file(
    path_or_fileobj="dataset/ds_salaries.csv",
    path_in_repo="ds_salaries.csv",
    repo_id="Einstellung/demo-salaries",
    repo_type="dataset",
    commit_message="my first commit in HG"
)

## README.md 📄

- [DataCard Metadata](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1)
- [DataCard Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md)
- [DataCard Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1)
- [Valid Licenses](https://huggingface.co/docs/hub/repositories-licenses)
- [Valid Tasks/Sub-Tasks](https://github.com/huggingface/hub-docs/blob/main/js/src/lib/interfaces/Types.ts)

In [None]:

api.upload_file(
    path_or_fileobj="README.md",
    path_in_repo="README.md",
    repo_id="Einstellung/demo-salaries",
    repo_type="dataset",
)

In [None]:
api.delete_file(path_in_repo='/README.md', repo_id='Einstellung/demo-salaries', repo_type='dataset')

### [Folder-Based builder](https://huggingface.co/docs/datasets/create_dataset#folderbased-builders)
- ImageFolder (jpeg, png ...)
- AudioFolder (wav, mp3 ...)

!['builder'](img/folder-based-builder.png)

#### metadata.csv 
The metadata file needs to have a file_name column that links the image or audio file

## [Loading Script 🐍](https://huggingface.co/docs/datasets/v2.12.0/en/about_dataset_load#build-and-load)
[Script Template](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py)

!['builder'](img/dataset_classes.png)

| Attribute | Description |
|---|---|
| name | Short name of the dataset |
| version | Dataset version identifier |
| data_dir | Stores the path to a local folder containing the data files |
| data_files | Stores paths to local data files |
| description | Description of the dataset |

!['builder'](img/dataset_builder.png)

- **DatasetBuilder._info** defining the dataset attributes and Features. dataset.info returns the information stored here.

- **DatasetBuilder._split_generator** download data files, splits it and defines arguments for the generation process. 
    -   DownloadManager that downloads files or fetches them from your local filesystem. 
    -   DownloadManager.download_and_extract() a single URL or path, or a list/dictionary of URLs or paths

    The SplitGenerator contains the name of the split, and any keyword arguments that are provided to the DatasetBuilder._generate_examples method. at least the local path to the data

- **DatasetBuilder._generate_examples** reads and parses the data files for a split. Then it yields dataset examples according to the format specified in the features from DatasetBuilder._info(). The input of DatasetBuilder._generate_examples is actually the filepath provided in the keyword arguments of the last method.

The dataset is generated with a Python generator, which doesn’t load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an ArrowWriter buffer. This means the generated samples are written by batch. 

> **Note:** DEFAULT_WRITER_BATCH_SIZE attribute should not exceeding 200 MB

### [Wiki-Art](https://www.kaggle.com/datasets/antoinegruson/-wikiart-all-images-120k-link)

In [None]:
create_repo("Einstellung/wiki_art", repo_type="dataset")

In [None]:
api.upload_file(
    path_or_fileobj="wiki_art.py",
    path_in_repo="wiki_art.py",
    repo_id="Einstellung/wiki_art",
    repo_type="dataset",
)

In [None]:
api.delete_file(path_in_repo='/wiki_art.py', repo_id='Einstellung/wiki_art', repo_type='dataset')

[Medellín AI - Meetup](https://linktr.ee/colombia_ai?utm_source=linktree_profile_share&ltsid=4da78a52-278a-45b6-9cd8-3e69c51aa19d)