## Prerequest

- Library 설치

In [1]:
!pip install datasets==2.19.0
!pip install transformers==4.40.1



## 데이터셋 생성하기

- Dataset 객체 만들기
    * 객체의 종류에 따라 사용하는 함수가 다름, 예시에서는 dict로 작성
        * dict → Dataset.from_dict
        * list → Dataset.from_list
        * json → Dataset.from_json
        
- DatasetDict 객체 생성하기
    * 데이터셋들을 dictionary 형태로 생성
    * e.g., DatasetDict({"train": t_dataset})

In [2]:
from datasets import Dataset

# Dataset 객체 생성(train)
sample_train = {'first' : '1', 'second' : '2', 'third' : '3'}
sample_train_dataset = Dataset.from_dict(sample_train)

# type과 dataset 내용 확인
print(type(sample_train_dataset))
print(sample_train_dataset)

# Dataset 객체 생성(test)
sample_test = {'first' : '11', 'second' : '22', 'third' : '33'}
sample_test_dataset = Dataset.from_dict(sample_test)


<class 'datasets.arrow_dataset.Dataset'>
Dataset({
    features: ['first', 'second', 'third'],
    num_rows: 1
})


In [3]:
from datasets import DatasetDict

# datasetDcit 객체 생성
sample_datasetDict = DatasetDict({"train": sample_train_dataset, "test" : sample_test_dataset})

print(type(sample_datasetDict))
print(sample_datasetDict)

<class 'datasets.dataset_dict.DatasetDict'>
DatasetDict({
    train: Dataset({
        features: ['first', 'second', 'third'],
        num_rows: 1
    })
    test: Dataset({
        features: ['first', 'second', 'third'],
        num_rows: 2
    })
})


## 데이터셋 허깅페이스에 업로드

- 허깅페이스 로그인하기 (토큰 가져와서 로그인, [토큰 링크](https://huggingface.co/settings/tokens))
- 허깅페이스에 업로드하기
- load_dataset을 통해 업로드 잘 되었는지 확인하기

In [4]:
# hf로 시작하는 Token(Write) 입력, (허깅페이스 로그인)
!huggingface-cli login --token hf_

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
# 업로드 할 Repo 및 토큰(Write) 입력(입력하셔야 합니다. )
sample_datasetDict.push_to_hub('giliit/upload_dataset', token="hf_")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/giliit/upload_dataset/commit/747001120bc8fb7e451771893f8962c562bb3595', commit_message='Upload dataset', commit_description='', oid='747001120bc8fb7e451771893f8962c562bb3595', pr_url=None, pr_revision=None, pr_num=None)

In [6]:
# 데이터셋 확인
from datasets import load_dataset

dataset = load_dataset('giliit/upload_dataset')

print(dataset)

Downloading readme:   0%|          | 0.00/419 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['first', 'second', 'third'],
        num_rows: 1
    })
    test: Dataset({
        features: ['first', 'second', 'third'],
        num_rows: 2
    })
})
