# Loading the Dataset to and from Hugging Face.

This notebook performs both tasks. First, we focus on loading our CSV file and uploading it to Hugging Face as a dataset.

Afterwards, we examine how to load the data back into our working directory for training.

In [1]:
# IMPORTS
from datasets import load_dataset, Dataset, DatasetDict

## Upload CSV dataset to Hugging Face

### Load the CSV dataset into an HF Dataset object

We can visualize the data tabularly through it's methods.

In [2]:
csv_path = "datasets/gpt4_dataset.csv"
dataset = load_dataset("csv", data_files=csv_path)

Generating train split: 0 examples [00:00, ? examples/s]

In [16]:
for i in range(len(dataset["train"])):
    print(f"Entry {i}:")
    print(f"chat_history: " + dataset["train"]["chat_history"][i] + "\n")


Entry 0:

chat_history: Meeting at 3 PM.
Entry 1:

chat_history: The report needs final review.
Entry 2:

chat_history: Flight leaves at 8 AM.
Entry 3:

chat_history: We need to be there at 12 AM.
Entry 4:

chat_history: The project deadline is next Friday.
Entry 5:

chat_history: Client call rescheduled to 2 PM.
Entry 6:

chat_history: Tasks: Submit report, update spreadsheet.
Entry 7:

chat_history: Discussed budget updates.
Entry 8:

chat_history: I Had a quick chat with Sarah.
Entry 9:

chat_history: Finalized event schedule.
Entry 10:

chat_history: Finished the quarterly report. Sent it to the team for review. Also drafted the email for stakeholders.
Entry 11:

chat_history: The client presentation is scheduled for next Monday. We still need to finalize the slides and confirm attendance.
Entry 12:

chat_history: Team meeting covered progress updates. Discussed blockers and next steps. The new roadmap was introduced as well.
Entry 13:

chat_history: Spoke with Jake about the upcom

### Splitting the dataset (optional)

In [None]:
# split csv dataset into train and temp (80% train, 20% temp)
train_test_split = dataset["train"].train_test_split(test_size=0.2, seed=42)

# Further split temp into validation and test (50% each → 10% of total dataset each)
valid_test_split = train_test_split["test"].train_test_split(test_size=0.5, seed=42)

# Create the final DatasetDict
dataset = DatasetDict({
    "train": train_test_split["train"],
    "valid": valid_test_split["train"],
    "test": valid_test_split["test"]
})

# Verify the splits
dataset

### Upload to HF

We first convert the dataframe to an HF Dataset object, which seamlessly uploads to the HF hub.

In [None]:
# destination
hf_username = ""
dataset_name = "text-edit-dataset-gpt4"

# upload to hub
dataset.push_to_hub(f"{hf_username}/{dataset_name}")

## Load the model from HF

This can be for training, or to overwrite the existing dataset that's pushed to HF.

In [None]:
# load dataset from HF Hub
dataset = load_dataset("your_hf_username/text_editing_dataset")

In [None]:
# split into train, validation, test
split_dataset = dataset["train"].train_test_split(test_size=0.2)
split_dataset = {
    "train": split_dataset["train"],
    "valid_test": split_dataset["test"]
}
split_dataset["valid_test"] = split_dataset["valid_test"].train_test_split(test_size=0.5)

# write as python dict
dataset_dict = {
    "train": split_dataset["train"],
    "valid": split_dataset["valid_test"]["train"],
    "test": split_dataset["valid_test"]["test"],
}

In [None]:
# convert to DatasetDict object (suitable for training)
daataset = DatasetDict(dataset_dict)