In this notebook we will show some methods that can be applied to HuggingFace dataset to perform different data transformation manipulations.

* Load your own custom dataset using the HuggingFace load_dataset (as a HuggingFace DatasetDict)

In [None]:
dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})

* Load a dataset from the HuggingFace dataset Hub

In [1]:
from datasets import load_dataset

squad = load_dataset("squad", split='train')
squad

Reusing dataset squad (C:\Users\loriz\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

* Use features attribute to get an overview about the features.

In [64]:
squad.features

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

In [70]:
squad.features['answers'].feature

{'text': Value(dtype='string', id=None),
 'answer_start': Value(dtype='int32', id=None)}

* Using shuffle method we can shuffle the dataset by setting a particular seed value to reproduce same results.

In [6]:
squad_shuffled = squad.shuffle(seed=666)
squad_shuffled[0]

{'id': '5727cc873acd2414000deca9',
 'title': 'Oklahoma',
 'context': 'Oklahoma is the 20th largest state in the United States, covering an area of 69,898 square miles (181,035 km2), with 68,667 square miles (177847 km2) of land and 1,281 square miles (3,188 km2) of water. It is one of six states on the Frontier Strip and lies partly in the Great Plains near the geographical center of the 48 contiguous states. It is bounded on the east by Arkansas and Missouri, on the north by Kansas, on the northwest by Colorado, on the far west by New Mexico, and on the south and near-west by Texas.',
 'question': 'Where does Oklahoma rank by land area?',
 'answers': {'text': ['20th'], 'answer_start': [16]}}

* We can split the data into train and test and it automatically randomly select different examples. This method does not include stratify parameter for stratified train test split. Instead we should use train_test_split method of scikit-learn.

In [7]:
dataset = squad.train_test_split(test_size=0.1, shuffle=True, seed=10)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 78839
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8760
    })
})

* Using select method we can select particular examples/rows from the dataset.

In [8]:
# select examples with these indices
indices = [0, 10, 20, 40, 80]
examples = squad.select(indices)
examples

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 5
})

In [9]:
# select first 2 examples
examples = squad.select(range(0,2))
examples

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 2
})

* Usually we use shuffle with select to select randomly select some examples from the dataset.

In [10]:
sample = squad.shuffle().select(range(5))
sample

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 5
})

* Using filter method we can select examples that satisfy some conditions. It applies a function to each example for each split. We should define the condition.

In [11]:
squad_filtered = squad.filter(lambda x : x["title"].startswith("L"))
squad_filtered[0]

  0%|          | 0/88 [00:00<?, ?ba/s]

{'id': '56de0fef4396321400ee2583',
 'title': 'Lighting',
 'context': 'Lighting or illumination is the deliberate use of light to achieve a practical or aesthetic effect. Lighting includes the use of both artificial light sources like lamps and light fixtures, as well as natural illumination by capturing daylight. Daylighting (using windows, skylights, or light shelves) is sometimes used as the main source of light during daytime in buildings. This can save energy in place of using artificial lighting, which represents a major component of energy consumption in buildings. Proper lighting can enhance task performance, improve the appearance of an area, or have positive psychological effects on occupants.',
 'question': 'What is used a main source of light for a building during the day?',
 'answers': {'text': ['Daylighting'], 'answer_start': [245]}}

* Rename_column can be used to give the columns new names.

In [14]:
new_squaed = squad.rename_column("context", "passages")
new_squaed

Dataset({
    features: ['id', 'title', 'passages', 'question', 'answers'],
    num_rows: 87599
})

* With remove_columns method we can remove specific columns

In [17]:
n_squad = squad.remove_columns(["id", "title"])
n_squad

Dataset({
    features: ['context', 'question', 'answers'],
    num_rows: 87599
})

* We may have nested columns in our data (nested dictionaries). Using flatten we can flatten our data to convert nested columns into normal columns in our dataset.

In [20]:
squad

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

In [19]:
squad['answers'][:3]

[{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]},
 {'text': ['a copper statue of Christ'], 'answer_start': [188]},
 {'text': ['the Main Building'], 'answer_start': [279]}]

In [21]:
fl_squad = squad.flatten()
fl_squad

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
    num_rows: 87599
})

* Using map method we apply a function (either built in or implemented by us) to each example/row for each split (train/test/validation). It returns something (value, array etc) for each example as a dictionary. In the dataset we ll have these new columns.

In [22]:
def lowercase_title(example):
    return {"title": example["title"].lower()}

squad_lowercase = squad.map(lowercase_title)
# Peek at random sample
squad_lowercase.shuffle(seed=42)["title"][:5]

  0%|          | 0/87599 [00:00<?, ?ex/s]

['egypt',
 'ann_arbor,_michigan',
 'rule_of_law',
 'samurai',
 'group_(mathematics)']

* Using batched=True and batch_size parameter we can perform paralell computing. This function will be applied for several examples in paralell (batch_size examples)

In [23]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_title(example):
    return tokenizer(example["title"])

tokenized_data = squad.map(tokenize_title, batched=True, batch_size=500)
tokenized_data

  0%|          | 0/176 [00:00<?, ?ba/s]

Dataset({
    features: ['answers', 'attention_mask', 'context', 'id', 'input_ids', 'question', 'title'],
    num_rows: 87599
})

* We can convert HuggingFace data into pandas DataFrame to perform more advanced operation and visualization. We can use set_format('pandas') or to_pandas methods.

In [39]:
from datasets import load_dataset

squad = load_dataset("squad", split='train')
squad

Reusing dataset squad (C:\Users\loriz\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

In [40]:
# Convert the output format to pandas.DataFrame
squad.set_format("pandas")
squad[0]

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."


In [41]:
squad.__getitem__(0)

squad.set_format("pandas")

squad.__getitem__(0)

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."


In [42]:
df = squad.to_pandas()
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


Now we can perform different operations using pandas on our pandas dataframe.

In [44]:
# How are languages distributed across regions?
df['title'].value_counts()

New_York_City            817
American_Idol            802
Beyoncé                  758
Frédéric_Chopin          697
Queen_Victoria           680
                        ... 
Great_Plains              47
Tristan_da_Cunha          44
Pitch_(music)             36
Matter                    24
Myocardial_infarction     22
Name: title, Length: 442, dtype: int64

* But we should make sure to switch again to HuggingFace Dataset before Tokenizing the data since the TOkenizer expects the dataset to be of type Dataset. Eitherwise we face issue like this one.

In [47]:
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the `text` column
squad.map(lambda x : tokenizer(x["title"]))

  0%|          | 0/87599 [00:00<?, ?ex/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 87599
})

* TO bring the dataframe back to dataset object we use reset_format method.

In [48]:
# Reset back to Arrow format
squad.reset_format()
# Now we can tokenize!
squad.map(lambda x : tokenizer(x["title"]))

  0%|          | 0/87599 [00:00<?, ?ex/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 87599
})

* Using Dataset.from_pandas we can convert from Pandas DataFrame into HuggingFace Dataset.
* Using datasets.DatasetDict({"train":train_dataset,"test":test_dataset}) we can build a DatasetDict using Datasets.

In [59]:
import pandas as pd
from datasets import Dataset

train_df = pd.DataFrame({"a": [1, 2, 3]})
train_dataset = Dataset.from_pandas(train_df)
train_dataset

Dataset({
    features: ['a'],
    num_rows: 3
})

In [60]:
val_df = pd.DataFrame({"a": [4, 5, 6]})
val_dataset = Dataset.from_pandas(val_df)
val_dataset

Dataset({
    features: ['a'],
    num_rows: 3
})

In [62]:
import datasets

dataset_dict = datasets.DatasetDict({"train":train_dataset,"validation":val_dataset})
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['a'],
        num_rows: 3
    })
    validation: Dataset({
        features: ['a'],
        num_rows: 3
    })
})

* When we download a dataset, it is stored in a cache directory locally to avoid redownloading it again. Using cache_files we can get the directory names for each split. Since there is .arrow extension the data is stored as Arrow table.

In [49]:
from datasets import load_dataset

raw_datasets = load_dataset("allocine")
raw_datasets.cache_files

Downloading:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/866 [00:00<?, ?B/s]

Downloading and preparing dataset allocine_dataset/allocine (download: 63.54 MiB, generated: 109.12 MiB, post-processed: Unknown size, total: 172.66 MiB) to C:\Users\loriz\.cache\huggingface\datasets\allocine_dataset\allocine\1.0.0\91f700d606838c22c5c370846746e60503219d0c1f16ed96bfd1fa19a73458eb...


Downloading:   0%|          | 0.00/66.6M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset allocine_dataset downloaded and prepared to C:\Users\loriz\.cache\huggingface\datasets\allocine_dataset\allocine\1.0.0\91f700d606838c22c5c370846746e60503219d0c1f16ed96bfd1fa19a73458eb. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'train': [{'filename': 'C:\\Users\\loriz\\.cache\\huggingface\\datasets\\allocine_dataset\\allocine\\1.0.0\\91f700d606838c22c5c370846746e60503219d0c1f16ed96bfd1fa19a73458eb\\allocine_dataset-train.arrow'}],
 'validation': [{'filename': 'C:\\Users\\loriz\\.cache\\huggingface\\datasets\\allocine_dataset\\allocine\\1.0.0\\91f700d606838c22c5c370846746e60503219d0c1f16ed96bfd1fa19a73458eb\\allocine_dataset-validation.arrow'}],
 'test': [{'filename': 'C:\\Users\\loriz\\.cache\\huggingface\\datasets\\allocine_dataset\\allocine\\1.0.0\\91f700d606838c22c5c370846746e60503219d0c1f16ed96bfd1fa19a73458eb\\allocine_dataset-test.arrow'}]}

* If dataset is small we can save it into JSON or CSV format. If its huge we should save it in Arrow or Parquet format. Arrow is great if we plan to reuse the data in the future while Parquest are designed for long-term storage.

* When we save in Arrow format we use save_to_disk function. Each split and its metadata is stored in a separate directory. We should specify the directory name
* When we reload the data, we use load_from_disk function.

In [50]:
raw_datasets.save_to_disk("my-arrow-datasets")

In [51]:
from datasets import load_from_disk

arrow_datasets_reloaded = load_from_disk("my-arrow-datasets")
arrow_datasets_reloaded

DatasetDict({
    train: Dataset({
        features: ['review', 'label'],
        num_rows: 160000
    })
    validation: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
})

* If we want to store the data in CSV format, we use to_csv function. We should loop over each split and save it as csv file.

In [52]:
for split, dataset in raw_datasets.items():
    dataset.to_csv(f"my-dataset-{split}.csv", index=None)

Creating CSV from Arrow format:   0%|          | 0/16 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

In [53]:
# load the saved csv data 
data_files = {
    "train": "my-dataset-train.csv",
    "validation": "my-dataset-validation.csv",
    "test": "my-dataset-test.csv",
}

csv_datasets_reloaded = load_dataset("csv", data_files=data_files)
csv_datasets_reloaded

Using custom data configuration default-5139732793baa2f3


Downloading and preparing dataset csv/default to C:\Users\loriz\.cache\huggingface\datasets\csv\default-5139732793baa2f3\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to C:\Users\loriz\.cache\huggingface\datasets\csv\default-5139732793baa2f3\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['review', 'label'],
        num_rows: 160000
    })
    validation: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
})

* To save data in JSON or PARQUET format, we should loop over each split and save it into a specific directory.

In [54]:
# Save in JSON Lines format
for split, dataset in raw_datasets.items():
    dataset.to_json(f"my-dataset-{split}.jsonl")

# Save in Parquet format
for split, dataset in raw_datasets.items():
    dataset.to_parquet(f"my-dataset-{split}.parquet")

Creating json from Arrow format:   0%|          | 0/16 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

In [55]:
# Reload the saved data

json_data_files = {
    "train": "my-dataset-train.jsonl",
    "validation": "my-dataset-validation.jsonl",
    "test": "my-dataset-test.jsonl",
}

parquet_data_files = {
    "train": "my-dataset-train.parquet",
    "validation": "my-dataset-validation.parquet",
    "test": "my-dataset-test.parquet",
}

# Reload with the `json` script
json_datasets_reloaded = load_dataset("json", data_files=json_data_files)
# Reload with the `parquet` script
parquet_datasets_reloaded = load_dataset("parquet", data_files=parquet_data_files)

Using custom data configuration default-c03988952e6b6818


Downloading and preparing dataset json/default to C:\Users\loriz\.cache\huggingface\datasets\json\default-c03988952e6b6818\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to C:\Users\loriz\.cache\huggingface\datasets\json\default-c03988952e6b6818\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Using custom data configuration default-f1d8613dd327676c


Downloading and preparing dataset parquet/default to C:\Users\loriz\.cache\huggingface\datasets\parquet\default-f1d8613dd327676c\0.0.0\1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to C:\Users\loriz\.cache\huggingface\datasets\parquet\default-f1d8613dd327676c\0.0.0\1638526fd0e8d960534e2155dc54fdff8dce73851f21f031d2fb9c2cf757c121. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [56]:
json_datasets_reloaded

DatasetDict({
    train: Dataset({
        features: ['review', 'label'],
        num_rows: 160000
    })
    validation: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
})

In [57]:
parquet_datasets_reloaded

DatasetDict({
    train: Dataset({
        features: ['review', 'label'],
        num_rows: 160000
    })
    validation: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['review', 'label'],
        num_rows: 20000
    })
})