# **Dataset Library**

In this chapter, you'll delve deeper into the capabilities of the 🤗 Datasets library. Here are some of the key questions you'll explore:

1. How to handle datasets not available on the Hugging Face Hub?
2. Techniques for slicing, dicing, and working with datasets, including using Pandas.
3. Handling large datasets that might overwhelm your system's RAM.
4. Understanding concepts like memory mapping and Apache Arrow.
5. Creating custom datasets and contributing them to the Hugging Face Hub.

Let's embark on this journey to enhance your understanding of 🤗 Datasets!

## What should I do if my dataset isn't available on the Hugging Face Hub?

You've learned how to utilize the Hugging Face Hub to fetch datasets, but there will be instances where you need to work with data stored locally on your laptop or on a remote server. In this section, we'll explore how 🤗 Datasets can be employed to load datasets that aren't accessible on the Hugging Face Hub.

### Working with local and remote datasets

🤗 Datasets simplifies the loading of local and remote datasets by providing loading scripts for various common data formats. Here are examples of loading scripts for different data formats:

- CSV & TSV: `load_dataset("csv", data_files="my_file.csv")`
- Text files: `load_dataset("text", data_files="my_file.txt")`
- JSON & JSON Lines: `load_dataset("json", data_files="my_file.jsonl")`
- Pickled DataFrames: `load_dataset("pandas", data_files="my_dataframe.pkl")`

The above illustrates that for each data format, specifying the type of loading script in the `load_dataset()` function is sufficient. Additionally, the `data_files` argument is used to provide the path to one or more files. Let's begin by loading a dataset from local files, and subsequently, we'll explore how to achieve the same with remote files.

### Loading a local dataset

For this example we’ll use the SQuAD-it dataset, which is a large-scale dataset for question answering in Italian.

The training and test splits are hosted on GitHub, so we can download them using the blow link:

https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz

https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

Once you've downloaded them, unzip the files. You can see the compressed files has SQuAD_it-train.json and SQuAD_it-test.json, and that the data is stored in the JSON format.

Loading a JSON file using the `load_dataset()` function involves specifying whether the dataset is in standard JSON format (resembling a nested dictionary) or JSON Lines format (JSON separated by lines). In datasets like SQuAD-it, the information is stored in a nested structure, often with text contained within a specific field. To load this dataset correctly, we'd specify the `field` argument to accommodate this structure:

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

When loading datasets from local files, the default behavior is to generate a DatasetDict object containing at least one split, typically the train split. To verify this, you can inspect the `squad_it_dataset` object:

In [None]:
squad_it_dataset

The output displays the count of rows along with the column names present in the training set. You can explore individual examples by selecting one from the train split.

In [None]:
squad_it_dataset["train"][0]

That's the right approach! Having both the train and test splits within a single DatasetDict object allows for more efficient handling. Mapping each split name to its respective file using the data_files argument ensures the inclusion of both splits in a unified dataset object.

In [None]:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

Having both splits within a unified object facilitates uniform preprocessing across the entire dataset, ensuring consistency in the applied transformations or cleaning methods.

Datasets simplifies the process by handling file decompression automatically. Using compressed files directly in the `data_files` argument streamlines the loading process without the need for pre-decompression steps. This means we could have skipped the process of unzipping hte file manually by pointing the `data_files` argument directly to the compressed files.

In [None]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Whether it's ZIP, TAR, or other common compression formats, 🤗 Datasets conveniently handles the decompression process upon loading, ensuring ease of use when working with compressed files directly in the `data_files` argument.

### Loading a remote dataset

When handling remote datasets, the process remains straightforward. Instead of directing the `data_files` argument to local paths, you simply assign it the URLs where the remote files are located. For instance, in the case of the SQuAD-it dataset residing on GitHub, the `data_files` parameter can point directly to the URLs hosting the SQuAD_it-*.json.gz files.

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

This retrieves the identical DatasetDict object we previously obtained, eliminating the need for manual downloading and decompression of the SQuAD_it-*.json.gz files. With this dataset at hand, let's delve into diverse data manipulation techniques!

### Toying with data subsets

In this section, we'll explore various techniques for slicing and dicing data using the 🤗 Datasets library. We'll cover operations like selecting specific columns, filtering rows based on conditions, and shuffling the dataset

Beyond Dataset.map(), 🤗 Datasets offers a range of methods to manage datasets. These functions empower you to filter rows, select columns, shuffle data, and more. Let's explore some of these to enhance our dataset manipulations.

In this instance, we'll work with the Drug Review Dataset available on the UC Irvine Machine Learning Repository. It includes patient reviews concerning different drugs, along with the treated condition and a 10-star rating reflecting patient satisfaction.

First we need to download and extract the data

https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip


In [1]:
!curl -O "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!tar -xvf drugsCom_raw.zip


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
100 94133    0 94133    0     0  22839      0 --:--:--  0:00:04 --:--:-- 22847
100  787k    0  787k    0     0   156k      0 --:--:--  0:00:05 --:--:--  156k
100 1823k    0 1823k    0     0   304k      0 --:--:--  0:00:05 --:--:--  304k
100 2875k    0 2875k    0     0   410k      0 --:--:--  0:00:07 --:--:--  525k
100 3807k    0 3807k    0     0   475k      0 --:--:--  0:00:08 --:--:--  764k
100 4747k    0 4747k    0     0   527k      0 --:--:--  0:00:08 --:--:--  954k
100 5455k    0 5455k    0     0   545k      0 --:--:--  0:00:09 --:--:--  938k
100 6223k    0 6223k    0     0   565k      0 --:--

TSV (Tab-Separated Values) functions similarly to CSV (Comma-Separated Values) but employs tabs instead of commas as the separator. To load TSV files, you can use the csv loading script and indicate the delimiter argument within the load_dataset() function, like this:

In [None]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

One effective practice during data analysis is to extract a small random sample to gain a preliminary understanding of the data structure. In 🤗 Datasets, generating a random sample involves combining the `Dataset.shuffle()` and `Dataset.select()` functions in a sequence:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

Sure, here is a rephrased version of the text:

To ensure consistent results, we've fixed the seed in the `Dataset.shuffle()` function. Since `Dataset.select()` requires an iterable of indices, we've passed `range(1000)` to extract the first 1,000 samples from the shuffled dataset. This initial sample reveals a few peculiarities in our dataset:

- The `Unnamed: 0` column appears to be an anonymized patient ID.
- The `condition` column contains a combination of uppercase and lowercase labels.
- The reviews vary in length and include a mix of Python line separators (`\r\n`) and HTML character codes like `&#039;`.

To confirm our hypothesis that the `Unnamed: 0` column represents anonymized patient IDs, we can use the `Dataset.unique()` function to check if the number of IDs matches the number of rows in each split.

In [None]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

Our hypothesis about the `Unnamed: 0` column being anonymized patient IDs seems valid. Let's improve the dataset's clarity by renaming the `Unnamed: 0` column to something more meaningful. We can use the `DatasetDict.rename_column()` function to modify the column name in both splits simultaneously:

In [None]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

Following the tokenization process discussed in Chapter 3, let's standardize all the `condition` labels using `Dataset.map()`. We can define a simple function that can be applied to all rows in each split of `drug_dataset`:

In [None]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset.map(lowercase_condition)

Unfortunately, we've encountered an issue with our mapping function. The error indicates that some values in the `condition` column are `None`, which cannot be lowercased because they are not strings. To handle this, we can remove these rows using `Dataset.filter()`, which operates similarly to `Dataset.map()` and takes a function that receives a single sample from the dataset. Instead of defining an explicit function like:

In [None]:
def filter_nones(x):
    return x["condition"] is not None

Instead of writing an explicit function like `filter_nones` and then calling `drug_dataset.filter(filter_nones)`, we can accomplish the same task in a single line using a lambda function. In Python, lambda functions are concise functions that can be defined without explicitly naming them. They follow the general structure:

        lambda <arguments> : <expression>


The `lambda` keyword is a special term in Python that introduces anonymous functions. The `<arguments>` section is a list or set of comma-separated values that represent the function's inputs. The `<expression>` part specifies the operations to be performed. For instance, we can define a simple lambda function that squares a number using the following code:

In [4]:
lambda x : x * x

<function __main__.<lambda>(x)>

To utilize this function for a given input, it needs to be enclosed in parentheses along with the input itself:

In [3]:
(lambda x: x * x)(3)

9

In a similar manner, lambda functions can accommodate multiple arguments by separating them with commas. For instance, we can calculate the area of a triangle using the following lambda function:

In [5]:
(lambda base, height: 0.5 * base * height)(4, 8)

16.0

Lambda functions prove useful when you need to create concise, disposable functions (for more details, we recommend reading Andre Burgaud's exceptional Real Python tutorial). Within the context of 🤗 Datasets, lambda functions allow us to define straightforward map and filter operations. Let's leverage this approach to remove the `None` entries from our dataset:

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Once the `None` entries have been eliminated, we can proceed with normalizing our `condition` column:

In [None]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

Excellent! After normalizing the labels, let's turn our attention to cleaning up the reviews themselves.

#### Creating new columns

When dealing with customer reviews, it's advisable to examine the number of words in each review. A review could range from a single word like "Great!" to a lengthy essay spanning thousands of words. Depending on the specific application, you may need to handle these extremes differently. To determine the number of words in each review, we'll employ a basic approach that involves splitting each text by whitespace.

Let's define a straightforward function that calculates the word count for each review:

In [None]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

Unlike our `lowercase_condition()` function, `compute_review_length()` yields a dictionary whose key doesn't match any of the column names in the dataset. Consequently, when `compute_review_length()` is passed to `Dataset.map()`, it will be applied to every row in the dataset, generating a new `review_length` column:

In [None]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

As anticipated, our training set now includes a `review_length` column. To examine the extreme values of this new column, we can sort it using `Dataset.sort()`:

In [None]:
drug_dataset["train"].sort("review_length")[:3]

Our suspicions were confirmed. Some reviews consist of only a single word, which, while acceptable for sentiment analysis, would not provide sufficient information for the task of predicting the condition.

An alternative method for adding new columns to a dataset is the `Dataset.add_column()` function. This function enables you to supply the column as a Python list or NumPy array, making it useful in scenarios where `Dataset.map()` is not the optimal choice for your analysis.

Let's employ the `Dataset.filter()` function to eliminate reviews with less than 30 words. Similar to our approach for handling the `condition` column, we can filter out extremely short reviews by requiring that the reviews' length exceed this threshold:

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

As evident, this process has eliminated approximately 15% of the reviews from our original training and test sets.

The final step involves addressing the presence of HTML character codes in our reviews. Python's `html` module provides a convenient tool for unescaping these characters, as demonstrated here:

In [6]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

We'll leverage `Dataset.map()` to unescape all HTML characters within our corpus:

In [None]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

#### The transformative power of the `map()` method

The `Dataset.map()` method offers a `batched` argument that, when set to `True`, instructs it to send a batch of examples to the map function simultaneously. The batch size is configurable with a default value of 1,000. For instance, the previous map function that unescaped all HTML characters took a noticeable amount of time to execute (the progress bars display the elapsed time). We can expedite this process by processing multiple elements concurrently using a list comprehension.

When `batched=True`, the function receives a dictionary containing the dataset's fields, but each value is now a list of values rather than a single value. The return value of `Dataset.map()` should remain consistent: a dictionary with the fields we want to update or add to our dataset, along with a list of values. Here's an alternative method for unescaping all HTML characters using `batched=True`:


In [None]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

If you're running this code in a notebook, you'll observe that this command executes significantly faster than the previous one. This improvement is not due to the reviews already being HTML-unescaped; re-running the instruction from the previous section (without `batched=True`) will yield the same execution time as before. The reason for this performance gain is that list comprehensions are typically faster than executing the same code in a `for` loop. Additionally, accessing numerous elements simultaneously rather than one at a time contributes to the performance improvement.

Employing `Dataset.map()` with `batched=True` will prove crucial in harnessing the speed of the "fast" tokenizers, which excel at tokenizing large text collections efficiently. For instance, to tokenize all drug reviews using a fast tokenizer, we could utilize a function like this:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

As you learned in Chapter 3, the tokenizer can handle one or multiple examples, enabling us to utilize this function with or without `batched=True`. Let's seize this opportunity to compare the performance of these different approaches. In a notebook environment, you can time a single-line instruction by adding `%time` before the line of code you want to measure:

You can also time an entire cell by placing %%time at the beginning of the cell.

In [None]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)


Here are the results we obtained with and without batching, with a fast and a slow tokenizer:

![](2023-11-23-13-28-09.png)

This implies that employing a fast tokenizer with the `batched=True` option is approximately 30 times faster than its slow counterpart without batching—an incredible performance gain! This remarkable speedup stems from the use of Rust, a language that facilitates code parallelization, in the background tokenization process. Owing to these advantages, fast tokenizers are the default choice when using `AutoTokenizer` (and the reason behind their name).

Parallelization is also the driving force behind the significant speedup, nearly 6 times faster, achieved by the fast tokenizer with batching. Parallelizing a single tokenization operation is not feasible, but when tokenizing multiple texts simultaneously, the execution can be distributed across multiple processes, each handling its assigned texts. This parallel execution model enables the impressive performance gain observed.

The `Dataset.map()` function also offers parallelization capabilities. While these capabilities don't leverage Rust, they won't allow a slow tokenizer to match the performance of a fast one, but they can still provide significant benefits, particularly when using a tokenizer that lacks a fast version. To enable multiprocessing, utilize the `num_proc` argument and specify the desired number of processes in your `Dataset.map()` call:

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

You can experiment with timing to determine the optimal number of processes to use; in our case, 8 seemed to yield the most significant speed gain. Here's a comparison of the performance with and without multiprocessing:

![](2023-11-23-13-25-10.png)


While multiprocessing significantly improved the performance of the slow tokenizer, it also enhanced the fast tokenizer's performance. However, this improvement is not always guaranteed. For values of `num_proc` other than 8, our tests revealed that using `batched=True` without multiprocessing was faster. As a general recommendation, we advise against using Python multiprocessing for fast tokenizers with `batched=True`.

The versatility of the `Dataset.map()` method with `batched=True` is truly remarkable. It not only simplifies processing large datasets but also enables modifying the number of elements in the dataset. This capability proves particularly valuable in scenarios where you want to generate multiple training features from a single example.

In the realm of machine learning, an "example" typically refers to the collection of "features" provided to the model for training. In certain contexts, these features correspond to the columns in a "Dataset." However, in other scenarios, such as here and in question answering, multiple features can be derived from a single example and reside within a single column.

Let’s have a look at how it works! Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one. This can be done with return_overflowing_tokens=True:

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

Let’s test this on one example before using Dataset.map() on the whole dataset:

In [None]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

Consequently, our initial example in the training set was divided into two features due to exceeding the specified maximum token length during tokenization. The resulting features have lengths of 128 and 49 tokens, respectively. Let's now apply this process to all elements of the dataset!

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Oops! It appears that something went wrong during the tokenization process. Upon examining the error message, we discover a discrepancy in the lengths of two columns: one column has a length of 1,463, while the other has a length of 1,000. If you've reviewed the `Dataset.map()` documentation, you might recall that it specifies the number of samples passed to the mapping function. In this case, 1,000 examples were provided to the function, resulting in 1,463 new features, leading to a shape error.

The underlying issue lies in the attempt to combine two datasets of different sizes. The `drug_dataset` columns will have a specific number of examples (in our case, 1,000, as indicated by the error message), while the `tokenized_dataset` we're constructing will have more examples (1,463, as mentioned in the error message; this exceeds 1,000 because we're splitting lengthy reviews into multiple examples using `return_overflowing_tokens=True`). This mismatch is incompatible with a `Dataset`, so we either need to remove the columns from the original dataset or ensure they have the same size as in the new dataset. The former approach can be achieved using the `remove_columns` argument:

In [None]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

After removing the columns, the tokenization process proceeds without errors. We can verify that the new dataset contains significantly more elements than the original dataset by comparing their respective lengths:

In [None]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

As we mentioned earlier, addressing the mismatched length issue can also be achieved by adjusting the size of the old columns to match that of the new ones. To accomplish this, we'll utilize the `overflow_to_sample_mapping` field that the tokenizer returns when `return_overflowing_tokens=True` is set. This field provides a mapping from a new feature index to the index of the sample it originated from. Leveraging this mapping, we can associate each key in our original dataset with a list of values of the appropriate size by replicating the values of each example as many times as it produces new features:

In [None]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

We can see it works with Dataset.map() without us needing to remove the old columns:

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

This approach yields the same number of training features as the previous method, but it preserves all the original fields. If you require these fields for post-processing steps after applying your model, this approach might be preferable.

As you have witnessed, 🤗 Datasets offers a versatile toolkit for preprocessing datasets in various ways. While the processing functions provided by 🤗 Datasets will address most of your model training needs, there may be instances where you need to transition to Pandas to access more advanced features, such as `DataFrame.groupby()` or high-level visualization APIs. Fortunately, 🤗 Datasets is designed for seamless interoperability with libraries like Pandas, NumPy, PyTorch, TensorFlow, and JAX. Let's explore how this works.

#### From Datasets to DataFrames and back

To facilitate conversion between various third-party libraries, 🤗 Datasets offers the `Dataset.set_format()` function. This function exclusively alters the dataset's output format, allowing you to seamlessly switch between formats without impacting the underlying data format, which is Apache Arrow. The formatting is applied directly to the dataset. To illustrate this, let's convert our dataset to Pandas:

In [None]:
drug_dataset.set_format("pandas")

Now when we access elements of the dataset we get a pandas.DataFrame instead of a dictionary:

In [None]:
drug_dataset["train"][:3]

Let’s create a pandas.DataFrame for the whole training set by selecting all the elements of drug_dataset["train"]:

In [None]:
train_df = drug_dataset["train"][:]

At the technical level, `Dataset.set_format()` alters the return format for the dataset's `__getitem__()` method. This implies that when attempting to create a new object like `train_df` from a `Dataset` in the `"pandas"` format, the entire dataset must be sliced to obtain a `pandas.DataFrame`. You can independently verify that the type of `drug_dataset["train"]` remains `Dataset`, regardless of the output format.

Once the dataset is converted to the "pandas" format, you can leverage the full range of Pandas functionalities. For instance, you can employ elegant chaining to calculate the class distribution within the "condition" entries:

In [None]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

After completing your Pandas analysis, you can seamlessly convert the modified DataFrame back into a `Dataset` object using the `Dataset.from_pandas()` function:

In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

This concludes our exploration of the diverse preprocessing capabilities offered by 🤗 Datasets. To finalize this section, let's establish a validation set to prepare the dataset for training a classifier. Before proceeding, we'll revert the output format of `drug_dataset` from `"pandas"` to `"arrow"`:

In [None]:
drug_dataset.reset_format()

#### Creating a validation set

While we have access to a test set for evaluation, it's a prudent practice to preserve its integrity and create a distinct validation set during the development phase. Once you're satisfied with the performance of your models on the validation set, you can perform a final sanity check on the test set. This approach helps alleviate the risk of overfitting to the test set and deploying a model that performs poorly on real-world data.

🤗 Datasets offers the `Dataset.train_test_split()` function, which draws inspiration from the popular scikit-learn functionality. Let's utilize this function to divide our training set into "train" and "validation" splits (the `seed` argument is set for reproducibility):

In [None]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

#### Saving a dataset

While 🤗 Datasets automatically caches every downloaded dataset and the transformations applied to it, there may be instances where you need to explicitly save a dataset to disk (for example, to prevent data loss in case the cache is cleared). As illustrated in the table below, 🤗 Datasets offers three primary functions for saving datasets in various formats:

![](2023-11-23-15-31-26.png)

For example, let’s save our cleaned dataset in the Arrow format:

In [None]:
drug_dataset_clean.save_to_disk("drug-reviews")

This will create a directory with the following structure:

            drug-reviews/
            ├── dataset_dict.json
            ├── test
            │   ├── dataset.arrow
            │   ├── dataset_info.json
            │   └── state.json
            ├── train
            │   ├── dataset.arrow
            │   ├── dataset_info.json
            │   ├── indices.arrow
            │   └── state.json
            └── validation
                ├── dataset.arrow
                ├── dataset_info.json
                ├── indices.arrow
                └── state.json


Each split is linked to its own `dataset.arrow` table, along with some metadata stored in `dataset_info.json` and `state.json`. The Arrow format can be conceptualized as an advanced table of columns and rows, optimized for developing high-performance applications that handle and transfer large datasets.

To load a saved dataset, utilize the `load_from_disk()` function:

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

When saving datasets in CSV or JSON formats, each split must be stored as an individual file. One approach to achieve this is to iterate through the keys and values in the `DatasetDict` object:

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

This saves each split in JSON Lines format, where each row in the dataset is stored as a single line of JSON. Here’s what the first example looks like:

In [None]:
!head -n 1 drug-reviews-train.jsonl

We can then use the techniques from section 2 to load the JSON files as follows:

In [None]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)


Excellent! We have successfully explored data wrangling techniques with 🤗 Datasets. 