### Tutorial on using the HuggingFace 'datasets' library

This script provides a solid foundation for using the datasets library in your NLP projects. It shows you how to load, inspect, and transform data, as well as how to integrate it with pandas, which is a common workflow.

In [1]:
# This tutorial demonstrates key functionalities of the Hugging Face `datasets` library.
# It is designed for those building NLP applications with the `transformers` library.

import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Loading a Dataset from the Hugging Face Hub
# The easiest way to get started is to load a pre-existing dataset.
# The `load_dataset` function automatically downloads and caches the data.
print("Step 2: Loading the 'imdb' dataset from the Hugging Face Hub.")
raw_datasets = load_dataset("imdb")

# Check the structure of the loaded dataset. It's a `DatasetDict` object.
print("\nDataset structure:")
print(raw_datasets)

# Access a specific split, like 'train'
train_dataset = raw_datasets["train"]

# Print the features (columns) and the number of rows
print(f"\nTraining dataset features: {train_dataset.features}")
print(f"Number of rows in training dataset: {len(train_dataset)}")

# Inspect a single example
print("\nFirst example in the training dataset:")
print(train_dataset[0])




Step 2: Loading the 'imdb' dataset from the Hugging Face Hub.

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Training dataset features: {'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}
Number of rows in training dataset: 25000

First example in the training dataset:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything sh

In [3]:
# Converting a Dataset to and from a Pandas DataFrame
# The `datasets` library integrates seamlessly with pandas.

# To Pandas DataFrame
print("\nStep 3: Converting a Hugging Face Dataset to a Pandas DataFrame.")
df = train_dataset.to_pandas()
print("\nPandas DataFrame head:")
print(df.head())

# From Pandas DataFrame
print("\nCreating a Hugging Face Dataset from the Pandas DataFrame.")
# You can convert the entire DataFrame or a portion of it.
df_small = df.sample(100)
new_dataset_from_pandas = Dataset.from_pandas(df_small)
print("\nNew Dataset created from Pandas DataFrame:")
print(new_dataset_from_pandas)





Step 3: Converting a Hugging Face Dataset to a Pandas DataFrame.

Pandas DataFrame head:
                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0

Creating a Hugging Face Dataset from the Pandas DataFrame.

New Dataset created from Pandas DataFrame:
Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 100
})


In [4]:
# Preprocessing Data for a Transformer Model
# This is a critical step for preparing your data for a model.
# We'll use a tokenizer to convert text into numerical token IDs.

# Load a pre-trained tokenizer. We'll use a small, efficient model for this example.
print("\nStep 4: Tokenizing the dataset for a transformer model.")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

# Create a preprocessing function that tokenizes a batch of examples.
# The `batched=True` argument in `map` is crucial for efficiency.
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Apply the preprocessing function to the entire dataset
# The `map` function is the core of data preprocessing in `datasets`.
# It applies a function to all examples in the dataset and caches the result.
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

print("\nTokenized dataset features (note the new 'input_ids' and other keys):")
print(tokenized_datasets["train"].features)

print("\nFirst example of the tokenized dataset:")
print(tokenized_datasets["train"][0])




Step 4: Tokenizing the dataset for a transformer model.


Map: 100%|██████████████████████| 50000/50000 [00:10<00:00, 4716.13 examples/s]


Tokenized dataset features (note the new 'input_ids' and other keys):
{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos']), 'input_ids': List(Value('int32')), 'token_type_ids': List(Value('int8')), 'attention_mask': List(Value('int8'))}

First example of the tokenized dataset:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians




In [5]:
# Preparing for Training
# Select and rename columns, and set the format for your deep learning framework.

# We'll assume a training setup where the model expects 'input_ids', 'attention_mask',
# and 'labels' (not 'text' or 'label').
print("\nStep 5: Preparing the dataset for PyTorch training.")

# Remove original columns that the model doesn't need
tokenized_datasets = tokenized_datasets.remove_columns(["text"])

# Rename the 'label' column to 'labels' as many models expect this name
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# Set the format to 'torch' to return PyTorch tensors instead of standard Python lists
tokenized_datasets.set_format("torch")

print("\nFinal prepared dataset structure:")
print(tokenized_datasets["train"]['input_ids'])



Step 5: Preparing the dataset for PyTorch training.

Final prepared dataset structure:
Column([tensor([  101,  1045, 12524,  1045,  2572,  8025,  1011,  3756,  2013,  2026,
         2678,  3573,  2138,  1997,  2035,  1996,  6704,  2008,  5129,  2009,
         2043,  2009,  2001,  2034,  2207,  1999,  3476,  1012,  1045,  2036,
         2657,  2008,  2012,  2034,  2009,  2001,  8243,  2011,  1057,  1012,
         1055,  1012,  8205,  2065,  2009,  2412,  2699,  2000,  4607,  2023,
         2406,  1010,  3568,  2108,  1037,  5470,  1997,  3152,  2641,  1000,
         6801,  1000,  1045,  2428,  2018,  2000,  2156,  2023,  2005,  2870,
         1012,  1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,  1996,
         5436,  2003,  8857,  2105,  1037,  2402,  4467,  3689,  3076,  2315,
        14229,  2040,  4122,  2000,  4553,  2673,  2016,  2064,  2055,  2166,
         1012,  1999,  3327,  2016,  4122,  2000,  3579,  2014,  3086,  2015,
         2000,  2437,  2070,  4066,  1997,  45