# Datasets

In [1]:
!pip install datasets



# Loading a Dataset

In [2]:
import datasets

# Working with IMDB Dataset

## Dataset Overview
- Access a comprehensive movie review dataset from the [HuggingFace Hub](https://huggingface.co/datasets)
- Dataset identifier: `stanfordnlp/imdb`

## Efficient Data Management
- First-time downloads are cached locally
- Subsequent access loads from `.cache` directory
- Utilizes [Memory-Mapped](https://huggingface.co/docs/datasets/en/about_arrow) columnar format for:
    - Storage efficiency
    - Fast data iteration
    - Optimized memory usage

## Usage
We can load the dataset using the `load_dataset()` function to begin our analysis.


In [3]:
from datasets import load_dataset

imdb_dataset = load_dataset('stanfordnlp/imdb')
print(imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


- The dataset is downloaded in the `DatasetDict` format, which functions like a dictionary.
- Each split within the dataset is a `Dataset` object containing two fields: `features` and `num_rows`.

We can also extract only the train split.

In [4]:
imdb_train_split = imdb_dataset['train']
print(imdb_train_split)

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


This is a `Dataset` class

- We can also remove any unwanted splits from the dataset in the same way we remove items from a dictionary.

In [5]:
_ = imdb_dataset.pop('unsupervised')
print(imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


We can download only the `train` split

In [6]:
train_split = load_dataset('stanfordnlp/imdb', split='train')
train_split

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

We can also split the dataset into `train` and `test`, the same way we do in sklearn

In [7]:
small_ds = train_split.train_test_split(test_size=0.2)
print(small_ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})


In [8]:
train_file = small_ds['train']
test_file = small_ds['test']

We can also save the files in various formats

In [9]:
train_file.to_csv('data/train.csv')
test_file.to_csv('data/test.csv')

Creating CSV from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

6761756

# Loading Local Dataset

Now we have two datasets inside the `data` directory. Please note that all the files, we are going to load should contains the same set of columns

In [10]:
data_files = ['data/train.csv', 'data/test.csv']
local_dataset = load_dataset('csv', data_files=data_files)
print(local_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


[Here](https://huggingface.co/docs/datasets/en/tabular_load#csv-files) are the other supported formats in `Datasets` library

- `train_test_split` is only accepted for `Dataset` class and not `DatasetDict` class

In [11]:
train_test_splits = local_dataset['train'].train_test_split(test_size=0.2)
train_test_splits

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

When we save the files to the disk, bydefault it saves it in the format of `pyarrow`(.arrow) format

In [12]:
train_test_splits.save_to_disk('pyarrow_dataset/movie_review')

Saving the dataset (0/1 shards):   0%|          | 0/20000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

```
📁 pyarrow_dataset/
├── 📁 movie_review/
│   ├── 📁 train/
│   │   ├── data-0000-of-0001.arrow
│   │   ├── dataset_info.json
│   │   └── state.json
│   └── 📁 test/
│       ├── data-00000-of-0001.arrow
│       ├── dataset_info.json
│       └── state.json
```

Now we can load the dataset from the local arrow format

In [14]:
from datasets import load_from_disk

raw_dataset_from_disk = load_from_disk('pyarrow_dataset/movie_review')
print(raw_dataset_from_disk)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})


Note that `load_dataset` and `load_from_disk` are completely different

# Accessing the samples