# Summary

This notebook will create the imdb dataset for classification.  We don't pad these here, the data collator will do it on the fly.  We also don't condense because they are labeled.

In [1]:
import os

def create_dir(dir_path):
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

In [2]:
create_dir('datasets/classifier/imdb')

In [3]:
!curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/imdb/train.jsonl --output-dir 'datasets/classifier/imdb'
!curl -Lo dev.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/imdb/dev.jsonl --output-dir 'datasets/classifier/imdb'
!curl -Lo test.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/imdb/test.jsonl --output-dir 'datasets/classifier/imdb'


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25.6M  100 25.6M    0     0  4326k      0  0:00:06  0:00:06 --:--:-- 4154k     0  0:00:05  0:00:01  0:00:04 4735k00:02  0:00:03 5130k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6651k  100 6651k    0     0  4710k      0  0:00:01  0:00:01 --:--:-- 4713k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31.4M  100 31.4M    0     0  11.2M      0  0:00:02  0:00:02 --:--:-- 11.2M


In [4]:
from datasets import load_dataset

dataset = load_dataset(
    "json",
    data_files={
        "train": "datasets/classifier/imdb/train.jsonl",
        "test": "datasets/classifier/imdb/test.jsonl",
        "dev": "datasets/classifier/imdb/dev.jsonl",
    },
)

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 20000 examples [00:00, 480007.32 examples/s]
Generating test split: 25000 examples [00:00, 664189.57 examples/s]
Generating dev split: 5000 examples [00:00, 738850.06 examples/s]


In [15]:
dataset["train"][4]

{'id': 'train_2136',
 'text': "I thought this film was just about perfect. The descriptions/summaries you'll read about this movie don't do it justice. The plot just does not sound very interesting, BUT IT IS. Just rent it and you will not be sorry!!",
 'label': 1}

In [16]:
# This isn't used here because the labels are already integers, but can be used in training
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

In [18]:
def preprocess_function(examples):
    tokens = tokenizer(
        examples["text"],
        truncation=True,
    )
    return tokens

In [19]:
# This removes the text and id columns from the dataset as they are not needed
dataset_tokens = dataset.map(preprocess_function, batched=True, remove_columns=["id", "text"])
dataset_tokens 


Map: 100%|██████████| 20000/20000 [00:03<00:00, 6107.53 examples/s]
Map: 100%|██████████| 25000/25000 [00:03<00:00, 6525.01 examples/s]
Map: 100%|██████████| 5000/5000 [00:00<00:00, 6392.43 examples/s]


DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    dev: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})

In [39]:
dataset_tokens["train"][4]["input_ids"][:15]


[0, 100, 802, 42, 822, 21, 95, 59, 1969, 4, 20, 24173, 73, 29, 16598]

In [27]:
decoded_string =tokenizer.decode(dataset_tokens["train"][4]["input_ids"])
original_string = dataset["train"][4]["text"]
print(decoded_string)
print(original_string)

<s>I thought this film was just about perfect. The descriptions/summaries you'll read about this movie don't do it justice. The plot just does not sound very interesting, BUT IT IS. Just rent it and you will not be sorry!!</s>
I thought this film was just about perfect. The descriptions/summaries you'll read about this movie don't do it justice. The plot just does not sound very interesting, BUT IT IS. Just rent it and you will not be sorry!!


In [40]:
dataset_tokens.push_to_hub("imdb_sentiment_dataset")

Creating parquet from Arrow format: 100%|██████████| 20/20 [00:00<00:00, 117.67ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.17s/it]
Creating parquet from Arrow format: 100%|██████████| 25/25 [00:00<00:00, 123.81ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.72s/it]
Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 95.74ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.07s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/BigTMiami/imdb_sentiment_dataset/commit/aefb7cb57b743624ce7e22d9fa56843f22387405', commit_message='Upload dataset', commit_description='', oid='aefb7cb57b743624ce7e22d9fa56843f22387405', pr_url=None, pr_revision=None, pr_num=None)