Skip to content

Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.

Notifications You must be signed in to change notification settings

MiniXC/opensubtitles-dataloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

opensubtitles-dataloader

PyPI version

pip install opensubtitles-dataloader

Download, preprocess and use sentences from the OpenSubtitles v2018 dataset without ever needing to load all of it into memory.

Download

See possible languages here.

opensubtitles-download en

Load tokenized version.

opensubtitles-download en --token

Use in Python

Load

opensubtites_dataset = OpenSubtitlesDataset('en')

Load only the first 1 million lines.

opensubtites_dataset = OpenSubtitlesDataset('en', first_n_lines=1_000_000)

Group sentences into groups of 5.

opensubtites_dataset = OpenSubtitlesDataset('en', 5)

Group sentences into groups ranging from 2 to 5.

opensubtites_dataset = OpenSubtitlesDataset('en', (2,5))

Split sentences using "\n".

opensubtites_dataset = OpenSubtitlesDataset('en', delimiter="\n")

Do preprocessing.

opensubtites_dataset = OpenSubtitlesDataset('en', preprocess_function=my_preprocessing_function)

Split for Training

train, valid, test = opensubtites_dataset.split()

Set the fractions of the original dataset.

train, valid, test = opensubtites_dataset.split([0.7, 0.15, 0.15])

Use a seed.

train, valid, test = opensubtites_dataset.split(seed=42)

Access

index.

train, valid, text = OpenSubtitlesDataset('en').splits()
train[20_000]

pytorch.

from torch.utils.data import DataLoader
train, valid, text = OpenSubtitlesDataset('en').splits()
train_loader = DataLoader(train, batch_size=16)

About

Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages