Merge pull request #84 from PetrochukM/index_to_token

PyTorch-NLP 0.5.0
PetrochukM · Nov 4, 2019 · 86a44fd · 86a44fd
2 parents 49bb1d7 + 7f82397
commit 86a44fd
Show file tree

Hide file tree

Showing 74 changed files with 1,124 additions and 855 deletions.
diff --git a/.gitignore b/.gitignore
@@ -75,7 +75,7 @@ target/
 logs/**
 
 # Coverage
-coverage/** 
+coverage/**
 cover/
 
 # Data
@@ -94,3 +94,6 @@ data/**
 
 # ReadTheDocs build files
 docs/_build
+
+# Python's virtual env
+venv
diff --git a/.travis.yml b/.travis.yml
@@ -1,12 +1,18 @@
 language: python
 matrix:
   include:
+    - python: 3.5
+      dist: xenial
+      sudo: true
+      env: RUN_DOCTEXT=false # Python 3.5 prints differently from Python 3.6
     - python: 3.6
       dist: xenial
       sudo: true
+      env: RUN_DOCTEXT=true
     - python: 3.7
       dist: xenial
       sudo: true
+      env: RUN_DOCTEXT=true
 
 cache: pip
 

diff --git a/README.md b/README.md
@@ -1,43 +1,45 @@
 <p align="center"><img width="55%" src="docs/_static/img/logo.svg" /></p>
 
-<h3 align="center">Supporting Rapid Prototyping with a Deep Learning NLP Toolkit&nbsp;&nbsp;
-  <a href="https://twitter.com/intent/tweet?text=Supporting%20rapid%20prototyping%20for%20research,%20PyTorch-NLP%20has%20LAUNCHED,%20a%20deep%20learning%20natural%20language%20processing%20(NLP)%20toolkit!%20&url=https://github.com/PetrochukM/PyTorch-NLP&hashtags=pytorch,nlp,research">
-    <img style='vertical-align: text-bottom !important;' src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social" alt="Tweet">
-  </a>
-</h3>
+<h3 align="center">Basic Utilities for PyTorch NLP Software</h3>
 
-PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research.
-
-Join our community, add datasets and neural network layers! Chat with us on [Gitter](https://gitter.im/PyTorch-NLP/Lobby) and join the [Google Group](https://groups.google.com/forum/#!forum/pytorch-nlp), we're eager to collaborate with you.
+PyTorch-NLP, or `torchnlp` for short, is a library of basic utilities for PyTorch
+Natural Language Processing (NLP). `torchnlp` extends PyTorch to provide you with
+basic text data processing functions.
 
 ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pytorch-nlp.svg?style=flat-square)
 [![Codecov](https://img.shields.io/codecov/c/github/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://codecov.io/gh/PetrochukM/PyTorch-NLP)
 [![Downloads](http://pepy.tech/badge/pytorch-nlp)](http://pepy.tech/project/pytorch-nlp)
-[![Documentation Status](	https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
+[![Documentation Status](https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
 [![Build Status](https://img.shields.io/travis/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://travis-ci.org/PetrochukM/PyTorch-NLP)
+[![Twitter: PetrochukM](https://img.shields.io/twitter/follow/MPetrochuk.svg?style=social)](https://twitter.com/MPetrochuk)
 
-_Logo by [Chloe Yeo](http://www.yeochloe.com/)_
+_Logo by [Chloe Yeo](http://www.yeochloe.com/), Corporate Sponsorship by [WellSaid Labs](https://wellsaidlabs.com/)_
 
-## Installation
+## Installation 🐾
 
-Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
+Make sure you have Python 3.5+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
 pip:
 
-    pip install pytorch-nlp
+```python
+pip install pytorch-nlp
+```
 
 Or to install the latest code via:
 
-    pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
+```python
+pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
+```
 
-## Docs 📖
+## Docs
 
-The complete documentation for PyTorch-NLP is available via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).
+The complete documentation for PyTorch-NLP is available
+via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).
 
-## Basics
+## Get Started
 
-Add PyTorch-NLP to your project by following one of the common use cases:
+Within an NLP data pipeline, you'll want to implement these basic steps:
 
-### Load a [Dataset](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
+### Load Your Data 🐿
 
 Load the IMDB dataset, for example:
 
@@ -49,51 +51,133 @@ train = imdb_dataset(train=True)
 train[0]  # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
 ```
 
-### Apply [Neural Networks](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.nn.html) Layers
+Load a custom dataset, for example:
+
+```python
+from pathlib import Path
+
+from torchnlp.download import download_file_maybe_extract
+
+directory_path = Path('data/')
+train_file_path = Path('trees/train.txt')
+
+download_file_maybe_extract(
+    url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
+    directory=directory_path,
+    check_files=[train_file_path])
+
+open(directory_path / train_file_path)
+```
+
+Don't worry we'll handle caching for you!
 
-For example, from the neural network package, apply state-of-the-art LockedDropout:
+### Text To Tensor
+
+Tokenize and encode your text as a tensor. For example, a `WhitespaceEncoder` breaks
+text into terms whenever it encounters a whitespace character.
+
+```python
+from torchnlp.encoders.text import WhitespaceEncoder
+
+loaded_data = ["now this ain't funny", "so don't you dare laugh"]
+encoder = WhitespaceEncoder(loaded_data)
+encoded_data = [encoder.encode(example) for example in loaded_data]
+```
+
+### Tensor To Batch
+
+With your loaded and encoded data in hand, you'll want to batch your dataset.
 
 ```python
 import torch
-from torchnlp.nn import LockedDropout
+from torchnlp.samplers import BucketBatchSampler
+from torchnlp.utils import collate_tensors
+from torchnlp.encoders.text import stack_and_pad_tensors
 
-input_ = torch.randn(6, 3, 10)
-dropout = LockedDropout(0.5)
+encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]
 
-# Apply a LockedDropout to `input_`
-dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
+train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
+train_batch_sampler = BucketBatchSampler(
+    train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])
+
+batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
+batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
 ```
 
-### [Encode Text](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.encoders.text.html)
+PyTorch-NLP builds on top of PyTorch's existing `torch.utils.data.sampler`, `torch.stack`
+and `default_collate` to support sequential inputs of varying lengths!
+
+### Your Good To Go!
+
+With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.
+
+### Last But Not Least
 
-Tokenize and encode text as a tensor. For example, a `WhitespaceEncoder` breaks text into terms whenever it encounters a whitespace character.
+PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗
+
+#### Deterministic Functions
+
+Now you've setup your pipeline, you may want to ensure that some functions run deterministically.
+Wrap any code that's random, with `fork_rng` and you'll be good to go, like so:
 
 ```python
+import random
+import numpy
+import torch
+
+from torchnlp.random import fork_rng
+
+with fork_rng(seed=123):  # Ensure determinism
+    print('Random:', random.randint(1, 2**31))
+    print('Numpy:', numpy.random.randint(1, 2**31))
+    print('Torch:', int(torch.randint(1, 2**31, (1,))))
+```
+
+This will always print:
+
+```text
+Random: 224899943
+Numpy: 843828735
+Torch: 843828736
+```
+
+#### Pre-Trained Word Vectors
+
+Now that you've computed your vocabulary, you may want to make use of
+pre-trained word vectors, like so:
+
+```python
+import torch
 from torchnlp.encoders.text import WhitespaceEncoder
+from torchnlp.word_to_vector import GloVe
 
-# Create a `WhitespaceEncoder` with a corpus of text
 encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
 
-# Encode and decode phrases
-encoder.encode("this ain't funny.") # RETURNS: torch.Tensor([6, 7, 1])
-encoder.decode(encoder.encode("This ain't funny.")) # RETURNS: "this ain't funny."
+vocab = set(encoder.vocab)
+pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab)
+embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
+for i, token in enumerate(encoder.vocab):
+    embedding_weights[i] = pretrained_embedding[token]
 ```
 
-### Load [Word Vectors](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.word_to_vector.html)
+#### Neural Networks Layers
 
-For example, load FastText, state-of-the-art English word vectors:
+For example, from the neural network package, apply the state-of-the-art `LockedDropout`:
 
 ```python
-from torchnlp.word_to_vector import FastText
+import torch
+from torchnlp.nn import LockedDropout
+
+input_ = torch.randn(6, 3, 10)
+dropout = LockedDropout(0.5)
 
-vectors = FastText()
-# Load vectors for any word as a `torch.FloatTensor`
-vectors['hello']  # RETURNS: [torch.FloatTensor of size 300]
+# Apply a LockedDropout to `input_`
+dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
 ```
 
-### Compute [Metrics](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.metrics.html)
+#### Metrics
 
-Finally, compute common metrics such as the BLEU score.
+Compute common NLP metrics such as the BLEU score.
 
 ```python
 from torchnlp.metrics import get_moses_multi_bleu
@@ -131,8 +215,8 @@ AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to b
 
 ## Authors
 
-* [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
-* [Chloe Yeo](http://www.yeochloe.com/) — Logo Design
+- [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
+- [Chloe Yeo](http://www.yeochloe.com/) — Logo Design
 
 ## Citing
 

diff --git a/build_tools/travis/install.sh b/build_tools/travis/install.sh
@@ -27,6 +27,8 @@ pip install -r requirements.txt --progress-bar off
 pip install spacy --progress-bar off
 pip install nltk --progress-bar off
 pip install sacremoses --progress-bar off
+pip install pandas --progress-bar off
+pip install requests --progress-bar off
 
 # SpaCy English web model
 python -m spacy download en

diff --git a/build_tools/travis/test_script.sh b/build_tools/travis/test_script.sh
@@ -21,10 +21,13 @@ if [[ "$RUN_FLAKE8" == "true" ]]; then
 fi
 
 run_tests() {
-    TEST_CMD="python -m pytest tests/ torchnlp/ --verbose --durations=20 --cov=torchnlp --doctest-modules"
+    TEST_CMD="python -m pytest tests/ torchnlp/ -c /dev/null --verbose --durations=10 --cov=torchnlp"
     if [[ "$RUN_SLOW" == "true" ]]; then
         TEST_CMD="$TEST_CMD --runslow"
     fi
+    if [[ "$RUN_DOCTEXT" == "true" ]]; then
+        TEST_CMD="$TEST_CMD --doctest-modules"
+    fi
     $TEST_CMD
 }
 

diff --git a/examples/snli/train.py b/examples/snli/train.py
@@ -1,18 +1,19 @@
 from functools import partial
 
+import glob
+import itertools
 import os
 import time
-import glob
 
 from torch.utils.data import DataLoader
+from torch.utils.data.sampler import SequentialSampler
 
 import torch
 import torch.optim as optim
 import torch.nn as nn
 
 from torchnlp.samplers import BucketBatchSampler
 from torchnlp.datasets import snli_dataset
-from torchnlp.utils import datasets_iterator
 from torchnlp.encoders.text import WhitespaceEncoder
 from torchnlp.encoders import LabelEncoder
 from torchnlp import word_to_vector
@@ -29,20 +30,20 @@
 train, dev, test = snli_dataset(train=True, dev=True, test=True)
 
 # Preprocess
-for row in datasets_iterator(train, dev, test):
+for row in itertools.chain(train, dev, test):
     row['premise'] = row['premise'].lower()
     row['hypothesis'] = row['hypothesis'].lower()
 
 # Make Encoders
-sentence_corpus = [row['premise'] for row in datasets_iterator(train, dev, test)]
-sentence_corpus += [row['hypothesis'] for row in datasets_iterator(train, dev, test)]
+sentence_corpus = [row['premise'] for row in itertools.chain(train, dev, test)]
+sentence_corpus += [row['hypothesis'] for row in itertools.chain(train, dev, test)]
 sentence_encoder = WhitespaceEncoder(sentence_corpus)
 
-label_corpus = [row['label'] for row in datasets_iterator(train, dev, test)]
+label_corpus = [row['label'] for row in itertools.chain(train, dev, test)]
 label_encoder = LabelEncoder(label_corpus)
 
 # Encode
-for row in datasets_iterator(train, dev, test):
+for row in itertools.chain(train, dev, test):
     row['premise'] = sentence_encoder.encode(row['premise'])
     row['hypothesis'] = sentence_encoder.encode(row['hypothesis'])
     row['label'] = label_encoder.encode(row['label'])
@@ -88,11 +89,12 @@
 for epoch in range(args.epochs):
     n_correct, n_total = 0, 0
 
-    train_sampler = BucketBatchSampler(
-        train, args.batch_size, True, sort_key=lambda r: len(row['premise']))
+    train_sampler = SequentialSampler(train)
+    train_batch_sampler = BucketBatchSampler(
+        train_sampler, args.batch_size, True, sort_key=lambda r: len(row['premise']))
     train_iterator = DataLoader(
         train,
-        batch_sampler=train_sampler,
+        batch_sampler=train_batch_sampler,
         collate_fn=collate_fn,
         pin_memory=torch.cuda.is_available(),
         num_workers=0)
@@ -139,11 +141,13 @@
 
             # calculate accuracy on validation set
             n_dev_correct, dev_loss = 0, 0
-            dev_sampler = BucketBatchSampler(
+
+            dev_sampler = SequentialSampler(train)
+            dev_batch_sampler = BucketBatchSampler(
                 dev, args.batch_size, True, sort_key=lambda r: len(row['premise']))
             dev_iterator = DataLoader(
                 dev,
-                batch_sampler=dev_sampler,
+                batch_sampler=dev_batch_sampler,
                 collate_fn=partial(collate_fn, train=False),
                 pin_memory=torch.cuda.is_available(),
                 num_workers=0)

diff --git a/examples/snli/util.py b/examples/snli/util.py
@@ -4,7 +4,7 @@
 
 import torch
 
-from torchnlp.encoders.text import pad_batch
+from torchnlp.encoders.text import stack_and_pad_tensors
 
 
 def makedirs(name):
@@ -55,8 +55,8 @@ def get_args():
 
 def collate_fn(batch, train=True):
     """ list of tensors to a batch tensors """
-    premise_batch, _ = pad_batch([row['premise'] for row in batch])
-    hypothesis_batch, _ = pad_batch([row['hypothesis'] for row in batch])
+    premise_batch, _ = stack_and_pad_tensors([row['premise'] for row in batch])
+    hypothesis_batch, _ = stack_and_pad_tensors([row['hypothesis'] for row in batch])
     label_batch = torch.stack([row['label'] for row in batch])
 
     # PyTorch RNN requires batches to be transposed for speed and integration with CUDA