Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

Commit

Permalink
Merge pull request #84 from PetrochukM/index_to_token
Browse files Browse the repository at this point in the history
PyTorch-NLP 0.5.0
  • Loading branch information
PetrochukM committed Nov 4, 2019
2 parents 49bb1d7 + 7f82397 commit 86a44fd
Show file tree
Hide file tree
Showing 74 changed files with 1,124 additions and 855 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ target/
logs/**

# Coverage
coverage/**
coverage/**
cover/

# Data
Expand All @@ -94,3 +94,6 @@ data/**

# ReadTheDocs build files
docs/_build

# Python's virtual env
venv
6 changes: 6 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
language: python
matrix:
include:
- python: 3.5
dist: xenial
sudo: true
env: RUN_DOCTEXT=false # Python 3.5 prints differently from Python 3.6
- python: 3.6
dist: xenial
sudo: true
env: RUN_DOCTEXT=true
- python: 3.7
dist: xenial
sudo: true
env: RUN_DOCTEXT=true

cache: pip

Expand Down
168 changes: 126 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,45 @@
<p align="center"><img width="55%" src="docs/_static/img/logo.svg" /></p>

<h3 align="center">Supporting Rapid Prototyping with a Deep Learning NLP Toolkit&nbsp;&nbsp;
<a href="https://twitter.com/intent/tweet?text=Supporting%20rapid%20prototyping%20for%20research,%20PyTorch-NLP%20has%20LAUNCHED,%20a%20deep%20learning%20natural%20language%20processing%20(NLP)%20toolkit!%20&url=https://github.com/PetrochukM/PyTorch-NLP&hashtags=pytorch,nlp,research">
<img style='vertical-align: text-bottom !important;' src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social" alt="Tweet">
</a>
</h3>
<h3 align="center">Basic Utilities for PyTorch NLP Software</h3>

PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research.

Join our community, add datasets and neural network layers! Chat with us on [Gitter](https://gitter.im/PyTorch-NLP/Lobby) and join the [Google Group](https://groups.google.com/forum/#!forum/pytorch-nlp), we're eager to collaborate with you.
PyTorch-NLP, or `torchnlp` for short, is a library of basic utilities for PyTorch
Natural Language Processing (NLP). `torchnlp` extends PyTorch to provide you with
basic text data processing functions.

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pytorch-nlp.svg?style=flat-square)
[![Codecov](https://img.shields.io/codecov/c/github/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://codecov.io/gh/PetrochukM/PyTorch-NLP)
[![Downloads](http://pepy.tech/badge/pytorch-nlp)](http://pepy.tech/project/pytorch-nlp)
[![Documentation Status]( https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
[![Documentation Status](https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square)
[![Build Status](https://img.shields.io/travis/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://travis-ci.org/PetrochukM/PyTorch-NLP)
[![Twitter: PetrochukM](https://img.shields.io/twitter/follow/MPetrochuk.svg?style=social)](https://twitter.com/MPetrochuk)

_Logo by [Chloe Yeo](http://www.yeochloe.com/)_
_Logo by [Chloe Yeo](http://www.yeochloe.com/), Corporate Sponsorship by [WellSaid Labs](https://wellsaidlabs.com/)_

## Installation
## Installation 🐾

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
Make sure you have Python 3.5+ and PyTorch 1.0+. You can then install `pytorch-nlp` using
pip:

pip install pytorch-nlp
```python
pip install pytorch-nlp
```

Or to install the latest code via:

pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
```python
pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
```

## Docs 📖
## Docs

The complete documentation for PyTorch-NLP is available via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).
The complete documentation for PyTorch-NLP is available
via [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).

## Basics
## Get Started

Add PyTorch-NLP to your project by following one of the common use cases:
Within an NLP data pipeline, you'll want to implement these basic steps:

### Load a [Dataset](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
### Load Your Data 🐿

Load the IMDB dataset, for example:

Expand All @@ -49,51 +51,133 @@ train = imdb_dataset(train=True)
train[0] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
```

### Apply [Neural Networks](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.nn.html) Layers
Load a custom dataset, for example:

```python
from pathlib import Path

from torchnlp.download import download_file_maybe_extract

directory_path = Path('data/')
train_file_path = Path('trees/train.txt')

download_file_maybe_extract(
url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
directory=directory_path,
check_files=[train_file_path])

open(directory_path / train_file_path)
```

Don't worry we'll handle caching for you!

For example, from the neural network package, apply state-of-the-art LockedDropout:
### Text To Tensor

Tokenize and encode your text as a tensor. For example, a `WhitespaceEncoder` breaks
text into terms whenever it encounters a whitespace character.

```python
from torchnlp.encoders.text import WhitespaceEncoder

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]
```

### Tensor To Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

```python
import torch
from torchnlp.nn import LockedDropout
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)
encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]

# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])

batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
```

### [Encode Text](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.encoders.text.html)
PyTorch-NLP builds on top of PyTorch's existing `torch.utils.data.sampler`, `torch.stack`
and `default_collate` to support sequential inputs of varying lengths!

### Your Good To Go!

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.

### Last But Not Least

Tokenize and encode text as a tensor. For example, a `WhitespaceEncoder` breaks text into terms whenever it encounters a whitespace character.
PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗

#### Deterministic Functions

Now you've setup your pipeline, you may want to ensure that some functions run deterministically.
Wrap any code that's random, with `fork_rng` and you'll be good to go, like so:

```python
import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123): # Ensure determinism
print('Random:', random.randint(1, 2**31))
print('Numpy:', numpy.random.randint(1, 2**31))
print('Torch:', int(torch.randint(1, 2**31, (1,))))
```

This will always print:

```text
Random: 224899943
Numpy: 843828735
Torch: 843828736
```

#### Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use of
pre-trained word vectors, like so:

```python
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

# Create a `WhitespaceEncoder` with a corpus of text
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

# Encode and decode phrases
encoder.encode("this ain't funny.") # RETURNS: torch.Tensor([6, 7, 1])
encoder.decode(encoder.encode("This ain't funny.")) # RETURNS: "this ain't funny."
vocab = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
```

### Load [Word Vectors](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.word_to_vector.html)
#### Neural Networks Layers

For example, load FastText, state-of-the-art English word vectors:
For example, from the neural network package, apply the state-of-the-art `LockedDropout`:

```python
from torchnlp.word_to_vector import FastText
import torch
from torchnlp.nn import LockedDropout

input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)

vectors = FastText()
# Load vectors for any word as a `torch.FloatTensor`
vectors['hello'] # RETURNS: [torch.FloatTensor of size 300]
# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
```

### Compute [Metrics](http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.metrics.html)
#### Metrics

Finally, compute common metrics such as the BLEU score.
Compute common NLP metrics such as the BLEU score.

```python
from torchnlp.metrics import get_moses_multi_bleu
Expand Down Expand Up @@ -131,8 +215,8 @@ AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to b

## Authors

* [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
* [Chloe Yeo](http://www.yeochloe.com/) — Logo Design
- [Michael Petrochuk](https://github.com/PetrochukM/) — Developer
- [Chloe Yeo](http://www.yeochloe.com/) — Logo Design

## Citing

Expand Down
2 changes: 2 additions & 0 deletions build_tools/travis/install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ pip install -r requirements.txt --progress-bar off
pip install spacy --progress-bar off
pip install nltk --progress-bar off
pip install sacremoses --progress-bar off
pip install pandas --progress-bar off
pip install requests --progress-bar off

# SpaCy English web model
python -m spacy download en
Expand Down
5 changes: 4 additions & 1 deletion build_tools/travis/test_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,13 @@ if [[ "$RUN_FLAKE8" == "true" ]]; then
fi

run_tests() {
TEST_CMD="python -m pytest tests/ torchnlp/ --verbose --durations=20 --cov=torchnlp --doctest-modules"
TEST_CMD="python -m pytest tests/ torchnlp/ -c /dev/null --verbose --durations=10 --cov=torchnlp"
if [[ "$RUN_SLOW" == "true" ]]; then
TEST_CMD="$TEST_CMD --runslow"
fi
if [[ "$RUN_DOCTEXT" == "true" ]]; then
TEST_CMD="$TEST_CMD --doctest-modules"
fi
$TEST_CMD
}

Expand Down
28 changes: 16 additions & 12 deletions examples/snli/train.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
from functools import partial

import glob
import itertools
import os
import time
import glob

from torch.utils.data import DataLoader
from torch.utils.data.sampler import SequentialSampler

import torch
import torch.optim as optim
import torch.nn as nn

from torchnlp.samplers import BucketBatchSampler
from torchnlp.datasets import snli_dataset
from torchnlp.utils import datasets_iterator
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.encoders import LabelEncoder
from torchnlp import word_to_vector
Expand All @@ -29,20 +30,20 @@
train, dev, test = snli_dataset(train=True, dev=True, test=True)

# Preprocess
for row in datasets_iterator(train, dev, test):
for row in itertools.chain(train, dev, test):
row['premise'] = row['premise'].lower()
row['hypothesis'] = row['hypothesis'].lower()

# Make Encoders
sentence_corpus = [row['premise'] for row in datasets_iterator(train, dev, test)]
sentence_corpus += [row['hypothesis'] for row in datasets_iterator(train, dev, test)]
sentence_corpus = [row['premise'] for row in itertools.chain(train, dev, test)]
sentence_corpus += [row['hypothesis'] for row in itertools.chain(train, dev, test)]
sentence_encoder = WhitespaceEncoder(sentence_corpus)

label_corpus = [row['label'] for row in datasets_iterator(train, dev, test)]
label_corpus = [row['label'] for row in itertools.chain(train, dev, test)]
label_encoder = LabelEncoder(label_corpus)

# Encode
for row in datasets_iterator(train, dev, test):
for row in itertools.chain(train, dev, test):
row['premise'] = sentence_encoder.encode(row['premise'])
row['hypothesis'] = sentence_encoder.encode(row['hypothesis'])
row['label'] = label_encoder.encode(row['label'])
Expand Down Expand Up @@ -88,11 +89,12 @@
for epoch in range(args.epochs):
n_correct, n_total = 0, 0

train_sampler = BucketBatchSampler(
train, args.batch_size, True, sort_key=lambda r: len(row['premise']))
train_sampler = SequentialSampler(train)
train_batch_sampler = BucketBatchSampler(
train_sampler, args.batch_size, True, sort_key=lambda r: len(row['premise']))
train_iterator = DataLoader(
train,
batch_sampler=train_sampler,
batch_sampler=train_batch_sampler,
collate_fn=collate_fn,
pin_memory=torch.cuda.is_available(),
num_workers=0)
Expand Down Expand Up @@ -139,11 +141,13 @@

# calculate accuracy on validation set
n_dev_correct, dev_loss = 0, 0
dev_sampler = BucketBatchSampler(

dev_sampler = SequentialSampler(train)
dev_batch_sampler = BucketBatchSampler(
dev, args.batch_size, True, sort_key=lambda r: len(row['premise']))
dev_iterator = DataLoader(
dev,
batch_sampler=dev_sampler,
batch_sampler=dev_batch_sampler,
collate_fn=partial(collate_fn, train=False),
pin_memory=torch.cuda.is_available(),
num_workers=0)
Expand Down
6 changes: 3 additions & 3 deletions examples/snli/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import torch

from torchnlp.encoders.text import pad_batch
from torchnlp.encoders.text import stack_and_pad_tensors


def makedirs(name):
Expand Down Expand Up @@ -55,8 +55,8 @@ def get_args():

def collate_fn(batch, train=True):
""" list of tensors to a batch tensors """
premise_batch, _ = pad_batch([row['premise'] for row in batch])
hypothesis_batch, _ = pad_batch([row['hypothesis'] for row in batch])
premise_batch, _ = stack_and_pad_tensors([row['premise'] for row in batch])
hypothesis_batch, _ = stack_and_pad_tensors([row['hypothesis'] for row in batch])
label_batch = torch.stack([row['label'] for row in batch])

# PyTorch RNN requires batches to be transposed for speed and integration with CUDA
Expand Down

0 comments on commit 86a44fd

Please sign in to comment.