Release candidate v0.2.0 (#10)

* Added some context for EvNN * AdamW, Simpler Thresholds (#5) * Switched to AdamW optimizer, simplified threshold parameterization, slight changes to the training of thresholds * removed wandb from training script * fixed inference script and updated README.md * Improved setup and install (#8) * improve setup and remove makefiles * remove makefile --------- authored-by: KhaleelKhan <khaleelulla.khan@tu-dresden.de> * bump up version, update readme * include required files in distributed archive * only require nvcc to compile cuda kernels * cleaned LM code from pruning attempts * update changelog and prepare merge --------- Co-authored-by: Anand <anandtrex@users.noreply.github.com> Co-authored-by: Mark Schoene <mark.schoene@tu-dresden.de> Co-authored-by: KhaleelKhan <khaleelulla.khan@tu-dresden.de>
Efficient-Scalable-Machine-Learning · May 24, 2024 · eaf293a · eaf293a
1 parent 1c49161
commit eaf293a
Show file tree

Hide file tree

Showing 17 changed files with 339 additions and 506 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # ChangeLog
 
+## 0.2.0-egru (2024-05-24)
+### Changed
+- Simplified install and removed makefile
+- CUDA compute capability is automatically detected
+- Update Readme with the setup instruction
+- Update Dockerfile
+- Cleaned LM pruning code
+
+
 ## 0.1.0-egru (2022-03-01)
 ### Changed
 - Project forked from original

diff --git a/build/MANIFEST.in → MANIFEST.in b/build/MANIFEST.in → MANIFEST.in
@@ -1,4 +1,3 @@
-include Makefile
 include frameworks/pytorch/*.h
 include frameworks/pytorch/*.cc
 include lib/*.cc

diff --git a/Makefile b/Makefile
diff --git a/README.md b/README.md
@@ -30,36 +30,41 @@ Here's what you'll need to get started:
 - a [CUDA Compute Capability](https://developer.nvidia.com/cuda-gpus) 3.7+ GPU (required only if using GPU)
 - [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) 11.0+ (required only if using GPU)
 - [PyTorch](https://pytorch.org) 1.3+ for PyTorch integration (GPU optional)
-- [BLAS](https://netlib.org/blas/) or any BLAS-like library for CPU computation.
-- [Eigen 3](http://eigen.tuxfamily.org/) to build the C++ examples (optional)
+- [OpenBLAS](https://www.openblas.net/) or any BLAS-like library for CPU computation.
 
 Once you have the prerequisites, you can install with pip or by building the source code.
 
-<!-- ### Using pip
+### Using pip
 ```
 pip install evnn_pytorch
-``` -->
+```
 
 ### Building from source
 > **Note**
 > 
 > Currenty supported only on Linux, use Docker for building on Windows.
 
+Build and install it with `pip`:
 ```bash
-make evnn_pytorch # Build PyTorch API
+pip install .
 ```
+### Building in Docker
 
-If you built the PyTorch API, install it with `pip`:
+Build docker image:
 ```bash
-pip install evnn_pytorch-*.whl
+docker build -t evnn -f docker/Dockerfile .
 ```
 
-If the CUDA Toolkit that you're building against is not in `/usr/local/cuda`, you must specify the
-`$CUDA_HOME` environment variable before running make:
+Example usage:
 ```bash
-CUDA_HOME=/usr/local/cuda-10.2 make
+docker run --rm --gpus=all evnn python -m unittest discover -p "*_test.py" -s /evnn_src/validation -v
 ```
 
+> **Note**
+> 
+> The build script tries to automatically detect GPU compute capability. In case the GPU is not available during compilation, for example when building with docker or when using compute cluster login nodes for compiling, Use enviroment variable `EVNN_CUDA_COMPUTE` to set the required compute capability.
+> Example: For CUDA Compute capability 8.0 use ```export EVNN_CUDA_COMPUTE=80```
+
 ## Performance
 
 Code for the experiments and benchmarks presented in the paper are published in ``benchmarks`` directory.

diff --git a/benchmarks/lm/README.md b/benchmarks/lm/README.md
@@ -4,18 +4,30 @@ To run the language modeling experiments, first download the data
 
     ./getdata <data_dir>
 
-Then run Penn Treebank experiments with EGRU (1350 units)
+We [provide checkpoints for EGRU](https://cloudstore.zih.tu-dresden.de/index.php/s/NPQ9pLnpZnTsM5X) with 3 layers of hidden size (1350, 1350, 750)
 
-    python lm/train.py --data path_to_your_data --scratch ./log --dataset PTB --epochs 2500 --rnn_type egru --layers 3 --hidden_dim 1350 --batch_size=64 --bptt=68 --dropout_connect=0.6788113982442464 --dropout_emb=0.7069992062976298 --dropout_forward=0.2641540030663871 --dropout_words=0.05460274136214911 --emb_dim=788 --learning_rate=0.00044406742918918466 --pseudo_derivative_width=2.179414375864446 --thr_init_mean=-3.76855645544185 --weight_decay=9.005509348932795e-06 --seed 12008
+# Penn Treebank
+To train EGRU on Penn Treebank word-level language modeling, run
 
-or EGRU (2000 units)
+    python benchmarks/lm/train.py --data=/path/to/data --scratch=/your/scratch/directory/Experiments --dataset=PTB --epochs=1000 --batch_size=64 --rnn_type=egru --layer=3 --bptt=70 --scheduler=cosine --weight_decay=0.10 --learning_rate=0.0012 --learning_rate_thresholds 0.0 --emb_dim=750 --dropout_emb=0.6 --dropout_words=0.1 --dropout_forward=0.25 --grad_clip=0.1 --thr_init_mean=0.01 --dropout_connect=0.7 --hidden_dim=1350 --pseudo_derivative_width=3.6 --scheduler_start=700 --seed=9612
 
-    python lm/train.py --data path_to_your_data --scratch ./log --dataset PTB --epochs 2500 --rnn_type egru --layers 3 --hidden_dim 2000 --batch_size=128 --bptt=67 --dropout_connect=0.621405385527356 --dropout_emb=0.7651296208061924 --dropout_forward=0.24131807369801447 --dropout_words=0.14942681962154375 --emb_dim=786 --learning_rate=0.000494172266064804 --pseudo_derivative_width=2.35216907207571 --thr_init_mean=-3.4957794302256007 --weight_decay=6.6878095661652755e-06 --seed 52798
+For inference with the [provided checkpoint](https://cloudstore.zih.tu-dresden.de/index.php/s/NPQ9pLnpZnTsM5X), run
+
+    python benchmarks/lm/infer.py --data /path/to/data --dataset PTB --datasplit test --batch_size 1 --directory /path/to/checkpoint
+
+# Wikitext-2
+To train EGRU on Wikitext-2, run
+
+    python benchmarks/lm/train.py --data=/your/data/directory --scratch=/your/scratch/directory/Experiments --dataset=WT2 --epochs=800 --batch_size=128 --rnn_type=egru --layer=3 --bptt=70 --scheduler=cosine --weight_decay=0.12 --learning_rate=0.001 --learning_rate_thresholds 0.0 --emb_dim=750 --dropout_emb=0.7 --dropout_words=0.1 --dropout_forward=0.25 --grad_clip=0.1 --thr_init_mean=0.01 --dropout_connect=0.7 --hidden_dim=1350 --pseudo_derivative_width=3.6 --scheduler_start=400 --seed=913420
+
+For inference with the [provided checkpoint](https://cloudstore.zih.tu-dresden.de/index.php/s/NPQ9pLnpZnTsM5X), run
+
+    python benchmarks/lm/infer.py --data /path/to/data --dataset WT2 --datasplit test --batch_size 1 --directory /path/to/checkpoint
 
 Various flags can be passed to change the defaults parameters. 
 See "train.py" for a list of all available arguments.
 
-This code was tested with PyTorch >= 1.9.0
+This code was tested with PyTorch >= 1.9.0, CUDA 11.
 
 A large batch of code stems from Salesforce AWD-LSTM implementation:
 https://github.com/salesforce/awd-lstm-lm
diff --git a/benchmarks/lm/eval.py b/benchmarks/lm/eval.py
@@ -14,16 +14,15 @@
 # ==============================================================================
 
 import torch
-import lm.data as d
+import data as d
 
 
-def evaluate(model, eval_data, criterion, batch_size, bptt, ntokens, device, return_hidden=False):
+def evaluate(model, eval_data, criterion, batch_size, bptt, ntokens, device, hidden_dims, return_hidden=False):
     # turn on evaluation mode
     model.eval()
 
     # initialize evaluation metrics
     iter_range = range(0, eval_data.size(0) - 1, bptt)
-    hidden_dims = [rnn.hidden_size for rnn in model.rnns]
 
     total_loss = 0.
     mean_activities = torch.zeros(len(iter_range), dtype=torch.float16, device=device)

diff --git a/benchmarks/lm/infer.py b/benchmarks/lm/infer.py
@@ -23,9 +23,9 @@
 import torch
 import torch.nn
 
-import lm.data as d
-from lm.models import LanguageModel
-from lm.eval import evaluate
+import data as d
+from models import LanguageModel
+from eval import evaluate
 
 
 def get_args():
@@ -37,7 +37,6 @@ def get_args():
     argparser.add_argument('--batch_size', type=int, default=80)
     argparser.add_argument('--directory', type=str, required=False, help='model directory for checkpoints and config')
     argparser.add_argument('--hidden', action='store_true', help='returns the hidden states of the whole dataset to perform analysis')
-    argparser.add_argument('--prune', type=float, default=0.0)
 
     return argparser.parse_args()
 
@@ -85,14 +84,19 @@ def main(args):
         model = LanguageModel(**model_args).to(device)
     elif config['rnn_type'] == 'egru':
         model = LanguageModel(**model_args,
-                              dampening_factor=config['damp_factor'],
+                              dampening_factor=config['pseudo_derivative_width'],
                               pseudo_derivative_support=config['pseudo_derivative_width']).to(device)
     else:
         raise RuntimeError("Unknown RNN type: %s" % config['rnn_type'])
 
     best_model_path = os.path.join(args.directory, 'checkpoints', f"{config['rnn_type'].upper()}_best_model.cpt")
     model.load_state_dict(torch.load(best_model_path, map_location=device))
 
+    if model_args['rnn_type'] == 'egru':
+        hidden_dims = [rnn.hidden_size for rnn in model.rnns]
+    else:
+        hidden_dims = [rnn.module.hidden_size if args.dropout_connect > 0 else rnn.hidden_size for rnn in model.rnns]
+
     criterion = torch.nn.CrossEntropyLoss()
 
     if args.hidden:
@@ -104,6 +108,7 @@ def main(args):
                      bptt=config['bptt'],
                      ntokens=vocab_size,
                      device=device,
+                     hidden_dims=hidden_dims,
                      return_hidden=True)
         save_file = os.path.join(args.directory, f'hidden_states_{args.datasplit}.hdf')
         with h5py.File(save_file, 'w') as f:
@@ -121,6 +126,7 @@ def main(args):
                      bptt=config['bptt'],
                      ntokens=vocab_size,
                      device=device,
+                     hidden_dims=hidden_dims,
                      return_hidden=False)
 
     test_ppl = math.exp(test_loss)
@@ -131,58 +137,6 @@ def main(args):
     print(f'Layerwise activity {test_layerwise_activity_mean.tolist()} +- {test_layerwise_activity_std.tolist()}')
     print('=' * 89)
 
-    if args.prune > 0.0 and args.hidden:
-        print(f"Model Parameter Count: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
-        input_indices = torch.arange(model.rnns[0].input_size).to(device)
-        for i in range(model.nlayers):
-            if i < model.nlayers - 1:
-                # get event frequencies
-                hid_dim = all_hiddens[i].shape[2]
-                hid_cells = all_hiddens[i].reshape(-1, hid_dim)
-                seq_len = hid_cells.shape[0]
-                spike_frequency = torch.sum(hid_cells != 0, dim=0) / seq_len
-                print(
-                    f"Layer {i + 1}: "
-                    f"less than 1/100: {torch.sum(spike_frequency < 0.01)} / {spike_frequency.shape} "
-                    f"// never: {torch.sum(hid_cells.sum(dim=0) == 0)} / {spike_frequency.shape}")
-
-                # compute remaining indicies from spike frequencies
-                topk = int(model.rnns[i].hidden_size * (1 - args.prune))
-                hidden_indices, _ = torch.sort(torch.argsort(spike_frequency, descending=True)[:topk], descending=False)
-                hidden_indices = hidden_indices.to(device)
-            else:
-                hidden_indices = torch.arange(model.rnns[i].hidden_size).to(device)
-            model.rnns[i].prune_units(input_indices, hidden_indices)
-            input_indices = hidden_indices
-
-        print(f"Model Parameter Count: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
-
-        test_loss, test_activity, test_layerwise_activity_mean, test_layerwise_activity_std, centered_cell_states, all_hiddens = \
-            evaluate(model=model,
-                     eval_data=test_data,
-                     criterion=criterion,
-                     batch_size=args.batch_size,
-                     bptt=config['bptt'],
-                     ntokens=vocab_size,
-                     device=device,
-                     return_hidden=True)
-        for i in range(model.nlayers - 1):
-            # get event frequencies
-            hid_dim = all_hiddens[i].shape[2]
-            hid_cells = all_hiddens[i].reshape(-1, hid_dim)
-            seq_len = hid_cells.shape[0]
-            spike_frequency = torch.sum(hid_cells != 0, dim=0) / seq_len
-            print(
-                f"less than 1/100: {torch.sum(spike_frequency < 0.01)} / {spike_frequency.shape} "
-                f"// never: {torch.sum(hid_cells.sum(dim=0) == 0)} / {spike_frequency.shape}")
-        test_ppl = math.exp(test_loss)
-        print('=' * 89)
-        print(f'| Inference | test loss {test_loss:5.2f} | '
-              f'test ppl {test_ppl:8.2f} | '
-              f'test mean activity {test_activity}')
-        print(f'Layerwise activity {test_layerwise_activity_mean.tolist()} +- {test_layerwise_activity_std.tolist()}')
-        print('=' * 89)
-
 
 if __name__ == "__main__":
     args = get_args()

diff --git a/benchmarks/lm/models.py b/benchmarks/lm/models.py
@@ -17,8 +17,8 @@
 import torch.nn as nn
 import torch.nn.functional as F
 import evnn_pytorch as evnn
-from lm.modules import VariationalDropout, WeightDrop
-from lm.embedding_dropout import embedded_dropout
+from modules import VariationalDropout, WeightDrop
+from embedding_dropout import embedded_dropout
 from typing import Union
 
 
@@ -64,9 +64,8 @@ def forward(self, x):
         bs, seq_len, ninp = x.shape
         if self.project:
             x = x.view(-1, ninp)
-            x = F.relu(self.projection(x))
+            x = self.projection(x)
             x = x.view(bs, seq_len, self.nemb)
-            x = self.variational_dropout(x, self.dropout)
         x = x.view(-1, self.nemb)
         x = self.decoder(x)
         return x
@@ -155,57 +154,6 @@ def __init__(self,
 
         self.backward_sparsity = torch.zeros(len(self.rnns))
 
-    def prune_embeddings(self, index):
-        device = next(self.parameters()).device
-        self.embeddings.weight = nn.Parameter(
-            self.embeddings.weight[:, index]).to(device)
-        self.emb_dim = self.embeddings.weight.shape[1]
-        self.decoder = Decoder(ninp=self.hidden_dim if self.projection else self.emb_dim, ntokens=self.vocab_size,
-                               project=self.projection, nemb=self.emb_dim,
-                               dropout=self.dropout_forward).to(device)
-        self.decoder.decoder.weight = self.embeddings.weight
-
-    def prune(self, fractions, hiddens, device):
-        # calculate new hidden dimensions
-        indicies = [torch.arange(self.rnns[0].input_size).to(device)]
-
-        for i in range(self.nlayers):
-            if isinstance(fractions, float):
-                frac = fractions
-            elif isinstance(fractions, tuple) or isinstance(fractions, list):
-                frac = fractions[i]
-            else:
-                raise NotImplementedError(
-                    f"data type {type(fractions)} not implemented. Use float, tuple or list")
-
-            # get event frequencies
-            hid_dim = hiddens[i].shape[2]
-            hid_cells = hiddens[i].reshape(-1, hid_dim)
-            seq_len = hid_cells.shape[0]
-            spike_frequency = torch.sum(hid_cells != 0, dim=0) / seq_len
-            print(
-                f"Layer {i + 1}: "
-                f"less than 1/100: {torch.sum(spike_frequency < 0.01)} / {spike_frequency.shape} "
-                f"// never: {torch.sum(hid_cells.sum(dim=0) == 0)} / {spike_frequency.shape}")
-
-            # compute remaining indicies from spike frequencies
-            topk = int(self.rnns[i].hidden_size * (1 - frac))
-            hidden_indices, _ = torch.sort(torch.argsort(
-                spike_frequency, descending=True)[:topk], descending=False)
-            hidden_indices = hidden_indices.to(device)
-            indicies.append(hidden_indices)
-
-        # input dimension equals embedding dimension for tied weights
-        indicies[0] = indicies[-1]
-
-        # prune weights
-        for i in range(self.nlayers):
-            self.rnns[i].prune_units(indicies[i], indicies[i+1])
-
-        self.prune_embeddings(indicies[-1])
-        print(
-            f"Final model hidden size: {[rnn.hidden_size for rnn in self.rnns]}")
-
     def init_embedding(self, initrange):
         nn.init.uniform_(self.embeddings.weight, -initrange, initrange)