NVIDIA AI Workbench: Introduction

This is an NVIDIA AI Workbench example Project that demonstrates how to train a Large Language Model to annotate large sections of text with realistic punctuation and capitalization using the NeMo Framework. We will conduct data preprocessing, model configuration, model training, and model inference on new text to evaluate performance. Users who have installed AI Workbench can get up and running with this project in minutes.

Have questions? Please direct any issues, fixes, suggestions, and discussion on this project to the DevZone Members Only Forum thread here.

Project Description

Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. This tutorial explains how to implement a model in NeMo that will predict punctuation and capitalization for each word in a sentence to make ASR output more readable and to boost performance of the named entity recognition, machine translation or text-to-speech models. We'll show how to train a model for this task using a pre-trained BERT model. For every word in our training dataset we’re going to predict:

punctuation mark that should follow the word and
whether the word should be capitalized

System Requirements:

Operating System: Ubuntu 22.04
CPU requirements: None, tested with Intel® Xeon® Gold 6240R CPU @ 2.40GHz
GPU requirements: Any NVIDIA training GPU, tested with NVIDIA A100-40GB
NVIDIA driver requirements: Latest driver version
Storage requirements: 40GB

Quickstart

The notebook(s) in this project were adapted from the NVIDIA NeMo Github repository, which can be found here.

If you have NVIDIA AI Workbench already installed, you can use this Project in AI Workbench on your choice of machine by:

Forking this Project to your own GitHub namespace and copying the clone link

https://github.com/[your_namespace]/<project_name>.git
Opening a shell and activating the Context you want to clone into by
```
$ nvwb list contexts

$ nvwb activate <desired_context>
```
Cloning this Project onto your desired machine by running
```
$ nvwb clone project <your_project_url>
```

Opening the Project by

$ nvwb list projects

$ nvwb open <project_name>

Starting JupyterLab by
```
$ nvwb start jupyterlab
```
Navigate to the code directory of the project. Then, open the notebook titled Punctuation_and_Capitalization.ipynb and get started. Happy coding!

Tip: Use nvwb help to see a full list of commands.

Tested On

This notebook has been tested with an NVIDIA A100-40gb GPU and the following version of NVIDIA AI Workbench: nvwb 0.2.66 (internal; linux; amd64; go1.18.10; Tue Sep 12 18:50:21 UTC 2023)

License

This NVIDIA AI Workbench example project is under the Apache 2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.project		.project
code		code
data		data
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
apt.txt		apt.txt
base-command.sh		base-command.sh
postBuild.bash		postBuild.bash
preBuild.bash		preBuild.bash
requirements.txt		requirements.txt
variables.env		variables.env

License

NVIDIA/workbench-example-nemo-punctuation

Folders and files

Latest commit

History

Repository files navigation

NVIDIA AI Workbench: Introduction

Project Description

System Requirements:

Quickstart

Tested On

License

About

Resources

License

Stars

Watchers

Forks

Languages