Skip to content

Devix71/nlu_dialogue_dataset_generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

Dialogue Dataset Generator

Description and Purpose

This repository provides a script to generate synthetic dialogue datasets using a schema-first approach. The pipeline processes input configurations to produce human-readable dialogues annotated in MultiWoZ2.2 format. The script main.py, its dependencies and auxiliary modules can be found in the DatasetGenerator folder.

Requirements

Auxiliary

  • Python 3.10+
  • valid OpenAI API key

Python Libraries

  • pyyaml 6.0.1
  • openai 1.9.0

How to Run

The main.py script can be ran from any CLI terminal.

The user can specify the domain and how many dialogues can be generated by using a YAML file or directly through a list of domains.

Command Line Arguments

-y, --yaml: Path to the YAML dataset configuration file.
-l, --list: List of domains to generate datasets for (e.g., hotel train).
-r, --repetitions: Number of dialogues to generate for the dataset (default is 1).

Example Usage

Using a YAML Configuration File

python main.py -y path/to/config.yaml

Using a List of Domains

python main.py -l hotel train -r 10

YAML Configuration File

The YAML configuration file should list the domains for which datasets need to be generated. Here is an example configuration:

Example YAML File

domains:
  - hotel
  - train

Save the above content in a file, e.g., config.yaml, and provide its path using the -y argument.

Supported Domains

The script currently supports the following domains:

hotel
train

If an unsupported domain is specified, the script will print a warning and continue processing.

Error Handling

The script includes retry logic to handle errors during dialogue generation. If an error occurs, it will retry up to three times before moving to the next step.

Output

Generated dialogues are saved in JSON format in the IntermediaryFiles directory. The files are named synthetic_dataset_<domain>.json for each domain.

Experiments

Requirements

The following Python libraries must be installed in order to run the experiments:

  • pandas 2.1.1
  • transformers 4.40.2
  • torch 2.3.0
  • tqdm 4.66.4
  • numpy 1.26.3
  • scikit-learn 1.2.1
  • plotly 5.20.0

Synthetic-Trained Models

The code for training the SVM, Na"ive Bayes and BERT models on the synthetic data can be found within the directory Experiments/SynthDataset/.

Before using any new synthetic dataset, the files must be processed by running the gpt_classifier.ipynb and json_processing.ipynb in order to make sure that the annotation is aligned to MultiWoZ 2.2 standards.

The files can be found within the Experiments/SynthDataset/ and they must be pointed to the new synthetic dataset's path.

The files containing the experiments are:

For SVM and Na"ive Bayes:

  • domain_synth.ipynb
  • intent_synth.ipynb
  • slot_synth.ipynb

For BERT:

  • BERT_Synth_Domain.ipynb
  • BERT_Synth_Intent.ipynb
  • BERT_Synth_Slot.ipynb

The files can be run as they are provided that the paths towards the json files containing the synthetic dialogue data are specified.

MultiWoZ-Trained Models and A/B Testing

The code for training the SVM, Na"ive Bayes and BERT models on the MultiWoZ data can be found within the directory Experiments/MultiWozDataset/code.

The files can be used to train the MultiWoZ models, but also to test loaded synth-trained models, provided the necessary paths are added to the files.

The files containing the experiments are:

For SVM and Na"ive Bayes:

  • multiwoz_active_intent.ipynb
  • multiwoz_domain.ipynb
  • multiwoz_slot_values.ipynb

For BERT:

  • BERT_MultiWoZ_Domain.ipynb
  • BERT_MultiWoZ_Intent.ipynb
  • BERT_MultiWoZ_Slot.ipynb

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published