Skip to content

Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity; ACL 2021


Notifications You must be signed in to change notification settings


Repository files navigation

Code for the R-U-A-Robot Dataset.

We use Python 3.8.5, but will probably work on Python 3.6+. Start with installing the requirements in requirements.txt.

The Data

  • data/1.0.0/..csv - main data files

Auxiliary data:

  • data/RUARobot_CodeGuide_v1.4.1.pdf - The various categories of utterances we encountered while labeling data and the determined label for each. As mentioned in the paper, there is definitely room to debate for a lot of these (though note before getting too focused on a specific category, note most the categories are pretty rare and only affect a few of the 6000+ examples). We think it is pretty good for v1.0 of the dataset though, and discussion welcome for how to improve on future releases.
  • data/auxdata/part1_survey_data.csv - The data collected from mechanical turk for expanding grammar
  • data/auxdata/existing_sys.v2fill.csv - Catorgizing existing system responses
  • data/auxdata/randsample_labeldata.csv - Labeling data sampled randomly from the dataset
  • data/auxdata/survey_test_r.csv - The additional test split
  • data/auxdata/tfidf_labeldata.csv - Labeling data sampled with Tf-IDF sampling
  • data/auxdata/codingguide_table_examples.csv - An export of coding guide in csv form for programmatic checkingjk

Auxdata is what went into making the data. Noteably it should include justification categories for most of the more tricky examples. Currently these justification labels are not linked up with the main dataset files (since it goes through the grammar first). Aspirationally we want to fix this at some point as while we tried to be reasonable and consistent, some labls on the edge cases might be confusing without the justification label.

Data Bias Notes: While efforts were to make the dataset comprehensive, users of the dataset should be aware of potential dataset biases. Most example review and grammar development was done by one individual, which could induce biases in topics covered. We collect crowd-sourced examples to try to reduce individual bias in phrasings or topics, but data comes US-based Amazon Mechanical Turk workers which might also represent a specific biased demographic. Additionally, the dataset is English-only, which potentially perpetuates an English-bias in NLP systems (translation PR's welcome :) ).

Src Highlights:

  • datatoy/ - This is the main file used for creating sampling the grammar and making the splits
  • baselines/ - Used to run baselines and grammar baselines on the models. It will print a latex table at the end. Note though, seeds aren't fixed. If you run it expect some variance in underlying numbers. The ranking between models should be pretty stable though.
  • templates/ - Grammar for positive examples. Executing it should print 100 samples.
  • templates/ - Grammar for hard negatives
  • templates/ - Grammar for AIC examples

Other somewhat useful files:

  • classify_text_plz - Actual code the ML model baselines. Useful if really want to dig into exact details of hyperparameters. We used fairly default/typical settings.
  • datatoy/ - Used as a test to if grammar fully covers the data we collect from the Turk data, the Tf-IDF data, and the rand data.
  • templates/ - The backend logic of the Context Free Grammar
  • datatoy/ - The modifier helps with augmenting/allowing typos in the context free grammar
  • othersurvey/ - How we made the questions for the "good response" surveys
  • tests - About 35 automated tests for the CFG, utilities, and fancy within-rule partitioning stuff.
  • datatoy/ - How we convert our classifier. See also:
    • templates/ - converts the grammar into EBNF form for use in a off the shelf parser (Lark). Also the GramRecognizer is the backend of the classifier and includes some of the heuristics we mention in the paper.

In the data files we provide the existing system outputs (blender, Alexa, Google), but here's how we get them:

  • baselines/blender_baseline - Code for getting blender responses to possitive examples
  • baselines/googleassistant/ - Get google assistant results
  • baselines/googleassistant/audioresults - The audio results because some results in the spreadsheet were truncated transcriptions Unforunately Alexa was a manual process in the Alexa simulator.

Other packages

Our surveys/eval is done using a separate module

Other certain components of this source might be independently useful from the dataset. This includes the tool for creating the context free grammar with features like intra-rule partitioning and modifiers. Also there the ML code was put into classify_text_plz, which was intended to be buildable into a few-lines-of-code solution to easily benchmark NLP classification tasks. Will eventually separate these into decoupled packages.


Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity; ACL 2021