What do you think is the most important part of building an AI chatbot?
I would think the most important part is the quality of conversation data used for training as opposed to only quantity.

Garbage in, garbage out is a well-known statement for machine learning, and so to ensure that an AI chatbot performs well, it is critical that the dataset used to train it contains as much high-quality data as possible.
Are there any tools to filter out the low quality conversation data automatically, or do I have to do this manually?
Yes, there are several tools available to automatically filter out low quality conversation training data for natural language models. These tools use different techniques such as language models, neural networks, and heuristics to identify and remove low quality data. Here are some examples:

1. Snorkel: Snorkel is a framework for building machine learning models with weak supervision. It includes tools for programmatically labeling, transforming, and filtering data. Snorkel can be used to filter out low quality data by applying labeling functions that identify patterns of low quality conversations.

2. BERTScore: BERTScore is a metric for evaluating the quality of generated text based on how well it matches human-written text. It can be used to compare the quality of different training datasets and to filter out low quality data.

3. Dialogue Quality Control Toolkit (DQCT): DQCT is a toolkit for filtering out low quality data in conversational datasets. It includes tools for identifying and correcting errors in transcripts, detecting low quality conversations, and evaluating the overall quality of the dataset.

4. TextBlob: TextBlob is a Python library for processing textual data. It includes tools for sentiment analysis, part-of-speech tagging, and noun phrase extraction. It can be used to filter out low quality conversations by identifying conversations with low sentiment or poor grammar.

While these tools can be useful for filtering out low quality conversation data automatically, it's important to note that they are not perfect and may still make mistakes. Therefore, it's a good idea to manually review a sample of the filtered data to ensure that the tool is accurately identifying low quality conversations.