# Convert the Ubuntu Dialogue Corpus into Q&A pairs

Select the short Q&A discussions only from the Ubuntu Dialogue Corpus.

**Acknowledgments**
https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus

This dataset was ORIGINALLY collected by Ryan Lowe, Nissan Pow , Iulian V. Serban† and Joelle Pineau. It is made available here under the Apache License, 2.0. If you use this data in your work, please include the following citation:

Ryan Lowe, Nissan Pow, Iulian V. Serban and Joelle Pineau, "The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems", SIGDial 2015. URL: http://www.sigdial.org/workshops/conference16/proceedings/pdf/SIGDIAL40.pdf

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/data/datasets/ubuntu_dialogue/ubuntu_parser.ipynb)

In [1]:
# uncomment and run below lines to set up if running in colab
# !git clone https://github.com/LAION-AI/Open-Assistant.git
# %cd Open-Assistant/data/datasets/ubuntu_dialogue/
# !pip install -r requirements.txt

In [4]:
# download data, you can get your kaggle.json file from your account page https://www.kaggle.com/me/account
import kaggle

kaggle.api.dataset_download_files("rtatman/ubuntu-dialogue-corpus", unzip=True)

In [5]:
# global settings

FOLDER = "Ubuntu-dialogue-corpus"  # input folder containing ubuntu dialogue csv files
SOURCE = "ubuntu-dialogue"  # source to use in the parquet for each row

In [6]:
import os
import re
import json

from tqdm import tqdm

import numpy as np
import pandas as pd

In [7]:
def load(file):
    data = pd.read_csv(file)
    data["date"] = pd.to_datetime(data["date"])
    data["id"] = data[["folder", "dialogueID"]].apply(lambda x: f"{x[0]}_{x[1].split('.tsv')[0]}", axis=1)
    data.drop(columns=["folder", "dialogueID"], inplace=True)
    return data


data = None
for file in tqdm(os.listdir(FOLDER)):
    data = pd.concat([data, load(os.path.join(FOLDER, file))])

data.head()

100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [07:34<00:00, 151.36s/it]


Unnamed: 0,date,from,to,text,id
0,2008-04-23 14:55:00+00:00,bad_image,,"Hello folks, please help me a bit with the fol...",3_126125
1,2008-04-23 14:56:00+00:00,bad_image,,Did I choose a bad channel? I ask because you ...,3_126125
2,2008-04-23 14:57:00+00:00,lordleemo,bad_image,the second sentence is better english and we...,3_126125
3,2009-08-01 06:22:00+00:00,mechtech,,Sock Puppe?t,3_64545
4,2009-08-01 06:22:00+00:00,mechtech,,WTF?,3_64545


In [8]:
# clean up the df, remove duplicates and answers that are way too short, etc.
clean = {col: [] for col in ["INSTRUCTION", "RESPONSE", "SOURCE", "METADATA"]}

for name, group in tqdm(data.groupby("id")):
    if len(group) < 3 or len(group) > 5:  # 3, 4, 5 len
        continue  # back and forth will most likely not be parsed correctly

    group.sort_values(by=["date"], ascending=True, inplace=True)
    instruction = str(group["text"].values[0]).strip()
    insturction_user = group["from"].values[0]
    if not instruction or pd.isna(instruction) or len(instruction) < 12:
        continue
    if not re.findall(
        r"(?i)(?:\?|what|who|where|why|when|how|whose|explain|tell|does|way|can|know|able|best|recommend)", instruction
    ):
        continue  # probably not a question

    all_recipients = "|".join(
        [re.escape(item) for item in set(group["to"].tolist() + group["from"].tolist()) if pd.notna(item)]
    )
    response = None
    response_user = None
    for _, row in group.iterrows():
        if row["to"] == insturction_user:
            candidate = str(row["text"]).strip()
            if (
                not row["text"]
                or pd.isna(row["text"])
                or re.findall(r"(?i)^(yes|yep|yeah|no|nah|nope|sure|yes\s*sir)\W*$", candidate)
            ):
                continue  # answer is not expressive
            if len(candidate) < 3:
                continue  # too short
            if re.findall(r"(?i)(?:wrong|o[nf].*?topic|else\s*where|ask.+?in|\#\w+|google|you.+?mean)", candidate):
                continue  # probably off topic
            if re.findall(r"\b(" + all_recipients + r")\b", candidate):
                continue  # answer includes user name(s)
            response = candidate
            response_user = row["from"]
        elif response_user is not None and row["to"] == response_user and row["from"] == insturction_user:
            if re.findall(r"(?i)(?:thank|thx|works|working|great)", str(row["text"])):
                clean["INSTRUCTION"].append(instruction)
                clean["RESPONSE"].append(response)
                clean["SOURCE"].append(SOURCE)
                clean["METADATA"].append(json.dumps({"user_question": insturction_user, "user_answer": response_user}))
                break

100%|██████████████████████████████████████████████████████████████████████| 1852868/1852868 [08:17<00:00, 3725.97it/s]


In [9]:
clean = pd.DataFrame(clean)
clean.sort_values(by="RESPONSE", key=lambda x: x.str.len(), inplace=True, ascending=False)
clean.drop_duplicates(subset=["INSTRUCTION"], inplace=True)
clean.sort_index(inplace=True)
clean.head()

Unnamed: 0,INSTRUCTION,RESPONSE,SOURCE,METADATA
0,"hi, is there a CLI command to roll back any up...",your recourse is to re-install fresh the older...,ubuntu-dialogue,"{""user_question"": ""edd"", ""user_answer"": ""n8tus..."
1,A LiveCD iso can be burned to a DVD-R and run ...,"I hope so, or the custom DVDs I've done are wo...",ubuntu-dialogue,"{""user_question"": ""usrl"", ""user_answer"": ""Ghos..."
2,"hello, is there a way to adjust gamma settings...",for me i have my nvidia settings manager and i...,ubuntu-dialogue,"{""user_question"": ""nucco_"", ""user_answer"": ""sp..."
4,does ubuntu come with a firewall by default?,no iptables rule is loaded by deault on ubuntu,ubuntu-dialogue,"{""user_question"": ""aeleon"", ""user_answer"": ""er..."
5,Can someone tell me howto get rid of Google Ch...,sudo dpkg -l |grep -i chrom ----> sudo apt-get...,ubuntu-dialogue,"{""user_question"": ""frold"", ""user_answer"": ""shi..."


In [10]:
print(f"Retrieved {len(clean) / len(data['id'].unique()) * 100.:.2f}% of all questions ({len(clean)})")  # 19921

Retrieved 0.87% of all questions (16173)


In [11]:
for index, row in clean.iterrows():
    print("Q >", row["INSTRUCTION"])
    print("A >", row["RESPONSE"])
    print()
    if index > 100:
        break

Q > hi, is there a CLI command to roll back any updates/upgrades I made recently?
A > your recourse is to re-install fresh the older version

Q > A LiveCD iso can be burned to a DVD-R and run with no problems, right?
A > I hope so, or the custom DVDs I've done are worthless. ;)

Q > hello, is there a way to adjust gamma settings in totem? my videos aren't playing with the correct colours
A > for me i have my nvidia settings manager and i change the video gamma settings from there...

Q > does ubuntu come with a firewall by default?
A > no iptables rule is loaded by deault on ubuntu

Q > Can someone tell me howto get rid of Google Chrome? Im not able to uninstall it...
A > sudo dpkg -l |grep -i chrom ----> sudo apt-get remove 'on what appears'

Q > wow. for the life of me i can never remember this command. whats the command that outputs your ati hardare information? shows if you have direct rendering?
A > glxinfo | grep dri ?

Q > ack!  what the heck kind of Linux distro doesn't install