<h1> DATA PREPROCESSING </h1>

**1. LOADING THE DATASET**

**We import the dataset from the referenced open-source repository.**

In [2]:
# We login to hugging face hub to import the dataset
from huggingface_hub import login
login(new_session=False,
write_permission=True,
token='...',  # Add a valid HuggingFace token in this row to import the dataset
add_to_git_credential=True)

from datasets import load_dataset
dataset = load_dataset("Salesforce/dialogstudio", "TweetSumm") #The dataset is available in Hugginface ine the Salesforce/dialogstudio repository
dataset # We show the content of the dataset

Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful


Downloading builder script:   0%|          | 0.00/18.4k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.69k [00:00<?, ?B/s]

The repository for Salesforce/dialogstudio contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/Salesforce/dialogstudio.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


❤️Attention❤️: Dataset download may take some time. We appreciate your patience!


DatasetDict({
    train: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt'],
        num_rows: 879
    })
    validation: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt'],
        num_rows: 110
    })
    test: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt'],
        num_rows: 110
    })
})

**2. DATA CLEANING**

**The data comes from twitter conversation, hence contains informations such as usernames preceded by @, urls, or other contextual information. These elements can be identified by looking at some data samples, and removed.**

In [3]:
def clean_data(data):
    data = re.sub('http://\S+', '', data) #Deleting urls
    data = re.sub('https://\S+', '', data)
    data = re.sub(r"@[^\s]+", "", data) #Deleting twitter usernames preceded by @
    data = re.sub('_', ' ', data) #Deleting underscores from the text
    data = re.sub(r"\^[^ ]+", "", data) #Deleting names and initials preceded by ^
    return data

**3. FORMATTING THE DATASET**

**The dataset contains many information that will not be used during our training, so the content and the format of the dataset will be adjusted to match our needs.**

In [4]:
# We use the python library Pandas to visualise the training and validation datasets
import pandas as pd

training_data = pd.DataFrame(dataset["train"]) # The training set
validation_data = pd.DataFrame(dataset["validation"]) # The validation set
testing_data = pd.DataFrame(dataset["test"]) # The validation set

In [5]:
# Only the "log" column containing the conversation, and the "original dialog info" column containing the summaries will be used. So we remove all other columns from the dataset.

training_data = training_data.drop(columns=['original dialog id', 'new dialog id','prompt','dialog index']) #Removing unecessary columns from the training data
validation_data = validation_data.drop(columns=['original dialog id', 'new dialog id','prompt','dialog index']) #Removing unecessary columns from the training data
testing_data = testing_data.drop(columns=['original dialog id', 'new dialog id','prompt','dialog index']) #Removing unecessary columns from the training data

summary_column = training_data["original dialog info"]


**The log column contains several data entries, we select only the data entries we will need, "user utterance" containing the user message, and "system response" containing the agent response, and we group them into a single column to obtain the full dialog between user and agent.**

In [6]:
# We create a helper function to update the content of the "log" column.
def formatting_log(column,data,index):
    column[index] = data

**We create a function that takes as a parameter a the "log" column from the dataset, and updates the content of every row of the column with the text dialog by using the previous helper function.**

In [7]:
# Function that will be used to format the dialog column of the training and validation data
import re

def format_column(column):
    i = 0
    for row in column:
        text = ""
        for turn in row: # 
            user = clean_data(turn["user utterance"]) # We select the user utterance
            agent = clean_data(turn["system response"]) # We select system response
            text += f"user:{user}\nagent:{agent}\n" # We concatenate them, and add them to the previous ones if any, which results in the full dialog
        formatting_log(column, text, i) # We change the content of the column to the text dialog
        i += 1

In [8]:
#Formatting dialog training text using the previous function
column_log = training_data["log"]
format_column(column_log)
print(column_log[0])

user:So neither my iPhone nor my Apple Watch are recording my steps/activity, and Health doesn’t recognise either source anymore for some reason. Any ideas?   please read the above.
agent: Let’s investigate this together. To start, can you tell us the software versions your iPhone and Apple Watch are running currently?
user: My iPhone is on 11.1.2, and my watch is on 4.1.
agent: Thank you. Have you tried restarting both devices since this started happening?
user: I’ve restarted both, also un-paired then re-paired the watch.
agent: Got it. When did you first notice that the two devices were not talking to each other. Do the two devices communicate through other apps such as Messages?
user: Yes, everything seems fine, it’s just Health and activity.
agent: Let’s move to DM and look into this a bit more. When reaching out in DM, let us know when this first started happening please. For example, did it start after an update or after installing a certain app? 



In [9]:
#Formatting dialog validation text using the previous function
validation_dialog = validation_data["log"]
format_column(validation_dialog)

In [10]:
#Formatting dialog testing text using the previous function
testing_dialog = testing_data["log"]
format_column(testing_dialog)

In [11]:
#We can now print the first dialog of the validation dataset to make sure that its format is correct.
print(validation_dialog[0])

user: hey, any explanation why the "Create similar playlist" function doesn't work anymore for me? MacBook, v1.0.64.399.g4637b02a.
agent: Hi there, the cavalry's here! Does logging out, restarting your device, and logging back into Spotify help? Keep us in the loop /JI
user: no, it didn't :( tried everything but I still can't create the playlist. it's not even greyed out but nothing happens after clicking on it.
agent: Okay. Can we have you try reinstalling the app? To do so, just follow the steps at 
user: i tried and it's still the same... moreover, my song history is always empty, so I can't find songs from previous Discover playlists :(
agent: Does restarting your computer help at all? Also, is the song history you're referring to the History tab on your Play Queue? /MT
user: no, I tried that as well and just reinstalled again - didn't help. yes, that's what I mean.
agent: Could you DM us your account's email address or username? We'll take a look backstage /MT 



In [12]:
# We create a function used to format summary columns of the training and validation datasets
import json

def format_column_summary(column):
    i=0
    for row in column:
        data = column[i]
        text = json.loads(data)
        text = text["summaries"]["abstractive_summaries"][0]
        text = " ".join(text)
        column[i] = text
        i+=1

In [13]:
# We use the previously created function to update the content of the "original dialog info" column in the training data
summary_training = training_data["original dialog info"]
format_column_summary(summary_training)

# We print the first row of the training column as a verification
print(summary_training[0])

Customer enquired about his Iphone and Apple watch which is not showing his any steps/activity and health activities. Agent is asking to move to DM and look into it.


In [14]:
# We do the same for the "original dialog info" column in the validation set
summary_validation = validation_data["original dialog info"]
format_column_summary(summary_validation)

# We print the first row of the validation column as a verification
print(summary_validation[0])

Customer is complaining about unable to create similar playlist so that  function does not  work anymore. Agent says could DM the account's email address or username so that they look backstage.


In [15]:
# We do the same for the "original dialog info" column in the testing set
summary_testing = testing_data["original dialog info"]
format_column_summary(summary_testing)

# We print the first row of the validation column as a verification
print(summary_testing[0])

Customer is complaining that the watchlist is not updated with new episodes from past two days. Agent informed that the team is working hard to investigate to show new episodes on page.


In [16]:
#We rename the current columns of the dataset to "dialog" and "summary" for better understandability

training_data.rename(columns={'original dialog info': 'summary', 'log': 'dialog'}, inplace=True)
training_data = training_data[['dialog', 'summary']]

validation_data.rename(columns={'original dialog info': 'summary', 'log': 'dialog'}, inplace=True)
validation_data = validation_data[['dialog', 'summary']]

testing_data.rename(columns={'original dialog info': 'summary', 'log': 'dialog'}, inplace=True)
testing_data = testing_data[['dialog', 'summary']]

**We create Data Dictionary containing all sets to be able and push it to Huggingface Hub to use it for the evaluation part.**

In [17]:
# We create Data Dictionary containing all sets to be able to push it to Huggingface Hub and use it for the evaluation part.

from datasets import DatasetDict, Dataset

dataset_training = Dataset.from_pandas(training_data)
dataset_validation = Dataset.from_pandas(validation_data)
dataset_testing = Dataset.from_pandas(testing_data)

In [18]:
# We create a dataset dictionary where we store store the training and validation sets
from datasets import DatasetDict, Dataset

final_dataset = DatasetDict({
    'training': dataset_training,
    'validation' : dataset_validation,
    'testing' : dataset_testing
    })

In [19]:
# We visualise the resulting dataset
final_dataset

DatasetDict({
    training: Dataset({
        features: ['dialog', 'summary'],
        num_rows: 879
    })
    validation: Dataset({
        features: ['dialog', 'summary'],
        num_rows: 110
    })
    testing: Dataset({
        features: ['dialog', 'summary'],
        num_rows: 110
    })
})

In [20]:
# We store the current dataset to Hugging Face Hub to use it for our the model evaluation
final_dataset.push_to_hub("Dialog-Summarization-Dataset", token="...")

README.md:   0%|          | 0.00/540 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/Marouane50/Dialog-Summarization-Dataset/commit/25db6c8fb4473cf1c509f5f62bd5ceadb331a83f', commit_message='Upload dataset', commit_description='', oid='25db6c8fb4473cf1c509f5f62bd5ceadb331a83f', pr_url=None, pr_revision=None, pr_num=None)

**4. CONVERTING THE DATASET INTO THE REQUIRED INSTRUCTION FORMAT**

**To be able to fine-tune the Llama2-7b-chat model, we need a to convert the dataset into the required instruction format for the model. Each row in the instruction dataset should be in the format:**

[INST] <<SYS>> {{system_prompt}} <</SYS>> {{input}} [/INST] {{summary}} 

**Where: {{ system_prompt }} represents the default prompt used in the dataset, {{ input }} represents the dialog to be summarised, and summary represents the corresponding summary of the dialog.**
    
**So we need to reformat the current dataset to match this specific format.**

In [21]:
# We create a default_prompt that will be used as a system prompt in the instruction dataset
default_prompt = "The following text is a conversation between a user and an AI agent. Write a summary of the conversation."

In [22]:
# We convert the training set into the instruction format for the Llama2 chat model
import pandas as pd

df_training = pd.DataFrame(dataset_training)

df_training['text'] = df_training.apply(
    lambda row: f"""<s> [INST] <<SYS>> {default_prompt} <</SYS>> {row['dialog']} [/INST] {row['summary']} </s>""",axis=1
)

dataset_training = df_training[['text']]

In [23]:
# We convert the training set into the instruction format for the Llama2 chat model
df_validation = pd.DataFrame(dataset_validation)

df_validation['text'] = df_validation.apply(
    lambda row: f"""<s> [ INST] <<SYS>> {default_prompt} <</SYS>> {row['dialog']} [/INST] {row['summary']} </s>""",axis=1
)

dataset_validation = df_validation[['text']]


In [24]:
# We convert the datasets from pandas into the format required to store it  used in Hugging Face Hub, the testing set will not be used during the training
dataset_training_formatted = Dataset.from_pandas(dataset_training)
dataset_validation_formatted = Dataset.from_pandas(dataset_validation)

In [25]:
# We create a dataset dictionary where we store store the training and validation sets
from datasets import DatasetDict, Dataset

final_dataset_formatted = DatasetDict({
    'training': dataset_training_formatted,
    'validation' : dataset_validation_formatted
    })


In [26]:
# We visualise content of the dataset dictionary 
final_dataset_formatted

DatasetDict({
    training: Dataset({
        features: ['text'],
        num_rows: 879
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 110
    })
})

**Our final dataset is now ready, we can now push it to Hugging Face Hub**

In [27]:
# We store the final formatted dataset in Hugging Face Hub to use it for finetuning.
final_dataset_formatted.push_to_hub("Dialog-Summarization-Dataset-Formatted", token="...")

README.md:   0%|          | 0.00/396 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/Marouane50/Dialog-Summarization-Dataset-Formatted/commit/24e6007483f48ae8253a4f8d2e69001d65ec1919', commit_message='Upload dataset', commit_description='', oid='24e6007483f48ae8253a4f8d2e69001d65ec1919', pr_url=None, pr_revision=None, pr_num=None)

Reference: TweetSum Dataset https://huggingface.co/datasets/Salesforce/dialogstudio/tree/main/dialogue_summarization/TweetSumm