# <center>**data cleaning** 🧹

This notebook serves as a pipeline to clean/pre-process training data (***data normalization with REGEX***)... 

---

In [1]:
import pandas as pd
import re

In [22]:
# read CSV file
data = pd.read_csv("intents-data.csv")

### single-question dataset

In [23]:
sq_data = data[5833:].copy()

input_, label = "text", "sentiment"
cols = [input_, label]
sq_data = sq_data[cols]

# take a look
print(sq_data.shape)
display(sq_data.sample(5))

(637, 2)


Unnamed: 0,text,sentiment
5966,The Rowdy watch for men with a stainless steel...,postSale
5904,I'm searching for a pair of unisex snowboard g...,inventory
6287,Who is the Pope?,irrelevant
5989,I'm looking for a bag to transport the snowboard.,inventory
6069,Long ethnic floral print dress,inventory


In [24]:
# strip blank spaces (if any) and add "Question 1: ..."
sq_data.text = sq_data.text.map(lambda x: "Question 1: "+x.strip())
display(sq_data.sample(5))

Unnamed: 0,text,sentiment
6144,Question 1: Can customers track their orders i...,conversational
6374,Question 1: I bought two pairs of shoes in siz...,postSale
6292,Question 1: Is there a waiting list for the vi...,inventory
6437,"Question 1: I ordered a sweater in size S, and...",postSale
6028,Question 1: I have purchased a set of ski clot...,postSale


### multi-question dataset

In [25]:
# read CSV file
mq_data = data[:5833].copy()

input_, label = "text", "sentiment"
cols = [input_, label]
mq_data = mq_data[cols]

# take a look
print(mq_data.shape)
display(mq_data.sample(5))

(5833, 2)


Unnamed: 0,text,sentiment
2757,Client: I'm just so frustrated because I've be...,inventory
1045,"Client: Hello, do you have the Burton Snowboar...",storefront
50,Client: Got any snow boots in size 9? \nAssis...,inventory
3800,"Client: I bought a jacket from your store, and...",conversational
2961,Client: I must say your website is an absolute...,postSale


First, we separate del multi-question dataset into **question format dataset** and **GPT dataset**...

In [26]:
# strip blank spaces (if any)
mq_data.text = mq_data.text.map(lambda x: x.strip())

# separate mq-data into gpt-data and qf-data
qf_data = mq_data[mq_data.text.map(lambda x: x[:8]=="Question")].copy() # Question 1:... Question 2:...
gpt_data = mq_data[mq_data.text.map(lambda x: x[:6]=="Client")].copy() # Client: ... \nAssistant:... \nClient:...

print(qf_data.shape)
print(gpt_data.shape)
print(len(mq_data)-len(gpt_data)-len(qf_data))

(173, 2)
(5645, 2)
15


Second, we identify the **instances that did not match** the criteria of either dataset...

In [27]:
no_match = mq_data[~mq_data.index.isin(qf_data.index) & ~mq_data.index.isin(gpt_data.index)].copy()
display(no_match)
# NOTES:
# case A: 419, 849, 931, 990, 1220, 2298, 2728, 3784, 4755 and 5739 have "Client:" missing (...\nAssistant:... or ...)
# case B: 58 and 4081 start with assistant text (...Client:...\nAssistant:...), we just need to delete that part
# case C: 526, 3734 and 3550 are all assistant, we drop them

Unnamed: 0,text,sentiment
58,"Hello, welcome to Rowdy! How can I assist you ...",conversational
419,"Hey, I just wanted to say that I really like t...",storefront
526,"Hello, welcome to Rowdy! How can I assist you ...",conversational
849,"Hello, I was browsing online for some winter s...",conversational
931,"Hey there, can you tell me more about Rowdy an...",conversational
990,Hello.\nAssistant: Welcome to Rowdy! How can I...,conversational
1220,"Hey, I'm looking for some cozy winter sweaters...",inventory
2298,"Hello, how are you today?\nI'm here to help yo...",conversational
2728,"Hey there, Rowdy! I love shopping for winter g...",conversational
3550,"Hello, welcome to Rowdy! How can I assist you ...",conversational


Third, we **fix that instances** and put them with their corresponding dataset...

In [28]:
def remove_text_before_keyword(text, keyword):
    # Find the position of the keyword in the text
    keyword_pos = text.find(keyword)
    # If the keyword is not found, return the original text
    if keyword_pos == -1:
        return text
    # Slice the text from the keyword position to the end
    return text[keyword_pos:]

In [29]:
# CASE A: add "Client:" at the beginning
idx_A = [419, 849, 931, 990, 1220, 2298, 2728, 3784, 4755, 5739]
case_A = no_match.loc[idx_A,:].copy() 
case_A.text = case_A.text.map(lambda x: "Client: "+x) 

# CASE B: delete text before "Client"
idx_B = [58, 4081] # also you can find this cases with: no_match['text'].str.contains('Client')
case_B = no_match.loc[idx_B,:].copy() 
case_B.text = case_B.text.map(lambda x: remove_text_before_keyword(x,"Client"))

# CASE C: we do nothing 
# we lost 3 instances

# concat fixed data to gpt data
gpt_fixed_data = pd.concat([gpt_data, case_A, case_B])
gpt_fixed_data.sort_index(inplace=True)
len(gpt_fixed_data) == len(gpt_data) + len(case_A) + len(case_B)

True

In [30]:
# take a look
print(gpt_fixed_data.shape)
display(gpt_fixed_data.sample(5))

(5657, 2)


Unnamed: 0,text,sentiment
1997,"Client: Hello, I'm looking for a warm winter c...",conversational
4297,"Client: Hey, do you even have that ski jacket ...",storefront
2964,Client: Hello there! I hope you're having a fa...,postSale
4998,Client: Hey there! Excited for winter to come?...,conversational
4739,"Client: Hello, do you have the black ski jacke...",inventory


Now, we **format the GPT-dataset**...

In [31]:
def extract_client(text):
    # split the text into parts based on the 'Client:' prefix and eliminate the assistant parts
    client_parts = [part.strip() for part in text.split('Client:') if part]
    
    # join the client parts 
    client_texts = []
    for part in client_parts:
        if 'Assistant:' in part:
            client_texts.append(part.split('Assistant:')[0].strip())
        else:
            client_texts.append(part.strip())
    return client_texts

def format_text(texts):
    # Normalize the client texts: Client: ...\nClient: ... into the desired format: Question 1:...\nQuestion 2:... 
    normalized_texts = [f"Question {i+1}: {text}" for i, text in enumerate(texts)]
    return '\n'.join(normalized_texts)

In [32]:
gpt_formatted_data = gpt_fixed_data.copy()
gpt_formatted_data.text = gpt_formatted_data.text.apply(extract_client)
gpt_formatted_data.text = gpt_formatted_data.text.apply(format_text)

print(gpt_formatted_data.shape)
display(gpt_formatted_data.sample(5))

(5657, 2)


Unnamed: 0,text,sentiment
1312,"Question 1: Hi, do you have the Roxy snow pant...",irrelevant
803,Question 1: Hello! I hope you're doing well to...,conversational
1518,Question 1: Hey there! I'm looking for a warm ...,inventory
1448,Question 1: The jacket I bought is too small.,postSale
2550,"Question 1: Hello, I was wondering what paymen...",irrelevant


Finally, we concat the GPT formatted data and the question format data, to obtain the **final multi-question dataset**...

In [33]:
# concat gpt formatted data and question format data
mq_data_final = pd.concat([gpt_formatted_data, qf_data])
mq_data_final.sort_index(inplace=True)

mq_data_final.shape

(5830, 2)

### final intents dataset

In [34]:
# concat pre-processed multi-question data and single-question data
train_set = pd.concat([sq_data, mq_data_final])
train_set.sort_index(inplace=True)

print(train_set.shape)
display(train_set.sample(5))

(6467, 2)


Unnamed: 0,text,sentiment
5497,"Question 1: Hey, I bought this winter jacket f...",postSale
1515,"Question 1: Ugh, I've been scrolling through y...",storefront
4104,Question 1: ksjhfkjsha,irrelevant
2763,Question 1: I've been shopping online for year...,storefront
5470,Question 1: Hey.,conversational


In [35]:
mq_len = len(qf_data)+len(gpt_data)+len(case_A)+len(case_B)
len(train_set) == len(sq_data)+mq_len

True

In [36]:
lost = len(mq_data)+len(sq_data)-len(train_set)
len(no_match)-len(case_A)-len(case_B) == lost

True

In [38]:
train_set.to_csv("dataset.csv", index=False)

---