In this notebook, we will clean our data and prepare it for model training. Since our only input is 'conversation' and the only output is 'customer_sentiment', we will drop all other features from our data. As you know, we investigated other features in [1_eda.ipynb](1_eda.ipynb) and found some insights that may be interesting for company managers.

If you didn't install the required packages before, you can with the commented out line below. Otherwise, you will get errors at the first import.

In [1]:
#!pip install torch numpy transformers datasets tiktoken wandb tqdm matplotlib seaborn scipy scikit-learn

In [2]:
import wandb
from data.utils.prepare import load_reduced_data, encode_labels, remove_redundant_lines, train_val_split, encode_texts, save_dataset

We will initialize the Weights & Biases project now. If you are not logged in to your wandb account, in this step you should enter your wandb credentials.

In [3]:
wandb.init(
    project="DI725_assignment_1_2389088"
)
config = wandb.config

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33merennarin-92[0m ([33merennarin-92-metu-middle-east-technical-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Before starting the preprocessing step, we should load the data again. In this step, we also drop unnecessary features. We will only hold the "conversation" column as feature and "customer_sentiment" column as target value. Also, we will split our data into train and validation datasets.

In [4]:
features = ["conversation"]
target = "customer_sentiment"
df_train, df_test = load_reduced_data(features, target)

df_train, df_val = train_val_split(df_train, target=target)

df_train.info()
df_val.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 776 entries, 183 to 359
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        776 non-null    object
 1   customer_sentiment  776 non-null    object
dtypes: object(2)
memory usage: 18.2+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 194 entries, 26 to 779
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        194 non-null    object
 1   customer_sentiment  194 non-null    object
dtypes: object(2)
memory usage: 4.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        30 non-null     object
 1   customer_sentiment  30 non-null     object
dtypes: object(2)
memory usage: 608.0+ b

In [5]:
df_train.head(3)

Unnamed: 0,conversation,customer_sentiment
183,"Customer: Hi, I'm calling because I have an is...",negative
823,"Customer: Hi, I received an email from BrownBo...",neutral
649,Agent: Thank you for contacting BrownBox custo...,neutral


In [6]:
df_val.head(3)

Unnamed: 0,conversation,customer_sentiment
26,"Agent: Hello, thank you for calling BrownBox C...",neutral
834,Agent: Thank you for calling BrownBox Customer...,neutral
207,Agent: Thank you for calling BrownBox Customer...,neutral


In [7]:
df_test.head(3)

Unnamed: 0,conversation,customer_sentiment
0,Agent: Thank you for calling BrownBox Customer...,negative
1,Agent: Thank you for calling BrownBox Customer...,negative
2,Agent: Thank you for calling BrownBox Customer...,negative


Since our target values in string format, we should encode our labels. To avoid ambiguous labels, we will use a standardized map for label encoding for bot datasets. ('neutral': 0, 'positive': 1, 'negative': 2)

In [8]:
df_train_le = df_train.copy()
df_train_le["customer_sentiment"] = df_train["customer_sentiment"].apply(encode_labels)
df_train_le.head(3)

Unnamed: 0,conversation,customer_sentiment
183,"Customer: Hi, I'm calling because I have an is...","[0, 0, 1]"
823,"Customer: Hi, I received an email from BrownBo...","[1, 0, 0]"
649,Agent: Thank you for contacting BrownBox custo...,"[1, 0, 0]"


In [9]:
df_val_le = df_val.copy()
df_val_le["customer_sentiment"] = df_val["customer_sentiment"].apply(encode_labels)
df_val_le.head(3)

Unnamed: 0,conversation,customer_sentiment
26,"Agent: Hello, thank you for calling BrownBox C...","[1, 0, 0]"
834,Agent: Thank you for calling BrownBox Customer...,"[1, 0, 0]"
207,Agent: Thank you for calling BrownBox Customer...,"[1, 0, 0]"


In [10]:
df_test_le = df_test.copy()
df_test_le["customer_sentiment"] = df_test["customer_sentiment"].apply(encode_labels)
df_test_le.head(3)

Unnamed: 0,conversation,customer_sentiment
0,Agent: Thank you for calling BrownBox Customer...,"[0, 0, 1]"
1,Agent: Thank you for calling BrownBox Customer...,"[0, 0, 1]"
2,Agent: Thank you for calling BrownBox Customer...,"[0, 0, 1]"


As you can see, neutral values are converted to \[1 0 0], and negative values are converted to \[0 0 1].

Now we can start to clean our data. There are so many repetitive sentences in our data, like opening lines. As first step, we can remove them. While doing this, we will use regular expressions. Sentences like "After a few seconds" repeats in conversation many times, but we will hold this information, since its probably has an effect on customer sentiment.

At first, we should cast conversation values to string.

In [11]:
df_train_cleaned = df_train_le.copy()
df_train_cleaned["conversation"] = df_train_cleaned["conversation"].str.lower()
df_train_cleaned["conversation"] = df_train_cleaned["conversation"].apply(remove_redundant_lines)

df_val_cleaned = df_val_le.copy()
df_val_cleaned["conversation"] = df_val_cleaned["conversation"].str.lower()
df_val_cleaned["conversation"] = df_val_cleaned["conversation"].apply(remove_redundant_lines)

df_test_cleaned = df_test_le.copy()
df_test_cleaned["conversation"] = df_test_cleaned["conversation"].str.lower()
df_test_cleaned["conversation"] = df_test_cleaned["conversation"].apply(remove_redundant_lines)

In [12]:
df_train_cleaned.head(3)

Unnamed: 0,conversation,customer_sentiment
183,"C: hi, i'm calling because i have an issue wit...","[0, 0, 1]"
823,"C: hi, i received an email from brownbox stati...","[1, 0, 0]"
649,C: hi rachel. i recently ordered a water geyse...,"[1, 0, 0]"


In [13]:
df_val_cleaned.head(3)

Unnamed: 0,conversation,customer_sentiment
26,"W: hello, thank you for calling brownbox custo...","[1, 0, 0]"
834,W: thank you for calling brownbox customer sup...,"[1, 0, 0]"
207,"C: hi sarah, i recently purchased an air coole...","[1, 0, 0]"


In [14]:
df_test_cleaned.head(3)

Unnamed: 0,conversation,customer_sentiment
0,"C: hi, sarah. i am calling because i am intere...","[0, 0, 1]"
1,"C: hi sarah, my name is john. i'm having troub...","[0, 0, 1]"
2,"C: hi jane, i am calling regarding the refund ...","[0, 0, 1]"


Next step is encoding all datasets with gpt-2's tiktoken encoder and saving datasets.

In [15]:
df_train_final = df_train_cleaned.copy()
df_val_final = df_val_cleaned.copy()
df_test_final = df_test_cleaned.copy()

df_train_final["conversation"] = df_train_final["conversation"].apply(encode_texts)
df_val_final["conversation"] = df_val_final["conversation"].apply(encode_texts)
df_test_final["conversation"] = df_test_final["conversation"].apply(encode_texts)

In [16]:
df_train_final.head(3)

Unnamed: 0,conversation,customer_sentiment
183,"[34, 25, 23105, 11, 1312, 1101, 4585, 780, 131...","[0, 0, 1]"
823,"[34, 25, 23105, 11, 1312, 2722, 281, 3053, 422...","[1, 0, 0]"
649,"[34, 25, 23105, 3444, 2978, 13, 1312, 2904, 61...","[1, 0, 0]"


In [17]:
df_val_final.head(3)

Unnamed: 0,conversation,customer_sentiment
26,"[54, 25, 23748, 11, 5875, 345, 329, 4585, 7586...","[1, 0, 0]"
834,"[54, 25, 5875, 345, 329, 4585, 7586, 3524, 649...","[1, 0, 0]"
207,"[34, 25, 23105, 264, 23066, 11, 1312, 2904, 81...","[1, 0, 0]"


In [18]:
df_test_final.head(3)

Unnamed: 0,conversation,customer_sentiment
0,"[34, 25, 23105, 11, 264, 23066, 13, 1312, 716,...","[0, 0, 1]"
1,"[34, 25, 23105, 264, 23066, 11, 616, 1438, 318...","[0, 0, 1]"
2,"[34, 25, 23105, 474, 1531, 11, 1312, 716, 4585...","[0, 0, 1]"


In [19]:
save_dataset(df_train_final, "train")
save_dataset(df_val_final, "val")
save_dataset(df_test_final, "test")