In this notebook, we will clean our data and prepare it for model training. Since our only input is 'conversation' and the only output is 'customer_sentiment', we will drop all other features from our data. As you know, we investigated other features in [1_eda.ipynb](1_eda.ipynb) and found some insights that may be interesting for company managers.

If you didn't install the required packages before, you can with the commented out line below. Otherwise, you will get errors at the first import.

In [1]:
#!pip install torch numpy transformers datasets tiktoken wandb tqdm matplotlib seaborn scipy scikit-learn

In [2]:
import wandb
from data.utils.prepare import load_reduced_data, encode_labels, remove_redundant_lines, train_val_split, encode_texts, save_dataset

We will initialize the Weights & Biases project now. If you are not logged in to your wandb account, in this step you should enter your wandb credentials.

In [3]:
wandb.init(
    project="DI725_assignment_1_2389088"
)
config = wandb.config

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33merennarin-92[0m ([33merennarin-92-metu-middle-east-technical-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Before starting the preprocessing step, we should load the data again. In this step, we also drop unnecessary features. We will only hold the "conversation" column as feature and "customer_sentiment" column as target value.

In [4]:
features = ["conversation"]
target = "customer_sentiment"
df_train, df_test = load_reduced_data(features, target)

Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 970 entries, 0 to 969
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        970 non-null    object
 1   customer_sentiment  970 non-null    object
dtypes: object(2)
memory usage: 15.3+ KB

Test Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        30 non-null     object
 1   customer_sentiment  30 non-null     object
dtypes: object(2)
memory usage: 608.0+ bytes


In [5]:
df_train.head(3)

Unnamed: 0,conversation,customer_sentiment
0,Agent: Thank you for calling BrownBox Customer...,neutral
1,Agent: Thank you for calling BrownBox customer...,neutral
2,Agent: Thank you for calling BrownBox Customer...,neutral


In [6]:
df_test.head(3)

Unnamed: 0,conversation,customer_sentiment
0,Agent: Thank you for calling BrownBox Customer...,negative
1,Agent: Thank you for calling BrownBox Customer...,negative
2,Agent: Thank you for calling BrownBox Customer...,negative


Since our target values in string format, we should encode our labels. To avoid ambiguous labels, we will use a standardized map for label encoding for bot datasets. ('neutral': 0, 'positive': 1, 'negative': 2)

In [7]:
df_train_le = encode_labels(df_train, target)
df_train_le.head(3)

Unnamed: 0,conversation,customer_sentiment
0,Agent: Thank you for calling BrownBox Customer...,0
1,Agent: Thank you for calling BrownBox customer...,0
2,Agent: Thank you for calling BrownBox Customer...,0


In [8]:
df_test_le = encode_labels(df_test, target)
df_test_le.head(3)

Unnamed: 0,conversation,customer_sentiment
0,Agent: Thank you for calling BrownBox Customer...,2
1,Agent: Thank you for calling BrownBox Customer...,2
2,Agent: Thank you for calling BrownBox Customer...,2


As you can see, neutral values are converted to zeros, and negative values are converted to twos.

Now we can start to clean our data. There are so many repetitive sentences in our data, like opening lines. As first step, we can remove them. While doing this, we will use regular expressions. Sentences like "After a few seconds" repeats in conversation many times, but we will hold this information, since its probably has an effect on customer sentiment.

At first, we should cast conversation values to string.

In [9]:
df_train_cleaned = df_train_le.copy()
df_train_cleaned["conversation"] = df_train_cleaned["conversation"].str.lower()
df_train_cleaned["conversation"] = df_train_cleaned["conversation"].apply(remove_redundant_lines)

df_test_cleaned = df_test_le.copy()
df_test_cleaned["conversation"] = df_test_cleaned["conversation"].str.lower()
df_test_cleaned["conversation"] = df_test_cleaned["conversation"].apply(remove_redundant_lines)

In [10]:
df_train_cleaned.head(3)

Unnamed: 0,conversation,customer_sentiment
0,"C: hi tom, i'm trying to log in to my account ...",0
1,C: hi alex. i recently received an email from ...,0
2,"C: hi sarah, i am calling because i am unable ...",0


In [11]:
df_test_cleaned.head(3)

Unnamed: 0,conversation,customer_sentiment
0,"C: hi, sarah. i am calling because i am intere...",2
1,"C: hi sarah, my name is john. i'm having troub...",2
2,"C: hi jane, i am calling regarding the refund ...",2


Cleaning the data phase is complete. Now we can split our data into train and validation datasets.

In [12]:
df_train_final, df_val_final = train_val_split(df_train_cleaned, target=target)
df_test_final = df_test_cleaned.copy()

df_train_final.info()
df_val_final.info()
df_test_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 776 entries, 854 to 817
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        776 non-null    object
 1   customer_sentiment  776 non-null    uint8 
dtypes: object(1), uint8(1)
memory usage: 12.9+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 194 entries, 307 to 835
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        194 non-null    object
 1   customer_sentiment  194 non-null    uint8 
dtypes: object(1), uint8(1)
memory usage: 3.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   conversation        30 non-null     object
 1   customer_sentiment  30 non-null     uint8 
dtypes: object(1), 

Next step is encoding all datasets with gpt-2's tiktoken encoder and saving datasets to wandb.

In [13]:
df_train_final["conversation"] = df_train_final["conversation"].apply(encode_texts)
df_val_final["conversation"] = df_val_final["conversation"].apply(encode_texts)
df_test_final["conversation"] = df_test_final["conversation"].apply(encode_texts)

In [14]:
save_dataset(df_train_final, "train")
save_dataset(df_val_final, "val")
save_dataset(df_test_final, "test")

TODO: explain dataset

TODO: summarize process