In this notebook, we will clean our data and prepare it for model training. Since our only input is 'conversation' and the only output is 'customer_sentiment', we will drop all other features from our data. As you know, we investigated other features in [1_eda.ipynb](1_eda.ipynb) and found some insights that may be interesting for company managers.

If you didn't install the required packages before, you can with the commented out line below. Otherwise, you will get errors at the first import.

In [1]:
#!pip install torch numpy transformers datasets tiktoken wandb tqdm matplotlib seaborn scipy scikit-learn

In [2]:
import wandb
import datasets
from datasets import Dataset
from data.utils.prepare import load_reduced_data, encode_labels, remove_redundant_lines, train_val_split, tokenize_function, save_dataset, save_datasets

  from .autonotebook import tqdm as notebook_tqdm


We will initialize the Weights & Biases project now. If you are not logged in to your wandb account, in this step you should enter your wandb credentials.

In [3]:
wandb.init(
    project="DI725_assignment_1_2389088"
)
config = wandb.config

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33merennarin-92[0m ([33merennarin-92-metu-middle-east-technical-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Before starting the preprocessing step, we should load the data again. In this step, we also drop unnecessary features. We will only hold the "conversation" (text) column as feature and "customer_sentiment" (label) column as target value. Also, we will split our data into train and validation datasets.

In [4]:
feature = "text"
target = "label"
columns = {"conversation": feature, "customer_sentiment": target}

df_train, df_test = load_reduced_data(columns)

df_train, df_val = train_val_split(df_train, target=target)

df_train.info()
df_val.info()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 776 entries, 183 to 359
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    776 non-null    object
 1   label   776 non-null    object
dtypes: object(2)
memory usage: 18.2+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 194 entries, 26 to 779
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    194 non-null    object
 1   label   194 non-null    object
dtypes: object(2)
memory usage: 4.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    30 non-null     object
 1   label   30 non-null     object
dtypes: object(2)
memory usage: 608.0+ bytes


In [5]:
df_train.head(3)

Unnamed: 0,text,label
183,"Customer: Hi, I'm calling because I have an is...",negative
823,"Customer: Hi, I received an email from BrownBo...",neutral
649,Agent: Thank you for contacting BrownBox custo...,neutral


In [6]:
df_val.head(3)

Unnamed: 0,text,label
26,"Agent: Hello, thank you for calling BrownBox C...",neutral
834,Agent: Thank you for calling BrownBox Customer...,neutral
207,Agent: Thank you for calling BrownBox Customer...,neutral


In [7]:
df_test.head(3)

Unnamed: 0,text,label
0,Agent: Thank you for calling BrownBox Customer...,negative
1,Agent: Thank you for calling BrownBox Customer...,negative
2,Agent: Thank you for calling BrownBox Customer...,negative


Since our target values in string format, we should encode our labels. To avoid ambiguous labels, we will use a standardized map for label encoding for bot datasets. ('neutral': 0, 'positive': 1, 'negative': 2)

In [8]:
df_train_le = df_train.copy()
df_train_le = encode_labels(df_train_le, target)
df_train_le.head(3)

Unnamed: 0,text,label
183,"Customer: Hi, I'm calling because I have an is...",2
823,"Customer: Hi, I received an email from BrownBo...",0
649,Agent: Thank you for contacting BrownBox custo...,0


In [9]:
df_val_le = df_val.copy()
df_val_le = encode_labels(df_val_le, target)
df_val_le.head(3)

Unnamed: 0,text,label
26,"Agent: Hello, thank you for calling BrownBox C...",0
834,Agent: Thank you for calling BrownBox Customer...,0
207,Agent: Thank you for calling BrownBox Customer...,0


In [10]:
df_test_le = df_test.copy()
df_test_le = encode_labels(df_test_le, target)
df_test_le.head(3)

Unnamed: 0,text,label
0,Agent: Thank you for calling BrownBox Customer...,2
1,Agent: Thank you for calling BrownBox Customer...,2
2,Agent: Thank you for calling BrownBox Customer...,2


As you can see, neutral values are converted to zeros, and negative values are converted to twos.

Now we can start to clean our data. There are so many repetitive sentences in our data, like opening lines. As first step, we can remove them. While doing this, we will use regular expressions. Sentences like "After a few seconds" repeats in conversation many times, but we will hold this information, since its probably has an effect on customer sentiment.

At first, we should cast conversation values to string.

In [11]:
df_train_cleaned = df_train_le.copy()
df_train_cleaned[feature] = df_train_cleaned[feature].str.lower()
df_train_cleaned[feature] = df_train_cleaned[feature].apply(remove_redundant_lines)

df_val_cleaned = df_val_le.copy()
df_val_cleaned[feature] = df_val_cleaned[feature].str.lower()
df_val_cleaned[feature] = df_val_cleaned[feature].apply(remove_redundant_lines)

df_test_cleaned = df_test_le.copy()
df_test_cleaned[feature] = df_test_cleaned[feature].str.lower()
df_test_cleaned[feature] = df_test_cleaned[feature].apply(remove_redundant_lines)

In [12]:
df_train_cleaned.head(3)

Unnamed: 0,text,label
183,"C: hi, i'm calling because i have an issue wit...",2
823,"C: hi, i received an email from brownbox stati...",0
649,C: hi rachel. i recently ordered a water geyse...,0


In [13]:
df_val_cleaned.head(3)

Unnamed: 0,text,label
26,"W: hello, thank you for calling brownbox custo...",0
834,W: thank you for calling brownbox customer sup...,0
207,"C: hi sarah, i recently purchased an air coole...",0


In [14]:
df_test_cleaned.head(3)

Unnamed: 0,text,label
0,"C: hi, sarah. i am calling because i am intere...",2
1,"C: hi sarah, my name is john. i'm having troub...",2
2,"C: hi jane, i am calling regarding the refund ...",2


Next step is encoding all datasets with GPT2 Tokenizer and transforming datasets to DatabaseDict. We won't tokenize the test dataset for now to assess trained model's outputs precisely.

In [15]:
train_dataset = Dataset.from_dict(df_train_cleaned)
test_dataset = Dataset.from_dict(df_val_cleaned)
final_datasets = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})

tokenized_datasets = final_datasets.map(tokenize_function, batched=True)

tokenized_datasets

Map: 100%|██████████| 776/776 [00:01<00:00, 492.88 examples/s]
Map: 100%|██████████| 194/194 [00:00<00:00, 506.08 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 776
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 194
    })
})

In [16]:
save_datasets(tokenized_datasets, "train-val")
save_dataset(df_test_cleaned, "test")

Saving the dataset (1/1 shards): 100%|██████████| 776/776 [00:00<00:00, 82470.48 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 194/194 [00:00<00:00, 43061.76 examples/s]
