# IMDA Dataset processing
Updated Pre-Processing Steps for IMDA Dataset TextGrid Dialog Scripts

1.	Maintain Original Time Coordinate Separation:  
	•	Objective: For easy processing and feature engineering, and the dataset is large enough to filter out adequate instatnces 
	•	Implementation: keep the segmentation based on the original x_min and x_max time coordinates as provided in the TextGrid files.  
2.	Filter Out Short Text Segments:
	•	Objective: Focus on meaningful text segments that are likely to carry sentiment.
	•	Implementation: Filter out text segments that contain fewer than 10 words before performing sentiment analysis.
	•	Reasoning: Short text segments (e.g., “yes”, “no”, “okay”) may not provide enough context for accurate labeling.
3.	Add a Qualification Label for Sentiment Analysis:
	•	Objective: Identify which text segments are eligible for sentiment analysis based on word count.
	•	Implementation: Add a new column, `qualified_label_sentiment`, to the DataFrame: Set to True if the segment contains at least 10 words.

Example Implementation:

 • Input Data: Text segments from IMDA dataset TextGrid dialog scripts.  
 • Process:  
	1.	Parse and load the TextGrid files, maintaining original time coordinates (x_min, x_max).  
	2.	For each text segment, count the number of words.  
	3.	Filter out segments with fewer than 10 words.  
	4.	Add a qualified_for_sentiment column to indicate whether each segment meets the criteria for sentiment analysis.  
 • Outcome: A refined dataset with preserved time coordinates and only the most relevant text segments flagged for sentiment analysis.  

This process update ensures that the sentiment analysis is performed on meaningful segments while respecting the original structure of the dialog scripts, leading to more accurate and context-aware results.

In [None]:
# setup - install library
# !pip install textgrid
# !pip install praatio

In [2]:
import os
import re
import pandas as pd
from praatio import textgrid
from utils_v2 import process_textgrid_file_to_sentences_v2

## Document

### Data processing output 
The combined call center dialogue data has been processed and displayed in a tabular format. Here is a brief overview of the data:

 • unique_id: The unique identifier for the session, derived from the filename. e.g. 0683  
 • speaker_id: The identifier for the speaker (e.g. 0013, 4366).  
 • speaker_type: The type of speaker (agent or client).  
 • dialog_type: The type of conversation topic.  
 • x_min: The start time of the spoken segment.  
 • x_max: The end time of the spoken segment.  
 • text: The transcribed text of the spoken segment.  
 • cleaned_text_sentiment: processed text of the spoken segment.  
 • qualified_label_sentiment:
 

The data is sorted by the start time (x_min) to maintain the chronological order of the conversation.

### Data understanding (GPT Gen)
In TextGrid files, especially in the context of dialogue transcription,   
special symbols like \<B\>, \<Z\>, and \<S\> often represent specific events or markers within the conversation. 
Here are some common interpretations:

	•	<B>: This might indicate a backchannel, which are listener responses (like “uh-huh”, “right”, etc.) to show that they are following along but not taking the floor.
	•	<Z>: This could represent a pause or silence in the conversation, possibly of significant duration.
	•	<S>: This is often used to denote a short pause or a speaker’s hesitation, such as when they are thinking or momentarily pausing in their speech.

These markers help in analyzing the structure and flow of conversations, providing insight into pauses, interruptions, and listener engagement.

## Start with test on one textgrid file

In [3]:
file_path_test = "../data/input/TextGrid_Scripts/app_0683_0013_phnd_cc-bnk.TextGrid"
from utils_v1 import decode_filename
master_data_test = decode_filename(file_path_test)
res_683 = process_textgrid_file_to_sentences_v2(file_path_test, master_data_test)

In [4]:
file_path_test = "../data/input/TextGrid_Scripts/app_0683_4366_phnd_cc-bnk.TextGrid"
from utils_v1 import decode_filename
master_data_test = decode_filename(file_path_test)
res_683 += process_textgrid_file_to_sentences_v2(file_path_test, master_data_test)
pd.DataFrame(res_683)

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text
0,app_0683_0013_phnd_cc-bnk.TextGrid,0683,0013,agent,bank,2.48,5.35000,hi this is A B C bank how can I help you
1,app_0683_0013_phnd_cc-bnk.TextGrid,0683,0013,agent,bank,13.12,14.74000,okay (mmhmm)
2,app_0683_0013_phnd_cc-bnk.TextGrid,0683,0013,agent,bank,17.15,23.55000,ya our bank (uh) do give out ya our bank does ...
3,app_0683_0013_phnd_cc-bnk.TextGrid,0683,0013,agent,bank,23.55,66.39000,(mmhmm) (mmhmm) (mmhmm) <B>
4,app_0683_0013_phnd_cc-bnk.TextGrid,0683,0013,agent,bank,66.39,81.65000,okay (ppb) so (uh) we do have two rates like (...
...,...,...,...,...,...,...,...,...
106,app_0683_4366_phnd_cc-bnk.TextGrid,0683,4366,client,bank,581.65,584.54000,can can okay noted noted
107,app_0683_4366_phnd_cc-bnk.TextGrid,0683,4366,client,bank,584.54,591.44000,(mm) (uh) either it'll be all on my side proba...
108,app_0683_4366_phnd_cc-bnk.TextGrid,0683,4366,client,bank,603.57,611.78000,okay if it's weekend then (uh) anytime with yo...
109,app_0683_4366_phnd_cc-bnk.TextGrid,0683,4366,client,bank,611.78,615.54000,(ppb) (uh) five plus six P_M will be better


## batch processing on the whole folder - Session Call Center Design 2

In [5]:
import os

def process_all_textgrid_files_in_directory(directory_path):
    """
    wrapper function for iteration processing
    """
    all_data = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".TextGrid"):
            try:
                file_path = os.path.join(directory_path, filename)
                file_master_data = decode_filename(file_path)
                temp_file_sentences = process_textgrid_file_to_sentences_v2(file_path, file_master_data)
            except Exception as e:
                print("error occurred at file ", file_path)
                pass            
            all_data += temp_file_sentences
                
    # Combine all dataframes into one
    if all_data:
        combined_df = pd.DataFrame(all_data).sort_values(by=['session_id', 'dialog_type', 'x_min']).reset_index(drop=True)
        return combined_df
    else:
        return pd.DataFrame() 

directory_path = "../data/input/TextGrid_Scripts/"
res_df = process_all_textgrid_files_in_directory(directory_path)

error occurred at file  ../data/input/TextGrid_Scripts/app_0897_4794_phnd_cc-ins.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_1295_5590_phnd_cc-ins.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_1066_5132_phnd_cc-ins.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_0729_4458_phnd_cc-ins.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_1126_0036_phnd_cc-tel.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_0893_0030_phnd_cc-ins.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_1280_5560_phnd_cc-ins.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts/app_0897_0018_phnd_cc-ins.TextGrid


In [9]:
res_df.to_csv('../data/processed/sentence_level_script_data_raw_V2.csv',index=False)

In [6]:
res_df.head(10)

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text
0,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,0.0,2.34,(ppo)
1,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,2.34,4.0,okay
2,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,2.48,5.35,hi this is A B C bank how can I help you
3,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,5.35,15.42,hi my name is john (uh) I'm calling in with in...
4,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,13.12,14.74,okay (mmhmm)
5,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,17.15,23.55,ya our bank (uh) do give out ya our bank does ...
6,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,17.2,20.05,sorry does does your bank give out home loans
7,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,20.05,22.12,okay okay sure okay (ppb)
8,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,23.28,27.41,[oh] ya probably I can give you some informati...
9,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,23.55,66.39,(mmhmm) (mmhmm) (mmhmm) <B>


In [8]:
res_df.shape

(185185, 8)

In [10]:
res_df.groupby(by=['file_name'])['text'].count()

file_name
app_0683_0013_phnd_cc-bnk.TextGrid    50
app_0683_0013_phnd_cc-ins.TextGrid    54
app_0683_0013_phnd_cc-tel.TextGrid    56
app_0683_4366_phnd_cc-bnk.TextGrid    61
app_0683_4366_phnd_cc-ins.TextGrid    88
                                      ..
app_1355_0018_phnd_cc-ins.TextGrid    40
app_1355_0018_phnd_cc-tel.TextGrid    35
app_1355_5710_phnd_cc-bnk.TextGrid    55
app_1355_5710_phnd_cc-ins.TextGrid    56
app_1355_5710_phnd_cc-tel.TextGrid    52
Name: text, Length: 3958, dtype: int64

In [13]:
res_df.file_name.nunique()//2

1979

## Filtering eligible sentences for sentiment labelling 

In [15]:
# data_raw_df = pd.read_csv('../data/processed/sentence_level_script_data_raw_V2.csv')
# remove unnamed column
# data_raw_df.drop(columns=['Unnamed: 0'], inplace=True)
# remove none session id
data_raw_df = res_df
data_raw_df = data_raw_df[~data_raw_df['session_id'].isna()]
# cast session_id into integer
data_raw_df['session_id'] = data_raw_df['session_id'].astype(int)
data_raw_df['speaker_id'] = data_raw_df['speaker_id'].astype(int)
data_raw_df.head()

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text
0,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,0.0,2.34,(ppo)
1,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,2.34,4.0,okay
2,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,2.48,5.35,hi this is A B C bank how can I help you
3,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,5.35,15.42,hi my name is john (uh) I'm calling in with in...
4,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,13.12,14.74,okay (mmhmm)


In [18]:
import re
def clean_text_for_word_count(text):
    # Remove any string in brackets like (<xxx>) or [<xxx>] or <<xxx>>
    text = re.sub(r'\(.*?\)', '', text)
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'\<.*?\>', '', text)
    # Remove specific markers <B>, <S>, <Z>
    # text = re.sub(r'<B>|<S>|<Z>', '', text)
    # Remove extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', text).strip()
    
    return cleaned_text

def clean_text_for_sentiment(text):
    # Remove specific markers <B>, <S>, <Z>
    text = re.sub(r'<B>|<S>|<Z>', '', text)
    # Remove extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', text).strip()
    return cleaned_text

def count_unique_words(text):
    # Split the text into words and count unique words
    words = text.split()
    unique_words = set(words)
    return len(unique_words)

def add_qualification_label(df):
    # Clean the text and count words
    df['cleaned_text_for_sentiment'] = df['text'].apply(clean_text_for_sentiment)
    df['word_count'] = df['text'].apply(clean_text_for_word_count).apply(count_unique_words)
    df = df.assign(duration=lambda x: x.x_max - x.x_min)
    # Filter rows where the word count is less than 10 and duration less than 15 seconds
    # filtered_df = df[df['word_count'] >= 6][df['duration'] <= 15].reset_index(drop=True)
    # Add a qualification label: True if the segment contains at least 10 words and duration less than 15 seconds
    df['qualified_for_sentiment'] = (df['word_count'] >= 7) & (df['duration'] <= 15)
    return df

def filter_by_qualified(df):
    filtered_df = df[df['qualified_for_sentiment'] == True].reset_index(drop=True)
    # filtered_df = filtered_df.drop(columns=['text','word_count','duration'])
    return filtered_df

data_w_stats_df = add_qualification_label(data_raw_df)
data_w_stats_df.to_csv('../data/processed/sentence_level_script_data_raw_V2.csv',index=False)
print("data_w_stats_df number of records: ", data_w_stats_df.shape[0])
qualified_data_df = filter_by_qualified(data_w_stats_df)
qualified_data_df.to_csv('../data/processed/sentence_level_script_data_filtered_V2.csv',index=False)
print("number of qualified records: ", qualified_data_df.shape[0])

data_w_stats_df number of records:  185185
number of qualified records:  86747


In [19]:
qualified_data_df.head(10)

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text,cleaned_text_for_sentiment,word_count,duration,qualified_for_sentiment
0,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,2.48,5.35,hi this is A B C bank how can I help you,hi this is A B C bank how can I help you,12,2.87,True
1,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,5.35,15.42,hi my name is john (uh) I'm calling in with in...,hi my name is john (uh) I'm calling in with in...,20,10.07,True
2,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,17.15,23.55,ya our bank (uh) do give out ya our bank does ...,ya our bank (uh) do give out ya our bank does ...,9,6.4,True
3,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,17.2,20.05,sorry does does your bank give out home loans,sorry does does your bank give out home loans,8,2.85,True
4,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,23.28,27.41,[oh] ya probably I can give you some informati...,[oh] ya probably I can give you some informati...,11,4.13,True
5,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,27.41,31.26,first (ppb) okay I'm actually having a H_D_B r...,first (ppb) okay I'm actually having a H_D_B r...,9,3.85,True
6,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,31.26,38.52,(ppb) (err) (uh) four room flat that I bought ...,(ppb) (err) (uh) four room flat that I bought ...,11,7.26,True
7,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,38.52,49.04,so I'm actually looking (uh) I I having a (uh)...,so I'm actually looking (uh) I I having a (uh)...,19,10.52,True
8,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,49.04,59.2,(ppb) to the bank and hoping to inquire more o...,(ppb) to the bank and hoping to inquire more o...,18,10.16,True
9,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,59.2,64.36,thirdly is (uh) how many years of a loan that ...,thirdly is (uh) how many years of a loan that ...,14,5.16,True
