# IMDA Dataset processing
Updated Pre-Processing Steps for IMDA Dataset TextGrid Dialog Scripts

1.	Maintain Original Time Coordinate Separation:  
	•	Objective: For easy processing and feature engineering, and the dataset is large enough to filter out adequate instatnces 
	•	Implementation: keep the segmentation based on the original x_min and x_max time coordinates as provided in the TextGrid files.  
2.	Filter Out Short Text Segments:
	•	Objective: Focus on meaningful text segments that are likely to carry sentiment.
	•	Implementation: Filter out text segments that contain fewer than 10 words before performing sentiment analysis.
	•	Reasoning: Short text segments (e.g., “yes”, “no”, “okay”) may not provide enough context for accurate labeling.
3.	Add a Qualification Label for Sentiment Analysis:
	•	Objective: Identify which text segments are eligible for sentiment analysis based on word count.
	•	Implementation: Add a new column, `qualified_label_sentiment`, to the DataFrame: Set to True if the segment contains at least 10 words.

Example Implementation:

 • Input Data: Text segments from IMDA dataset TextGrid dialog scripts.  
 • Process:  
	1.	Parse and load the TextGrid files, maintaining original time coordinates (x_min, x_max).  
	2.	For each text segment, count the number of words.  
	3.	Filter out segments with fewer than 10 words.  
	4.	Add a qualified_for_sentiment column to indicate whether each segment meets the criteria for sentiment analysis.  
 • Outcome: A refined dataset with preserved time coordinates and only the most relevant text segments flagged for sentiment analysis.  

This process update ensures that the sentiment analysis is performed on meaningful segments while respecting the original structure of the dialog scripts, leading to more accurate and context-aware results.

In [None]:
# setup - install library
# !pip install textgrid
# !pip install praatio

In [1]:
import os
import re
import pandas as pd
from praatio import textgrid
from utils_v2 import process_textgrid_file_to_sentences_v2

## Document

### Data processing output 
The combined call center dialogue data has been processed and displayed in a tabular format. Here is a brief overview of the data:

 • unique_id: The unique identifier for the session, derived from the filename. e.g. 0683  
 • speaker_id: The identifier for the speaker (e.g. 0013, 4366).  
 • speaker_type: The type of speaker (agent or client).  
 • dialog_type: The type of conversation topic.  
 • x_min: The start time of the spoken segment.  
 • x_max: The end time of the spoken segment.  
 • text: The transcribed text of the spoken segment.  
 • cleaned_text_sentiment: processed text of the spoken segment.  
 • qualified_label_sentiment:
 

The data is sorted by the start time (x_min) to maintain the chronological order of the conversation.

### Data understanding (GPT Gen)
In TextGrid files, especially in the context of dialogue transcription,   
special symbols like \<B\>, \<Z\>, and \<S\> often represent specific events or markers within the conversation. 
Here are some common interpretations:

	•	<B>: This might indicate a backchannel, which are listener responses (like “uh-huh”, “right”, etc.) to show that they are following along but not taking the floor.
	•	<Z>: This could represent a pause or silence in the conversation, possibly of significant duration.
	•	<S>: This is often used to denote a short pause or a speaker’s hesitation, such as when they are thinking or momentarily pausing in their speech.

These markers help in analyzing the structure and flow of conversations, providing insight into pauses, interruptions, and listener engagement.

## Start with test on one textgrid file

## batch processing on the whole folder - Session Call Center Design 2

In [22]:
def decode_filename_new(file_path):
    """
    decode master data info from file name 
    like session index, speaker index, dialog type, etc.
    """
    session_id = file_path.split('/')[-1][4:8]
    speaker_id = file_path.split('/')[-1][9:13]
    speaker_type = 'agent' if speaker_id.startswith('00') else 'client'
    file_name = file_path.split('/')[-1]
    
    print(file_name)
    dialog_type = os.path.splitext(file_name)[0][-3:]
    
    return {
        'file_name': file_name,
        'session_id': session_id,
        'speaker_id': speaker_id,
        'speaker_type': speaker_type,
        'dialog_type': dialog_type,
        }
decode_filename_new("../data/input/Scripts_3/app_1360_0018_phnd_cc-moe.TextGrid")

app_1360_0018_phnd_cc-moe.TextGrid


{'file_name': 'app_1360_0018_phnd_cc-moe.TextGrid',
 'session_id': '1360',
 'speaker_id': '0018',
 'speaker_type': 'agent',
 'dialog_type': 'moe'}

In [25]:
import os

def process_all_textgrid_files_in_directory(directory_path):
    """
    wrapper function for iteration processing
    """
    all_data = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".TextGrid"):
            try:
                file_path = os.path.join(directory_path, filename)
                file_master_data = decode_filename_new(file_path)
                temp_file_sentences = process_textgrid_file_to_sentences_v2(file_path, file_master_data)
            except Exception as e:
                print("error occurred at file ", file_path)
                pass            
            all_data += temp_file_sentences
                
    # Combine all dataframes into one
    if all_data:
        combined_df = pd.DataFrame(all_data).sort_values(by=['session_id', 'dialog_type', 'x_min']).reset_index(drop=True)
        return combined_df
    else:
        return pd.DataFrame() 

# directory_path = "../data/input/Scripts_1/"
# res_df = process_all_textgrid_files_in_directory(directory_path)

In [4]:
res_df.to_csv('../data/processed/sentence_level_script_data_raw_V2_design1.csv',index=False)

In [5]:
res_df.head(10)

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text
0,app_0001_0010_phnd_cc-hol.TextGrid,1,10,agent,holiday,0.0,7.06,okay <mandarin>来来:lai lai</mandarin> okay <man...
1,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,0.0,3.13,up okay
2,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,3.13,7.97812,so (ppo) okay so hello I'm calling to (err)
3,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,7.97812,10.4337,make some enquiry about (uh) travel
4,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,10.4337,12.49,trip to (uh) korea
5,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,12.49,15.07154,(uh) #busan# I have (err)
6,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,15.07154,18.58958,four adult two children and
7,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,18.58958,23.42,one of the adult is wheelchair bound which is ...
8,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,23.42,25.79,(uh) my father and my mother and
9,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,25.79,32.49143,my spouse and myself and my two children and (...


In [6]:
res_df.shape

(259776, 8)

In [7]:
res_df.groupby(by=['file_name'])['text'].count()

file_name
app_0001_0010_phnd_cc-hol.TextGrid    28
app_0001_0010_phnd_cc-hot.TextGrid    35
app_0001_0010_phnd_cc-res.TextGrid    42
app_0001_3002_phnd_cc-hol.TextGrid    58
app_0001_3002_phnd_cc-hot.TextGrid    55
                                      ..
app_0682_0018_phnd_cc-hot.TextGrid    45
app_0682_0018_phnd_cc-res.TextGrid    47
app_0682_4364_phnd_cc-hol.TextGrid    70
app_0682_4364_phnd_cc-hot.TextGrid    82
app_0682_4364_phnd_cc-res.TextGrid    70
Name: text, Length: 3975, dtype: int64

In [8]:
res_df.file_name.nunique()//2

1987

## Filtering eligible sentences for sentiment labelling 

In [11]:
# remove none session id
data_raw_df = res_df
data_raw_df = data_raw_df[~data_raw_df['session_id'].isna()]
# cast session_id into integer
data_raw_df['session_id'] = data_raw_df['session_id'].astype(int)
data_raw_df['speaker_id'] = data_raw_df['speaker_id'].astype(int)
data_raw_df.head()

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text
0,app_0001_0010_phnd_cc-hol.TextGrid,1,10,agent,holiday,0.0,7.06,okay <mandarin>来来:lai lai</mandarin> okay <man...
1,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,0.0,3.13,up okay
2,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,3.13,7.97812,so (ppo) okay so hello I'm calling to (err)
3,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,7.97812,10.4337,make some enquiry about (uh) travel
4,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,10.4337,12.49,trip to (uh) korea


In [9]:
import re

def clean_text_for_word_count(text):
    # Remove any string in brackets like (<xxx>) or [<xxx>] or <<xxx>>
    text = re.sub(r'\(.*?\)', '', text)  # Removes content within parentheses
    text = re.sub(r'\[.*?\]', '', text)  # Removes content within square brackets
    text = re.sub(r'\<.*?\>', '', text)  # Removes content within angle brackets

    # Remove any character that is not an English alphabet or space
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    cleaned_text = ' '.join(cleaned_text.split())

    return cleaned_text

# Example usage
example_text = "This is an example (with noise) [and more noise] <<additional noise>> 123!"
cleaned_text = clean_text_for_word_count(example_text)
print(f"Cleaned Text: '{cleaned_text}'")

Cleaned Text: 'This is an example'


In [12]:
def clean_text_for_sentiment(text):
    # Remove specific markers <B>, <S>, <Z>
    text = re.sub(r'<B>|<S>|<Z>', '', text)
    # Remove extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', text).strip()
    return cleaned_text

def count_unique_words(text):
    # Split the text into words and count unique words
    words = text.split()
    unique_words = set(words)
    return len(unique_words)

def add_qualification_label(df):
    # Clean the text and count words
    df['cleaned_text_for_sentiment'] = df['text'].apply(clean_text_for_sentiment)
    df['word_count'] = df['text'].apply(clean_text_for_word_count).apply(count_unique_words)
    df = df.assign(duration=lambda x: x.x_max - x.x_min)
    # Filter rows where the word count is less than 10 and duration less than 15 seconds
    # filtered_df = df[df['word_count'] >= 6][df['duration'] <= 15].reset_index(drop=True)
    # Add a qualification label: True if the segment contains at least 10 words and duration less than 15 seconds
    df['qualified_for_sentiment'] = (df['word_count'] >= 7) & (df['duration'] <= 15)
    return df

def filter_by_qualified(df):
    filtered_df = df[df['qualified_for_sentiment'] == True].reset_index(drop=True)
    # filtered_df = filtered_df.drop(columns=['text','word_count','duration'])
    return filtered_df

data_w_stats_df = add_qualification_label(data_raw_df)
data_w_stats_df.to_csv('../data/processed/sentence_level_script_data_raw_V2_design1.csv',index=False)
print("data_w_stats_df number of records: ", data_w_stats_df.shape[0])
qualified_data_df = filter_by_qualified(data_w_stats_df)
qualified_data_df.to_csv('../data/processed/sentence_level_script_data_filtered_V2_design1.csv',index=False)
print("number of qualified records: ", qualified_data_df.shape[0])

data_w_stats_df number of records:  259776
number of qualified records:  136831


In [13]:
qualified_data_df.head(10)

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text,cleaned_text_for_sentiment,word_count,duration,qualified_for_sentiment
0,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,18.58958,23.42,one of the adult is wheelchair bound which is ...,one of the adult is wheelchair bound which is ...,13,4.83042,True
1,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,25.79,32.49143,my spouse and myself and my two children and (...,my spouse and myself and my two children and (...,11,6.70143,True
2,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,32.49143,41.16,first for the air airplane right I would like ...,first for the air airplane right I would like ...,16,8.66857,True
3,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,41.16,46.47164,to the toilet so it's more convenient for my (...,to the toilet so it's more convenient for my (...,9,5.31164,True
4,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,46.47164,57.26,parent and also my kids because you know the t...,parent and also my kids because you know the t...,16,10.78836,True
5,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,57.26,66.85,so you know elderly they may may not be able t...,so you know elderly they may may not be able t...,21,9.59,True
6,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,73.86,81.31947,handicap friendly do they have (uh) facility l...,handicap friendly do they have (uh) facility l...,9,7.45947,True
7,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,81.31947,91.85,and you know the restaurant is it also (uh) ok...,and you know the restaurant is it also (uh) ok...,17,10.53053,True
8,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,holiday,99.3736,108.56,we also want to have some (uh) places for them...,we also want to have some (uh) places for them...,17,9.1864,True
9,app_0001_0010_phnd_cc-hol.TextGrid,1,10,agent,holiday,112.8,125.68,okay alright so (err) miss lily [ah] thank you...,okay alright so (err) miss lily [ah] thank you...,34,12.88,True


## Call center design 3 

In [26]:
directory_path = "../data/input/Scripts_3/"
res_df = process_all_textgrid_files_in_directory(directory_path)

app_1440_0014_phnd_cc-moe.TextGrid
app_1747_0038_phnd_cc-moe.TextGrid
app_1576_6152_phnd_cc-moe.TextGrid
app_1450_5900_phnd_cc-msf.TextGrid
app_1592_6184_phnd_cc-hdb.TextGrid
app_2019_6938_phnd_cc-hdb.TextGrid
app_1810_0042_phnd_cc-msf.TextGrid
app_1551_0041_phnd_cc-moe.TextGrid
app_1659_6318_phnd_cc-moe.TextGrid
app_1811_0041_phnd_cc-msf.TextGrid
app_1728_0041_phnd_cc-hdb.TextGrid
app_1749_0014_phnd_cc-moe.TextGrid
app_1431_5862_phnd_cc-moe.TextGrid
app_1601_6202_phnd_cc-moe.TextGrid
app_1919_0001_phnd_cc-moe.TextGrid
app_1846_0043_phnd_cc-hdb.TextGrid
app_1397_0018_phnd_cc-msf.TextGrid
app_1787_6574_phnd_cc-msf.TextGrid
app_1797_6594_phnd_cc-moe.TextGrid
app_1862_6724_phnd_cc-msf.TextGrid
app_1501_0003_phnd_cc-msf.TextGrid
app_2055_7010_phnd_cc-hdb.TextGrid
app_1811_6622_phnd_cc-moe.TextGrid
app_1738_0041_phnd_cc-msf.TextGrid
app_1860_0014_phnd_cc-moe.TextGrid
app_1627_0041_phnd_cc-msf.TextGrid
app_2028_6956_phnd_cc-msf.TextGrid
app_1387_0018_phnd_cc-hdb.TextGrid
app_1870_6740_phnd_c

In [27]:
res_df.to_csv('../data/processed/sentence_level_script_data_raw_V2_design3.csv',index=False)

In [28]:
# remove none session id
data_raw_df = res_df
data_raw_df = data_raw_df[~data_raw_df['session_id'].isna()]
# cast session_id into integer
data_raw_df['session_id'] = data_raw_df['session_id'].astype(int)
data_raw_df['speaker_id'] = data_raw_df['speaker_id'].astype(int)
data_raw_df.head()

data_w_stats_df = add_qualification_label(data_raw_df)
data_w_stats_df.to_csv('../data/processed/sentence_level_script_data_raw_V2_design3.csv',index=False)
print("data_w_stats_df number of records: ", data_w_stats_df.shape[0])
qualified_data_df = filter_by_qualified(data_w_stats_df)
qualified_data_df.to_csv('../data/processed/sentence_level_script_data_filtered_V2_design3.csv',index=False)
print("number of qualified records: ", qualified_data_df.shape[0])

data_w_stats_df number of records:  176292
number of qualified records:  77807


In [29]:
data_w_stats_df

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text,cleaned_text_for_sentiment,word_count,duration,qualified_for_sentiment
0,app_0302_3604_phnd_cc-hol.TextGrid,302,3604,client,hol,0.00000,3.09350,call three holiday,call three holiday,3,3.09350,False
1,app_0302_0018_phnd_cc-hol.TextGrid,302,18,agent,hol,3.12927,8.50931,hi good afternoon this is lily from A B C trav...,hi good afternoon this is lily from A B C trav...,17,5.38004,True
2,app_0302_3604_phnd_cc-hol.TextGrid,302,3604,client,hol,8.22110,21.06413,hi (uh) lily (uh) I'm joyce here (ppb) (um) I'...,hi (uh) lily (uh) I'm joyce here (ppb) (um) I'...,20,12.84303,True
3,app_0302_3604_phnd_cc-hol.TextGrid,302,3604,client,hol,21.06413,30.21838,(um) I'm looking into (um) a package to either...,(um) I'm looking into (um) a package to either...,15,9.15425,True
4,app_0302_0018_phnd_cc-hol.TextGrid,302,18,agent,hol,30.81900,42.98125,hi miss joy we do have a package to korea and ...,hi miss joy we do have a package to korea and ...,24,12.16225,True
...,...,...,...,...,...,...,...,...,...,...,...,...
176287,app_2055_7010_phnd_cc-msf.TextGrid,2055,7010,client,msf,605.98194,622.57313,(mm) I see but (um) that would mean that my ch...,(mm) I see but (um) that would mean that my ch...,34,16.59119,False
176288,app_2055_0042_phnd_cc-msf.TextGrid,2055,42,agent,msf,622.84706,643.47525,not to worry so as long as your child is a sin...,not to worry so as long as your child is a sin...,50,20.62819,False
176289,app_2055_0042_phnd_cc-msf.TextGrid,2055,42,agent,msf,643.47525,648.53688,leave scheme alright is there anything else th...,leave scheme alright is there anything else th...,15,5.06163,True
176290,app_2055_7010_phnd_cc-msf.TextGrid,2055,7010,client,msf,647.52119,650.98163,(um) no I think I'm good thank you,(um) no I think I'm good thank you,7,3.46044,True
