# IMDA Dataset processing
Updated Pre-Processing Steps for IMDA Dataset TextGrid Dialog Scripts

1.	Maintain Original Time Coordinate Separation:  
	•	Objective: For easy processing and feature engineering, and the dataset is large enough to filter out adequate instatnces 
	•	Implementation: keep the segmentation based on the original x_min and x_max time coordinates as provided in the TextGrid files.  
2.	Filter Out Short Text Segments:
	•	Objective: Focus on meaningful text segments that are likely to carry sentiment.
	•	Implementation: Filter out text segments that contain fewer than 10 words before performing sentiment analysis.
	•	Reasoning: Short text segments (e.g., “yes”, “no”, “okay”) may not provide enough context for accurate labeling.
3.	Add a Qualification Label for Sentiment Analysis:
	•	Objective: Identify which text segments are eligible for sentiment analysis based on word count.
	•	Implementation: Add a new column, `qualified_label_sentiment`, to the DataFrame: Set to True if the segment contains at least 10 words.

Example Implementation:

 • Input Data: Text segments from IMDA dataset TextGrid dialog scripts.  
 • Process:  
	1.	Parse and load the TextGrid files, maintaining original time coordinates (x_min, x_max).  
	2.	For each text segment, count the number of words.  
	3.	Filter out segments with fewer than 10 words.  
	4.	Add a qualified_for_sentiment column to indicate whether each segment meets the criteria for sentiment analysis.  
 • Outcome: A refined dataset with preserved time coordinates and only the most relevant text segments flagged for sentiment analysis.  

This process update ensures that the sentiment analysis is performed on meaningful segments while respecting the original structure of the dialog scripts, leading to more accurate and context-aware results.

In [1]:
# setup - install library
# !pip install textgrid
# !pip install praatio

In [2]:
import os
import re
import pandas as pd
from praatio import textgrid
from utils_v2 import process_textgrid_file_to_sentences_v2

## Document

### Data processing output 
The combined call center dialogue data has been processed and displayed in a tabular format. Here is a brief overview of the data:

 • unique_id: The unique identifier for the session, derived from the filename. e.g. 0683  
 • speaker_id: The identifier for the speaker (e.g. 0013, 4366).  
 • speaker_type: The type of speaker (agent or client).  
 • dialog_type: The type of conversation topic.  
 • x_min: The start time of the spoken segment.  
 • x_max: The end time of the spoken segment.  
 • text: The transcribed text of the spoken segment.  
 • cleaned_text_sentiment: processed text of the spoken segment.  
 • qualified_label_sentiment:
 

The data is sorted by the start time (x_min) to maintain the chronological order of the conversation.

### Data understanding (GPT Gen)
In TextGrid files, especially in the context of dialogue transcription,   
special symbols like \<B\>, \<Z\>, and \<S\> often represent specific events or markers within the conversation. 
Here are some common interpretations:

	•	<B>: This might indicate a backchannel, which are listener responses (like “uh-huh”, “right”, etc.) to show that they are following along but not taking the floor.
	•	<Z>: This could represent a pause or silence in the conversation, possibly of significant duration.
	•	<S>: This is often used to denote a short pause or a speaker’s hesitation, such as when they are thinking or momentarily pausing in their speech.

These markers help in analyzing the structure and flow of conversations, providing insight into pauses, interruptions, and listener engagement.

## processing Session Call Center Design 3

In [3]:
def decode_filename_new(file_path):
    """
    decode master data info from file name 
    like session index, speaker index, dialog type, etc.
    """
    session_id = file_path.split('/')[-1][4:8]
    speaker_id = file_path.split('/')[-1][9:13]
    speaker_type = 'agent' if speaker_id.startswith('00') else 'client'
    file_name = file_path.split('/')[-1]

    dialog_type = os.path.splitext(file_name)[0][-3:]
    
    return {
        'file_name': file_name,
        'session_id': session_id,
        'speaker_id': speaker_id,
        'speaker_type': speaker_type,
        'dialog_type': dialog_type,
        }
decode_filename_new("../data/input/TextGrid_Scripts_Session3/app_1360_0018_phnd_cc-moe.TextGrid")

{'file_name': 'app_1360_0018_phnd_cc-moe.TextGrid',
 'session_id': '1360',
 'speaker_id': '0018',
 'speaker_type': 'agent',
 'dialog_type': 'moe'}

In [4]:
import os

def process_all_textgrid_files_in_directory(directory_path):
    """
    wrapper function for iteration processing
    """
    all_data = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".TextGrid"):
            try:
                file_path = os.path.join(directory_path, filename)
                file_master_data = decode_filename_new(file_path)
                temp_file_sentences = process_textgrid_file_to_sentences_v2(file_path, file_master_data)
            except Exception as e:
                print("error occurred at file ", file_path)
                pass            
            all_data += temp_file_sentences
                
    # Combine all dataframes into one
    if all_data:
        combined_df = pd.DataFrame(all_data).sort_values(by=['session_id', 'dialog_type', 'x_min']).reset_index(drop=True)
        return combined_df
    else:
        return pd.DataFrame() 



In [5]:
directory_path_1 = "../data/input/TextGrid_Scripts_Session1/"
directory_path_3 = "../data/input/TextGrid_Scripts_Session3/"
res_df_1 = process_all_textgrid_files_in_directory(directory_path_1)
res_df_3 = process_all_textgrid_files_in_directory(directory_path_3)
res_df_1.to_csv('../data/processed/sentence_level_script_data_raw_session1.csv',index=False)
res_df_3.to_csv('../data/processed/sentence_level_script_data_raw_session3.csv',index=False)

error occurred at file  ../data/input/TextGrid_Scripts_Session1/app_0459_0021_phnd_cc-hot.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts_Session1/app_0246_0013_phnd_cc-res.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts_Session1/app_0188_0001_phnd_cc-hot.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts_Session3/app_0303_3606_phnd_cc-res.TextGrid
error occurred at file  ../data/input/TextGrid_Scripts_Session3/app_1978_6956_phnd_cc-hdb.TextGrid


In [6]:
res_df_1.shape, res_df_3.shape

((259776, 8), (176292, 8))

## Filtering eligible sentences for sentiment labelling 

In [7]:
# remove none session id
data_raw_df = res_df_1
data_raw_df = data_raw_df[~data_raw_df['session_id'].isna()]
# cast session_id into integer
data_raw_df['session_id'] = data_raw_df['session_id'].astype(int)
data_raw_df['speaker_id'] = data_raw_df['speaker_id'].astype(int)
data_raw_df.head()

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text
0,app_0001_0010_phnd_cc-hol.TextGrid,1,10,agent,hol,0.0,7.06,okay <mandarin>来来:lai lai</mandarin> okay <man...
1,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,hol,0.0,3.13,up okay
2,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,hol,3.13,7.97812,so (ppo) okay so hello I'm calling to (err)
3,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,hol,7.97812,10.4337,make some enquiry about (uh) travel
4,app_0001_3002_phnd_cc-hol.TextGrid,1,3002,client,hol,10.4337,12.49,trip to (uh) korea


In [8]:
import re

def clean_text_for_word_count(text):
    # Remove any string in brackets like (<xxx>) or [<xxx>] or <<xxx>>
    text = re.sub(r'\(.*?\)', '', text)  # Removes content within parentheses
    text = re.sub(r'\[.*?\]', '', text)  # Removes content within square brackets
    text = re.sub(r'\<.*?\>', '', text)  # Removes content within angle brackets

    # Remove any character that is not an English alphabet or space
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    cleaned_text = ' '.join(cleaned_text.split())

    return cleaned_text

# Example usage
example_text = "This is an example (with noise) [and more noise] <<additional noise>> 123!"
cleaned_text = clean_text_for_word_count(example_text)
print(f"Cleaned Text: '{cleaned_text}'")

Cleaned Text: 'This is an example'


In [9]:
def clean_text_for_sentiment(text):
    # Remove specific markers <B>, <S>, <Z>
    text = re.sub(r'<B>|<S>|<Z>', '', text)
    # Remove extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', text).strip()
    return cleaned_text

def count_unique_words(text):
    # Split the text into words and count unique words
    words = text.split()
    unique_words = set(words)
    return len(unique_words)

def add_qualification_label(df):
    # Clean the text and count words
    df['cleaned_text_for_sentiment'] = df['text'].apply(clean_text_for_sentiment)
    df['word_count'] = df['text'].apply(clean_text_for_word_count).apply(count_unique_words)
    df = df.assign(duration=lambda x: x.x_max - x.x_min)
    # Filter rows where the word count is less than 10 and duration less than 15 seconds
    # filtered_df = df[df['word_count'] >= 6][df['duration'] <= 15].reset_index(drop=True)
    # Add a qualification label: True if the segment contains at least 10 words and duration less than 15 seconds
    df['qualified_for_sentiment'] = (df['word_count'] >= 7) & (df['duration'] <= 15)
    return df

def filter_by_qualified(df):
    filtered_df = df[df['qualified_for_sentiment'] == True].reset_index(drop=True)
    # filtered_df = filtered_df.drop(columns=['text','word_count','duration'])
    return filtered_df

data_w_stats_df = add_qualification_label(data_raw_df)
data_w_stats_df.to_csv('../data/processed/sentence_level_script_data_raw_session1.csv',index=False)
print("data_w_stats_df number of records: ", data_w_stats_df.shape[0])
qualified_data_df = filter_by_qualified(data_w_stats_df)
qualified_data_df.to_csv('../data/processed/sentence_level_script_data_filtered_session1.csv',index=False)
print("number of qualified records: ", qualified_data_df.shape[0])

data_w_stats_df number of records:  259776
number of qualified records:  136831


In [10]:
# qualified_data_df.head(10)

## Call center design 3 

In [11]:
# remove none session id
data_raw_df = res_df_3
data_raw_df = data_raw_df[~data_raw_df['session_id'].isna()]
# cast session_id into integer
data_raw_df['session_id'] = data_raw_df['session_id'].astype(int)
data_raw_df['speaker_id'] = data_raw_df['speaker_id'].astype(int)
data_raw_df.head()

data_w_stats_df = add_qualification_label(data_raw_df)
data_w_stats_df.to_csv('../data/processed/sentence_level_script_data_raw_session3.csv',index=False)
print("data_w_stats_df number of records: ", data_w_stats_df.shape[0])
qualified_data_df = filter_by_qualified(data_w_stats_df)
qualified_data_df.to_csv('../data/processed/sentence_level_script_data_filtered_session3.csv',index=False)
print("number of qualified records: ", qualified_data_df.shape[0])

data_w_stats_df number of records:  176292
number of qualified records:  77807


# The END