In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from utils.preprocessing_helpers import *

# Data Cleaning and Preprocessing for Interview Transcripts

This notebook provides a modular workflow to clean and preprocess CSV transcripts from raw interview data. Users can follow the steps that apply to their specific needs, or customize the workflow as required.

## **Sections**

- **Organizing Files by Experiment:** Structure raw files by experiment to improve manageability. *(if mixed on another folder)*
- **Removing Filler Words and Repetitions + Visual Cleaning:** Clean transcripts by eliminating unnecessary content + visual cleaning
- **Predicting Speaker Roles:** Automatically assign speaker roles if they are not manually labeled.
- **Text Format + Visual Cleaning:** Prepare transcripts with consistent formatting and timestamps.
- **Matching Translations:** Align translated transcripts with original data for bilingual analysis.
- **Merging Files:** Combine split transcripts from the same interview.
- **Adding Metadata:** Enrich transcripts with additional information such as conditions or utterance indices.

*The removal of vocalized fillers & repetitions, visual cleaning, and speaker role predictions have already been completed in the ``results/processed`` folder.*

# **Workflow Sections**

## Organizing Files by Experiment *(if mixed on another folder)*
Group raw files into folders based on experiments (e.g., OBE1, OBE2, Compassion). This ensures data is well-structured for downstream processing.

In [3]:
source_directory = "../evaluation/references"    
destination_directory = "../interviews_corrected/0_raw"    

organize_csv_files_by_experiment(source_directory, destination_directory)

../evaluation/references\ID 05.csv
Copied ID 05.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 08.csv
Copied Id 08.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 13.csv
Copied Id 13.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 13b.csv
Copied Id 13b.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 14.csv
Copied Id 14.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 15.csv
Copied Id 15.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 16.csv
Copied Id 16.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 17.csv
Copied Id 17.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 18.csv
Copied Id 18.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 19.csv
Copied Id 19.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 19b.csv
Copied Id 19b.csv to ../interviews_corrected/0_raw\OBE1
../eva

## Removing "Vocalized Fillers", repetitions + Visual Cleaning 

Vocalized Fillers are natural sounds, pauses, or expressions that occur in speech, such as ``"mmh"`` , ``"uh"`` or ``"hmm"``. While these fillers can represent meaningful pauses or conversational intentions, they are often considered meaningless words that do not contribute additional information and can complicate text analysis by adding unnecessary noise.

### Context-Based Decision:

**Meditation Interviews:** Fillers were removed during transcription to enhance clarity and focus on meaningful content.

**Grief Interviews:** Fillers were retained during transcription, as they provide insight into natural pauses, emotions, and the conversational flow, which are critical for the analysis.

In [4]:
source_directory = "../interviews_corrected/0_raw" 
destination_directory = "../interviews_corrected/1_cleaned" 

# List of common Vocalized Fillers add more if needed or put None
vocalized_fillers = ["Mm-hmm", "uh", "huh", "um", "hmm", "Mm",]

process_files(source_directory, destination_directory, vocalized_fillers)

Further cleaning of filler words, including more nuanced removal of conversational fillers, stop-words filtering, and lemmatization, is handled during post-transcription analysis in the analysis notebook. This approach allows for targeted preprocessing based on specific analysis goals, ensuring optimal results for tasks such as topic modeling, sentiment analysis, and text visualization.

By balancing transcription cleaning and post-analysis processing, this workflow maintains flexibility to adapt to different analytical needs while ensuring that important conversational nuances are preserved or highlighted as required.

## Predicting Speaker Roles
- Achieve 100 % accuracy, on this set of interviews.
- Work pretty well, on *normal* audio interviews, might be incorrect on ``"0 : No interview (eg. Set-up)"`` (even if in this dataset it  was correctly predicted).

In [5]:
source_directory = "../interviews_corrected/1_cleaned"    
destination_directory = "../interviews_corrected/2_cleaned_role" 

process_files(source_directory, destination_directory, roles=True)

File 'S209-2.csv': Couldn't accurately predict the most probable participant. Define the mosts probable interviewers and select by default the participant as a fallback.


## Text Format + Visual Cleaning
Converts transcripts into a standardized text format with optional timestamps for better readability and analysis.

In [6]:
source_directory = "../interviews_corrected/2_cleaned_role"  
destination_directory = "../interviews_corrected/3_text"

process_files(source_directory, destination_directory, text_format=True, time_stamps=True)

## Matching Translations
Align transcripts by largest overlap or closest proximity, using as reference the english translation to have as support the original transcript in the same file.
- Translation (*source*) & Original (*target*)

In [9]:
ref_folder = "../results/Grief/fr/fr_to_eng"
target_folder = "../results/Grief/french/fr"

output_folder = "../results/Grief/french/matched_transcripts"

match_transcripts_folder(ref_folder, target_folder, output_folder)

Matching file not found for ../results/Grief/fr/fr_to_eng\ADE_pilot_002\ADE_pilot_002_interview_08.10.24_part1.csv
Matching file not found for ../results/Grief/fr/fr_to_eng\ADE_pilot_002\ADE_pilot_002_interview_08.10.24_part2.csv
Matching file not found for ../results/Grief/fr/fr_to_eng\ADE_pilot_010\ADE_pilot_010_interview_08.11.24_part1.csv
Matching file not found for ../results/Grief/fr/fr_to_eng\ADE_pilot_010\ADE_pilot_010_interview_08.11.24_part2.csv
Matching file not found for ../results/Grief/fr/fr_to_eng\ADE_pilot_011\ADE_pilot_011_interview_20.11.24.csv


## Merging Files

Combines multiple transcripts from a single interview into a single file for easier processing and analysis.

- Whenever possible, it is recommended to merge the audio files before processing. This allows the model to work on longer segments, up to a logical limit (depending on computational constraints), which improves its ability to identify and separate different speakers effectively.

In [None]:
source_directory = "../Grief_submission/0_fr_role"
destination_directory = "../Grief_submission/1_fr_merged_role"

merge_csv_in_subdirectories(source_directory, destination_directory)

## Adding Metadata (*if available*)
Enriches transcripts with experiment-specific metadata, such as experimental conditions or utterance indices.

In [13]:
source_directory = "../interviews_corrected/2_cleaned_role"  
destination_directory = "../interviews_corrected/4_conditions"
condition_info = pd.read_csv("../Dataset/meditation_interviews/overview_interviews.csv")

process_files(source_directory, destination_directory, conditions=condition_info)

In [14]:
source_directory = "../interviews_corrected/4_conditions"
destination_directory = "../interviews_corrected/5_turn"

process_files(source_directory, destination_directory, turn=True)

# Workflow used for the Grief Transcriptions

## For English audio files

In [None]:
source_directory = "../results/Grief/eng"
destination_directory = "../Grief_submission/1_eng_role"
process_files(source_directory, destination_directory, roles=True)

source_directory = "../Grief_submission/1_eng_role"
destination_directory = "../Grief_submission/2_eng_merged_role_text"
merge_csv_in_subdirectories(source_directory, destination_directory)

source_directory = "../Grief_submission/2_eng_merged_role_text"
destination_directory = "../Grief_submission/3_eng_merged_role_text"
process_files(source_directory, destination_directory, text_format=True, time_stamps=True)

## For French audio files

In [None]:
ref_folder = "../results/Grief/fr/fr_to_eng"
target_folder = "../results/Grief/fr/fr"
output_folder = "../results/Grief/fr/matched_transcripts"
match_transcripts_folder(ref_folder, target_folder, output_folder)

source_directory = "../results/Grief/fr/matched_transcripts"
destination_directory = "../Grief_submission/1_fr_role"
process_files(source_directory, destination_directory, roles=True)

source_directory = "../Grief_submission/1_fr_role"
destination_directory = "../Grief_submission/2_fr_merged_role"
merge_csv_in_subdirectories(source_directory, destination_directory)

source_directory = "../Grief_submission/2_fr_merged_role"
destination_directory = "../Grief_submission/3_fr_merged_role_text"
process_files(source_directory, destination_directory, text_format=True, time_stamps=True)