# Data cleaning (Preprocessing) for CSV transcripts

Perform basic data cleaning on CSV transcript data, particularly if using the *raw* versions of the files. The steps here are intended to clean up any inconsistencies that may remain after manual verification.

**Steps 1 & 2:** Initial data preparation has already been completed on the preprocessed version.

**Using Raw Data:** You can perform all the steps here.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from utils.preprocessing_helpers import *

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")  

## 0. Save the files to procces by experiment (if mixed on another folder)

In [9]:
source_directory = "../evaluation/references"    
destination_directory = "../interviews_corrected/0_raw"    

organize_csv_files_by_experiment(source_directory, destination_directory)

../evaluation/references\ID 05.csv
Copied ID 05.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 08.csv
Copied Id 08.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 13.csv
Copied Id 13.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 13b.csv
Copied Id 13b.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 14.csv
Copied Id 14.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 15.csv
Copied Id 15.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 16.csv
Copied Id 16.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 17.csv
Copied Id 17.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 18.csv
Copied Id 18.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 19.csv
Copied Id 19.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 19b.csv
Copied Id 19b.csv to ../interviews_corrected/0_raw\OBE1
../eva

## 1. Remove fillers words, repetitions

In [4]:
#source_directory = "../interviews_corrected/0_raw" 
#destination_directory = "../interviews_corrected/1_cleaned" 

source_directory = "../Grief_submission/1_fr_merged_role"
destination_directory = "../Grief_submission/3_frcleaned_new"

# List of common filler words to remove
fillers_words = ["uh", "huh", "um", "hmm", "Mm"]

process_files(source_directory, destination_directory, fillers_words)

## 2. If not done manually, you can try to predict the assignment of Speaker Role 

- Achieve 100 % accuracy, on this subset of interviews.

- Work pretty well, on *normal* audio interviews, might be incorrect on "0 : No interview (eg. Set-up)" see **structured_data_manual.xlsx** (even if in this subset it  was correctly predicted).

In [10]:
#source_directory = "../interviews_corrected/1_cleaned"    
#destination_directory = "../interviews_corrected/2_cleaned_role" 

source_directory = "../results/Grief/french/matched_transcripts"
destination_directory = "../Grief_submission/0_fr_role"

process_files(source_directory, destination_directory, roles=True)

## 3. Manual check to verify the role assignment & cut unrelated parts of the interview (Set-up, etc ...)

## 4.Text Format 

In [15]:
source_directory = "../Grief_submission/1_fr_merged_role" 
destination_directory = "../Grief_submission/2_fr_merged_role_text"

process_files(source_directory, destination_directory, text_format=True, time_stamps=True)

## 5. Match transcripts (translation (source) and original (target))

Align transcripts by largest overlap or closest proximity, using as reference the english translation & have as support if needed to the original transcriptt.

In [16]:
ref_folder = "../results/Grief/french/fr_to_eng"
target_folder = "../results/Grief/french/fr"

output_folder = "../results/Grief/french/matched_transcripts"

match_transcripts_folder(ref_folder, target_folder, output_folder)

## 6. Merge two files (one interview split into two files)

In [11]:
source_directory = "../Grief_submission/0_fr_role"
destination_directory = "../Grief_submission/1_fr_merged_role"

merge_csv_in_subdirectories(source_directory, destination_directory)

Created output directory: ../Grief_submission/1_fr_merged_role
Merged files from ../Grief_submission/0_fr_role\ADE_pilot_002 into ../Grief_submission/1_fr_merged_role\ADE_pilot_002_merged.csv


## 7. Add conditions information if available

In [11]:
source_directory = "../interviews_corrected/3_manual_check"  
destination_directory = "../interviews_corrected/5_conditions"

condition_info = pd.read_excel("../interviews_corrected/structured_data_manual.xlsx")

process_files(source_directory, destination_directory, conditions=condition_info)

## Add Index Utterance for further analysis

In [4]:
source_directory = "../interviews_corrected/5_conditions" 
destination_directory = "../interviews_corrected/6_final"

process_files(source_directory, destination_directory, utterance=True)

## Text merged visual cleaning + role

In [7]:
source_directory = "../results/Grief/eng"
destination_directory = "../Grief_submission/1_eng_role"
process_files(source_directory, destination_directory, roles=True)

source_directory = "../Grief_submission/1_eng_role"
destination_directory = "../Grief_submission/2_eng_merged_role_text"
merge_csv_in_subdirectories(source_directory, destination_directory)

source_directory = "../Grief_submission/2_eng_merged_role_text"
destination_directory = "../Grief_submission/3_eng_merged_role_text"
process_files(source_directory, destination_directory, text_format=True, time_stamps=True)

Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_003 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_003_merged.csv
Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_004 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_004_merged.csv
Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_005 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_005_merged.csv
Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_006 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_006_merged.csv
Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_007 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_007_merged.csv
Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_008 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_008_merged.csv
Merged 2 files from ../Grief_submission/1_eng_role\ADE_pilot_009 into ../Grief_submission/2_eng_merged_role_text\ADE_pilot_009_merged.csv
Merged 3 files from ../Grief_submi

In [8]:
ref_folder = "../results/Grief/fr/fr_to_eng"
target_folder = "../results/Grief/fr/fr"
output_folder = "../results/Grief/fr/matched_transcripts"

match_transcripts_folder(ref_folder, target_folder, output_folder)

In [9]:
source_directory = "../results/Grief/fr/matched_transcripts"
destination_directory = "../Grief_submission/1_fr_role"
process_files(source_directory, destination_directory, roles=True)

source_directory = "../Grief_submission/1_fr_role"
destination_directory = "../Grief_submission/2_fr_merged_role"
merge_csv_in_subdirectories(source_directory, destination_directory)

source_directory = "../Grief_submission/2_fr_merged_role"
destination_directory = "../Grief_submission/3_fr_merged_role_text"
process_files(source_directory, destination_directory, text_format=True, time_stamps=True)

Merged 2 files from ../Grief_submission/1_fr_role\ADE_pilot_002 into ../Grief_submission/2_fr_merged_role\ADE_pilot_002_merged.csv
Merged 2 files from ../Grief_submission/1_fr_role\ADE_pilot_010 into ../Grief_submission/2_fr_merged_role\ADE_pilot_010_merged.csv
Merged 1 files from ../Grief_submission/1_fr_role\ADE_pilot_011 into ../Grief_submission/2_fr_merged_role\ADE_pilot_011_merged.csv
