# Data cleaning (Preprocessing) for CSV transcripts

Perform basic data cleaning on CSV transcript data, particularly if using the *raw* versions of the files. The steps here are intended to clean up any inconsistencies that may remain after manual verification.

**Steps 1 & 2:** Initial data preparation has already been completed on the preprocessed version.

**Using Raw Data:** You can perform all the steps here.

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
from utils.preprocessing_helpers import organize_csv_files_by_dir, process_files
from utils.analysis_helpers import *

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")

### 0. Save the files to procces by experiment (if mixed on another folder)

In [5]:
source_directory = "../evaluation/references"    
destination_directory = "../interviews_corrected/0_raw"    

organize_csv_files_by_dir(source_directory, destination_directory)

../evaluation/references\ID 05.csv
Copied ID 05.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 08.csv
Copied Id 08.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 13.csv
Copied Id 13.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 13b.csv
Copied Id 13b.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 14.csv
Copied Id 14.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 15.csv
Copied Id 15.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 16.csv
Copied Id 16.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 17.csv
Copied Id 17.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 18.csv
Copied Id 18.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 19.csv
Copied Id 19.csv to ../interviews_corrected/0_raw\OBE1
../evaluation/references\Id 19b.csv
Copied Id 19b.csv to ../interviews_corrected/0_raw\OBE1
../eva

### 1. Remove fillers words, repetitions

In [8]:
source_directory = "../interviews_corrected/0_raw" 
destination_directory = "../interviews_corrected/1_cleaned"    

# List of common filler words to remove
filler_words = ["uh", "huh", "um", "hmm", "Mm"]

process_files(source_directory, destination_directory, filler_words)

### 2. If not done manually, you can try to predict the assignment of Speaker Role 

- Achieve 100 % accuracy, on this subset of interviews.

- Work pretty well, on *normal* audio interviews, might be incorrect on "0 : No interview (eg. Set-up)" see **structured_data_manual.xlsx** (even if in this subset it  was correctly predicted).

In [9]:
source_directory = "../interviews_corrected/1_cleaned"    
destination_directory = "../interviews_corrected/2_cleaned_role" 

process_files(source_directory, destination_directory, roles=True)

File 'S209-2.csv': Couldn't accurately predict the most probable participant. Define the mosts probable interviewers and select by default the participant as a fallback.


### 3. Manual check to verify the role assignment & cut unrelated parts of the interview (Set-up, etc ...)

### 4.Text Format 

In [10]:
source_directory = "../interviews_corrected/3_manual_check"    
destination_directory = "../interviews_corrected/4_text" 

process_files(source_directory, destination_directory, text_format=True)

### 5. Add conditions information if available

In [11]:
source_directory = "../interviews_corrected/3_manual_check"  
destination_directory = "../interviews_corrected/5_conditions"

condition_info = pd.read_excel("../interviews_corrected/structured_data_manual.xlsx")

process_files(source_directory, destination_directory, conditions=condition_info)