In [1]:
import pandas as pd

# Explore transcript structure

Take a look at the first transcript to get the idea of the structure.

In [2]:
transcript = pd.read_csv('../data/raw/transcripts/300_Transcript.csv')
transcript

Unnamed: 0,Start_Time,End_Time,Text,Confidence
0,14.3,15.1,so I'm going to,0.934210
1,20.3,21.1,interview in Spanish,0.608470
2,23.9,24.3,okay,0.690606
3,62.1,62.7,good,0.951897
4,68.8,69.8,Atlanta Georgia,0.987629
...,...,...,...,...
72,602.4,603.2,then,0.957885
73,618.0,618.8,thank you,0.989859
74,620.2,620.7,bye-bye,0.622634
75,640.3,643.7,a real life person is really looking at me,0.969585


We can see that the transcript is auto-generated from the participant's speech. There are a couple of problems with that. If one looks at the audio, the first entry does not belong to the participant. Additionally, speech outside of the formal interview is included in the transcript as well (since it is included in the audio). The "Start_Time" in the last row is also labeled incorrectly, so the data is not perfect. Let us see another file.

In [3]:
pd.read_csv('../data/raw/transcripts/301_Transcript.csv') 

Unnamed: 0,Start_Time,End_Time,Text,Confidence
0,0.8,7.0,yeah there's also on Craigslist so that's why,0.883057
1,41.9,42.5,okay,0.960925
2,52.9,55.8,how are you doing today I'm doing good thank you,0.950963
3,59.7,60.7,I'm from Los Angeles,0.970176
4,63.4,64.2,I'm great,0.904099
...,...,...,...,...
67,797.8,798.3,okay,0.781798
68,799.0,801.4,no problem,0.987629
69,802.2,802.8,all right,0.370485
70,818.0,819.2,I was weird,0.881451


It looks like the issue with the last row at "Start_Time" is a repeating problem. It appears it duplicates the time from the first row. This issue would most likely be irrelevant for our task, however.

# Explore labels directory

## Explore detailed labels

In [4]:
detailed_labels = pd.read_csv('../data/raw/labels/detailed_labels.csv')
detailed_labels

Unnamed: 0,Participant,PHQ8_1_NoInterest,PHQ8_2_Depressed,PHQ8_3_Sleep,PHQ8_4_Tired,PHQ8_5_Appetite,PHQ8_6_Failure,PHQ8_7_Concentration,PHQ8_8_Psychomotor,Depression_severity,...,PCL-C_14_Irritability,PCL-C_15_Concentration,PCL-C_16_HyperAlert,PCL-C_17_Jumpy,PTSD_severity,gender,age,Depression_label,PTSD_label,split
0,300,0,0,1,0,1,0,0,0,2,...,2,2,2,1,25.0,male,33,0,0,dev
1,301,0,0,1,1,1,0,0,0,3,...,1,1,1,1,17.0,male,39,0,0,dev
2,302,1,1,0,1,0,1,0,0,4,...,1,1,1,1,28.0,male,25,0,0,train
3,303,0,0,0,0,0,0,0,0,0,...,1,1,1,1,17.0,female,41,0,0,train
4,304,0,1,1,2,2,0,0,0,6,...,1,2,2,1,20.0,female,22,0,0,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270,713,0,0,0,0,0,0,0,0,0,...,2,1,1,1,22.0,unknown,58,0,0,dev
271,715,1,1,1,1,1,1,1,0,7,...,3,2,5,3,55.0,male,55,0,0,test
272,716,1,3,3,2,1,2,2,1,15,...,4,5,5,4,73.0,male,37,1,1,test
273,717,0,0,1,0,0,0,0,0,1,...,1,1,1,1,20.0,male,48,0,0,test


In [5]:
detailed_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 275 entries, 0 to 274
Data columns (total 33 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Participant                 275 non-null    int64  
 1   PHQ8_1_NoInterest           275 non-null    int64  
 2   PHQ8_2_Depressed            275 non-null    int64  
 3   PHQ8_3_Sleep                275 non-null    int64  
 4   PHQ8_4_Tired                275 non-null    int64  
 5   PHQ8_5_Appetite             275 non-null    int64  
 6   PHQ8_6_Failure              275 non-null    int64  
 7   PHQ8_7_Concentration        275 non-null    int64  
 8   PHQ8_8_Psychomotor          275 non-null    int64  
 9   Depression_severity         275 non-null    int64  
 10  PCL-C_1_Memories            275 non-null    int64  
 11  PCL-C_2_Dreams              275 non-null    int64  
 12  PCL-C_3_Reliving            275 non-null    int64  
 13  PCL-C_4_Upset               275 non

Focus on the target for our task:

In [6]:
detailed_labels['Depression_label'].value_counts(normalize=True)

Depression_label
0    0.76
1    0.24
Name: proportion, dtype: float64

Is is evident that the classes are unbalanced. Let us explore the distribution for each split.

In [7]:
detailed_labels.groupby('split')['Depression_label'].value_counts(normalize=True)

split  Depression_label
dev    0                   0.785714
       1                   0.214286
test   0                   0.696429
       1                   0.303571
train  0                   0.773006
       1                   0.226994
Name: proportion, dtype: float64

The distribution for labels is roughly the same for train and dev, and there is a slightly larger proportion of 'depressed' in test.

## Explore split files

In [8]:
pd.read_csv('../data/raw/labels/train_split.csv')  

Unnamed: 0,Participant_ID,Gender,PHQ_Binary,PHQ_Score,PCL-C (PTSD),PTSD Severity
0,302,male,0,4,0,28
1,303,female,0,0,0,17
2,304,female,0,6,0,20
3,305,male,0,7,0,28
4,307,female,0,4,0,23
...,...,...,...,...,...,...
158,695,male,0,7,1,62
159,697,male,0,5,0,24
160,702,male,0,0,0,19
161,703,male,0,8,0,28


In [9]:
pd.read_csv('../data/raw/labels/dev_split.csv')  

Unnamed: 0,Participant_ID,Gender,PHQ_Binary,PHQ_Score,PCL-C (PTSD),PTSD Severity
0,300,male,0,2,0,25
1,301,male,0,3,0,17
2,306,female,0,0,0,21
3,317,male,0,8,1,51
4,320,female,0,11,1,64
5,321,female,1,20,1,62
6,331,male,0,8,1,61
7,334,male,0,5,0,32
8,336,male,0,7,0,29
9,343,male,0,9,0,26


In [10]:
pd.read_csv('../data/raw/labels/test_split.csv') 

Unnamed: 0,Participant_ID,Gender,PHQ_Binary,PHQ_Score,PCL-C (PTSD),PTSD Severity
0,600,female,0,5,0,23.0
1,602,female,1,13,1,67.0
2,604,male,1,12,0,30.0
3,605,male,0,2,0,23.0
4,606,female,0,5,0,46.0
5,607,female,0,7,0,29.0
6,609,male,0,0,0,19.0
7,615,male,0,3,0,22.0
8,618,male,0,4,0,23.0
9,619,female,0,6,0,37.0


In [11]:
# todo iterate over all transcripts, look for missing data