Author: Jakidxav </br></br>

Date: 04.24.2019 </br></br>

We have been given 3 distinct subsets of the DAIC_WOZ data: the training, development, and testing sets. It has previously been split for other analyses, but we want to reconstruct the full dataset from these individual subsets. We want to do this so that we can do our own splitting of the data, especially in evaluating a given model through k-fold cross-validation.</br></br>

We want to keep at least the following columns:
- `Participant_ID`, 
- `PHQ8_Binary` scores for classification problems, and
- the raw `PHQ8_Score` column for a regression analysis. 

We can also keep the `Gender` column for a different type of classification problem later on.

In [1]:
import pandas as pd

In [2]:
#files on disk
train_file = 'train_split_Depression_AVEC2017.csv'
dev_file = 'dev_split_Depression_AVEC2017.csv'
test_file = 'full_test_split.csv'

In [3]:
#read in dataframes
train = pd.read_csv(train_file)
dev = pd.read_csv(dev_file)
test = pd.read_csv(test_file)

In [4]:
#keeping gender column, might be interesting to examine later
to_drop = ['PHQ8_NoInterest', 'PHQ8_Depressed', 'PHQ8_Sleep', 'PHQ8_Tired',
       'PHQ8_Appetite', 'PHQ8_Failure', 'PHQ8_Concentrating', 'PHQ8_Moving']

In [5]:
#drop unwanted columns
#test set does not have these unwanted columns
train = train.drop(to_drop, axis=1)
dev = dev.drop(to_drop, axis=1)

In [6]:
#should have a total of 189 participants
print(len(train)+len(dev)+len(test))

189


In [7]:
#notice that the columns are mislabeled between the train/dev and test sets
print(train.columns)
print(dev.columns)
print(test.columns)

Index(['Participant_ID', 'PHQ8_Binary', 'PHQ8_Score', 'Gender'], dtype='object')
Index(['Participant_ID', 'PHQ8_Binary', 'PHQ8_Score', 'Gender'], dtype='object')
Index(['Participant_ID', 'PHQ_Binary', 'PHQ_Score', 'Gender'], dtype='object')


In [8]:
#set testing set columns to match
test.columns = train.columns

In [9]:
#create a new dataset containing all of the data
#concatenate along rows
full_dataset = pd.concat([train, dev, test])

In [10]:
#check to see that all data is accounted for
print(len(full_dataset))

189


In [11]:
#sort dataframe by articipant id number
full_dataset = full_dataset.sort_values(['Participant_ID'], ascending=True)

#reset index
full_dataset = full_dataset.reset_index(drop=True)

#save to a CSV file
full_dataset.to_csv('full_dataset.csv')