# DEAP Dataset Preliminary Exploration

In this notebook we intend to explore the preprocessed python files of the DEAP dataset and run some simple analyses

## Participant Data

This file contains all the participant video ratings collected during the experiment. The file is available in Open-Office Calc (participant_ratings.ods), Microsoft Excel (participant_ratings.xls), and Comma-separated values (participant_ratings.csv) formats.

The start_time values were logged by the presentation software. Valence, arousal, dominance and liking were rated directly after each trial on a continuous 9-point scale using a standard mouse. SAM Mannequins were used to visualize the ratings for valence, arousal and dominance. For liking (i.e. how much did you like the video?), thumbs up and thumbs down icons were used. Familiarity was rated after the end of the experiment on a 5-point integer scale (from "never heard it before" to "listen to it regularly"). Familiarity ratings are unfortunately missing for participants 2, 15 and 23.

| Column Name    | Information                                                                                       |
|----------------|---------------------------------------------------------------------------------------------------|
| Participant_id | The unique id of the participant (1-32).                                                          |
| Trial          | The trial number (i.e. the presentation order).                                                   |
| Experiment_id  | The video id corresponding to the same column in  the video_list file.                            |
| Start_time     | The starting time of the trial video playback  in microseconds (relative to start of experiment). |
| Valence        | The valence rating (float between 1 and 9).                                                       |
| Arousal        | The arousal rating (float between 1 and 9).                                                       |
| Dominance      | The dominance rating (float between 1 and 9).                                                     |
| Liking         | The liking rating (float between 1 and 9).                                                        |
| Familiarity    | The familiarity rating (integer between 1 and 5).  Blank if missing.                              |

In [1]:
import pandas as pd

# import and look at the type and shape of the participant_ratings file
participant_data = pd.read_csv("deap_data/metadata_csv/participant_ratings.csv")
participant_data

Unnamed: 0,Participant_id,Trial,Experiment_id,Start_time,Valence,Arousal,Dominance,Liking,Familiarity
0,1,1,5,1695918,6.96,3.92,7.19,6.05,4.0
1,1,2,18,2714905,7.23,7.15,6.94,8.01,4.0
2,1,3,4,3586768,4.94,6.01,6.12,8.06,4.0
3,1,4,24,4493800,7.04,7.09,8.01,8.22,4.0
4,1,5,20,5362005,8.26,7.91,7.19,8.13,1.0
5,1,6,31,6176062,3.03,8.14,2.86,8.04,1.0
6,1,7,40,7138735,5.10,7.12,6.17,5.97,3.0
7,1,8,39,8081417,3.24,6.18,7.87,6.15,1.0
8,1,9,13,8960934,1.95,3.12,2.87,6.18,1.0
9,1,10,33,9816492,3.81,3.85,4.78,5.13,1.0


In [13]:
(participant_data.loc[participant_data['Participant_id'] == 1 & (participant_data['Trial'] == 1)])


Unnamed: 0,Participant_id,Trial,Experiment_id,Start_time,Valence,Arousal,Dominance,Liking,Familiarity
0,1,1,5,1695918,6.96,3.92,7.19,6.05,4.0


We can clearly see that each of the 32 participants has 40 rows, which is great. With these two datasets we can already do some simple analysis

In [2]:
# DEAP preprocessed data construction
# Lets get a brief overview of one piece of data
# The data is cut up into 32 pieces each with their own dat file loadable via pickle

import cPickle
x = cPickle.load(open('deap_data/data_preprocessed_python/s01.dat', 'rb'))
print type(x)
print x['labels'].shape
print x['data'].shape

<type 'dict'>
(40, 4)
(40, 40, 8064)


## Preprocessed Data Formats

** The data looks as follows: **

| Array name | Array shape |	Array contents |
| ----------- | ----------- | ------------------| 
| data | 40 x 40 x 8064 | video/trial x channel x data | 
| labels | 40 x 4 |video/trial x label (valence, arousal, dominance, liking) |

** Furthermore it is worth noting that in this pre processed dataset the following preprocessing steps were taken on the EEG data (the first 32 channels) of data. **

1. The data was downsampled to 128Hz.
2. EOG artefacts were removed as in [1].
3. A bandpass frequency filter from 4.0-45.0Hz was applied.
4. The data was averaged to the common reference.
5. The EEG channels were reordered so that they all follow the Geneva order as above.
6. The data was segmented into 60 second trials and a 3 second pre-trial baseline removed.
7. The trials were reordered from presentation order to video (Experiment_id) order.

** The remaining 8 channels were preprocessed in the following way: **

1. The data was downsampled to 128Hz.
2. The data was segmented into 60 second trials and a 3 second pre-trial baseline removed.
3. The trials were reordered from presentation order to video (Experiment_id) order.

We have written a file to combine all the sets into an easily usable dictionary.

In [3]:
# Load the entire 32 patients accesible by number.
raw_data_dict = cPickle.load(open('deap_data/data_preprocessed_python/all_32.dat', 'rb'))
print type(raw_data_dict)

<type 'dict'>
