In [1]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import pyarrow.feather as feather 
#I can't get this to work in the environment kernel but it works in the base environment kernel
#it kills the kernel when trying to import. I am using same version of pyarrow and when i created a copy of the base 
#environment it works from there, but not if I build from scratch

# Demographic Data
### Load data from demographic file

In [2]:
#import demographic data that is held in single csv. includes general info about participants, and answers to provided questionaire.
demo_file = r'Source_Data\Micro_Motion_2012\demographics\self_reportNM12.csv'
demographics = pd.read_csv(demo_file)

### Investigate demographic information

In [3]:
demographics.head()

Unnamed: 0,Group,Subject,Age,Sex,Height (m),Music listening hours/week,Perform/produce/compose music hours/week,Dance hours/week,Exercise (no dance) hours/week,Tiresome experience (1-5),Experience of motion (1-5),Experience of moving more to music (1-5),Eyes open?,Locked knees?,Mean QoM,Mean QoM w/oM,Mean QoM w M,NoMus-Mus Diff
0,A,1,23,M,1.72,5,30.0,1.0,7.0,3,3.0,2,1.0,1.0,8.271082,7.955534,8.586613,0.631079
1,A,2,24,M,1.67,10,10.0,2.0,10.0,1,2.0,2,1.0,1.0,11.224096,10.627763,11.820396,1.192633
2,A,3,27,F,1.63,14,1.0,4.0,3.0,4,3.0,4,0.0,0.5,6.44135,6.063694,6.818985,0.755291
3,A,4,27,M,1.75,5,20.0,2.0,2.0,4,5.0,3,1.0,1.0,5.216179,5.289182,5.143181,-0.146001
4,A,9,24,F,1.64,15,0.0,2.0,6.0,2,2.0,1,1.0,1.0,5.15187,4.940202,5.363525,0.423323


In [4]:
demographics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 18 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Group                                     91 non-null     object 
 1   Subject                                   91 non-null     int64  
 2   Age                                       91 non-null     int64  
 3   Sex                                       90 non-null     object 
 4   Height (m)                                91 non-null     float64
 5   Music listening hours/week                91 non-null     int64  
 6   Perform/produce/compose music hours/week  91 non-null     float64
 7   Dance hours/week                          91 non-null     float64
 8   Exercise (no dance) hours/week            91 non-null     float64
 9   Tiresome experience (1-5)                 91 non-null     int64  
 10  Experience of motion (1-5)              

### Remove and/or modify NaN values

In [5]:
demographics[demographics['Eyes open?'].isnull()]

Unnamed: 0,Group,Subject,Age,Sex,Height (m),Music listening hours/week,Perform/produce/compose music hours/week,Dance hours/week,Exercise (no dance) hours/week,Tiresome experience (1-5),Experience of motion (1-5),Experience of moving more to music (1-5),Eyes open?,Locked knees?,Mean QoM,Mean QoM w/oM,Mean QoM w M,NoMus-Mus Diff
74,P,1,24,M,1.72,35,40.0,0.0,1.0,2,4.0,3,,,9.022333,8.926027,9.118634,0.192607
75,P,2,23,F,1.64,6,3.0,2.0,4.0,3,4.0,3,,,5.662621,5.854347,5.470907,-0.38344
76,P,3,51,F,1.52,10,10.0,0.0,0.5,1,2.0,2,,,4.594001,4.438988,4.749006,0.310018
77,P,4,24,F,1.73,3,0.0,1.0,0.5,3,4.0,3,,,7.74067,7.749527,7.731813,-0.017714
78,P,5,21,M,1.76,3,10.0,0.5,3.0,2,3.0,2,,,7.14062,7.147052,7.134189,-0.012864
79,P,6,36,F,1.61,4,2.0,0.0,0.0,5,4.0,4,,,4.301132,4.151781,4.450474,0.298692
80,P,7,24,M,1.73,3,3.0,0.5,0.5,4,3.0,2,,,5.890001,5.723056,6.056937,0.333881
81,P,8,24,F,1.71,30,30.0,4.0,4.0,3,2.0,4,,,6.004398,6.105638,5.903164,-0.202474
82,P,9,22,F,1.66,20,6.0,3.0,1.0,2,3.0,2,,,5.059475,5.106874,5.012078,-0.094797
83,P,10,22,M,1.89,7,7.0,0.0,4.5,5,3.0,1,,,10.248176,9.667558,10.828762,1.161204


The data seems relatively clean. most columns are numeric type with the exception of Group and Sex. There is one participant who didn't provide Sex info, we can drop this row without considerable affect to data. There are 17 participants who have no value for 'Eyes open?' and 'Locked knees'. This happened for all participants in the P group. For now we will enter 0.5 which represents both for both of these columns. If these two columns are found to be strong indicators of movement we can revisit this decision and review it affect more closely. There is also one participant who didn't give an experience of motion answer, we will use a mean of the column to fill.

In [6]:
#Fill NaN values under 'Eyes open?' and 'Locked knees' with 0.5
demographics['Eyes open?'].fillna(0.5, inplace=True)
demographics['Locked knees?'].fillna(0.5, inplace=True)
print(demographics[demographics['Group']=='P'].T)

                                                74        75        76  \
Group                                            P         P         P   
Subject                                          1         2         3   
Age                                             24        23        51   
Sex                                              M         F         F   
Height (m)                                    1.72      1.64      1.52   
Music listening hours/week                      35         6        10   
Perform/produce/compose music hours/week      40.0       3.0      10.0   
Dance hours/week                               0.0       2.0       0.0   
Exercise (no dance) hours/week                 1.0       4.0       0.5   
Tiresome experience (1-5)                        2         3         1   
Experience of motion (1-5)                     4.0       4.0       2.0   
Experience of moving more to music (1-5)         3         3         2   
Eyes open?                            

The column names are rather long. We are going to replace some of them them with shorter versions.

In [7]:
#rename columns with shorter names
new_columns = {'Height (m)':'Height',
               'Music listening hours/week':'Listen', 
               'Perform/produce/compose music hours/week':'Produce', 
               'Dance hours/week':'Dance', 
               'Exercise (no dance) hours/week':'Exercise', 
               'Tiresome experience (1-5)':'Tiresome', 
               'Experience of motion (1-5)':'Exper_silent', 
               'Experience of moving more to music (1-5)':'Exper_music', 
               'Eyes open?':'Eyes', 
               'Locked knees?':'Knees'}
demos = demographics.rename(columns=new_columns)

In [8]:
#drop row with NaN in sex column
#print(demos.shape)
#demos.dropna(subset=['Sex'], inplace=True)
#print(demos.shape)

In [9]:
#Fill single datapoint in Exper_silent with mean from column
demos['Exper_silent'].fillna(demos['Exper_silent'].mean(), inplace=True)
print(demos.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Group           91 non-null     object 
 1   Subject         91 non-null     int64  
 2   Age             91 non-null     int64  
 3   Sex             90 non-null     object 
 4   Height          91 non-null     float64
 5   Listen          91 non-null     int64  
 6   Produce         91 non-null     float64
 7   Dance           91 non-null     float64
 8   Exercise        91 non-null     float64
 9   Tiresome        91 non-null     int64  
 10  Exper_silent    91 non-null     float64
 11  Exper_music     91 non-null     int64  
 12  Eyes            91 non-null     float64
 13  Knees           91 non-null     float64
 14  Mean QoM        91 non-null     float64
 15  Mean QoM w/oM   91 non-null     float64
 16  Mean QoM w M    91 non-null     float64
 17  NoMus-Mus Diff  91 non-null     float

### Create global participant ID column
It will be simpler to have a single column to note individual participants instead of two ,'Group' and 'Subject'. As all grouping are undertood to have had the same experience and not affected each other, we will create a new column that simply numbers particpants 1 to 90. This will also remove any confusion about the few particpants who didn't provide a survey and missing numbers within groups. We will need to perform the same column addition to the motion data in order to properly line them up and will leave the 'Group' and 'Subject' columns in place for now incase they are usful for combining on. 

In [10]:
#add column for Participant ID, PID
demos['PID']=list(range (1,92))
demos.head()

Unnamed: 0,Group,Subject,Age,Sex,Height,Listen,Produce,Dance,Exercise,Tiresome,Exper_silent,Exper_music,Eyes,Knees,Mean QoM,Mean QoM w/oM,Mean QoM w M,NoMus-Mus Diff,PID
0,A,1,23,M,1.72,5,30.0,1.0,7.0,3,3.0,2,1.0,1.0,8.271082,7.955534,8.586613,0.631079,1
1,A,2,24,M,1.67,10,10.0,2.0,10.0,1,2.0,2,1.0,1.0,11.224096,10.627763,11.820396,1.192633,2
2,A,3,27,F,1.63,14,1.0,4.0,3.0,4,3.0,4,0.0,0.5,6.44135,6.063694,6.818985,0.755291,3
3,A,4,27,M,1.75,5,20.0,2.0,2.0,4,5.0,3,1.0,1.0,5.216179,5.289182,5.143181,-0.146001,4
4,A,9,24,F,1.64,15,0.0,2.0,6.0,2,2.0,1,1.0,1.0,5.15187,4.940202,5.363525,0.423323,5


In [11]:
demos['Group'].unique()

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'P'], dtype=object)

In [12]:
#export demographic dataframe (commented out so as to not overwrite without intention)
demos.to_pickle('DFs/demos_clean.pkl')

Note that group b subject 3 has NaN under 'Sex'. This was left in to simplify the PID import sequence. There are two total PID's with NaN in teh 'Sex' column and we will eliminate them prior to entering into model. If 'Sex' turns out to be a minimally dependant variable then we will add back in using M for one and F for the other in order to gain their motion data, otherwise they will just be dropped if 'Sex' is a strong variable as we don't want to sway the model with incorrect data.


# Summary
The data for both demographics was read in and combine so that we can view individual participant demopgrahic data with their motion data. The sound data is the next step bring in clean as needed. The EDA and feature engineering process will proceed with a single participant at first and then expanded to other participants.