# Exploratory Data Analysis

## Dataset Load
Here's the paths to the dataset's zip file and to the directory you want to extract the files:

In [1]:
# change path according to your local path
dset_zip_path = "D:\\_coding\\_wd\\_python\\uni\\Behavioral-Context-Recognition\\dataset\\ExtraSensory.per_uuid_features_labels.zip"
dset_final_path = "D:\\_coding\\_wd\\_python\\uni\\Behavioral-Context-Recognition\\dataset\\unzipped_dataset"

From the main dataset's zip file <code>dset_zip_path</code>, we extract the inner gz files
(where 1 gz file contains measurements relative to 1 subject). The gz file are extracted to the directory <code>dset_final_path</code>.

In [2]:
from zipfile import ZipFile

with ZipFile(dset_zip_path, 'r') as zipObject:
   listOfFileNames = zipObject.namelist()
   for fileName in listOfFileNames:
      zipObject.extract(fileName, dset_final_path)
   print('All the files are extracted')

All the files are extracted


Now we get the measurements' data from all files into 1 single <code>DataFrame</code>.
Since data come from several gz files (where each file is related to a single subject),
for all measurement time stamp, we add the id of the relative subject (by simply adding the name of the file).

In [3]:
import os
import gzip
import pandas as pd

df_list=[]
for gzFile in os.listdir(dset_final_path):
   path_gz = dset_final_path + "\\" + gzFile
   with gzip.open(path_gz) as f:
      df = pd.read_csv(f, header=0)
      df.insert(0,'subj_id',gzFile)
      df_list.append(df)

df_dset = pd.concat(df_list, axis=0, ignore_index=True)
df_dset.head()

Unnamed: 0,subj_id,timestamp,raw_acc:magnitude_stats:mean,raw_acc:magnitude_stats:std,raw_acc:magnitude_stats:moment3,raw_acc:magnitude_stats:moment4,raw_acc:magnitude_stats:percentile25,raw_acc:magnitude_stats:percentile50,raw_acc:magnitude_stats:percentile75,raw_acc:magnitude_stats:value_entropy,...,label:STAIRS_-_GOING_DOWN,label:ELEVATOR,label:OR_standing,label:AT_SCHOOL,label:PHONE_IN_HAND,label:PHONE_IN_BAG,label:PHONE_ON_TABLE,label:WITH_CO-WORKERS,label:WITH_FRIENDS,label_source
0,00EABED2-271D-49D8-B599-1D4A09240601.features_...,1444079161,0.996815,0.003529,-0.002786,0.006496,0.995203,0.996825,0.998502,1.748756,...,,,0.0,,,,1.0,1.0,,2
1,00EABED2-271D-49D8-B599-1D4A09240601.features_...,1444079221,0.996864,0.004172,-0.00311,0.00705,0.994957,0.996981,0.998766,1.935573,...,,,0.0,,,,1.0,1.0,,2
2,00EABED2-271D-49D8-B599-1D4A09240601.features_...,1444079281,0.996825,0.003667,0.003094,0.006076,0.994797,0.996614,0.998704,2.03178,...,,,0.0,,,,1.0,1.0,,2
3,00EABED2-271D-49D8-B599-1D4A09240601.features_...,1444079341,0.996874,0.003541,0.000626,0.006059,0.99505,0.996907,0.99869,1.865318,...,,,0.0,,,,1.0,1.0,,2
4,00EABED2-271D-49D8-B599-1D4A09240601.features_...,1444079431,0.997371,0.037653,0.043389,0.102332,0.995548,0.99686,0.998205,0.460806,...,,,0.0,,,,1.0,1.0,,2


Looking at the head of the DataFrame, we see that we have a total of 279 columns: 278 from the csv files and the first one (<code>subj_id</code>) which we added for linking the single time stamp to the relative subject id.

Now let's get a description of the <code>subj_id</code> column to assess that all the subjects are added to the dataframe.

In [6]:
df_dset['subj_id'].describe()

count                                                377346
unique                                                   60
top       78A91A4E-4A51-4065-BDA7-94755F0BB3BB.features_...
freq                                                  11996
Name: subj_id, dtype: object

There are a total of 60 <code>unique</code> items in the <code>subj_id</code> column so it's safe to say that all subjects' data have been correctly retrieved.

## Identification of variables and data types
The general description of the dataset is accessible at the link http://extrasensory.ucsd.edu/data/primary_data_files/README.txt.

Here we start by taking into consideration a single variable to find out the time series' trends for each variable and the correlation between the single variable and the labels.
### Watch Acceleration
---work in progress--