# Example Pre-processing of Trips

This notebook demonstrates pre-processing of the Trips.  The raw format of the directory is unsuiable.

Preprocessing is encapsulated within two classes: `cats_analysis.io.CleanTrip` and `cats_analysis.io.TripSummaryStatistics`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

In [2]:
from cats_analysis.io import (read_trip_file_names,
                             CleanTrip,
                             TripSummaryStatistics)

The raw data are provided in time series format within **.csv** files. The function `cats_analysis.io.read_trip_file_names` creates a `List` of file csv file names in a directory specified as a parameter

In [3]:
files = read_trip_file_names('/home/tom/Documents/code/cats_data')

e.g. The first 4 files in the directory are:

In [4]:
files[:5]

['/home/tom/Documents/code/cats_data/Data-f12be33-Cats1-30845.csv',
 '/home/tom/Documents/code/cats_data/Data-ea2a232-Cats2-30612.csv',
 '/home/tom/Documents/code/cats_data/Data-53e8a28-Cats1-30087.csv',
 '/home/tom/Documents/code/cats_data/Data-e423bc1-Cats5-31014.csv',
 '/home/tom/Documents/code/cats_data/Data-d0d7e7a-Cats5-31122.csv']

## CleanTrip

In [5]:
#Some example files...
#1 - multiple measures okay (might have different fields from others...)
#5 - lots of missing data
#6-  HR only
#25 - a nice example
#32 contains 'invalid date'

ct = CleanTrip(filepath=files[1], wave_features=['mean', 'std', 'max'])

Run the `CleanTrip.clean()` method to execute the preprocessing.  

In [6]:
ct.clean()

Use `time_series` property to view view the cleaned time series as a `pandas.DataFrame`

In [18]:
ct.time_series.head()

Unnamed: 0_level_0,merged_n,type,hr_0002-4182,spo2_0002-4bb8,nbps_0002-4a05,nbpd_0002-4a06,nbpm_0002-4a07,abps_0002-4a15,abpd_0002-4a16,abpm_0002-4a17,...,wcvp_max,wcoo_mean,wcoo_std,wcoo_max,wpleth_mean,wpleth_std,wpleth_max,wresp_mean,wresp_std,wresp_max
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-10-07 16:37:27,1,{9},,,,,,,,,...,,,,,0.484108,0.066486,0.613919,,,
2016-10-07 16:37:28,2,"{9, 6}",104.0,100.0,,,,,,,...,,,,,,,,,,
2016-10-07 16:37:29,2,"{9, 6}",104.0,100.0,,,,,,,...,,,,,,,,,,
2016-10-07 16:37:30,2,"{9, 6}",104.0,100.0,,,,,,,...,,,,,,,,,,
2016-10-07 16:37:31,2,"{9, 6}",104.0,100.0,,,,,,,...,,,,,,,,,,


## TripSummaryStatistics

When a trips has been cleaned it can be passed to a `TripSummaryStatistics` object.  These are used to summarise and resample (set different levels of frequency) for a trip.

Instantiate a new `TripSummaryStatistics` object using a `CleanTrip` as a parameter.

In [29]:
trip_summary = TripSummaryStatistics(clean_trip=ct)

In [30]:
trip_summary.calculate(resample='1s')

Trip duration is by default in Timedelta format

In [31]:
trip_summary.duration

Timedelta('0 days 01:03:43')

To convert to minutes use the following formula

In [27]:
int(trip_summary.duration.total_seconds() / 60)

63

A key property is `TripSummaryStatistics.summary_table`.  This returns a `pandas.DataFrame` containing a simple summary across each time series within the trip.

In [28]:
trip_summary.summary_table

Unnamed: 0,per_missing,mean,std,min,max,median,iqr
merged_n,0.653766,1.970519,0.17378,1.0,3.0,2.0,0.0
hr_0002-4182,24.529289,104.061677,3.970211,96.0,112.0,103.0,7.0
spo2_0002-4bb8,24.529289,99.967221,0.259542,95.9,100.0,100.0,0.0
nbps_0002-4a05,100.0,,,,,,
nbpd_0002-4a06,100.0,,,,,,
nbpm_0002-4a07,100.0,,,,,,
abps_0002-4a15,100.0,,,,,,
abpd_0002-4a16,100.0,,,,,,
abpm_0002-4a17,100.0,,,,,,
arts_0002-4a11,24.529289,99.721414,5.582531,88.0,119.0,99.0,6.0


In [None]:
ct.time_series.head()

In [None]:
df = ct.time_series
results = {}
results['per_missing'] = (1 - df.count()/df.shape[0])*100
results['mean'] = df.mean()
results['std'] = df.std()
results['min'] = df.min()
results['max'] = df.max()
results['median'] = df.quantile(q=0.5)
results['iqr'] = df.quantile(q=0.75) - df.quantile(q=0.25)
df_summary = pd.DataFrame(results)
df_summary

In [None]:
ct.time_series['hr_0002-4182'].plot(figsize=(12,8))

In [None]:
r_ts = ct.resample(rule='15s', interp_missing=True)
r_ts['wcoo_mean'].plot.line(figsize=(12,8))
#r_ts['wcoo_std'].plot.line(figsize=(12,8))
#r_ts['wcoo_max'].plot.line(figsize=(12,8))
#r_ts['wcoo_min'].plot.line(figsize=(12,8))

In [None]:
r_ts = ct.resample(rule='60s', interp_missing=False)
r_ts['hr_0002-4182'].plot.line(figsize=(12,8))
#r_ts['abps_0002-4a15'].plot.line(figsize=(12,8))
#r_ts['nbpm_0002-4a07'].plot.line(figsize=(12,8))