# Outlier Removal

This notebook removes rows of data where the Apple Watch sensors were used instead of the Empatica E4 sensor, to allow for a consistent sampling rate among the data without the need for extra interpolation. 

__INPUT: Aggregated Patient .csv w/ Activity Label__ (30_all_partic_aggregated_with_activity.csv)

__OUTPUT:__
1. __.csv without Apple Watch Data: label rounds 1 & 2 separated__ (50_data_no_aw.csv)
2. __.csv without Apple Watch Data: label rounds 1 & 2 combined__ (50_no_apple_watch_combined_rounds.csv)

## Imports

In [12]:
import pandas as pd
import warnings
import numpy as np
warnings.simplefilter("ignore")

## Read in Data

In [19]:
df = pd.read_csv("../../20_intermediate_files/30_all_partic_aggregated_with_activity.csv")

In [20]:
df.head()

Unnamed: 0,Time,ACC1,ACC2,ACC3,TEMP,EDA,BVP,HR,Subject_ID,Activity
0,2019-07-17 11:52:00.000,41.0,27.2,40.0,32.39,0.275354,15.25,78.98,19-001,Baseline 1
1,2019-07-17 11:52:00.250,41.0,27.3,40.0,32.39,0.276634,-12.75,78.835,19-001,Baseline 1
2,2019-07-17 11:52:00.500,41.0,27.4,40.0,32.39,0.270231,-42.99,78.69,19-001,Baseline 1
3,2019-07-17 11:52:00.750,41.0,27.5,40.0,32.39,0.270231,18.39,78.545,19-001,Baseline 1
4,2019-07-17 11:52:01.000,41.0,27.6,40.0,32.34,0.26895,13.61,78.4,19-001,Baseline 1


In [11]:
df.shape

(296128, 10)

## Outlier Removal Procedure

### Create Round Feature from Activity Round Number

In [15]:
df['Round'] = df['Activity'].str.strip().str[-1]

### Remove Data Taken From Apple Watch 

Data Taken From Apple Watch

| Subject ID  | Round | 
|---|---|
| 19-021  | 2 |
| 19-030  | 2  |
| 19-047  | 2  |
| 19-056  | 2  |

In [16]:
df_round2 = df.loc[(df['Subject_ID'] != '19-021') | (df['Round'] != '2')]
df_round2 = df_round2.loc[(df['Subject_ID'] != '19-030') | (df_round2['Round'] != '2')]
df_round2 = df_round2.loc[(df['Subject_ID'] != '19-047') | (df_round2['Round'] != '2')]
df_round2 = df_round2.loc[(df['Subject_ID'] != '19-056') | (df_round2['Round'] != '2')]

### Create Magnitude Feature from ACC X,Y,Z Components

In [17]:
df_round2['Magnitude'] = np.sqrt(df_round2['ACC1']**2 + df_round2['ACC2']**2 + df_round2['ACC3']**2)

### Check Outlier Removal

In [18]:
df_round2['Subject_ID'].value_counts()

19-035    5288
19-004    5288
19-031    5288
19-032    5288
19-048    5288
19-041    5288
19-038    5288
19-015    5288
19-014    5288
19-046    5288
19-024    5288
19-042    5288
19-020    5288
19-002    5288
19-029    5288
19-039    5288
19-043    5288
19-034    5288
19-054    5288
19-019    5288
19-053    5288
19-055    5288
19-012    5288
19-006    5288
19-005    5288
19-018    5288
19-008    5288
19-025    5288
19-037    5288
19-010    5288
19-049    5288
19-026    5288
19-033    5288
19-036    5288
19-040    5288
19-044    5288
19-050    5288
19-045    5288
19-027    5288
19-022    5288
19-028    5288
19-052    5288
19-013    5288
19-001    5288
19-011    5288
19-017    5288
19-051    5288
19-023    5288
19-007    5288
19-009    5288
19-016    5288
19-003    5288
19-021    2644
19-056    2644
19-030    2644
19-047    2644
Name: Subject_ID, dtype: int64

This makes sense because the 4 subjects that had Round 2 AW data (21, 56, 30, 47) have half the data points as the other subjects that have both Round 1 & 2 taken from Empatica data. 

## Save Output Data to CSV 

In [12]:
df_round2.to_csv("../../20_intermediate_files/30_data_no_aw.csv", index = False)

### Remove 1 and 2 from Activity Labels, combining rounds

In [13]:
df_round2['Activity'] = df_round2['Activity'].str.replace('\d+', '').str.rstrip()
df_round2_combined_rounds = df_round2

In [14]:
df_round2_combined_rounds.to_csv("../../20_intermediate_files/30_no_apple_watch_combined_rounds.csv", index = False)