# Working with Activity Data

You can watch [Part 1 here](https://www.youtube.com/watch?v=XKQcHGzYCQQ) and [Part 2 here](https://www.youtube.com/watch?v=tbmPttrWQeU)

In [7]:
# import statements 

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy as np
import pandas as pd

**RECALL: Data scientists spend about 60% of their time on cleaning data**

Cleaning data looks like the following:
* Standardizing/scaling data
* Missing values
* Merging data from multiple datasets
* etc.

## Missing Values

Values can be missing for various reasons (sensors/batteries die, data in database has been compromised, etc.).

There is special support in `pandas` to use `NaN` values so that we can fix missing data

In [8]:
x = np.arange(0,10)
ser = pd.Series(x)
ser[1] = np.NaN
ser[5] = np.NaN
print(ser)
nans = ser.isnull()
print('\n',nans)
ser.dropna(inplace=True)
print('\n',ser)

0    0.0
1    NaN
2    2.0
3    3.0
4    4.0
5    NaN
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

 0    False
1     True
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
dtype: bool

 0    0.0
2    2.0
3    3.0
4    4.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64


In [11]:
%matplotlib inline

fname = 'data.csv'
df= pd.read_csv(fname, header=0)
print(df.shape)
print('number of participants:', df.shape[0] // 9)

(675, 5)
number of participants: 75


## Explore Data

In [16]:
print(df.head())
print(df.tail())

   pid task duration  age class
0    0    1      146   72   HOA
1    0    2      210   72   HOA
2    0    3      241   72   HOA
3    0    4      328   72   HOA
4    0    5      229   72   HOA
     pid task duration  age class
670   74    5      235   78    PD
671   74    6       41   78    PD
672   74    7       11   78    PD
673   74    8        9   78    PD
674   74  dot     1532   78    PD


In [15]:
print(df['duration'].value_counts()['?'])

10


In [18]:
df.replace('?', np.NaN, inplace=True)

Now let's look at the data in each column.

In [21]:
for col in df.columns:
    ser = df[col]
    print(ser.value_counts())
    print('Number of NaN:', ser.isnull().sum())
    print('*' * 50)

74    9
18    9
20    9
21    9
22    9
     ..
50    9
51    9
52    9
53    9
0     9
Name: pid, Length: 75, dtype: int64
Number of NaN: 0
**************************************************
4      75
8      75
1      75
dot    75
3      75
2      75
5      75
7      75
6      75
Name: task, dtype: int64
Number of NaN: 0
**************************************************
9       29
10      20
11      18
8       17
7       12
        ..
684      1
64       1
164      1
1077     1
291      1
Name: duration, Length: 355, dtype: int64
Number of NaN: 10
**************************************************
70    72
72    36
62    36
64    36
65    36
68    36
54    36
67    27
56    27
57    27
76    27
59    27
80    18
55    18
60    18
61    18
87    18
81    18
83    18
77    18
85     9
69     9
66     9
86     9
89     9
63     9
88     9
73     9
75     9
78     9
79     9
93     9
Name: age, dtype: int64
Number of NaN: 0
**************************************************
HOA          

In [27]:
print(df.shape)

df.dropna(inplace=True)
index = np.arange(0, len(df))
df.set_index(index, inplace=True)

print('\n', df.shape)

(665, 5)

 (665, 5)


In [30]:
task_decoder = {"1":'Water Plants', '2':'Fill Medication Dispenser', '3':'Wash Countertop', 
                '4':'Sweep and Dust', '5':'Cook', '6':'Wash Hands', 
                '7':'Perform TUG', '8':'Perform TUG w/Questions', 'dot':'Day Out Task'}

ser = df['task']
    
for key in task_decoder:
    ser.replace(key, task_decoder[key], inplace=True)
    print(df.head(n=11))

    pid                       task duration  age class
0     0               Water Plants      146   72   HOA
1     0  Fill Medication Dispenser      210   72   HOA
2     0            Wash Countertop      241   72   HOA
3     0             Sweep and Dust      328   72   HOA
4     0                       Cook      229   72   HOA
5     0                 Wash Hands       38   72   HOA
6     0                Perform TUG       10   72   HOA
7     0    Perform TUG w/Questions       10   72   HOA
8     0               Day Out Task      680   72   HOA
9     1               Water Plants       63   54   hoa
10    1  Fill Medication Dispenser      202   54   hoa
    pid                       task duration  age class
0     0               Water Plants      146   72   HOA
1     0  Fill Medication Dispenser      210   72   HOA
2     0            Wash Countertop      241   72   HOA
3     0             Sweep and Dust      328   72   HOA
4     0                       Cook      229   72   HOA
5     0   

In [32]:
print(df['duration'].dtype)
print(df['age'].dtype)

object
int64


In [33]:
df['duration'] = df['duration'].astype(np.int)
print(df['duration'].dtype)

int64


## Clean Class

In [37]:
ser = df['class'].copy()
for i in range(0, len(ser), 1):
    curr = str(ser[i])
    curr = curr.lower()
    
    if 'hoa' in curr or 'healthy' in curr:
        ser[i] = 'HOA'
    elif 'pdf' is curr or 'parkinson' in curr:
        ser[i] = 'PD'
    else:
        
        print('Unrecognized class label:', curr)
        ser[i] = np.NaN
        
    df['class'] = ser
    print(df['class'].value_counts())

  elif 'pdf' is curr or 'parkinson' in curr:


HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
HOA    446
PD      27
Name: class, dtype: int64
Unrecognized class label: pd
HOA    446
PD      26
Name: class, dtype: int64
Unrecognized class label: pd
HOA    446
PD      25
Name: cl

In [None]:
out_fname = 'new_data.csv'
df.to_csv(out_fn)