# Opioid Data - Standardize Timeseries
HW #2 Part 2 - Timeseries.  
Use all rows per patient from about 30 consecutive days.
Standardize all of it.

## Preprocessing
Input characteristics: 
Patient files are in one of two directories: R or NR.
Each patient is represented by one CSV file.
Each row of each CSV contains readings from one day.
Days are in order and mostly sequential (at least one day is missing in one patient).
Most patients have 30 days but some have less.

Loaded data structure characteristics:
Labels columns for cohort (R or NR), patient (2005_S3), date.
Labels rows are num_cohorts * num_patients * num_dates.
Features columns are measurements without dates.
Features rows are same as Labels rows.

We will assume timeseries records are equally spaced at one per day.
This is mostly true with a few abberations.
For example, R patient 2060_S3 is missing the record between 2020-06-01 and 2020-06-03.

We will align all patient records by measuring days-since-start.
We will ignore the specific dates which can be in different months for different patients.
Thus, we are assuming month of year has no effect on the response variable R/NR.

In [1]:
from os import listdir
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import time

In [2]:
pathR='data/ChunkedData_R/'
pathN='data/chunkedData_NR/'
CLASS_SEPARATOR=13  # data[:13] vs data[13:]
WITH_VARIANCE_COLUMNS=True   # Use mean and variance per patient

In [3]:
COL_LIST = ['cohort','patient_name','date']
labels_list = []
features_df = pd.DataFrame()
# Read one CSV file. 
# Load global lists
def load_patient (filepath,cohort,patient_name):
    global labels_df
    global features_df
    one_patient = pd.read_csv(filepath)
    rows,cols = one_patient.shape
    features_df = features_df.append(one_patient)
    for rec in range(0,rows):
        one_label=(cohort,patient_name,one_patient.loc[rec]['Date'])
        labels_list.append(one_label)

In [4]:
# Read directory of CSV files (R or NR). 
# Given directory, load all the patients in that directory.
# We use filenames as patient names.
def load_cohort (cohort,directory):
    file_names = listdir(directory)
    for fp in file_names:
        dfp = directory+fp
        one_name = fp.split('.')[0]  # strip away .csv suffix
        one_name = one_name[6:]    # strip away Daily_ prefix
        one_patient = load_patient(dfp,cohort,one_name)

In [5]:
print(len(listdir(pathR))," files of type R")
load_cohort('R',pathR)
print(len(listdir(pathN))," files of type N")
load_cohort('N',pathN)
features_df = features_df.drop('Date',axis=1) 
features_df.head()

14  files of type R
26  files of type N


Unnamed: 0,Morning_Question1,Morning_Question2,Morning_Question3,Morning_Question4,Morning_Question5,Morning_Question6,Afternoon_Question1,Afternoon_Question2,Afternoon_Question3,Afternoon_Question4,...,HR_mean,HR_var,HR_std,HR_sk,HR_ku,Stress_mean,Stress_var,Stress_std,Stress_sk,Stress_ku
0,2,2,4,4,3,4,4,4,4,4,...,74.290849,320.934322,17.91464,1.07101,0.770627,39.17378,769.416191,27.738352,0.451429,-1.31996
1,1,2,4,4,2,4,2,5,5,4,...,74.401459,308.214628,17.556042,1.327296,1.33125,38.691558,485.93387,22.043908,0.706097,-0.149387
2,2,1,1,4,4,5,4,4,4,4,...,74.329967,292.800716,17.111421,0.993076,0.316197,35.116129,880.102975,29.66653,0.565035,-1.181336
3,2,2,3,4,3,4,2,2,3,4,...,77.153765,166.815153,12.915694,1.120305,3.849928,52.852547,349.389489,18.691963,-0.040921,-0.137957
4,2,2,4,4,3,5,4,4,4,4,...,80.128234,197.631934,14.058163,1.264925,0.945865,45.83432,377.31968,19.424718,0.539156,-0.36347


In [6]:
# Labels is a list of triples: cohort, patient name, test date.
# Labels is just identifying info; no data.
# Labels and Features are in one-to-one correspondence.
print("labels:",len(labels_list)," like ",labels_list[0])
# Features is a dataframe with named columns.
# Features is just features; no identifying info.
print("features:",features_df.shape)

labels: 1004  like  ('R', '2060_S3', '2020-05-21')
features: (1004, 259)


## Scaling 
Normalize by subtracting the column mean from every column value.  
Since columns have widely different numerical ranges,   
also normalize by making each column have unit variance.  
Note: without normalization, the covariance plot would be all black except for the few features with large absolute values.

In [7]:
# Standardize features by shifting the mean to zero and scaling to unit variance.
# Subtract the mean and divide by the std.dev: z = (x - u) / s
def scale_features(df):
    scaled = StandardScaler().fit_transform(df.values)
    scaled_df = pd.DataFrame(scaled, index=df.index, columns=df.columns)
    return scaled_df
scaled_features = scale_features(features_df)
scaled_features.head()

Unnamed: 0,Morning_Question1,Morning_Question2,Morning_Question3,Morning_Question4,Morning_Question5,Morning_Question6,Afternoon_Question1,Afternoon_Question2,Afternoon_Question3,Afternoon_Question4,...,HR_mean,HR_var,HR_std,HR_sk,HR_ku,Stress_mean,Stress_var,Stress_std,Stress_sk,Stress_ku
0,-0.596739,-0.701134,-0.108969,0.405753,0.629528,-0.303159,-0.119786,-0.027694,0.657211,1.04132,...,-0.403874,1.385364,1.423754,0.611028,0.083082,0.625278,1.402042,1.290681,-0.450538,-0.757345
1,-1.279115,-0.701134,-0.108969,0.405753,-0.178715,-0.303159,-2.091345,0.869223,1.317051,1.04132,...,-0.389217,1.273239,1.335386,1.06104,0.354311,0.589421,0.265149,0.402898,-0.147939,-0.370571
2,-0.596739,-1.443686,-3.14799,0.405753,1.437772,0.845414,-0.119786,-0.027694,0.657211,1.04132,...,-0.398691,1.137365,1.225819,0.474183,-0.136771,0.323556,1.845946,1.59129,-0.315551,-0.711542
3,-0.596739,-0.701134,-1.121976,0.405753,0.629528,-0.303159,-2.091345,-1.821527,-0.002629,1.04132,...,-0.024494,0.026794,0.191876,0.697586,1.572846,1.642417,-0.282457,-0.119682,-1.035555,-0.366795
4,-0.596739,-0.701134,-0.108969,0.405753,0.629528,0.845414,-0.119786,-0.027694,0.657211,1.04132,...,0.36967,0.298446,0.473412,0.951524,0.167862,1.120549,-0.170444,-0.005443,-0.346301,-0.441308


## Extract records for one patient

In [16]:
# Returns dataframe.
def patient_by_index(ndx):
    prev_name='XXX'
    name_index=-1
    for i in range(0,len(labels_list)):
        (cohort,name,date)=labels_list[i]
        if not name == prev_name:
            prev_name = name
            name_index = name_index+1
        if name_index == ndx:
            return (cohort,name,date)
    return None
    
def features_by_patient_index(ndx):
    prev_name='XXX'
    name_index=-1
    min=1000000
    max=-1
    for i in range(0,len(labels_list)):
        (cohort,name,date)=labels_list[i]
        if not name == prev_name:
            prev_name = name
            name_index = name_index+1
        if name_index == ndx:
            if i<min:
                min=i
            if i>max:
                max=i
    one_p = features_df.iloc[min:max+1]
    return (one_p)

Patient number: 1
Patient name: ('R', '2027_S2', '2020-03-12')
Num records: 29


In [17]:
# Demo
ndx=1
my_feat = features_by_patient_index(ndx)
print("Patient number:",ndx)
print("Patient cohort, name, start date:",patient_by_index(ndx))
print("Num records:",len(my_feat))

Patient number: 1
Patient cohort, name, start date: ('R', '2027_S2', '2020-03-12')
Num records: 29
