NAME: __FULLNAME__

# Homework 2

### Objectives
* Object orientation in Python
* Constructing Data Pre-processing Pipelines
  + Imputing
  + Filtering
  + Simple Numerical Methods
* Do not save work within the ml_practices folder
  + create a folder in your home directory for assignments, and copy the templates there  

### General References
* [Sci-kit Learn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [Sci-kit Learn Impute](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)
* [Sci-kit Learn Preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
* [Pandas Interpolate](https://pandas.pydata.org/pandas-docs/version/0.16/generated/pandas.DataFrame.interpolate.html)
* [Pandas fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

FIGWIDTH = 10
FIGHEIGHT = 2

%matplotlib inline

# LOAD DATA

In [None]:
fname = '~/ml_practices/imports/datasets/baby1/subject_k1_w10_hw2.csv'
baby_data_raw = # TODO
baby_data_raw.info()

In [None]:
""" TODO
Call describe() on the data to get summary statistics
"""



In [None]:
""" TODO
Call head() on the data to observe the first few examples
"""



In [None]:
""" TODO
Call tail() on the data to observe the last few examples
"""



In [None]:
""" TODO
Display the column names for the data
"""



In [None]:
""" TODO
Determine whether any data are NaN. Use isna() and
any() to obtain a summary of which features have at 
least one missing value
"""



# Create Pipeline Elements
In the lecture, some of the Pipeline components might have taken in or returned numpy arrays and others pandas DataFrames. For this assignment, transform methods for all the Pipeline components will take input as a pandas DataFrame and return a DataFrame.

In [None]:
""" PROVIDED
Pipeline component object for selecting a subset of specified features
"""
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribs):
        self.attribs = attribs
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X):
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame of the selected attributes
        '''
        return X[self.attribs]


""" TODO
Complete the Pipeline component object for interpolating and filling in 
gaps within the data. Whenever data are missing inbetween valid values, 
use interpolation to fill in the gaps. For example,
    1.2 NaN NaN 1.5 
becomes
    1.2 1.3 1.4 1.5 

Whenever data are missing on the edges of the data, fill in the gaps
with the first available valid value. For example,
    NaN NaN 2.3 3.6 3.2 NaN
becomes
    2.3 2.3 2.3 3.6 3.2 3.2
The transform() method should fill in the holes and the edge cases.
"""
class InterpolationImputer(BaseEstimator, TransformerMixin):
    def __init__(self, method='quadratic'):
        self.method = method
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X): # TODO
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame without NaNs
        '''
        # TODO: Interpolate holes within the data
        
        # TODO: Fill in the NaNs on the edges of the data
                
        # TODO: return the imputed dataframe


""" TODO
Complete the Pipeline component object for smoothing specific features
using a gaussian kernel. Use the following formula to apply the filter:
    x'[t] = ( w[0]*x[t-3] + w[1]*x[t-2] + w[2]*x[t-1] + w[3]*x[t]
           + w[4]*x[t+1] + w[5]*x[t+2] + w[6]*x[t+3])
    DISCLAIMER: if you implement this computation on more than one line, 
                make sure to place parentheses around the entire expression 
                such that the interpreter reads the lines as all part of 
                one expression
This can be implemented similarly to how the derivative is computed. 
Additionally, pad both ends of x with three instances of the adjacent
value, before filtering, to maintain the original signal length and 
smoothness. For example,
                1.3 2.1 4.4 4.1 3.2
would be padded as
    1.3 1.3 1.3 1.3 2.1 4.4 4.1 3.2 3.2 3.2 3.2
"""

def computeweights(length=3, sig=1):
    '''
    Computes the weights for a Gaussian filter kernel
    PARAMS:
        length: the number of terms in the filter kernel
        sig: the standard deviation (i.e. the scale) of the Gaussian
    RETURNS: a list of filter weights for the Gaussian kernel
    '''
    x = np.linspace(-2.5, 2.5, length)
    kernel = stats.norm.pdf(x, scale=sig)
    return kernel / kernel.sum()

class GaussianFilter(BaseEstimator, TransformerMixin):
    def __init__(self, attribs=None, kernelsize=3, sig=1):
        self.attribs = attribs
        self.kernelsize = kernelsize
        self.sig = sig
        self.weights = computeweights(length=kernelsize, sig=sig)
        print("KERNEL WEIGHTS", self.weights)

    def fit(self, x, y=None):
        return self
    
    def transform(self, X): # TODO
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame with the smoothed signals
        '''
        w = self.weights
        Xout = X.copy()
        if self.attribs == None:
            self.attribs = Xout.columns
        
        # TODO for each attribute:
            # TODO: pad the data
            
            # TODO: filter the data
                        
        # TODO: return filtered dataframe


""" PROVIDED
Pipeline component object for computing the derivative for specified features
"""
class DerivativeComputer(BaseEstimator, TransformerMixin):
    def __init__(self, attribs=None, prefix='d_', dt=1.0):
        self.attribs = attribs
        self.prefix = prefix
        self.dt = dt
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X):
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame with additional features for the derivatives
        '''
        Xout = X.copy()
        if self.attribs == None:
            self.attribs = Xout.columns
        
        for attrib in self.attribs:
            vals = Xout[attrib].values
            diff = vals[1:] - vals[0:-1]
            deriv = diff / self.dt
            deriv = np.append(deriv, 0)
            attrib_name = self.prefix + attrib
            Xout[attrib_name] = pd.Series(deriv)

        return Xout


# Construct Pipeline

In [None]:
selected_names = ['left_wrist_x', 'left_wrist_y', 'left_wrist_z']
selected_inds = [baby_data_raw.columns.get_loc(name) for name in selected_names]
nselected = len(selected_names)
time = baby_data_raw['time'].values
Xsel_raw = baby_data_raw[selected_names].values

In [None]:
""" TODO
Create a pipeline that:
1. Selects a subset of features
2. Fills gaps within the data by linearly interpolating the values 
   in between existing data and fills the remaining gaps at the edges
   of the data with the first or last valid value
3. Compute the derivatives of the selected features. The data are 
   sampled at 50 Hz, therefore, the period or elapsed time (dt) between 
   the samples is .02 seconds (dt=.02)
"""
pipe1 = # TODO

""" TODO
Create a pipeline that:
1. Selects a subset of features
2. Fills gaps within the data by linearly interpolating the values 
   in between existing data and fills the remaining gaps at the edges
   of the data with the first or last valid value
3. Smooth the data with a Gaussian Filter. Use a standard deviation 
   of 2 and a kernel size of 7 for the filter
4. Compute the derivatives of the selected features. The data are 
   sampled at 50 Hz, therefore, the period or elapsed time (dt) between 
   the samples is .02 seconds (dt=.02)
"""
pipe2 = # TODO

In [None]:
""" TODO
Fit both Pipelines to the data and transform the data
"""
baby_data1 = # TODO
baby_data2 = # TODO

""" TODO
Display the summary statistics for the pre-processed data
from both pipelines
"""



In [None]:
""" TODO
Display the first few values for the pre-processed data
from both pipelines
"""


In [None]:
""" TODO
Display the last few values for the pre-processed data
from both pipelines
"""


In [None]:
""" TODO
Construct plots comparing the raw data to the pre-processed data 
for each selected feature from both pipelines. For each selected 
feature, create a figure displaying the raw data andthe cleaned 
data in the same subplot. The raw data should be shifted upwards 
to clearly observe where the gaps are filled in the cleaned data. 
There should be three subplots per feature figure. Each subplot 
is in a separate row.
    subplot(1) will compare the original raw data to the pipeline1 
               pre-processed data
    subplot(2) will compare the original raw data to the pipeline2 
               pre-processed data
    subplot(3) will compare pipeline1 to pipeline2. Set the x limit 
               to 45 and 55 seconds
For all subplots, include axis labels, legends and titles.
"""


    

In [None]:
""" TODO
Construct plots for each feature presenting the feature and its 
derivative from both pipelines. Each figure should have 
3 subplots:
    1: the pipeline1 feature data and cooresponding derivative 
    2: the pipeline2 feature data and corresponding derivative
    3: pipeline1 derivative and pipeline2 derivative. Set the x limit 
       to 8 and 12 seconds.
For all subplots, include axis labels, legends and titles.
"""
