NAME: __FULLNAME__

# Homework 2
## Machine Learning Practice, 2022

### Objectives
* Object orientation in Python
* Constructing Data Pre-processing Pipelines
  + Imputing
  + Filtering
  + Simple Numerical Methods
  
 
### Notes
* Do not save work within the MLP_2022 folder
  + create a folder in your home directory for assignments, and copy the skeleton there  

### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradscope Notebook HW2 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas
  
### General References
(there are hints here)
* [Sci-kit Learn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [Sci-kit Learn Impute](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute)
* [Sci-kit Learn Preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
* [Pandas Interpolate](https://pandas.pydata.org/pandas-docs/version/0.16/generated/pandas.DataFrame.interpolate.html)
* [Pandas fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

In [None]:
#Import required packages
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

# Default figure parameters
plt.rcParams['figure.figsize'] = (6,6)
plt.rcParams['font.size'] = 12
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = True

# Can be useful under some conditions where you are executing things locally:
#from IPython import get_ipython
#get_ipython().run_line_magic('matplotlib', 'inline')
#%matplotlib inline

In [None]:
# Mount Google Drive. This step should be done while using Google Colab.
from google.colab import drive
drive.mount('/content/drive')

# LOAD DATA

In [None]:
# TODO: Load in the baby data file
fname ='/content/drive/MyDrive/MLP_2022/datasets/baby1/subject_k1_w10_hw2.csv'
#fname ='/home/fagg/datasets/baby1/subject_k1_w10_hw2.csv'


baby_data_raw = # TODO


In [None]:
""" TODO
Call describe() on the data to get summary statistics
"""


In [None]:
""" TODO
Call head() on the data to observe the first few examples
"""


In [None]:
""" TODO
Call tail() on the data to observe the last few examples
"""


In [None]:
""" TODO
Display the column names for the data
"""


In [None]:
""" TODO
Determine whether any data are NaN. Use isna() and
any() to obtain a summary of which features have at 
least one missing value
"""


# Create Pipeline Elements
In the lecture, some of the Pipeline components received or returned numpy arrays, while others retceived or returned pandas DataFrames. For this assignment, transform methods for all the Pipeline components will take input as a pandas DataFrame and return a DataFrame.

In [None]:
""" PROVIDED
Pipeline component object for selecting a subset of specified features
"""
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribs):
        self.attribs = attribs
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X):
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame of the selected attributes
        '''
        return X[self.attribs]
 
""" TODO
Complete the Pipeline component object for interpolating and filling in 
gaps within the data. Whenever data are missing inbetween valid values, 
use interpolation to fill in the gaps. For example,
    1.2 NaN NaN 1.5 
becomes
    1.2 1.3 1.4 1.5 

Whenever data are missing on the edges of the data, fill in the gaps
with the first available valid value. For example,
    NaN NaN 2.3 3.6 3.2 NaN
becomes
    2.3 2.3 2.3 3.6 3.2 3.2
The transform() method you create should fill in the holes and the edge cases.

Hint: there are DataFrame methods that will help you implement these features
"""
class InterpolationImputer(BaseEstimator, TransformerMixin):
    def __init__(self, method='quadratic'):
        self.method = method
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X): # TODO
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame without NaNs
        '''
        # TODO: Interpolate holes within the data
        Xout = 
        # TODO: Fill in the NaNs on the edges of the data
        Xout = 
        
        return Xout
    
""" TODO
Oftentimes, simple linear interpolation does not produce desirable results.
One way to improve our approach is to use a Gaussian kernel, which applies
a smoothing function over the data. This smoothing process helps to improve 
our result from interpolation, reduces the noise the impact of outliers, and 
generally improves learning.

A gaussian kernel is a powerful tool in machine learning - in this case,
we're using a Gaussian kernel to build a filter that we convolve over the 
data. In a kernel, points close to x[t] are used in a weighted
average to compute the new (smoothed) value, x'[t]. Here, the three 
datapoints to the left and right of x[t] are used (along with x[t]) to 
compute x'[t]. In a Gaussian kernel, we use the Gaussian (normal) 
distribution to determine what the weight for each point should be in 
our weighted average. The calculation of this is done for you.

Complete the GaussianFilter component object for smoothing specific features
using a Gaussian kernel. Here is the example formula for a filter of size k=7:
    x'[t] = ( w[0]*x[t-3] + w[1]*x[t-2] + w[2]*x[t-1] + w[3]*x[t]
           + w[4]*x[t+1] + w[5]*x[t+2] + w[6]*x[t+3])
                
This can be implemented similarly to how the derivative is computed, but will
require
1. padding both ends of x with k/2 instances of the adjacent
value, before filtering, to maintain the original timeseries length and 
smoothness. For example,
                1.3 2.1 4.4 4.1 3.2
would be padded as
    1.3 1.3 1.3 1.3 2.1 4.4 4.1 3.2 3.2 3.2 3.2
2. Iterating over the k filter elements (rather than interating over the 
samples in x)

"""

def computeweights(length=3, sig=1):
    ''' PROVIDED
    Computes the weights for a Gaussian filter kernel
    PARAMS:
        length: the number of terms in the filter kernel
        sig: the standard deviation (i.e. the scale) of the Gaussian
    RETURNS: a list of filter weights for the Gaussian kernel
    '''
    limit = 2.5
    x = np.linspace(-limit, limit, length)
    kernel = stats.norm.pdf(x, scale=sig)
    
    # Return the normalized kernel
    return kernel / kernel.sum()

class GaussianFilter(BaseEstimator, TransformerMixin):
    def __init__(self, attribs=None, kernelsize=3, sig=1):
        self.attribs = attribs
        # Number of kernel elements 
        self.kernelsize = kernelsize
        
        # Check that we have an odd kernel size
        if kernelsize % 2 == 0:
            raise Exception("Expecting an odd kernel size")

        # Standard deviation of the Gaussian
        self.sig = sig
        # Compute the kernel element values
        self.weights = computeweights(length=kernelsize, sig=sig)
        print("KERNEL WEIGHTS", self.weights)
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X): # TODO
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame with the smoothed signals
        '''
        w = self.weights
        ks = self.kernelsize
        Xout = X.copy()
        
        # Select all attributes if unspecified
        if self.attribs == None:
          self.attribs = Xout.columns
        
        for attrib in self.attribs:
            # Extract the numpy vector
            vals = Xout[attrib].values
            # TODO: pad signal at both the front and end of the vector so that after
            #   convolution, the length is the same as the lenght of vals.  Use 
            #   vals[0] and vals[-1] to pad the front and back, respectively.
            #   You may assume that the kernel size is always odd
            
            nfrontpad = ks // 2 # int division
            
            # TODO: apply filter
            # Implementation is the same as for the DerivativeComputer element, but
            #   more general.  You must iterate over the kernel elements.
            #   (NOTE: due to the wonky way indexing works in python, you will have
            #   specific code for one index & iterate over the remaining k-1 indices)
            
            
            Xout[attrib] = pd.Series(avg)
            
        return Xout
    
""" PROVIDED
Pipeline component object for computing the derivative for specified features
"""
class DerivativeComputer(BaseEstimator, TransformerMixin):
    def __init__(self, attribs=None, prefix='d_', dt=1.0):
        self.attribs = attribs
        self.prefix = prefix
        self.dt = dt
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, X):
        '''
        PARAMS:
            X: is a DataFrame
        RETURNS: a DataFrame with additional features for the derivatives
        '''
        Xout = X.copy()
        if self.attribs == None:
            self.attribs = Xout.columns

        # Iterate over all of the attributes that we need to compute velocity over
        for attrib in self.attribs:
            # Extract the numpy array of data
            vals = Xout[attrib].values
            # Compute the difference between neighboring timeseries elements
            diff = vals[1:] - vals[0:-1]
            # Take into account the amount of time between timeseries samples
            deriv = diff / self.dt
            # Add a zero to the end so the resulting velocity vector is the same
            #   length as the position vector
            deriv = np.append(deriv, 0)
            
            # Add a new derivative attribute to the DataFrame
            attrib_name = self.prefix + attrib
            Xout[attrib_name] = pd.Series(deriv)

        return Xout


# Construct Pipeline

In [None]:
""" PROVIDED
Set up convenience variables. Use the right wrist data as features.
"""
#selected_names = TODO
selected_names = ['right_wrist_x', 'right_wrist_y', 'right_wrist_z']
nselected = len(selected_names)
time = baby_data_raw['time'].values
# raw data for selected features
Xsel_raw = baby_data_raw[selected_names].values

In [None]:
""" TODO
Create a pipeline that:
1. Selects a subset of features specified above
2. Fills gaps within the data by linearly interpolating the values 
   in between existing data and fills the remaining gaps at the edges
   of the data with the first or last valid value
3. Compute the derivatives of the selected features. The data are 
   sampled at 50 Hz, therefore, the period or elapsed time (dt) between 
   the samples is .02 seconds (dt=.02)
"""
pipe1 = Pipeline([
   
])

""" TODO
Create a pipeline that:
1. Selects a subset of features specified above
2. Fills gaps within the data by linearly interpolating the values 
   in between existing data and fills the remaining gaps at the edges
   of the data with the first or last valid value
3. Smooths the data with a Gaussian Filter. Use a standard deviation 
   of 2 and a kernel size of 7 for the filter
4. Compute the derivatives of the selected features. The data are 
   sampled at 50 Hz, therefore, the period or elapsed time (dt) between 
   the samples is .02 seconds (dt=.02)
"""
pipe2 = Pipeline([
    
])

In [None]:
""" TODO
Fit both Pipelines to the data and transform the data
"""
baby_data1 = # TODO
baby_data2 = # TODO

""" TODO
Display the summary statistics for the pre-processed data
from both pipelines
"""


In [None]:
""" TODO
Display the first 10 values for the pre-processed data
from both pipelines
"""


In [None]:
""" TODO
Display the last 10 values for the pre-processed data
from both pipelines
"""


In [None]:
""" TODO
Construct plots comparing the raw data to the pre-processed data 
for each selected feature from both pipelines. For each selected 
feature, create a figure displaying the raw data and the cleaned 
data in the same subplot. The raw data should be shifted upwards 
to clearly observe where the gaps are filled in the cleaned data. 
There should be three subplots per feature figure. Each subplot 
is in a separate row.
    subplot(1) will compare the original raw data to the pipeline1 
               pre-processed data
    subplot(2) will compare the original raw data to the pipeline2 
               pre-processed data
    subplot(3) will compare pipeline1 to pipeline2. Set the x limit 
               to 45 and 55 seconds
For all subplots, include axis labels, legends and titles.
"""
xlim = [45, 55]
xsel_clean1 = baby_data1.values
xsel_clean2 = baby_data2.values

for f, fname in enumerate(selected_names):
    fig, axs = plt.subplots(len(selected_names),1)
    axs = axs.ravel()
    
    # PIPELINE 1
    # TODO

    # PIPELINE 2
    # TODO

    # PIPELINE 1 VS PIPELINE 2
    # TODO

In [None]:
""" TODO
Construct plots for each feature presenting the feature and its 
derivative from both pipelines. Each figure should have 
3 subplots:
    1: the pipeline1 feature data and cooresponding derivative 
    2: the pipeline2 feature data and corresponding derivative
    3: pipeline1 derivative and pipeline2 derivative. Set the x limit 
       to 8 and 12 seconds.
For all subplots, include axis labels, legends and titles.
"""
xlim = [8, 12]

for f, fname in enumerate(selected_names):
    d_fname = 'd_' + fname
    fig, axs = plt.subplots(3,1,)
    #fig.subplots_adjust(hspace=.35)
    axs = axs.ravel()
    
    # PIPELINE 1
    # TODO

    # PIPELINE 2
    # TODO
    
    # DERIVATIVES
    # TODO
