# TSFRESH Human Activity Recognition Example
This example show shows how to use [tsfresh](https://tsfresh.readthedocs.io/) to exctract useful features from multiple timeseries and use them to improve classification performance.

In [None]:
%matplotlib inline
import matplotlib.pylab as plt
from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset, load_har_classes
import seaborn as sns
from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

import logging

In [None]:
# We set the logger to Error level
# This is not recommend for normal use as you can oversee important Warning messages
logging.basicConfig(level=logging.ERROR)

## Load and visualize data
The dataset consists of timeseries for 7352 accelerometer readings. Each reading represents an accelerometer reading for 2.56 sec at 50hz (for a total of 128 samples per reading). Furthermore, each reading corresponds one of six activities (walking, walking upstairs, walking downstairs, sitting, standing and laying)

For more information, or to fetch dataset, go to https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

In [None]:
# fetch dataset from uci
download_har_dataset()

In [None]:
df = load_har_dataset()
df.head()
df.shape

In [None]:
plt.title('accelerometer reading')
plt.plot(df.ix[0,:])
plt.show()

## Extract Features

In [None]:
extraction_settings = ComprehensiveFCParameters()

In [None]:
# rearrange first 500 sensor readings column-wise, not row-wise

N = 500
master_df = pd.DataFrame({0: df[:N].values.flatten(),
                          1: np.arange(N).repeat(df.shape[1])})
master_df.head()

In [None]:
%time X = extract_features(master_df, column_id=1, impute_function=impute, default_fc_parameters=extraction_settings);

In [None]:
X.shape

In [None]:
"Number of extracted features: {}.".format(X.shape[1])

## Train and evaluate classifier

In [None]:
y = load_har_classes()[:N]
y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [None]:
cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))

# Multiclass feature selection

In total our feature matrix contains 222 features. 
We can try to select a subset of features with the select_features method of tsfresh. 

However it only works for binary classification or regression tasks. 

For a 6 label multi classification we split the selection problem into 6 binary one-versus all classification problems. For each of them we can do a binary classification feature selection:

In [None]:
relevant_features = set()

for label in y.unique():
    y_train_binary = y_train == label
    X_train_filtered = select_features(X_train, y_train_binary)
    print("Number of relevant features for class {}: {}/{}".format(label, X_train_filtered.shape[1], X_train.shape[1]))
    relevant_features = relevant_features.union(set(X_train_filtered.columns))

In [None]:
len(relevant_features)

we keep only those features that we selected above, for both the train and test set

In [None]:
X_train_filtered = X_train[list(relevant_features)]
X_test_filtered = X_test[list(relevant_features)]

In [None]:
X_train_filtered.shape, X_test_filtered.shape

so, we reduced the number of used features from 794 to 263

In [None]:
cl = DecisionTreeClassifier()
cl.fit(X_train_filtered, y_train)
print(classification_report(y_test, cl.predict(X_test_filtered)))

It worked! The precision improved by removing irrelevant features.

## Compare against naive classification accuracy
By extracting using time-series features (as opposed to using raw data points), we can meaningfully increase classification accuracy.

In [None]:
X_1 = df.ix[:N-1,:]
X_1.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_1, y, test_size=.2)

In [None]:
cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))

So, both our unfiltered and filtered feature based classificators are able to beat the model on the raw time series values 