# Project 3: Activity Detection with Machine Learning

In this project, we will use a database containing activity data collected in Project 2.
The main steps include:
 - **Loading** the database into a Pandas dataframe
 - **Extracting features** for machine learning
 - **Training** different models to find the best activity detection model

Your assignment is to complete the analysis and submit the required files.

In [None]:
# Mount Google Drive (Optional - for Colab users)
from google.colab import drive
drive.mount('/content/drive')

# Define the working directory (update this path if needed)
data_path = '/content/drive/MyDrive/mHealthCourse/project-3'

## Submission Requirements
Your final submission should be a **single compressed file** containing:
 - **Completed Jupyter Notebook** (this file, with all outputs and analysis)
 - **PDF Report** with the required write-ups and explanations
 - **C Header File (`.h`)** containing the best-trained model ported to C

## Setp 1. Load the Activity Database
Ensure that the dataset file is located in the correct directory, or update the file path accordingly.

In [None]:
import pandas as pd
import os

# Define the path to the dataset (update if necessary)
db_path = os.path.join(data_path, 'dataset_spring_2024.csv') 

# Load the dataset into a Pandas dataframe
df = pd.read_csv(db_path, dtype={'label': 'str', 'participant': 'str'})

# Remove unnecessary columns (e.g., timestamp if not needed)
df = df.drop(columns=['time_ms'])

# Display basic dataset information
df.info()

# Display the first 10 rows to inspect the dataset
df.head(10)

In [None]:
# include the dataset from 2023
db_path = os.path.join(data_path, 'dataset_winter_2023.csv')
df_2023 = pd.read_csv(db_path)

# clean the data to match this year's dataset
df_2023['label'] = df_2023['label'].str.lower().str.strip()
df_2023['label'] = df_2023['label'].str.replace('sitting/working with a computer', 'sitting')
df_2023['label'] = df_2023['label'].str.replace('sitting/working\xa0with a computer', 'sitting')
print(df_2023.label.unique())

df_2023['participant'] = df_2023['participant'].astype(str)

# add 100 each participant id to separate
def _reformat_participant_id(id_str):
    id = int(id_str.lower().split('p')[-1])
    id += 100
    return f'P{id}'
df_2023['participant'] = df_2023['participant'].apply(_reformat_participant_id)

df_2023 = df_2023.drop(columns=['time'])

df_2023.head()

In [None]:
# merge the database
df = pd.concat([df, df_2023], ignore_index=True)

# clean up
del df_2023

df.head()

## Setp 2. Extract Features

Here we will select a window size and compute aggregations on the dataset. The following line of code will create a new dataframe which will compute aggregate statistics across each participant, label, and window size.

**Note**: this will take some time to compute, especially on a Google Colab instance. The more complex the aggregation calculations, the longer it will take.


### <span style="color:red">Task 2.1</span>
<span style="color:red">Add at least 4 additional features to the feature set. We have provided `mean` and `std` (standard deviation)</span>

In [None]:
# Define window size for feature extraction
window_size = 100 # Corresponds to approximately 1 second of data

############### edit code below ############### 
df_grouped = df.groupby(['participant', 'label']).rolling(window=window_size).agg({
    # 'time_ms': ['sum'],
    'accel_x': ['mean', 'std'],
    'accel_y': ['mean', 'std'],
    'accel_z': ['mean', 'std'],
    'gyr_x': ['mean', 'std'],
    'gyr_y': ['mean', 'std'],
    'gyr_z': ['mean', 'std']
}).reset_index().dropna()

# flatten column names so there are is no column levels
df_grouped.columns = ['_'.join(col).strip() for col in df_grouped.columns.values]

# clean up columns
df_grouped.rename(columns={'participant_': 'participant', 'label_': 'label'}, inplace=True)
df_grouped.drop(columns=['level_2_'], inplace=True)

# # optional - save the database to avoid re-running the above code
# df_group_file_path = f'project2_class_dataset_grouped_w{window_size}.csv'
# df_grouped.to_csv(df_group_file_path, index=False)

# then read the csv file
# df_grouped = pd.read_csv(df_group_file_path)

df_grouped.head()

## Setp 3. Feature Selection

Here we will find the *best* features to use in our model. This can be done by using a variety of techniques, including *forward selection*, *backward selection*, *L1 regularization*, or *Random Forest Importance*. See [feature selection in Python](https://scikit-learn.org/stable/modules/feature_selection.html) for more information.

We provide two methods for features selection, *Univariate Feature Selection* and *Recursive Feature Elimination (RFE)*.

In [None]:
X = df_grouped.drop(columns=['participant', 'label']) # all columns except grouped columns
y = df_grouped['label']

### 3.1 Univariate Feature Selection

### <span style="color:red">Task 3.1</span>
<span style="color:red">Explain the process of univariate feature selection (~1 paragraph, what is the purpose of this method? How does it work?). Which feature have the most importance according to this method?</span>

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Perform univariate feature selection
k = X.columns.size # select all features
selector = SelectKBest(f_classif, k=k)
X_new = selector.fit_transform(X, y)

features_selected = selector.get_support(indices=True)
selected_feature_names = X.columns[features_selected]

# print names and f_scores of selected features
selected_features = pd.DataFrame({'feature': selected_feature_names, 'f_score': selector.scores_[features_selected]})
selected_features = selected_features.sort_values(by='f_score', ascending=False)
selected_features.head(15)

### 3.2 Recursive Feature Elimination (RFE)

Here we use the method of RFE to select the best features. Different from the univariate feature selection, RFE selects features by recursively considering smaller and smaller sets of features. We choose how we evaluate the importance of a feature by setting the parameter `estimator`. In this case, we use a `RandomForestClassifier` to evaluate the importance of a feature.

**Note**: this will take some time to run. (ex: 5.5mins on M2 Macbook Pro)

### <span style="color:red">Task 3.2</span>
<span style="color:red">Explain the process of recursive feature elimination (~1 paragraph, what is the purpose of this method? How does it work?). Which feature have the most importance according to this method?</span>

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=5, # the number of trees in the forest
    random_state=42, # for reproducibility
    max_depth=5 # the maximum depth of the tree
)
rfe = RFE(estimator=model, n_features_to_select=4, step=1)

X_rfe = rfe.fit_transform(X, y)

selected_features_rfe = X.columns[rfe.support_]

print("Features selected by RFE:", selected_features_rfe)

### <span style="color:red">Task 3.3: Additional Feature Selection Method</span>

<span style="color:red">Please implement an additional feature selection method of your choice. You can use any feature selection method from the [scikit-learn library](https://scikit-learn.org/stable/modules/feature_selection.html), or any others. In your report, writing a short description of the method and the results, similar to above but include **why** you selected this method.</span>

In [None]:
## implement additional feature selection method here ...


### Task 3.4: Feature Selection

Choose the best features from the methods above and create a new dataframe with only those features. In your report, note which features you selected, and why.

In [None]:
# example features, you should choose your own.
selected_features = ['accel_x_std', 'accel_y_std', 'accel_z_std', 'gyr_x_std', 'gyr_y_std', 'gyr_z_std']
df_selected_features = df_grouped[['participant', 'label'] + selected_features]

df_selected_features.head()

## Setp 4. Model Training & Selection

### 4.1 Train/Test Split by Percentage

We divide our dataset into a training set and a test set. We will train the model on the training set and evaluate the model on the test set. We split with a percentage, where we use a randomly distribution of the percentage of the data for the training and testing - 33%.

In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.33 # percentage of data for testing

X = df_selected_features[selected_features]
Y = df_selected_features['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### 4.2 Model Training

Here we will train a series of models and evaluate them. We have provided two models: `Decision Tree Classifier`, and `GradientBoostingClassifier`.

### <span style="color:red">Task 4.2</span>
<span style="color:red">Please train 4 additional classifiers model. For each, including the ones we include a brief description (~1 paragraph). For the models you choose, explain why you chose it. Please limit you selection to machine learning models, we will try out deep learning models in the next project. You are free to use any models from the [scikit-learn library](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning), you may also use other as long as they are not deep learning models (e.g., multi-layer neural networks).</span>

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Train multiple models and compare performance
############### edit code below ############### 
models = []
models.append(DecisionTreeClassifier(random_state=0, max_depth=5))
models.append(GradientBoostingClassifier())

In [None]:
results = []

############### edit code below ############### 
for model in models:
    print('training: ', model.__class__.__name__)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    print(' - accuracy: ', accuracy)
    results.append(
        {
            'name': model.__class__.__name__,
            'accuracy': sklearn.metrics.accuracy_score(y_test, y_pred),
            'confusion_matrix': sklearn.metrics.confusion_matrix(y_test, y_pred),
            'classification_report': sklearn.metrics.classification_report(y_test, y_pred)
        }
    )

### <span style="color:red">Task 4.3: Model Reporting</span>
<span style="color:red">Report the results of each model, which one performed the best? Please print the model results below, but include a short write-up in your report.</span>

In [None]:
# model results
# ...

## 5. LOSO (Leave-One-Subject-Out) Experiment


We have XX participants in our dataset. In this experiment we will train our best model above, but instead of randomly splitting the data, we will use a Leave-One-Subject-Out (LOSO) method. This means we will train on all participants except one, and then evaluate the model on the left out participant. We will repeat this process for all participants and then average the results. We will then compare the results with the previous model evaluation.

### <span style="color:red">Task 5.1: LOSO</span>
<span style="color:red">Create the variables `X_test`, `y_test`, `X_train`, `y_train` for the LOSO experiment.</span>

In [None]:
import random

num_participants = df_selected_features['participant'].unique().size
test_participant = df_selected_features['participant'].unique()[random.randint(0, num_participants)]

# implement code here to create X_train, y_train, X_test, y_test for LOSO
# ...

### <span style="color:red">Task 5.2: Train Model w/ LOSO</span>
<span style="color:red">Train the best model with the LOSO method. report the results, and compare with the previous model evaluation. Write a short explanation of your observations. Why do you think it perform better or worse?
</span>

In [None]:
# train and report results
# ...

### <span style="color:red">Task 5.3 Cross Validation</span>
<span style="color:red">In the interest of time, we will not implement cross-validation. But please include a short explanation of how you would implement cross-validation, and why it is important. (~1 paragraph)</span>

## 6. Port Model

We will now port the model to C, for use in Arduino. Please include the C file in your submission, but we will not run it on the embedded device for this assignment. In the next assignment, we will run the model on the Arduino.

### 6.1 install `micromlgen`

[micromlgen](https://github.com/eloquentarduino/micromlgen) is a opensource project which will generate C code from your sklearn models

In [None]:
pip install micromlgen

### <span style="color:red">Task 6.2: Port Best Model and save</span>

In [None]:
from micromlgen import port

# Select the best performing model (e.g., Gradient Boosting)
c_code = port(models[0])

print(c_code)

# save/port the model to a directory
# ...

### <span style="color:red">Task 6.3: Model Review</span>
<span style="color:red">Take a look at the C code, what do you notice? Do you think the model will run on our microcontrollers? Why or why not? Please include a brief write-up in your report.</span>