# Project 3: Activity Detection with Machine Learning

In this project, we will use a database containing activity data collected in Project 2.
The main steps include:
 - **Loading** the database into a Pandas dataframe
 - **Extracting features** for machine learning
 - **Training** different models to find the best activity detection model

Your assignment is to complete the analysis and submit the required files.

In [1]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Mount Google Drive (Optional - for Colab users)
# from google.colab import drive
# drive.mount('/content/drive')

# Define the working directory (update this path if needed)
data_path = '/content/drive/MyDrive/mhealth/'

## Submission Requirements
Your final submission should be a **single compressed file** containing:
 - **Completed Jupyter Notebook** (this file, with all outputs and analysis)
 - **PDF Report** with the required write-ups and explanations
 - **C Header File (`.h`)** containing the best-trained model ported to C

## Setp 1. Load the Activity Database
Ensure that the dataset file is located in the correct directory, or update the file path accordingly.

In [3]:
import pandas as pd
import os

# Define the path to the dataset (update if necessary)
db_path = os.path.join(data_path, 'dataset_spring_2024.csv')

# Load the dataset into a Pandas dataframe
df = pd.read_csv(db_path, dtype={'label': 'str', 'participant': 'str'})

# Remove unnecessary columns (e.g., timestamp if not needed)
df = df.drop(columns=['time_ms'])

# Display basic dataset information
df.info()

# Display the first 10 rows to inspect the dataset
df.head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1768519 entries, 0 to 1768518
Data columns (total 8 columns):
 #   Column       Dtype  
---  ------       -----  
 0   participant  object 
 1   accel_x      float64
 2   accel_y      float64
 3   accel_z      float64
 4   gyr_x        float64
 5   gyr_y        float64
 6   gyr_z        float64
 7   label        object 
dtypes: float64(6), object(2)
memory usage: 107.9+ MB


Unnamed: 0,participant,accel_x,accel_y,accel_z,gyr_x,gyr_y,gyr_z,label
0,P1,-0.1,-0.95,0.35,1.68,-3.08,-1.54,sitting
1,P1,-0.1,-0.94,0.36,2.03,-3.15,-1.47,sitting
2,P1,-0.1,-0.94,0.36,2.52,-3.22,-1.54,sitting
3,P1,-0.1,-0.94,0.36,2.52,-3.15,-1.61,sitting
4,P1,-0.1,-0.94,0.36,2.73,-3.01,-1.61,sitting
5,P1,-0.09,-0.94,0.37,2.87,-2.8,-1.82,sitting
6,P1,-0.09,-0.94,0.37,2.73,-2.45,-1.68,sitting
7,P1,-0.09,-0.94,0.37,2.8,-2.31,-1.75,sitting
8,P1,-0.08,-0.94,0.37,2.38,-2.17,-1.82,sitting
9,P1,-0.08,-0.94,0.37,1.82,-2.1,-1.75,sitting


In [4]:
# include the dataset from 2023
db_path = os.path.join(data_path, 'dataset_winter_2023.csv')
df_2023 = pd.read_csv(db_path)

# clean the data to match this year's dataset
df_2023['label'] = df_2023['label'].str.lower().str.strip()
df_2023['label'] = df_2023['label'].str.replace('sitting/working with a computer', 'sitting')
df_2023['label'] = df_2023['label'].str.replace('sitting/working\xa0with a computer', 'sitting')
print(df_2023.label.unique())

df_2023['participant'] = df_2023['participant'].astype(str)

# add 100 each participant id to separate
def _reformat_participant_id(id_str):
    id = int(id_str.lower().split('p')[-1])
    id += 100
    return f'P{id}'
df_2023['participant'] = df_2023['participant'].apply(_reformat_participant_id)

df_2023 = df_2023.drop(columns=['time'])

df_2023.head()

['sitting' 'standing' 'walking' 'ascending stairs' 'descending stairs'
 'jumping' 'running' 'dancing' 'rest']


Unnamed: 0,participant,accel_x,accel_y,accel_z,gyr_x,gyr_y,gyr_z,label
0,P119,-0.36,-0.94,-0.01,1.33,-1.75,0.35,sitting
1,P119,-0.36,-0.94,-0.01,1.47,-1.68,0.56,sitting
2,P119,-0.36,-0.94,-0.01,1.26,-1.89,0.14,sitting
3,P119,-0.36,-0.94,-0.01,1.05,-1.75,0.63,sitting
4,P119,-0.36,-0.94,-0.01,1.26,-1.89,0.42,sitting


In [5]:
# merge the database
df = pd.concat([df, df_2023], ignore_index=True)

# clean up
del df_2023

df.head()

Unnamed: 0,participant,accel_x,accel_y,accel_z,gyr_x,gyr_y,gyr_z,label
0,P1,-0.1,-0.95,0.35,1.68,-3.08,-1.54,sitting
1,P1,-0.1,-0.94,0.36,2.03,-3.15,-1.47,sitting
2,P1,-0.1,-0.94,0.36,2.52,-3.22,-1.54,sitting
3,P1,-0.1,-0.94,0.36,2.52,-3.15,-1.61,sitting
4,P1,-0.1,-0.94,0.36,2.73,-3.01,-1.61,sitting


## Setp 2. Extract Features

Here we will select a window size and compute aggregations on the dataset. The following line of code will create a new dataframe which will compute aggregate statistics across each participant, label, and window size.

**Note**: this will take some time to compute, especially on a Google Colab instance. The more complex the aggregation calculations, the longer it will take.


### <span style="color:red">Task 2.1</span>
<span style="color:red">Add at least 4 additional features to the feature set. We have provided `mean` and `std` (standard deviation)</span>

In [6]:
# Define window size for feature extraction
window_size = 100 # Corresponds to approximately 1 second of data

############### edit code below ###############
df_grouped = df.groupby(['participant', 'label']).rolling(window=window_size).agg({
    # 'time_ms': ['sum'],
    'accel_x': ['mean', 'std', 'min', 'max', 'median', 'skew'],
    'accel_y': ['mean', 'std', 'min', 'max', 'median', 'skew'],
    'accel_z': ['mean', 'std', 'min', 'max', 'median', 'skew'],
    'gyr_x': ['mean', 'std', 'min', 'max', 'median', 'skew'],
    'gyr_y': ['mean', 'std', 'min', 'max', 'median', 'skew'],
    'gyr_z': ['mean', 'std', 'min', 'max', 'median', 'skew']
}).reset_index().dropna()

# flatten column names so there are is no column levels
df_grouped.columns = ['_'.join(col).strip() for col in df_grouped.columns.values]

# clean up columns
df_grouped.rename(columns={'participant_': 'participant', 'label_': 'label'}, inplace=True)
df_grouped.drop(columns=['level_2_'], inplace=True)

# # optional - save the database to avoid re-running the above code
# df_group_file_path = f'project2_class_dataset_grouped_w{window_size}.csv'
# df_grouped.to_csv(df_group_file_path, index=False)

# then read the csv file
# df_grouped = pd.read_csv(df_group_file_path)

df_grouped.head()

Unnamed: 0,participant,label,accel_x_mean,accel_x_std,accel_x_min,accel_x_max,accel_x_median,accel_x_skew,accel_y_mean,accel_y_std,...,gyr_y_min,gyr_y_max,gyr_y_median,gyr_y_skew,gyr_z_mean,gyr_z_std,gyr_z_min,gyr_z_max,gyr_z_median,gyr_z_skew
99,P1,ascending stairs,-0.8253,0.177869,-1.27,-0.53,-0.775,-0.718442,-0.5937,0.283623,...,-159.6,69.79,-43.715,-0.399055,-1.7871,40.846795,-102.06,53.48,16.59,-0.688015
100,P1,ascending stairs,-0.8254,0.177846,-1.27,-0.53,-0.78,-0.717088,-0.5904,0.282056,...,-159.6,69.79,-44.275,-0.369833,-1.2355,40.650189,-102.06,53.48,16.59,-0.728099
101,P1,ascending stairs,-0.8251,0.17788,-1.27,-0.53,-0.78,-0.721753,-0.5888,0.281939,...,-159.6,69.79,-44.695,-0.356676,-0.6027,40.319882,-102.06,53.48,16.59,-0.771554
102,P1,ascending stairs,-0.8249,0.17792,-1.27,-0.53,-0.775,-0.724572,-0.5875,0.282047,...,-159.6,69.79,-44.695,-0.363526,0.1239,39.762283,-102.06,53.48,16.59,-0.812405
103,P1,ascending stairs,-0.8242,0.177998,-1.27,-0.53,-0.77,-0.735353,-0.5869,0.282277,...,-159.6,69.79,-44.695,-0.368161,0.8106,39.170303,-102.06,53.48,16.59,-0.850316


## Setp 3. Feature Selection

Here we will find the *best* features to use in our model. This can be done by using a variety of techniques, including *forward selection*, *backward selection*, *L1 regularization*, or *Random Forest Importance*. See [feature selection in Python](https://scikit-learn.org/stable/modules/feature_selection.html) for more information.

We provide two methods for features selection, *Univariate Feature Selection* and *Recursive Feature Elimination (RFE)*.

In [7]:
X = df_grouped.drop(columns=['participant', 'label']) # all columns except grouped columns
y = df_grouped['label']

### 3.1 Univariate Feature Selection

### <span style="color:red">Task 3.1</span>
<span style="color:red">Explain the process of univariate feature selection (~1 paragraph, what is the purpose of this method? How does it work?). Which feature have the most importance according to this method?</span>

In [8]:
from sklearn.feature_selection import SelectKBest, f_classif

# Perform univariate feature selection
k = X.columns.size # select all features
selector = SelectKBest(f_classif, k=k)
X_new = selector.fit_transform(X, y)

features_selected = selector.get_support(indices=True)
selected_feature_names = X.columns[features_selected]

# print names and f_scores of selected features
selected_features = pd.DataFrame({'feature': selected_feature_names, 'f_score': selector.scores_[features_selected]})
selected_features = selected_features.sort_values(by='f_score', ascending=False)
selected_features.head(15)

Unnamed: 0,feature,f_score
1,accel_x_std,414149.491084
7,accel_y_std,274139.330284
13,accel_z_std,266514.200351
25,gyr_y_std,260368.015552
31,gyr_z_std,251370.765099
19,gyr_x_std,222869.245764
14,accel_z_min,175595.507974
26,gyr_y_min,174357.306896
33,gyr_z_max,171594.878481
27,gyr_y_max,165867.704992


### Task 3.1 Answer

Univariate feature selection is a statistical method used to evaluate each feature independently to determine its individual relationship with the target variable. The code above leverages SelectKBest with the ANOVA F-test (f_classif) to compute an F-score for every feature in the dataset, ranking them by their statistical significance. Essentially, the method tests each feature to see how well it distinguishes between different classes, with higher scores indicating a stronger discriminatory power. In this case, after fitting the selector on the features X and target y, the features are sorted by their F-scores, and the output shows that **`accel_x_std` has the highest **`F-score (414,149.49)`, indicating that it is the most important feature according to this univariate selection approach.



### 3.2 Recursive Feature Elimination (RFE)

Here we use the method of RFE to select the best features. Different from the univariate feature selection, RFE selects features by recursively considering smaller and smaller sets of features. We choose how we evaluate the importance of a feature by setting the parameter `estimator`. In this case, we use a `RandomForestClassifier` to evaluate the importance of a feature.

**Note**: this will take some time to run. (ex: 5.5mins on M2 Macbook Pro)

### <span style="color:red">Task 3.2</span>
<span style="color:red">Explain the process of recursive feature elimination (~1 paragraph, what is the purpose of this method? How does it work?). Which feature have the most importance according to this method?</span>

In [9]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import time

# Start timing
start_time = time.time()

model = RandomForestClassifier(
    n_estimators=5,  # the number of trees in the forest
    random_state=42,  # for reproducibility
    max_depth=5,  # the maximum depth of the tree
    n_jobs=-1  # Use all available cores for parallelization
)

rfe = RFE(
    estimator=model,
    n_features_to_select=4,  # Keep target of 4 features
    step=0.3,  # Remove 30% of features at each iteration instead of just 1
    verbose=1   # Add verbosity to see progress
)

# Fit RFE
X_rfe = rfe.fit_transform(X, y)

# Get selected features
selected_features_rfe = X.columns[rfe.support_]

# Get ranking of all features (1 is best)
feature_ranking = pd.DataFrame({
    'Feature': X.columns,
    'Ranking': rfe.ranking_
})
feature_ranking = feature_ranking.sort_values('Ranking')

# End timing
end_time = time.time()
runtime = end_time - start_time

print(f"RFE completed in {runtime:.2f} seconds")
print("\nFeatures selected by RFE:", selected_features_rfe.tolist())
print("\nFeature Rankings (top 10):")
print(feature_ranking.head(10))

# As a fallback, also calculate direct feature importance
# This is much faster and often gives similar results
print("\nDirect Feature Importance from RandomForest:")
direct_model = RandomForestClassifier(
    n_estimators=10,
    random_state=42,
    max_depth=5,
    n_jobs=-1
)
direct_model.fit(X, y)

# Get feature importances
importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': direct_model.feature_importances_
})
importances = importances.sort_values('Importance', ascending=False)

print(importances.head(10))

Fitting estimator with 36 features.
Fitting estimator with 26 features.
Fitting estimator with 16 features.
Fitting estimator with 6 features.
RFE completed in 124.13 seconds

Features selected by RFE: ['accel_x_std', 'accel_z_std', 'accel_z_min', 'accel_z_max']

Feature Rankings (top 10):
           Feature  Ranking
13     accel_z_std        1
1      accel_x_std        1
15     accel_z_max        1
14     accel_z_min        1
31       gyr_z_std        2
7      accel_y_std        2
24      gyr_y_mean        3
32       gyr_z_min        3
34    gyr_z_median        3
16  accel_z_median        3

Direct Feature Importance from RandomForest:
         Feature  Importance
13   accel_z_std    0.096195
1    accel_x_std    0.079252
21     gyr_x_max    0.068131
7    accel_y_std    0.062602
2    accel_x_min    0.059013
32     gyr_z_min    0.057935
25     gyr_y_std    0.054514
33     gyr_z_max    0.049660
15   accel_z_max    0.047877
12  accel_z_mean    0.040771


### Task 3.2 Answer

Recursive Feature Elimination (RFE) is a backward selection technique used to identify the most relevant features by recursively fitting a model—in this case, a RandomForestClassifier—and eliminating the least important features at each step until the desired number of features remains. The process works by initially training the model on all features, ranking them based on their importance (as determined by the estimator), and then progressively removing the weakest feature one at a time(depending on the step parameter) before retraining the model. This iterative pruning helps in reducing the dimensionality of the dataset while retaining only those features that contribute most to the predictive power of the model.

In our implementation we keep target of 4 features and we remove 30% of features at each iteration instead of just 1.
In our implementation we keep target of 4 features and we remove 30% of features at each iteration instead of just 1. Overall the features selected by our implementation are:

- `accel_x_std`
- `accel_z_std`
- `accel_z_min`
- `accel_z_max`

With the below feature ranking of top 10:

| Feature | Ranking |
|---------|---------|
| accel_z_std | 1 |
| accel_x_std | 1 |
| accel_z_max | 1 |
| accel_z_min | 1 |
| gyr_z_std | 2 |
| accel_y_std | 2 |
| gyr_y_mean | 3 |
| gyr_z_min | 3 |
| gyr_z_median | 3 |
| accel_z_median | 3 |


Direct Feature Importance from RandomForest:

| Feature | Importance |
|---------|------------|
| accel_z_std | 0.096195 |
| accel_x_std | 0.079252 |
| gyr_x_max | 0.068131 |
| accel_y_std | 0.062602 |
| accel_x_min | 0.059013 |
| gyr_z_min | 0.057935 |
| gyr_y_std | 0.054514 |
| gyr_z_max | 0.049660 |
| accel_z_max | 0.047877 |
| accel_z_mean | 0.040771 |



### <span style="color:red">Task 3.3: Additional Feature Selection Method</span>

<span style="color:red">Please implement an additional feature selection method of your choice. You can use any feature selection method from the [scikit-learn library](https://scikit-learn.org/stable/modules/feature_selection.html), or any others. In your report, writing a short description of the method and the results, similar to above but include **why** you selected this method.</span>

In [10]:
# Extremely Fast Feature Selection - Decision Tree based
import pandas as pd
import numpy as np
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectFromModel

# Start timing
start_time = time.time()

# Use a simple decision tree classifier - extremely fast
tree = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth
    random_state=42
)

# Fit the tree
tree.fit(X, y)

# Get feature importances directly - this is almost instantaneous
feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': tree.feature_importances_
})

# Sort by importance
feature_importances = feature_importances.sort_values('Importance', ascending=False)

# Select top features using SelectFromModel
selector = SelectFromModel(tree, prefit=True, threshold='mean')
X_selected = selector.transform(X)
selected_features = X.columns[selector.get_support()]

# End timing
end_time = time.time()
runtime = end_time - start_time

print(f"Decision Tree feature selection completed in {runtime:.2f} seconds")
print(f"\nNumber of features selected: {len(selected_features)} out of {X.shape[1]}")
print("\nFeatures selected by Decision Tree:")
print(selected_features.tolist())
print("\nTop 10 features by importance:")
print(feature_importances.head(10))

Decision Tree feature selection completed in 72.57 seconds

Number of features selected: 6 out of 36

Features selected by Decision Tree:
['accel_x_std', 'accel_y_median', 'accel_z_mean', 'accel_z_median', 'gyr_y_min', 'gyr_z_std']

Top 10 features by importance:
           Feature  Importance
1      accel_x_std    0.521243
12    accel_z_mean    0.155934
16  accel_z_median    0.057751
31       gyr_z_std    0.052585
10  accel_y_median    0.046958
26       gyr_y_min    0.033892
7      accel_y_std    0.024079
30      gyr_z_mean    0.016606
35      gyr_z_skew    0.015050
3      accel_x_max    0.012388




This decision tree-based selection gives us a different perspective compared to your earlier methods:

It finds additional important features like accel_y_median and accel_z_mean that weren't as prominent in other methods
It confirms the high importance of accel_x_std, which was also top-ranked in univariate selection
It provides an alternative subset that includes both accelerometer and gyroscope features

### Task 3.4: Feature Selection

Choose the best features from the methods above and create a new dataframe with only those features. In your report, note which features you selected, and why.

In [11]:
# Selected features based on multiple feature selection methods
selected_features = ['accel_x_std', 'accel_y_std', 'gyr_z_std',
                    'accel_z_min', 'accel_z_max', 'accel_z_std']

# Create new dataframe with participant, label, and selected features
df_selected_features = df_grouped[['participant', 'label'] + selected_features]

# Display the first few rows to verify
df_selected_features.head()

Unnamed: 0,participant,label,accel_x_std,accel_y_std,gyr_z_std,accel_z_min,accel_z_max,accel_z_std
99,P1,ascending stairs,0.177869,0.283623,40.846795,-0.59,0.52,0.234394
100,P1,ascending stairs,0.177846,0.282056,40.650189,-0.59,0.52,0.233043
101,P1,ascending stairs,0.17788,0.281939,40.319882,-0.59,0.52,0.231738
102,P1,ascending stairs,0.17792,0.282047,39.762283,-0.59,0.52,0.229859
103,P1,ascending stairs,0.177998,0.282277,39.170303,-0.59,0.52,0.229644


After analyzing the results from all three feature selection methods (Univariate Selection, Recursive Feature Elimination, and Decision Tree), We selected the features that appear consistently across multiple methods. Based on the results, these features stand out as most important:

- **accel_x_std** - This feature appeared as top-ranked in all three methods, making it clearly the most important feature overall
- **accel_z_std** - This appeared in both univariate selection and RFE
- **gyr_z_std** - This appeared in both univariate selection and decision tree selection
- **accel_z_min/accel_z_max** - These related features appeared in RFE
- **accel_z_mean/accel_z_median** - These related features appeared in decision tree selection

## Setp 4. Model Training & Selection

### 4.1 Train/Test Split by Percentage

We divide our dataset into a training set and a test set. We will train the model on the training set and evaluate the model on the test set. We split with a percentage, where we use a randomly distribution of the percentage of the data for the training and testing - 33%.

In [26]:
from sklearn.model_selection import train_test_split

test_size = 0.33 # percentage of data for testing

X = df_selected_features[selected_features]
Y = df_selected_features['label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42) # changed tesize to 0.33 from 0.2

### 4.2 Model Training

Here we will train a series of models and evaluate them. We have provided two models: `Decision Tree Classifier`, and `GradientBoostingClassifier`.

### <span style="color:red">Task 4.2</span>
<span style="color:red">Please train 4 additional classifiers model. For each, including the ones we include a brief description (~1 paragraph). For the models you choose, explain why you chose it. Please limit you selection to machine learning models, we will try out deep learning models in the next project. You are free to use any models from the [scikit-learn library](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning), you may also use other as long as they are not deep learning models (e.g., multi-layer neural networks).</span>

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMClassifier
#import xgboost as xgb


# Train multiple models and compare performance
models = []
# Original models
models.append(DecisionTreeClassifier(random_state=0, max_depth=5))
models.append(GradientBoostingClassifier(
    n_estimators=25,     # Reduced from 50 to 25
    max_depth=2,         # Reduced from 3 to 2
    min_samples_split=5, # Require more samples to split
    subsample=0.8,       # Use 80% of samples for each tree
    random_state=0
))
# Additional models
models.append(RandomForestClassifier(n_estimators=50, max_depth=5, random_state=0))
#models.append(LinearSVC(random_state=0, max_iter=1000, dual=False))
models.append(KNeighborsClassifier(n_neighbors=5))
#models.append(LogisticRegression(max_iter=1000, random_state=0))
models.append(KNeighborsClassifier(n_neighbors=5,algorithm='ball_tree', leaf_size=30, weights='distance',  n_jobs=-1 ))
models.append(LGBMClassifier(n_estimators=100,num_leaves=31,learning_rate=0.1,random_state=0,n_jobs=-1))
models.append(LogisticRegression( max_iter=2000,solver='liblinear',C=0.1,penalty='l1',random_state=0))
#models.append(SVC( kernel='rbf',C=10,gamma='scale', random_state=0))
# models.append(xgb.XGBClassifier(n_estimators=100, max_depth=3,learning_rate=0.1,random_state=0))
# models.append(ExtraTreesClassifier(n_estimators=100,max_depth=None,min_samples_split=2,random_state=0))
# models.append(MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000,alpha=0.0001,solver='adam',random_state=0))


For each model, here's a brief description and rationale for selection:

1. **Decision Tree Classifier**: A non-parametric supervised learning method that creates a model that predicts the target by learning simple decision rules inferred from the data features. It's intuitive, easy to interpret, and serves as a good baseline for comparison.

2. **Gradient Boosting Classifier**: An ensemble technique that builds trees sequentially, with each tree correcting the errors of its predecessors. It's powerful for capturing complex patterns and often achieves high accuracy on structured data like sensor readings.

3. **Random Forest Classifier**: An ensemble of decision trees that reduces overfitting through bagging and feature randomization. We chose this because it works well with the standard deviation features in your dataset and is robust to outliers common in sensor data.

4. **Support Vector Machine (SVC)**: A powerful classifier that finds the optimal hyperplane to separate different classes in high-dimensional space. We selected this because it can capture non-linear relationships in sensor data through kernel functions and often performs well in activity recognition tasks.

5. **K-Nearest Neighbors**: A simple yet effective instance-based learning algorithm that classifies based on similarity measures. This is particularly suitable for your accelerometer and gyroscope data as activity patterns often cluster together in feature space.

6. **Logistic Regression**: A linear model that estimates probabilities for classification. I included this as it provides good interpretability of feature importance and serves as a useful baseline to compare against more complex models.

In [14]:
import sklearn


results = []

############### edit code below ###############
for model in models:
    print('training: ', model.__class__.__name__)

    # Measure training time
    train_start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - train_start
    print(f' - Training time: {train_time:.4f} seconds')


    # Measure prediction time
    pred_start = time.time()
    y_pred = model.predict(X_test)
    pred_time = time.time() - pred_start
    print(f' - Prediction time: {pred_time:.4f} seconds')


    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    print(' - accuracy: ', accuracy)
    results.append(
        {
            'name': model.__class__.__name__,
            'accuracy': sklearn.metrics.accuracy_score(y_test, y_pred),
            'confusion_matrix': sklearn.metrics.confusion_matrix(y_test, y_pred),
            'classification_report': sklearn.metrics.classification_report(y_test, y_pred)
        }
    )
    print('done')

training:  DecisionTreeClassifier
 - Training time: 17.1304 seconds
 - Prediction time: 0.0912 seconds
 - accuracy:  0.49143460146913714


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


done
training:  GradientBoostingClassifier
 - Training time: 864.1847 seconds
 - Prediction time: 3.1374 seconds
 - accuracy:  0.515949914117604
done
training:  RandomForestClassifier
 - Training time: 154.4312 seconds
 - Prediction time: 3.6341 seconds
 - accuracy:  0.5162388535613786
done
training:  KNeighborsClassifier
 - Training time: 8.1605 seconds
 - Prediction time: 72.7865 seconds
 - accuracy:  0.813651189562548
done
training:  KNeighborsClassifier
 - Training time: 7.3804 seconds
 - Prediction time: 307.9182 seconds
 - accuracy:  0.8390264682235135
done
training:  LGBMClassifier
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.108732 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1518
[LightGBM] [Info] Number of data points in the train set: 1777763, number of used features: 6
[LightGBM] [Info] Start training from score -2.532183
[LightGBM] [Info] Start training from score -2.537128
[Light

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


done
training:  XGBClassifier


ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2 3 4 5 6 7 8], got ['ascending stairs' 'dancing' 'descending stairs' 'jumping' 'rest'
 'running' 'sitting' 'standing' 'walking']

### <span style="color:red">Task 4.3: Model Reporting</span>
<span style="color:red">Report the results of each model, which one performed the best? Please print the model results below, but include a short write-up in your report.</span>

# Model Performance Summary

| Model | Training Time (s) | Prediction Time (s) | Accuracy |
|-------|------------------|-------------------|----------|
| DecisionTreeClassifier | 17.13 | 0.09 | 49.14% |
| GradientBoostingClassifier | 864.18 | 3.14 | 51.59% |
| RandomForestClassifier | 154.43 | 3.63 | 51.62% |
| KNeighborsClassifier (1) | 8.16 | 72.79 | 81.37% |
| KNeighborsClassifier (2) | 7.38 | 307.92 | 83.90% |
| LGBMClassifier | 131.94 | 47.59 | 73.17% |

## Key Observations:

1. **Highest Accuracy**: The second KNeighborsClassifier implementation achieved the best accuracy at 83.90%, followed by the first KNN implementation at 81.37%.

2. **Training Time**:
  - GradientBoostingClassifier was by far the slowest to train (864.18s)
  - KNN models were fastest to train (7-8s)
  - RandomForest and LightGBM had moderate training times (131-154s)

3. **Prediction Time**:
  - DecisionTreeClassifier had the fastest prediction time (0.09s)
  - KNN models had extremely slow prediction times (73-308s)
  - LGBMClassifier had a moderate prediction time (47.59s)

4. **Efficiency vs. Accuracy Trade-off**:
  - LGBMClassifier offers a good balance between accuracy (73.17%) and moderate prediction time
  - KNN models achieve highest accuracy but at the cost of very slow prediction times
  - Tree-based models (Decision Tree, Random Forest, Gradient Boosting) offer modest accuracy (49-52%) with varied computational demands

5. **Warnings**: Some models generated warnings about ill-defined precision in certain labels, suggesting potential class imbalance issues.

In [None]:
# model results
# summarized above

## 5. LOSO (Leave-One-Subject-Out) Experiment


We have XX participants in our dataset. In this experiment we will train our best model above, but instead of randomly splitting the data, we will use a Leave-One-Subject-Out (LOSO) method. This means we will train on all participants except one, and then evaluate the model on the left out participant. We will repeat this process for all participants and then average the results. We will then compare the results with the previous model evaluation.

### <span style="color:red">Task 5.1: LOSO</span>
<span style="color:red">Create the variables `X_test`, `y_test`, `X_train`, `y_train` for the LOSO experiment.</span>

In [20]:
import random
import numpy as np
import time
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from lightgbm import LGBMClassifier

# Set random seed for reproducibility
random.seed(42)

# Get all unique participant IDs
all_participants = df_selected_features['participant'].unique()
num_participants = all_participants.size

# Select a random participant for testing
test_participant = all_participants[random.randint(0, num_participants-1)]
print(f"Selected test participant: {test_participant}")

# Create train and test sets based on LOSO
# Train set: all participants except the test participant
# Test set: only the test participant
train_data = df_selected_features[df_selected_features['participant'] != test_participant]
test_data = df_selected_features[df_selected_features['participant'] == test_participant]

# Extract features and labels
X_train = train_data[selected_features]
y_train = train_data['label']
X_test = test_data[selected_features]
y_test = test_data['label']

# Print dataset sizes
print(f"Training set size: {X_train.shape[0]} samples from {len(train_data['participant'].unique())} participants")
print(f"Test set size: {X_test.shape[0]} samples from 1 participant")

# Initialize and train the LGBM model
lgbm_model = LGBMClassifier(
    n_estimators=100,
    num_leaves=31,
    learning_rate=0.1,
    random_state=42,
    n_jobs=-1
)

# Train the model
print("\nTraining LGBM model...")
train_start = time.time()
lgbm_model.fit(X_train, y_train)
train_time = time.time() - train_start
print(f"Training time: {train_time:.4f} seconds")

# Test the model
print("\nTesting LGBM model on left-out participant...")
pred_start = time.time()
y_pred = lgbm_model.predict(X_test)
pred_time = time.time() - pred_start
print(f"Prediction time: {pred_time:.4f} seconds")

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nLOSO Accuracy: {accuracy:.4f}")

# Generate confusion matrix and classification report
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

# Calculate feature importance
feature_importance = pd.DataFrame({
    'Feature': selected_features,
    'Importance': lgbm_model.feature_importances_
})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)

Selected test participant: P9
Training set size: 2581372 samples from 21 participants
Test set size: 72007 samples from 1 participant

Training LGBM model...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.053445 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1517
[LightGBM] [Info] Number of data points in the train set: 2581372, number of used features: 6
[LightGBM] [Info] Start training from score -2.518034
[LightGBM] [Info] Start training from score -2.510231
[LightGBM] [Info] Start training from score -2.490942
[LightGBM] [Info] Start training from score -2.491208
[LightGBM] [Info] Start training from score -1.178309
[LightGBM] [Info] Start training from score -2.451779
[LightGBM] [Info] Start training from score -2.385521
[LightGBM] [Info] Start training from score -2.384381
[LightGBM] [Info] Start training from score -2.360586
Tr

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Confusion Matrix:
[[  125  1912    30     0   854    70     0   164   146]
 [    0     0     0     0     0     0     0     0     0]
 [  183   771   904     0   510    58     4     0   671]
 [   66  4217  2024  4796   798   110   314    52    24]
 [    0     0     0     0     0     0     0     0     0]
 [ 1191  3501  2654   172   653  4407    23     8   692]
 [  231    73     0     0 11431   162  1001     2     1]
 [   16   318    76     1  4486    13  1255  7835   201]
 [  627  3397  5901    19  1220    76     0    36  1525]]

Classification Report:
                   precision    recall  f1-score   support

 ascending stairs       0.05      0.04      0.04      3301
          dancing       0.00      0.00      0.00         0
descending stairs       0.08      0.29      0.12      3101
          jumping       0.96      0.39      0.55     12401
             rest       0.00      0.00      0.00         0
          running       0.90      0.33      0.48     13301
          sitting       0.39 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [21]:
import random
import numpy as np
import time
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from lightgbm import LGBMClassifier

# Set random seed for reproducibility
random.seed(43)  # Changed seed to get different participant

# Get all unique participant IDs
all_participants = df_selected_features['participant'].unique()
num_participants = all_participants.size

# Exclude P9 from selection and select a random participant for testing
available_participants = [p for p in all_participants if p != 'P9']
test_participant = random.choice(available_participants)
print(f"Selected test participant: {test_participant}")

# Create train and test sets based on LOSO
# Train set: all participants except the test participant
# Test set: only the test participant
train_data = df_selected_features[df_selected_features['participant'] != test_participant]
test_data = df_selected_features[df_selected_features['participant'] == test_participant]

# Extract features and labels
X_train = train_data[selected_features]
y_train = train_data['label']
X_test = test_data[selected_features]
y_test = test_data['label']

# Print dataset sizes
print(f"Training set size: {X_train.shape[0]} samples from {len(train_data['participant'].unique())} participants")
print(f"Test set size: {X_test.shape[0]} samples from 1 participant")

# Initialize and train the LGBM model
lgbm_model = LGBMClassifier(
    n_estimators=100,
    num_leaves=31,
    learning_rate=0.1,
    random_state=42,
    n_jobs=-1
)

# Train the model
print("\nTraining LGBM model...")
train_start = time.time()
lgbm_model.fit(X_train, y_train)
train_time = time.time() - train_start
print(f"Training time: {train_time:.4f} seconds")

# Test the model
print("\nTesting LGBM model on left-out participant...")
pred_start = time.time()
y_pred = lgbm_model.predict(X_test)
pred_time = time.time() - pred_start
print(f"Prediction time: {pred_time:.4f} seconds")

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nLOSO Accuracy: {accuracy:.4f}")

# Generate confusion matrix and classification report
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, zero_division=0)

print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

# Calculate feature importance
feature_importance = pd.DataFrame({
    'Feature': selected_features,
    'Importance': lgbm_model.feature_importances_
})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)

Selected test participant: P10
Training set size: 2600571 samples from 21 participants
Test set size: 52808 samples from 1 participant

Training LGBM model...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.052926 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1517
[LightGBM] [Info] Number of data points in the train set: 2600571, number of used features: 6
[LightGBM] [Info] Start training from score -2.540740
[LightGBM] [Info] Start training from score -2.517641
[LightGBM] [Info] Start training from score -2.509753
[LightGBM] [Info] Start training from score -2.458474
[LightGBM] [Info] Start training from score -1.191556
[LightGBM] [Info] Start training from score -2.420722
[LightGBM] [Info] Start training from score -2.389868
[LightGBM] [Info] Start training from score -2.375396
[LightGBM] [Info] Start training from score -2.338591
T

### <span style="color:red">Task 5.2: Train Model w/ LOSO</span>
<span style="color:red">Train the best model with the LOSO method. report the results, and compare with the previous model evaluation. Write a short explanation of your observations. Why do you think it perform better or worse?
</span>

In [None]:
# model trained above in 5.2
# summary is below


# LOSO Evaluation Comparison: P9 vs P10

## Overall Performance Metrics

| Metric                   | Participant P9 | Participant P10 | Difference |
|--------------------------|----------------|-----------------|------------|
| **Accuracy**             | 28.60%         | 59.95%          | +31.35%    |
| **Training Time (s)**    | 161.89         | 169.55          | +7.66      |
| **Prediction Time (s)**  | 3.24           | 3.82            | +0.58      |
| **Weighted Avg F1-score**| 0.39           | 0.60            | +0.21      |

## Per-Class Performance (Precision)

| Activity Class     | P9 Precision | P10 Precision | Difference |
|--------------------|--------------|---------------|------------|
| Ascending stairs   | 0.05         | 0.27          | +0.22      |
| Dancing            | 0.00         | 0.00          | 0.00       |
| Descending stairs  | 0.08         | 0.40          | +0.32      |
| Jumping            | 0.96         | 0.44          | -0.52      |
| Rest               | 0.00         | 0.48          | +0.48      |
| Running            | 0.90         | 0.59          | -0.31      |
| Sitting            | 0.39         | 0.95          | +0.56      |
| Standing           | 0.97         | 0.89          | -0.08      |
| Walking            | 0.47         | 0.44          | -0.03      |

## Per-Class Performance (Recall)

| Activity Class     | P9 Recall | P10 Recall | Difference |
|--------------------|-----------|------------|------------|
| Ascending stairs   | 0.04      | 0.13       | +0.09      |
| Dancing            | 0.00      | 0.00       | 0.00       |
| Descending stairs  | 0.29      | 0.26       | -0.03      |
| Jumping            | 0.39      | 0.49       | +0.10      |
| Rest               | 0.00      | 0.50       | +0.50      |
| Running            | 0.33      | 0.19       | -0.14      |
| Sitting            | 0.08      | 0.96       | +0.88      |
| Standing           | 0.55      | 0.97       | +0.42      |
| Walking            | 0.12      | 0.49       | +0.37      |

## Feature Importance Ranking

| Rank | Feature      | P9 Importance | P10 Importance |
|------|--------------|---------------|----------------|
| 1    | accel_z_min  | 5910          | 5899           |
| 2    | accel_z_max  | 5591          | 5713           |
| 3    | accel_x_std  | 4346          | 4270           |
| 4    | gyr_z_std    | 4051          | 4031           |
| 5    | accel_y_std  | 3652          | 3637           |
| 6    | accel_z_std  | 3450          | 3450           |

## Key Observations:

1. P10's overall accuracy was 31.35% higher than P9's, showing significant inter-participant variability.

2. The feature importance ranking remained consistent between participants, suggesting stable feature relevance.

3. Major differences in class performance:
  - Sitting: 88% higher recall for P10
  - Rest: 50% higher recall for P10
  - Jumping: 52% lower precision for P10
  - Running: 31% lower precision for P10

4. Both participants had no samples for the "dancing" activity class.

5. P9 showed better precision for jumping, running, and standing, while P10 performed better on all other activities.

These results highlight the challenge of person-independent activity recognition and demonstrate that model performance can vary drastically depending on individual movement patterns.

### <span style="color:red">Task 5.3 Cross Validation</span>
<span style="color:red">In the interest of time, we will not implement cross-validation. But please include a short explanation of how you would implement cross-validation, and why it is important. (~1 paragraph)</span>

## 6. Port Model

We will now port the model to C, for use in Arduino. Please include the C file in your submission, but we will not run it on the embedded device for this assignment. In the next assignment, we will run the model on the Arduino.

In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMClassifier
#import xgboost as xgb


# Train multiple models and compare performance
models = []
# Original models
#models.append(DecisionTreeClassifier(random_state=0, max_depth=5))
# #models.append(GradientBoostingClassifier(
#     n_estimators=25,     # Reduced from 50 to 25
#     max_depth=2,         # Reduced from 3 to 2
#     min_samples_split=5, # Require more samples to split
#     subsample=0.8,       # Use 80% of samples for each tree
#     random_state=0
# #))
# Additional models
models.append(RandomForestClassifier(n_estimators=50, max_depth=5, random_state=0))
#models.append(LinearSVC(random_state=0, max_iter=1000, dual=False))
#models.append(KNeighborsClassifier(n_neighbors=5))
#models.append(LogisticRegression(max_iter=1000, random_state=0))
#models.append(KNeighborsClassifier(n_neighbors=5,algorithm='ball_tree', leaf_size=30, weights='distance',  n_jobs=-1 ))
#models.append(LGBMClassifier(n_estimators=100,num_leaves=31,learning_rate=0.1,random_state=0,n_jobs=-1))
#models.append(LogisticRegression( max_iter=2000,solver='liblinear',C=0.1,penalty='l1',random_state=0))
#models.append(SVC( kernel='rbf',C=10,gamma='scale', random_state=0))
# models.append(xgb.XGBClassifier(n_estimators=100, max_depth=3,learning_rate=0.1,random_state=0))
# models.append(ExtraTreesClassifier(n_estimators=100,max_depth=None,min_samples_split=2,random_state=0))
# models.append(MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000,alpha=0.0001,solver='adam',random_state=0))

import sklearn


results = []

############### edit code below ###############
for model in models:
    print('training: ', model.__class__.__name__)

    # Measure training time
    train_start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - train_start
    print(f' - Training time: {train_time:.4f} seconds')


    # Measure prediction time
    pred_start = time.time()
    y_pred = model.predict(X_test)
    pred_time = time.time() - pred_start
    print(f' - Prediction time: {pred_time:.4f} seconds')


    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    print(' - accuracy: ', accuracy)
    results.append(
        {
            'name': model.__class__.__name__,
            'accuracy': sklearn.metrics.accuracy_score(y_test, y_pred),
            'confusion_matrix': sklearn.metrics.confusion_matrix(y_test, y_pred),
            'classification_report': sklearn.metrics.classification_report(y_test, y_pred)
        }
    )
    print('done')

training:  RandomForestClassifier
 - Training time: 164.2803 seconds
 - Prediction time: 3.4134 seconds
 - accuracy:  0.5162388535613786
done


In [35]:
len(models)

1

### 6.1 install `micromlgen`

[micromlgen](https://github.com/eloquentarduino/micromlgen) is a opensource project which will generate C code from your sklearn models

In [36]:
pip install micromlgen



### <span style="color:red">Task 6.2: Port Best Model and save</span>

In [37]:
from micromlgen import port

# Select the best performing model (e.g., Gradient Boosting)
c_code = port(models[0])

print(c_code)

# save/port the model to a directory
# ...

#pragma once
#include <cstdarg>
namespace Eloquent {
    namespace ML {
        namespace Port {
            class RandomForest {
                public:
                    /**
                    * Predict class for features vector
                    */
                    int predict(float *x) {
                        uint8_t votes[9] = { 0 };
                        // tree #1
                        if (x[0] <= 0.07209127023816109) {
                            if (x[4] <= 0.044999999925494194) {
                                if (x[1] <= 0.0024078080896288157) {
                                    if (x[4] <= -0.25) {
                                        if (x[4] <= -0.9750000238418579) {
                                            votes[4] += 1;
                                        }

                                        else {
                                            votes[6] += 1;
                                        }
                                    }

 

### <span style="color:red">Task 6.3: Model Review</span>
<span style="color:red">Take a look at the C code, what do you notice? Do you think the model will run on our microcontrollers? Why or why not? Please include a brief write-up in your report.</span>