<a href="https://colab.research.google.com/github/Christopher-LeeNU/Christopher-LeeNU/blob/main/Chris_Lee_project3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3

In the project we will use a database created from all activity collected in Project 2. We will first **load** the database into a [pandas] dataframe, **extract features**, and then **train machine learning model** to find the best model for activity detection.

Your assignment will include making edits to this notebook. Please complete the **tasks** in each section.

In [1]:
# !pip install scikit-learn

In [2]:
import sklearn

In [3]:
# OPTIONAL used to mount Google Drive folder with databases
from google.colab import drive
drive.mount('/content/drive')

# working_dir, this will be different for you
data_path = '/content/drive/MyDrive/CS 397'

Mounted at /content/drive


## Submission
Please include the following for your submission, compressed into a **single** file:
* Completed version of this notebook
* A PDF report with the requested write-ups
* A `.h` file the best model ported to C.

## 1. open class database
insure the database is located in the same directory as the notebook, or update the path.

In [4]:
import pandas as pd
import os

# open the class database
db_path = os.path.join(data_path, 'dataset_spring_2024.csv') # this may be different for you
df = pd.read_csv(db_path, dtype={'label': 'str', 'participant': 'str'})

df = df.drop(columns=['time_ms'])

df.head(10)

Unnamed: 0,participant,accel_x,accel_y,accel_z,gyr_x,gyr_y,gyr_z,label
0,P1,-0.1,-0.95,0.35,1.68,-3.08,-1.54,sitting
1,P1,-0.1,-0.94,0.36,2.03,-3.15,-1.47,sitting
2,P1,-0.1,-0.94,0.36,2.52,-3.22,-1.54,sitting
3,P1,-0.1,-0.94,0.36,2.52,-3.15,-1.61,sitting
4,P1,-0.1,-0.94,0.36,2.73,-3.01,-1.61,sitting
5,P1,-0.09,-0.94,0.37,2.87,-2.8,-1.82,sitting
6,P1,-0.09,-0.94,0.37,2.73,-2.45,-1.68,sitting
7,P1,-0.09,-0.94,0.37,2.8,-2.31,-1.75,sitting
8,P1,-0.08,-0.94,0.37,2.38,-2.17,-1.82,sitting
9,P1,-0.08,-0.94,0.37,1.82,-2.1,-1.75,sitting


In [5]:
# include the dataset from 2023
db_path = os.path.join(data_path, 'dataset_winter_2023.csv')
df_2023 = pd.read_csv(db_path)

# clean the data to match this year's dataset
df_2023['label'] = df_2023['label'].str.lower().str.strip()
df_2023['label'] = df_2023['label'].str.replace('sitting/working with a computer', 'sitting')
df_2023['label'] = df_2023['label'].str.replace('sitting/working\xa0with a computer', 'sitting')
print(df_2023.label.unique())

df_2023['participant'] = df_2023['participant'].astype(str)

# add 100 each participant id to separate
def _reformat_participant_id(id_str):
    id = int(id_str.lower().split('p')[-1])
    id += 100
    return f'P{id}'
df_2023['participant'] = df_2023['participant'].apply(_reformat_participant_id)

df_2023 = df_2023.drop(columns=['time'])


df_2023.head()

['sitting' 'standing' 'walking' 'ascending stairs' 'descending stairs'
 'jumping' 'running' 'dancing' 'rest']


Unnamed: 0,participant,accel_x,accel_y,accel_z,gyr_x,gyr_y,gyr_z,label
0,P119,-0.36,-0.94,-0.01,1.33,-1.75,0.35,sitting
1,P119,-0.36,-0.94,-0.01,1.47,-1.68,0.56,sitting
2,P119,-0.36,-0.94,-0.01,1.26,-1.89,0.14,sitting
3,P119,-0.36,-0.94,-0.01,1.05,-1.75,0.63,sitting
4,P119,-0.36,-0.94,-0.01,1.26,-1.89,0.42,sitting


In [6]:
# merge the database
df = pd.concat([df, df_2023], ignore_index=True)

# clean up
del df_2023

df.head()

Unnamed: 0,participant,accel_x,accel_y,accel_z,gyr_x,gyr_y,gyr_z,label
0,P1,-0.1,-0.95,0.35,1.68,-3.08,-1.54,sitting
1,P1,-0.1,-0.94,0.36,2.03,-3.15,-1.47,sitting
2,P1,-0.1,-0.94,0.36,2.52,-3.22,-1.54,sitting
3,P1,-0.1,-0.94,0.36,2.52,-3.15,-1.61,sitting
4,P1,-0.1,-0.94,0.36,2.73,-3.01,-1.61,sitting


## 2. Extract Features

Here we will select a window size and compute aggregations on the dataset. The following line of code will create a new dataframe which will compute aggregate statistics across each participant, label, and window size.

**Note**: this will take some time to compute, especially on a Google Colab instance. The more complex the aggregation calculations, the longer it will take.


#### Task 2.1
Add at least 4 additional features to the feature set. We have provided `mean` and `std` (standard deviation)

In [7]:
window_size = 100 # ~ 1 sec

df_grouped = df.groupby(['participant', 'label']).rolling(window=window_size).agg({
    # 'time_ms': ['sum'],
    'accel_x': ['mean', 'std'],
    'accel_y': ['mean', 'std'],
    'accel_z': ['mean', 'std'],
    'gyr_x': ['mean', 'std'],
    'gyr_y': ['mean', 'std'],
    'gyr_z': ['mean', 'std']
}).reset_index().dropna()

# flatten column names so there are is no column levels
df_grouped.columns = ['_'.join(col).strip() for col in df_grouped.columns.values]

# clean up columns
df_grouped.rename(columns={'participant_': 'participant', 'label_': 'label'}, inplace=True)
df_grouped.drop(columns=['level_2_'], inplace=True)

# # optional - save the database to avoid re-running the above code
# df_group_file_path = f'project2_class_dataset_grouped_w{window_size}.csv'
# df_grouped.to_csv(df_group_file_path, index=False)

# then read the csv file
# df_grouped = pd.read_csv(df_group_file_path)

df_grouped.head()

Unnamed: 0,participant,label,accel_x_mean,accel_x_std,accel_y_mean,accel_y_std,accel_z_mean,accel_z_std,gyr_x_mean,gyr_x_std,gyr_y_mean,gyr_y_std,gyr_z_mean,gyr_z_std
99,P1,ascending stairs,-0.8253,0.177869,-0.5937,0.283623,-0.005,0.234394,-22.239,95.78653,-37.492,41.125919,-1.7871,40.846795
100,P1,ascending stairs,-0.8254,0.177846,-0.5904,0.282056,-0.0086,0.233043,-24.6897,93.403866,-37.9239,41.128009,-1.2355,40.650189
101,P1,ascending stairs,-0.8251,0.17788,-0.5888,0.281939,-0.0116,0.231738,-27.3,90.788858,-38.1045,41.195001,-0.6027,40.319882
102,P1,ascending stairs,-0.8249,0.17792,-0.5875,0.282047,-0.0145,0.229859,-30.0398,87.643445,-37.9491,41.065231,0.1239,39.762283
103,P1,ascending stairs,-0.8242,0.177998,-0.5869,0.282277,-0.0152,0.229644,-32.6606,84.653801,-37.8301,40.958809,0.8106,39.170303


## 3. Feature Selection

Here we will find the *best* features to use in our model. This can be done by using a variety of techniques, including *forward selection*, *backward selection*, *L1 regularization*, or *Random Forest Importance*. See [feature selection in Python](https://scikit-learn.org/stable/modules/feature_selection.html) for more information.

We provide two methods for features selection, *Univariate Feature Selection* and *Recursive Feature Elimination (RFE)*.

In [8]:
X = df_grouped.drop(columns=['participant', 'label']) # all columns except grouped columns
y = df_grouped['label']

### 3.1 Univariate Feature Selection

#### Task 3.1
Explain the process of univariate feature selection (~1 paragraph, what is the purpose of this method? How does it work?). Which feature have the most importance according to this method?

In [9]:
from sklearn.feature_selection import SelectKBest, f_classif

k = X.columns.size # select all features
selector = SelectKBest(f_classif, k=k)
X_new = selector.fit_transform(X, y)

features_selected = selector.get_support(indices=True)
selected_feature_names = X.columns[features_selected]

# print names and f_scores of selected features
selected_features = pd.DataFrame({'feature': selected_feature_names, 'f_score': selector.scores_[features_selected]})
selected_features = selected_features.sort_values(by='f_score', ascending=False)
selected_features.head(15)

Unnamed: 0,feature,f_score
1,accel_x_std,414149.491084
3,accel_y_std,274139.330284
5,accel_z_std,266514.200351
9,gyr_y_std,260368.015552
11,gyr_z_std,251370.765099
7,gyr_x_std,222869.245764
4,accel_z_mean,70981.591504
2,accel_y_mean,14576.150159
0,accel_x_mean,3748.666303
6,gyr_x_mean,1178.499552


### 3.2 Recursive Feature Elimination (RFE)

Here we use the method of RFE to select the best features. Different from the univariate feature selection, RFE selects features by recursively considering smaller and smaller sets of features. We choose how we evaluate the importance of a feature by setting the parameter `estimator`. In this case, we use a `RandomForestClassifier` to evaluate the importance of a feature.

**Note**: this will take some time to run. (ex: 5.5mins on M2 Macbook Pro)

#### Task 3.2
Explain the process of recursive feature elimination (~1 paragraph, what is the purpose of this method? How does it work?). Which feature have the most importance according to this method?

In [10]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=5, # the number of trees in the forest
    random_state=42, # for reproducibility
    max_depth=5 # the maximum depth of the tree
)
rfe = RFE(estimator=model, n_features_to_select=4, step=1)

X_rfe = rfe.fit_transform(X, y)

selected_features_rfe = X.columns[rfe.support_]

print("Features selected by RFE:", selected_features_rfe)

Features selected by RFE: Index(['accel_x_mean', 'accel_x_std', 'accel_y_std', 'accel_z_mean'], dtype='object')


### Task 3.3: Additional Feature Selection Method

Please implement an additional feature selection method of your choice. You can use any feature selection method from the [scikit-learn library](https://scikit-learn.org/stable/modules/feature_selection.html), or any others. In your report, writing a short description of the method and the results, similar to above but include **why** you selected this method.

In [11]:
## implement additional feature selection method here ...



### Task 3.4: Feature Selection

Choose the best features from the methods above and create a new dataframe with only those features. In your report, note which features you selected, and why.

In [12]:
# example features, you should choose your own.
selected_features = ['accel_x_std', 'accel_y_std', 'accel_z_std', 'gyr_x_std', 'gyr_y_std', 'gyr_z_std']
df_selected_features = df_grouped[['participant', 'label'] + selected_features]

df_selected_features.head()

Unnamed: 0,participant,label,accel_x_std,accel_y_std,accel_z_std,gyr_x_std,gyr_y_std,gyr_z_std
99,P1,ascending stairs,0.177869,0.283623,0.234394,95.78653,41.125919,40.846795
100,P1,ascending stairs,0.177846,0.282056,0.233043,93.403866,41.128009,40.650189
101,P1,ascending stairs,0.17788,0.281939,0.231738,90.788858,41.195001,40.319882
102,P1,ascending stairs,0.17792,0.282047,0.229859,87.643445,41.065231,39.762283
103,P1,ascending stairs,0.177998,0.282277,0.229644,84.653801,40.958809,39.170303


## 4. Model Training & Selection


### 4.1 Train/Test Split by Percentage

here we will split the data into a training set and a test set. We will train the model on the training set and evaluate the model on the test set. We split with a percentage, where we use a randomly distribution of the percentage of the data for the training and testing, 33%.

In [13]:
from sklearn.model_selection import train_test_split

test_size = 0.33 # percentage of data for testing

X = df_selected_features[selected_features]
Y = df_selected_features['label']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### 4.2 Model Training

Here we will train a series of models and evaluate them. We have provided two models: `Decision Tree Classifier`, and `GradientBoostingClassifier`.

#### Task 4.2
Please train 4 additional classifiers model. For each, including the ones we include a brief description (~1 paragraph). For the models you choose, explain why you chose it. Please limit you selection to machine learning models, we will try out deep learning models in the next project. You are free to use any models from the [scikit-learn library](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning), you may also use other as long as they are not deep learning models (e.g., multi-layer neural networks).

In [32]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

models = []

models.append(DecisionTreeClassifier(random_state=0, max_depth=5))
models.append(GradientBoostingClassifier(random_state=0, verbose=1, n_estimators=10))


In [None]:
results = []

for model in models:
    print('training: ', model.__class__.__name__)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
    print(' - accuracy: ', accuracy)
    results.append(
        {
            'name': model.__class__.__name__,
            'accuracy': sklearn.metrics.accuracy_score(y_test, y_pred),
            'confusion_matrix': sklearn.metrics.confusion_matrix(y_test, y_pred),
            'classification_report': sklearn.metrics.classification_report(y_test, y_pred)
        }
    )

training:  DecisionTreeClassifier
 - accuracy:  0.24156891495601174


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


training:  GradientBoostingClassifier
      Iter       Train Loss   Remaining Time 
         1           1.9041           13.56m


### 4.3 Model Reporting
Report the results of each model, which one performed the best? Please print the model results below, but include a short write-up in your report.

In [None]:
# # model results
# # ...

# print(results)



## 5. LOSO (Leave-One-Subject-Out) Experiment


We have XX participants in our dataset. In this experiment we will train our best model above, but instead of randomly splitting the data, we will use a Leave-One-Subject-Out (LOSO) method. This means we will train on all participants except one, and then evaluate the model on the left out participant. We will repeat this process for all participants and then average the results. We will then compare the results with the previous model evaluation.

### Task 5.1: LOSO

create the variables `X_test`, `y_test`, `X_train`, `y_train` for the LOSO experiment.

In [29]:
import random

# Assume df_selected_features contains the relevant data
# Assume 'participant' is a column in the dataframe
# X and Y are the feature matrix and labels respectively, derived from df_selected_features

num_participants = df_selected_features['participant'].nunique()
test_ind = random.randint(0, num_participants - 1)
test_participant = df_selected_features['participant'].unique()[test_ind]


# Creating the training and testing sets
X_train = X[df_selected_features['participant'] != test_participant]
y_train = Y[df_selected_features['participant'] != test_participant]

X_test = X[df_selected_features['participant'] == test_participant]
y_test = Y[df_selected_features['participant'] == test_participant]

# Verify the shapes of the sets
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (2555171, 6)
y_train shape: (2555171,)
X_test shape: (98208, 6)
y_test shape: (98208,)


## Task 5.2: Train Model w/ LOSO

Train the best model with the LOSO method. report the results, and compare with the previous model evaluation. Write a short explanation of your observations. Why do you think it perform better or worse?


In [None]:
# train and report results
# ...



### Task 5.3 Cross Validation

In the interest of time, we will not implement cross-validation. But please include a short explanation of how you would implement cross-validation, and why it is important. (~1 paragraph)

## 6. Port Model

We will now port the model to C, for use in Arduino. Please include the C file in your submission, but we will not run it on the embedded device for this assignment. In the next assignment, we will run the model on the Arduino.

### 6.1 install `micromlgen`

[micromlgen](https://github.com/eloquentarduino/micromlgen) is a opensource project which will generate C code from your sklearn models


In [30]:
!pip install micromlgen


Collecting micromlgen
  Downloading micromlgen-1.1.28.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: micromlgen
  Building wheel for micromlgen (setup.py) ... [?25l[?25hdone
  Created wheel for micromlgen: filename=micromlgen-1.1.28-py3-none-any.whl size=32152 sha256=c88ffabcb78d764ce185628fe15855ab6598c1bdcd79713f9f379ec8f2f0a231
  Stored in directory: /root/.cache/pip/wheels/97/54/64/5d82c310920abe1be0d120313ceb9e12c88f5701f53f6ed248
Successfully built micromlgen
Installing collected packages: micromlgen
Successfully installed micromlgen-1.1.28


### Task 6.2 Port Best Model and save

In [31]:
from micromlgen import port

c_code = port(models[0]) ## use your best model here

print(c_code)

# save the model to a
# ...

AttributeError: 'DecisionTreeClassifier' object has no attribute 'tree_'

### Task 6.3 Model Review
Take a look at the C code, what do you notice? Do you think the model will run on our microcontrollers? Why or why not? Please include a brief write-up in your report.