# Activity Classification - DecisionTreeClassifier Training

This notebook trains a DecisionTreeClassifier on the physical activity dataset using GridSearchCV for hyperparameter tuning.

## Import Required Libraries

In [2]:
!python3 -m pip install kagglehub



In [3]:
%reset -f

import importlib

import activity_functions
importlib.reload(activity_functions)
from activity_functions import *


# this only works for google colab
# import sys
# sys.path.append('/content/drive/MyDrive/ds420Projects/project1')
# from activity_functions import *

In [4]:
activtity = load_data()

Loaded from Kaggle: /home/thuy/.cache/kagglehub/datasets/diegosilvadefrana/fisical-activity-dataset/versions/4/dataset2.csv


In [5]:
df_train, df_test = create_train_test(activtity, test_ratio=0.2)
print(df_train.shape)
print(df_test.shape)

(2291244, 33)
(572812, 33)


In [6]:
X_train, y_train, X_test, y_test = prepare_for_train(df_train, df_test)

In [7]:
import tensorflow as tf
print(tf.test.gpu_device_name())




W0000 00:00:1761255101.367842    7306 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


## Hyperparameter Tuning with GridSearchCV

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

def grid_searchCV(X, y):
    model = DecisionTreeClassifier(
        random_state=42
    )
    param = {
        "max_depth": [None, 3, 5, 7],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "criterion": ["gini", "entropy"]
    }

    grid = GridSearchCV(
        model,
        param,
        verbose=1,
        refit=True,
        cv=3,
        scoring='accuracy',
        n_jobs=-1,
        return_train_score=True
    )

    grid.fit(X, y)
    return grid
best_model = grid_searchCV(X_train, y_train)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


In [9]:
cv_result = pd.DataFrame(best_model.cv_results_)
columns = ['params', 'rank_test_score', 'mean_train_score', 'mean_test_score']
cv_result = cv_result[columns]
cv_result.sort_values(by='rank_test_score')

Unnamed: 0,params,rank_test_score,mean_train_score,mean_test_score
36,"{'criterion': 'entropy', 'max_depth': None, 'm...",1,1.000000,0.995843
37,"{'criterion': 'entropy', 'max_depth': None, 'm...",2,0.999791,0.995723
38,"{'criterion': 'entropy', 'max_depth': None, 'm...",3,0.999284,0.995462
39,"{'criterion': 'entropy', 'max_depth': None, 'm...",4,0.999437,0.995444
40,"{'criterion': 'entropy', 'max_depth': None, 'm...",5,0.999384,0.995390
...,...,...,...,...
47,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329
46,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329
45,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329
52,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329


## Best Hyperparameters Found

Display the best hyperparameters found by GridSearchCV:


In [10]:
print("Best Hyperparameters:")
print(best_model.best_params_)
print(f"\nBest Cross-Validation Accuracy: {best_model.best_score_:.4f}")


Best Hyperparameters:
{'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}

Best Cross-Validation Accuracy: 0.9958


## Model Evaluation

Evaluate the best model on the test set:


In [11]:
from sklearn.metrics import classification_report

# Predict on test set
y_test_hat = best_model.predict(X_test)

# Calculate metrics
compute_scores(y_test, y_test_hat, verbose=True)


Accuracy:  0.9970
F1-Score:  0.9967
Recall:    0.9967
Precision: 0.9968


Unnamed: 0,Accuracy,F1_Score,Recall,Precision
0,0.996971,0.996741,0.99668,0.996802


In [12]:
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_test_hat))


Detailed Classification Report:
                      precision    recall  f1-score   support

      Nordic walking       1.00      1.00      1.00     37621
    ascending stairs       0.99      0.99      0.99     23443
             cycling       1.00      1.00      1.00     32920
   descending stairs       0.99      0.99      0.99     20989
             ironing       1.00      1.00      1.00     47738
               lying       1.00      1.00      1.00     38505
        rope jumping       1.00      0.99      0.99      8594
             running       1.00      1.00      1.00     19640
             sitting       1.00      1.00      1.00     37038
            standing       1.00      1.00      1.00     37986
transient activities       1.00      1.00      1.00    185515
     vacuum cleaning       1.00      1.00      1.00     35071
             walking       1.00      1.00      1.00     47752

            accuracy                           1.00    572812
           macro avg       1.00    

## Summary

The DecisionTreeClassifier was tuned using GridSearchCV with the following hyperparameter grid:
- **max_depth**: [3, 5, 7, 10]
- **min_samples_split**: [2, 5, 10]
- **min_samples_leaf**: [1, 2, 4]
- **criterion**: ["gini", "entropy"]

This resulted in **72 candidate models** evaluated with **3-fold cross-validation** (216 total fits).

The best model was selected based on accuracy and evaluated on the held-out test set above.
