# Activity Classification - DecisionTreeClassifier Training

This notebook trains a DecisionTreeClassifier on the physical activity dataset using GridSearchCV for hyperparameter tuning.

## Import Required Libraries

In [5]:
!python3 -m pip install kagglehub



In [6]:
%reset -f

import importlib

import activity_functions
importlib.reload(activity_functions)
from activity_functions import *


# this only works for google colab
# import sys
# sys.path.append('/content/drive/MyDrive/ds420Projects/project1')
# from activity_functions import *

In [7]:
activtity = load_data()

Loaded from Kaggle: C:\Users\aryan\.cache\kagglehub\datasets\diegosilvadefrana\fisical-activity-dataset\versions\4\dataset2.csv


In [8]:
df_train, df_test = create_train_test(activtity, test_ratio=0.2)
print(df_train.shape)
print(df_test.shape)

(2291244, 33)
(572812, 33)


In [9]:
X_train, y_train, X_test, y_test = prepare_for_train(df_train, df_test)

In [10]:
import tensorflow as tf
print(tf.test.gpu_device_name())

Matplotlib is building the font cache; this may take a moment.





## Hyperparameter Tuning with GridSearchCV

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

def grid_searchCV(X, y):
    model = DecisionTreeClassifier(
        random_state=42
    )
    param = {
        "max_depth": [3, 5, 7, 10],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "criterion": ["gini", "entropy"]
    }

    grid = GridSearchCV(
        model,
        param,
        verbose=1,
        refit=True,
        cv=3,
        scoring='accuracy',
        n_jobs=-1,
        return_train_score=True
    )

    grid.fit(X, y)
    return grid
best_model = grid_searchCV(X_train, y_train)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


In [19]:
cv_result = pd.DataFrame(best_model.cv_results_)
columns = ['params', 'rank_test_score', 'mean_train_score', 'mean_test_score']
cv_result = cv_result[columns]
cv_result.sort_values(by='rank_test_score')

Unnamed: 0,params,rank_test_score,mean_train_score,mean_test_score
63,"{'criterion': 'entropy', 'max_depth': 10, 'min...",1,0.858622,0.857831
64,"{'criterion': 'entropy', 'max_depth': 10, 'min...",2,0.858621,0.857831
65,"{'criterion': 'entropy', 'max_depth': 10, 'min...",3,0.858620,0.857830
66,"{'criterion': 'entropy', 'max_depth': 10, 'min...",4,0.858618,0.857829
67,"{'criterion': 'entropy', 'max_depth': 10, 'min...",5,0.858618,0.857828
...,...,...,...,...
40,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329
43,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329
42,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329
41,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",64,0.452366,0.452329


## Best Hyperparameters Found

Display the best hyperparameters found by GridSearchCV:


In [20]:
print("Best Hyperparameters:")
print(best_model.best_params_)
print(f"\nBest Cross-Validation Accuracy: {best_model.best_score_:.4f}")


Best Hyperparameters:
{'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2}

Best Cross-Validation Accuracy: 0.8578


## Model Evaluation

Evaluate the best model on the test set:


In [22]:
from sklearn.metrics import classification_report

# Predict on test set
y_test_hat = best_model.predict(X_test)

# Calculate metrics
compute_scores(y_test, y_test_hat, verbose=True)


Accuracy:  0.8585
F1-Score:  0.8380
Recall:    0.8184
Precision: 0.8724


Unnamed: 0,Accuracy,F1_Score,Recall,Precision
0,0.858465,0.838007,0.818414,0.872355


In [23]:
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_test_hat))


Detailed Classification Report:
                      precision    recall  f1-score   support

      Nordic walking       0.97      0.87      0.92     37621
    ascending stairs       0.69      0.41      0.51     23443
             cycling       0.95      0.87      0.91     32920
   descending stairs       0.61      0.32      0.42     20989
             ironing       0.91      0.91      0.91     47738
               lying       1.00      0.96      0.98     38505
        rope jumping       0.92      0.91      0.91      8594
             running       0.90      0.86      0.88     19640
             sitting       0.99      0.96      0.97     37038
            standing       0.84      0.96      0.90     37986
transient activities       0.78      0.89      0.83    185515
     vacuum cleaning       0.91      0.82      0.86     35071
             walking       0.88      0.91      0.90     47752

            accuracy                           0.86    572812
           macro avg       0.87    

## Summary

The DecisionTreeClassifier was tuned using GridSearchCV with the following hyperparameter grid:
- **max_depth**: [3, 5, 7, 10]
- **min_samples_split**: [2, 5, 10]
- **min_samples_leaf**: [1, 2, 4]
- **criterion**: ["gini", "entropy"]

This resulted in **72 candidate models** evaluated with **3-fold cross-validation** (216 total fits).

The best model was selected based on accuracy and evaluated on the held-out test set above.
