# Activity Classification - DecisionTreeClassifier Training

This notebook trains a DecisionTreeClassifier on the physical activity dataset using GridSearchCV for hyperparameter tuning.

## Import Required Libraries

In [1]:
import warnings
from sklearn.exceptions import ConvergenceWarning

## Load Data and Prepare Training Set


In [None]:
%reset -f

import importlib

import activity_functions
importlib.reload(activity_functions)
from activity_functions import *


# this only works for google colab
# import sys
# sys.path.append('/content/drive/MyDrive/ds420Projects/project1')
# from activity_functions import *

In [4]:
activtity = load_data()

Downloading from https://www.kaggle.com/api/v1/datasets/download/diegosilvadefrana/fisical-activity-dataset?dataset_version_number=4...


100%|██████████| 297M/297M [00:01<00:00, 165MB/s]

Extracting files...





Loaded from Kaggle: /root/.cache/kagglehub/datasets/diegosilvadefrana/fisical-activity-dataset/versions/4/dataset2.csv


In [5]:
df_train, df_test = create_train_test(activtity, test_ratio=0.2)
print(df_train.shape)
print(df_test.shape)

(2291244, 33)
(572812, 33)


In [6]:
X_train, y_train, X_test, y_test = prepare_for_train(df_train, df_test)

In [9]:
import tensorflow as tf
print(tf.test.gpu_device_name())

/device:GPU:0


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

def grid_searchCV(X, y):
    model = DecisionTreeClassifier(
        random_state=42
    )
    param = {
        "max_depth": [3, 5, 7, 10],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "criterion": ["gini", "entropy"]
    }

    grid = GridSearchCV(
        model,
        param,
        verbose=1,
        refit=True,
        cv=3,
        scoring='accuracy',
        n_jobs=-1
    )

    grid.fit(X, y)
    return grid
best_model = grid_searchCV(X_train, y_train)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


## Hyperparameter Tuning with GridSearchCV


## Best Hyperparameters Found

Display the best hyperparameters found by GridSearchCV:


In [1]:
print("Best Hyperparameters:")
print(best_model.best_params_)
print(f"\nBest Cross-Validation Accuracy: {best_model.best_score_:.4f}")


Best Hyperparameters:


NameError: name 'best_model' is not defined

## Model Evaluation

Evaluate the best model on the test set:


In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report

# Predict on test set
y_test_pred = best_model.predict(X_test)

# Calculate metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
test_recall = recall_score(y_test, y_test_pred, average='weighted')
test_precision = precision_score(y_test, y_test_pred, average='weighted')

print("=" * 60)
print("DecisionTreeClassifier - Test Set Performance")
print("=" * 60)
print(f"Accuracy:  {test_accuracy:.4f}")
print(f"F1 Score:  {test_f1:.4f}")
print(f"Recall:    {test_recall:.4f}")
print(f"Precision: {test_precision:.4f}")
print("=" * 60)


In [None]:
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_test_pred))


## Summary

The DecisionTreeClassifier was tuned using GridSearchCV with the following hyperparameter grid:
- **max_depth**: [3, 5, 7, 10]
- **min_samples_split**: [2, 5, 10]
- **min_samples_leaf**: [1, 2, 4]
- **criterion**: ["gini", "entropy"]

This resulted in **72 candidate models** evaluated with **3-fold cross-validation** (216 total fits).

The best model was selected based on accuracy and evaluated on the held-out test set above.
