# Activity Dataset Preparation and Algorithm Investigation

#### Load Helper Functions

In [1]:
%reset -f

import importlib
import activity_functions
importlib.reload(activity_functions)
from activity_functions import *

## Preparation

#### Load Data

In [2]:
activity = load_data()

Loaded from Kaggle: /home/thuy/.cache/kagglehub/datasets/diegosilvadefrana/fisical-activity-dataset/versions/4/dataset2.csv


In [3]:
print(len(activity))

2864056


#### Create Train and Test Sets

In [4]:
train, test = create_train_test(activity, test_ratio=0.2)

In [11]:
print(train.shape)
test.shape

(2291244, 33)


(572812, 33)

#### Prepare Data

In [5]:
X_train, y_train, X_test, y_test = prepare_for_train(train, test)

In [6]:
X_train, y_train, X_dev, y_dev = train_dev_split(X_train, np.ravel(y_train), ratio=0.25)

## Algorithm Investigation

#### Train Models

We will investigate the following 5 algorithms, which we selected for a diverse range of classification prediction. These algorithms include linear/non-linear and parametric/non-parametric methods. They should yield different results that will be interesting to compare and contrast.

- SGDClassifier
- LinearSVC
- KNeighborsClassifier
- LogisticRegression
- DecisionTreeClassifier

In [7]:
# CHANGE MAX_ITER TO A SUITABLE AMOUNT
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

SGD_model = SGDClassifier(max_iter=1000, random_state=42)
SGD_model.fit(X_dev, y_dev)

LinearSVC_model = LinearSVC(max_iter=1000, random_state=42)
LinearSVC_model.fit(X_dev, y_dev)

KNeighbors_model = KNeighborsClassifier()
KNeighbors_model.fit(X_dev, y_dev)

LogisticRegression_model = LogisticRegression(max_iter=1000, random_state=42)
LogisticRegression_model.fit(X_dev, y_dev)

DecisionTree_model = DecisionTreeClassifier(random_state=42)
DecisionTree_model.fit(X_dev, y_dev)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


#### Evaluate Models

Now that our models are trained on the data, we can evaluate them using the following chosen metrics. These metrics will give us a good picture of each algorithm's success in predicting the activity types, and will be easy to compare across models.

- Accuracy
- F1 Score
- Recall
- Precision

In [8]:
results = []

models = {
    'SGDClassifier': SGD_model,
    'LinearSVC': LinearSVC_model,
    'KNeighbors': KNeighbors_model,
    'LogisticRegression': LogisticRegression_model,
    'DecisionTree': DecisionTree_model,
}

for name, model in models.items():
    y_test_hat = model.predict(X_test)
    scores_df = compute_scores(y_test, y_test_hat)
    scores_df['Model'] = name
    results.append(scores_df)

final_results = pd.concat(results, ignore_index=True)
final_results = final_results[['Model', 'Accuracy', 'F1_Score', 'Recall', 'Precision']]
print(final_results.to_string(index=False))

             Model  Accuracy  F1_Score   Recall  Precision
     SGDClassifier  0.691876  0.656217 0.654756   0.707369
         LinearSVC  0.711212  0.680369 0.655716   0.766411
        KNeighbors  0.973311  0.974542 0.975556   0.973831
LogisticRegression  0.753176  0.742560 0.722014   0.777716
      DecisionTree  0.977829  0.974233 0.973396   0.975081


Our algorithm investigation has perfectly yielded two clear best algorithms (of the selected five). From here, we will perform more analysis to find optimal hyperparameters for the following:

- KNeighbors Classifier
- Decision Tree Classifier

These algorithms both had very similar scores across all metrics, and performed much better than other models which still performed moderately well. It is most likely that the linear models struggled to accurately capture the relationship between the target and predictor variables, while these two algorithms likely excelled with local patterns and hierarchical decision boundaries respectively.