# Activity Dataset Preparation and Algorithm Investigation

#### Library

In [1]:
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

#### Load Helper Functions

In [1]:
%reset -f

import importlib
import activity_functions
importlib.reload(activity_functions)
from activity_functions import *

## Preparation

#### Load Data

In [2]:
activity = load_data()

Loaded from Kaggle: /Users/brookspeterson/.cache/kagglehub/datasets/diegosilvadefrana/fisical-activity-dataset/versions/4/dataset2.csv


#### Create Train and Test Sets

In [3]:
train, test = create_train_test(activity, test_ratio=0.2)

#### Prepare Data

In [4]:
X_train, y_train, X_test, y_test = prepare_for_train(train, test)

## Algorithm Investigation

#### Train Models

We will investigate the following 5 algorithms, which we selected for a diverse range of classification prediction. These algorithms include linear/non-linear and parametric/non-parametric methods. They should yield different results that will be interesting to compare and contrast.

- SGDClassifier
- LinearSVC
- KNeighborsClassifier
- LogisticRegression
- DecisionTreeClassifier

In [None]:
# CHANGE MAX_ITER TO A SUITABLE AMOUNT

warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

SGD_model = SGDClassifier(max_iter=10, random_state=42)
SGD_model.fit(X_train, y_train)

LinearSVC_model = LinearSVC(max_iter=10, random_state=42)
LinearSVC_model.fit(X_train, y_train)

KNeighbors_model = KNeighborsClassifier()
KNeighbors_model.fit(X_train, y_train)

LogisticRegression_model = LogisticRegression(max_iter=10, random_state=42)
LogisticRegression_model.fit(X_train, y_train)

DecisionTree_model = DecisionTreeClassifier(random_state=42)
DecisionTree_model.fit(X_train, y_train)



#### Evaluate Models

Now that our models are trained on the data, we can evaluate them using the following chosen metrics. These metrics will give us a good picture of each algorithm's success in predicting the activity types, and will be easy to compare across models.

- Accuracy
- F1 Score
- Recall
- Precision

In [None]:
results = []

models = {
    'SGDClassifier': SGD_model,
    'LinearSVC': LinearSVC_model,
    'KNeighbors': KNeighbors_model,
    'LogisticRegression': LogisticRegression_model,
    'DecisionTree': DecisionTree_model,
}

for name, model in models.items():
    y_test_hat = model.predict(X_test)
    scores_df = compute_scores(y_test, y_test_hat)
    scores_df['Model'] = name
    results.append(scores_df)

final_results = pd.concat(results, ignore_index=True)
final_results = final_results[['Model', 'Accuracy', 'F1_Score', 'Recall', 'Precision']]
print(final_results.to_string(index=False))

[IN THIS CELL EXPLAIN WHICH TWO ALGORITHMS DID THE BEST AND NOTE THEM AS OUR SELECTION FOR THE NEXT STEP]