# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error


## Model Choice

As a baseline modell a simple regression modell is mostly a good starting point. A appropriate modell for a classification problem is logistic regression, which I will use in the following.

## Feature Selection
The features will be the different activities. The train dataset will be based on the first 7 people in the datset whereas the test dataset will be the last of the 8 different people who gathered data. 

In [4]:
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('/home/tomruge/Schreibtisch/UNI/Semester_7/machine_learning_with_tensorflow/archive_physical_activity.csv', engine='pyarrow')

# apply undersampling. sample down to size of smallest class
df = df.groupby('activityID').apply(lambda x: x.sample(df['activityID'].value_counts().min())).reset_index(drop=True)

# Drop all rows with NaN values
df.dropna(inplace=True)

# Mask for train and test split
mask = (df['PeopleId'] == 8)

# Use LabelEncoder to automatically assign numerical values to classes
label_encoder = LabelEncoder()
df['activityID'] = label_encoder.fit_transform(df['activityID'])

# Print the mapping of original class labels to numerical labels
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label Mapping:", label_mapping)

y_train = df['activityID'][~mask]
X_train = df.drop(['activityID'], axis=1)[~mask]

y_test = df['activityID'][mask]
X_test = df.drop(['activityID'], axis=1)[mask]

Label Mapping: {'Nordic walking': 0, 'ascending stairs': 1, 'cycling': 2, 'descending stairs': 3, 'ironing': 4, 'lying': 5, 'rope jumping': 6, 'running': 7, 'sitting': 8, 'standing': 9, 'transient activities': 10, 'vacuum cleaning': 11, 'walking': 12}


## Implementation

Logistic regression:


In [34]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import copy
models = []
y_tests = []
X_tests = []
# Use LabelEncoder to automatically assign numerical values to classes
label_encoder = LabelEncoder()
df['activityID'] = label_encoder.fit_transform(df['activityID'])

for i in range(9):
    print("PeopleId_mask:", i)
    # Mask for train and test split
    mask = (df['PeopleId'] == i)
    work_df = copy.copy(df)

    # Print the mapping of original class labels to numerical labels
    label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
    
    y_train = work_df['activityID'][~mask]
    X_train = work_df.drop(['activityID'], axis=1, inplace = False)[~mask]

    y_test = work_df['activityID'][mask]
    X_test = work_df.drop(['activityID'], axis=1, inplace=False)[mask]
    
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)
    # Standardize the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Create and train the logistic regression model
    model = LogisticRegression(multi_class='multinomial', max_iter=100000)
    model.fit(X_train_scaled, y_train)
    models.append(copy.copy(model))
    y_tests.append(copy.copy(y_test))
    X_tests.append(copy.copy(X_test_scaled))


PeopleId_mask: 0
(558597, 32)
(0, 32)
(558597,)
(0,)


ValueError: Found array with 0 sample(s) (shape=(0, 32)) while a minimum of 1 is required by StandardScaler.

## Evaluation

My metrics will be: $$ Accuracy = \frac{Number\ of\ Correct\ Predictions}{Total\ Number\ of\ Predictions}$$


In [21]:
y_pred_labels_list = []

for i in range(9):
    print("Test_PeopleID:", i)
    
    # Make predictions on the test set
    y_pred = models[i].predict(X_tests[i])

    # Evaluate the model
    accuracy = accuracy_score(y_tests[i], y_pred)
    print(f'Accuracy: {accuracy:.2f}')

    # Display classification report
    print(classification_report(y_tests[i], y_pred))
    y_pred_labels_list.append(y_pred)

    # Check if all entries in y_tests and y_pred_labels_list are the same
    print(f"Unique values in y_tests: {set(y_tests[i])}")
    print(f"Unique values in y_pred_labels_list: {set(y_pred_labels_list[i])}")
    print("-" * 40)


Test_PeopleID: 0
Accuracy: 0.41
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      6599
           1       0.29      0.60      0.39      4341
           2       0.60      0.93      0.73      6546
           3       0.16      0.38      0.22      3948
           4       0.00      0.00      0.00      6017
           5       1.00      0.94      0.97      5418
           6       0.71      0.65      0.68      8806
           7       0.46      0.52      0.49      7299
           8       0.02      0.00      0.00      5447
           9       0.01      0.01      0.01      5675
          10       0.15      0.21      0.18      6697
          11       0.30      0.38      0.34      5956
          12       0.41      0.58      0.48      5679

    accuracy                           0.41     78428
   macro avg       0.32      0.40      0.34     78428
weighted avg       0.33      0.41      0.36     78428

Unique values in y_tests: {0, 1, 2, 3, 4, 5, 6,

Confusion matrix:

In [22]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt



# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test[-1], y_pred_labels_list[-1])

# Plot the confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=label_mapping.keys(), yticklabels=label_mapping.keys())
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


KeyError: -1

As expected the modell is confused by transient activities, since they are not well defined and have a big deviation within itself. Also activities which can be seen related like sitting and lying got confused very often. But for a first modell not that bad.