# Group Assignment 1: Human Activity Detection
In this assignment you will create you own dataset for classification. You will explore which ML algorithms are best to classify this and you will present your best solution. 

- Create your own dataset for custom human motions using Phyphox
- There should be at least 3 distinct types of motions
- The motions should be different to the ones used in the UCI dataset (Not: walking, sitting, standing, laying, stairs)
- Follow the steps and answer the questions given in this notebook

### Generating your dataset:

For this assignment you will create your own dataset of motions that you collect with an Accelerometer and Gyroscope. For this you can use your phone as a sensor.
To be able to collect your data you can best use an app called [phyphox](https://phyphox.org/), this is a free app available in app stores. This app can be configured to acces your sensordata, sample it as given frequency's. you can set it up te have experiment timeslots, and the data with a timestamp can be exported to a needed output format.

![](https://phyphox.org/wp-content/uploads/2019/06/phyphox_dark-1024x274.png)

When you installed the app you can setup a custum experiment by clicking on the + button. Define an experiment name, sample frequency and activate the Accelerometer and Gyroscope. Your custom experiment will be added, you can run it pressing the play button and you will see sensor motion. Pressing the tree dots (...) lets you define timed runs, remote access and exporting data.

Phyphox will generate 2 files with sensor data, one for the Accelerometer and one for the Giro. Both files will have timestamps which might not match the recorded sensor data for each sensor. Please, preprocess and merge the files for using it as your dataset for training, testing and deploying your own supervised learning model.

### steps

With your own generated dataset the similar sequence of steps should be taken to train your model.

These are the generic steps to be taken
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.
9. Additional Questions


---
In the Notebook this structure is used for dividing the different steps, so make sure you do the implementation and analisis at these location in the notebook. 

You may add additinal code blocks, but keep the seperation of the given structure.

At the end of each block summarize / comment / conclude your current step in the given textblocks.




```
Roel van der Leest // 4910087
Jari van hoof // 4938135
Wout van der zanden // 4845250
```


# 1. Frame the problem and look at the big picture
*Describe the problem at hand and explain your approach*

### Problem Description
The goal of this assignment is to develop a machine learning model capable of recognizing specific human activities based on sensor data. We will distinguish between three gestures using data collected from a smartphone's accelerometer and gyroscope.

### Selected Activities
We have chosen three motions that involve different arm trajectories and intensities:
1.  **Drawing the letter 'O':** A continuous circular motion performed in the air.
2.  **Throwing a ball:** A rapid, linear acceleration followed by a deceleration.
3.  **Opening a door:** A reach-and-pull or push-and-turn motion.

### Approach & Methodology
*   **Data Collection:** We will use the **Phyphox** app to record 3-axis Accelerometer and Gyroscope data.
*   **Hardware Consistency:** To eliminate overfitting to one specific device its sensors, we will use 2 different smartphones for all data collection. If this turns out to still be an issue, we will add 1 device more.
*   **Bias Mitigation:** To ensure our model learns the motion rather than a specific person's movement style, we will have each group member perform 10 movements with one phone, following 10 times the same movement with the other phone. this results in total 60 samples per activity which consist of 2 different phones and 3 different people, totalling 180 samples.


# 2. Get the data.

### Data Collection Protocol
To ensure good data and minimize bias, we have established the following protocol:
*   **Device:** Two phones will be used for all recordings to eliminate hardware sensor variance.
*   **Participants:** each group member will perform the motions to prevent the model from overfitting to a single person's movement style.
*   **Volume:** We aim for 20 samples per person per movement with 10 samples on each phone.
*   **Sampling:** We will use fixed time windows in Phyphox to ensure consistent data length for each sample.

### The Classes
We will classify the following 3 activities:

#### 1. Opening a Door
*   **Action:** Miming the action of reaching for a handle, turning/pushing, and returning.
*   **Start Position:** Hands beside the body.
*   **Movement:** As shown in the video [TODO: Add video reference].
*   **End Position:** Hands beside the body.
*   **Phone Orientation:** [TODO: Add image of phone grip]

#### 2. Drawing the Letter 'O'
*   **Action:** Drawing a large circle in the air.
*   **Start Position:** Hands in front of the body at the top of the 'O'.
*   **Movement:** As shown in the video [TODO: Add video reference].
*   **End Position:** Same as start position (completing the loop).
*   **Phone Orientation:** [TODO: Add image of phone grip]

#### 3. Throwing a Ball
*   **Action:** Miming an overhand throw.
*   **Start Position:** Hands beside the body.
*   **Movement:** As shown in the video [TODO: Add video reference].
*   **End Position:** Hand at the top of the arc, after the "release".
*   **Phone Orientation:** [TODO: Add image of phone grip]

In [None]:
import pandas as pd
import os

def get_label_from_path(path):
    """
    Helper function to determine Activity from the file path.
    Adjust the keywords below to match your actual folder names.
    """
    path_lower = path.lower()
    
    # Determine Activity based on folder keywords
    if 'door' in path_lower:
        activity = 'Door Movement'
    elif 'letter o' in path_lower:
        activity = 'Letter O'
    elif 'ball' in path_lower or 'throw' in path_lower: # Handles 'thorwing' typo if present
        activity = 'Throwing Ball'
    else:
        activity = 'Unknown'
        
    return activity

def load_and_merge_all_data(root_folder):
    all_dataframes = []
    sample_counter = 1 
    
    for root, dirs, files in os.walk(root_folder):
        if 'Accelerometer.csv' in files and 'Gyroscope.csv' in files:
            try:
                acc_path = os.path.join(root, 'Accelerometer.csv')
                gyro_path = os.path.join(root, 'Gyroscope.csv')
                
                df_acc = pd.read_csv(acc_path, sep=',')
                df_gyro = pd.read_csv(gyro_path, sep=',')
                
                df_acc.columns = ['Time', 'Acc_x', 'Acc_y', 'Acc_z']
                df_gyro.columns = ['Time', 'Gyro_x', 'Gyro_y', 'Gyro_z']
                
                df_acc = df_acc.sort_values('Time')
                df_gyro = df_gyro.sort_values('Time')
                
                df_merged = pd.merge_asof(df_acc, df_gyro, on='Time', direction='nearest')
                
                # --- ROBUST LABELING ---
                activity = get_label_from_path(root)
                df_merged['Activity'] = activity

                
                df_merged['Sample_ID'] = sample_counter
                sample_counter += 1
                
                all_dataframes.append(df_merged)
                
            except Exception as e:
                print(f"Error processing {root}: {e}")

    if all_dataframes:
        return pd.concat(all_dataframes, ignore_index=True)
    else:
        return pd.DataFrame()

df_final = load_and_merge_all_data('.')

print(f"Total data points: {len(df_final)}")
print(f"Total unique samples: {df_final['Sample_ID'].nunique() if not df_final.empty else 0}")

if not df_final.empty:
    print("\n--- Data Distribution ---")
    # This table proves to the user that the data is separated by Activity
    display(df_final['Activity'].value_counts().to_frame(name='Count'))
    
    print("\nPreview of the labeled dataset:")
    display(df_final.head())

### Data Loading & Merging Strategy

**Addressing your question:** *"Shouldn't we separate by folder?"*

We **are** separating the data logically, but storing it in a single efficient structure for Machine Learning.

**1. The `Activity` Label:**
Instead of creating separate variables for each folder, we add an `Activity` column.
*   The script automatically detects if a sample is "Door Movement", "Letter O", or "Throwing Ball" based on the folder path.
*   This creates the **Labels ($y$)** required for supervised learning.

**2. The `Sample_ID`:**
*   We assign a unique `Sample_ID` to every 5-second recording found.
*   This allows us to distinguish between individual movements (e.g., "Door Movement #1", "Door Movement #2") within the large dataset.

**Why this format?**
This "Long Format" (Time-Series) is the standard starting point. In **Step 4**, we will group by `Sample_ID` to calculate features (e.g., mean acceleration, max gyro) for each unique movement, creating the final training set.

# 3. Explore the data to gain insights.

*Explore the data in any possible way, visualize the results (if you have multiple plots of the same kind of data put them in one larger plot)*

After conducting all the data collection with pyphox all the data was inside a Zip file, plus it included unnecessary meta data. Manually unzipping and cleaning this data would be very time consuming. Therefore, we used a python script that automatically unzips all files and removes unnecessary meta data.

In [None]:
import os
import zipfile
import shutil

def unzip_and_clean(target_directory):
    
    if not os.path.exists(target_directory):
        print(f"Directory not found: {target_directory}")
        return

    # Walk through all directories recursively
    for root, dirs, files in os.walk(target_directory):
        # Check all files in the current directory
        for filename in files:
            file_path = os.path.join(root, filename)

            # Check if it's a zip file
            if filename.lower().endswith('.zip'):
                # Create a directory name based on the zip file (removing .zip extension)
                extract_folder_name = os.path.splitext(filename)[0]
                extract_path = os.path.join(root, extract_folder_name)
                
                # Unzip
                try:
                    with zipfile.ZipFile(file_path, 'r') as zip_ref:
                        zip_ref.extractall(extract_path)
                    
                    # Remove 'meta' directory if it exists
                    meta_dir = os.path.join(extract_path, 'meta')
                    if os.path.exists(meta_dir) and os.path.isdir(meta_dir):
                        shutil.rmtree(meta_dir)
                        
                    os.remove(file_path)
                except zipfile.BadZipFile:
                    print(f"Error: {filename} is a bad zip file.")
                except Exception as e:
                    print(f"Error processing {filename}: {e}")

if __name__ == "__main__":
    # Base directory containing all movement data folders
    base_dir = r"C:\Users\roelv\Documents\Machine Learning\AIS\assignment 1"
    
    # Process all movement data directories
    movement_types = ["Door movement data", "letter O movement data", "throwing ball movement data"]
    
    for movement_type in movement_types:
        target_dir = os.path.join(base_dir, movement_type)
        if os.path.exists(target_dir):
            print(f"Processing: {movement_type}")
            unzip_and_clean(target_dir)
    
    print("Done!")


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set plot style
sns.set(style="whitegrid")

# 1. Time Series Visualization
# We will plot one random sample for each activity to see the "shape" of the signal.
activities = df_final['Activity'].unique()
fig, axes = plt.subplots(len(activities), 2, figsize=(15, 4 * len(activities)))

for i, activity in enumerate(activities):
    # Pick the first available sample for this activity
    sample_id = df_final[df_final['Activity'] == activity]['Sample_ID'].iloc[0]
    sample_data = df_final[df_final['Sample_ID'] == sample_id].copy()
    
    # Normalize time to start at 0 for clearer plotting
    sample_data['Time'] = sample_data['Time'] - sample_data['Time'].iloc[0]
    
    # Plot Accelerometer
    axes[i, 0].plot(sample_data['Time'], sample_data['Acc_x'], label='X')
    axes[i, 0].plot(sample_data['Time'], sample_data['Acc_y'], label='Y')
    axes[i, 0].plot(sample_data['Time'], sample_data['Acc_z'], label='Z')
    axes[i, 0].set_title(f'{activity} - Accelerometer (Sample {sample_id})')
    axes[i, 0].set_ylabel('m/s^2')
    axes[i, 0].legend(loc='upper right')
    
    # Plot Gyroscope
    axes[i, 1].plot(sample_data['Time'], sample_data['Gyro_x'], label='X')
    axes[i, 1].plot(sample_data['Time'], sample_data['Gyro_y'], label='Y')
    axes[i, 1].plot(sample_data['Time'], sample_data['Gyro_z'], label='Z')
    axes[i, 1].set_title(f'{activity} - Gyroscope (Sample {sample_id})')
    axes[i, 1].set_ylabel('rad/s')
    axes[i, 1].legend(loc='upper right')

plt.tight_layout()
plt.show()

# 2. Intensity Analysis (Boxplot)
# Let's calculate the "Total Acceleration" (Magnitude) to see if some activities are more intense than others.
# Magnitude = sqrt(x^2 + y^2 + z^2)
df_final['Acc_Magnitude'] = np.sqrt(df_final['Acc_x']**2 + df_final['Acc_y']**2 + df_final['Acc_z']**2)

plt.figure(figsize=(10, 6))
sns.boxplot(x='Activity', y='Acc_Magnitude', data=df_final)
plt.title('Distribution of Acceleration Intensity (Magnitude) by Activity')
plt.ylabel('Acceleration Magnitude (m/s^2)')
plt.show()

### Insights from Data Exploration

**1. Signal Patterns (Time Series Plots):**
*   **Door Movement:** We observe a distinct "push/pull" pattern. The acceleration typically spikes in one direction (reaching out) and then reverses (pulling back). The gyroscope shows rotation primarily around one axis (the wrist turning).
*   **Letter O:** This motion is characterized by sinusoidal waves in the X and Y axes of the accelerometer, representing the circular motion. The signals are smoother and more periodic compared to the sudden spikes of the door opening.
*   **Throwing Ball:** This is the most "explosive" movement. We see a very sharp, high-amplitude spike in acceleration (the throw) followed by a quick return to rest. The duration is often shorter than the other two activities.

**2. Intensity (Boxplot):**
*   The **Throwing Ball** activity generally shows the highest outliers and variance in Acceleration Magnitude, which makes sense due to the force required to "throw".
*   **Letter O** tends to have a more consistent magnitude, hovering around 9.8 m/s² (gravity) plus the movement force, as it is a controlled motion.
*   **Door Movement** sits somewhere in between, with moderate peaks.

**Conclusion:**
The visual differences in signal shape (spikes vs. waves) and intensity suggest that features like *Maximum Acceleration*, *Standard Deviation*, and *Mean Gyroscope* values will be excellent predictors for our Machine Learning model.

# 4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms

prepare your data, is it normalized? are there outlier? Make a training and a test set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- 0. Data Cleaning & Sanity Checks ---
print("--- Checking for Data Issues ---")

# Check 1: Missing Values in Raw Data
# NaNs can occur if the merge_asof failed to find a close match or if the sensor dropped data.
missing_count = df_final.isnull().sum().sum()
if missing_count > 0:
    print(f"Warning: Found {missing_count} missing values in raw data. Dropping rows with NaNs...")
    df_final = df_final.dropna()
else:
    print("Check 1 Passed: No missing values found in raw data.")

# Check 2: Remove "Short" Samples (Potential recording errors)
# A valid movement should have enough data points. 
# If a sample has very few points (e.g., < 10), it might be a glitch or a mis-click in Phyphox.
sample_counts = df_final['Sample_ID'].value_counts()
min_points_threshold = 10 # Arbitrary safety threshold. 5s @ 50Hz = 250 points.
short_samples = sample_counts[sample_counts < min_points_threshold].index

if len(short_samples) > 0:
    print(f"Warning: Removing {len(short_samples)} samples with too few data points (<{min_points_threshold}): {list(short_samples)}")
    df_final = df_final[~df_final['Sample_ID'].isin(short_samples)]
else:
    print("Check 2 Passed: All samples have sufficient data points.")


# --- 1. Feature Extraction ---
# We cannot feed raw time-series data directly into standard classifiers.
# We must aggregate the data: Turn each "Sample" (time series) into a single row of features.
# We will calculate: Mean, Std, Min, Max for every sensor axis.

# Define the columns we want to summarize
sensor_columns = ['Acc_x', 'Acc_y', 'Acc_z', 'Gyro_x', 'Gyro_y', 'Gyro_z', 'Acc_Magnitude']

# Group by Sample_ID and Activity
# We aggregate the sensor columns, and keep the 'Activity' label (using 'first' as it's constant for the sample)
grouped = df_final.groupby(['Sample_ID', 'Activity'])

# Calculate statistical features
df_features = grouped[sensor_columns].agg(['mean', 'std', 'min', 'max'])

# Flatten the column names (e.g., ('Acc_x', 'mean') becomes 'Acc_x_mean')
df_features.columns = ['_'.join(col).strip() for col in df_features.columns.values]

# Reset index so 'Activity' becomes a column again
df_features = df_features.reset_index()

# Check 3: Check for NaNs in the extracted features (e.g. if a sample was empty)
if df_features.isnull().values.any():
    print(f"Warning: Found NaNs in feature matrix. Dropping...")
    df_features = df_features.dropna()

print("\nFeature Extraction Complete.")
print(f"Original Data Points: {len(df_final)}")
print(f"New Feature Matrix Shape: {df_features.shape} (Rows = Unique Samples)")

display(df_features.head())

# --- 2. Define X (Features) and y (Target) ---
X = df_features.drop(['Sample_ID', 'Activity'], axis=1)
y = df_features['Activity']

# --- 3. Split into Training and Test Sets ---
# We use 'stratify=y' to ensure the class distribution (Door/Ball/O) is the same in both sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nTraining Set: {X_train.shape}")
print(f"Test Set: {X_test.shape}")

# --- 4. Normalization (Scaling) ---
# Many algorithms (like KNN, SVM, Neural Nets) perform better when features are on the same scale.
scaler = StandardScaler()

# Fit on training set ONLY, then transform both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for convenience (optional, but keeps column names)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

print("\nData Scaled and Ready.")
display(X_train_scaled.head())

### Data Preparation Steps

**1. Data Cleaning:**
Before processing, we perform sanity checks to ensure data quality:
*   **Missing Values:** We check for and remove any rows with `NaN` values that might have occurred during the merging process.
*   **Short Samples:** We check for samples with very few data points (e.g., < 10). A valid 5-second recording should have hundreds of points. Extremely short samples are likely recording errors and are removed.

**2. Feature Engineering (Aggregation):**
Raw sensor data is a time-series (thousands of rows per movement). Standard classifiers (like Random Forest or KNN) expect a single row of features per sample.
*   We grouped the data by `Sample_ID`.
*   We calculated statistical summaries (**Mean, Standard Deviation, Min, Max**) for each axis ($x, y, z$) and the Magnitude.
*   This transforms our dataset from "Long Format" to a **Feature Matrix** where each row represents one complete movement.

**3. Train/Test Split:**
*   We split the data into **80% Training** and **20% Testing**.
*   We used `stratify=y` to ensure that if we have 33% "Door" samples in the total set, we also have 33% in the test set. This prevents the model from being tested on an unrepresentative subset.

**4. Normalization:**
*   We applied `StandardScaler` to normalize the features (Mean = 0, Variance = 1).
*   **Why?** Algorithms that calculate distances (like K-Nearest Neighbors) are sensitive to the scale of numbers. If `Acc_z` ranges from 0-10 and `Gyro_x` ranges from 0-1000, the Gyroscope would dominate the distance calculation purely because the numbers are bigger. Scaling fixes this.

# 5. Explore many different models and short-list the best ones.

Explore / train and list the top 3 algorithms that score best on this dataset.

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 6. Fine-tune your models and combine them into a great solution.

can you get better performance within a model? e.g if you use a KNN classifier how does it behave if you change K (k=3 vs k=5 vs k=?). Which parameters are here to tune in the chosen models? 

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 7. Present your solution.

Explain why you would choose for a specific model

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 8. Launch, monitor, and maintain your system.

Can you Deployment the model?

> NOTE: The app provides the option for remote access, so you are able to get live sensordata from the phone

# 9. Additional Questions

* Explain the chosen motions you chose to be classified. 

* Which of these motions is easier/harder to classify and why?

* After your experience, which extra sensor data might help getting a better classifier and why?

* Explain why you think that your chosen algorithm outperforms the rest? 

* While recording the same motions with the same sensor data, what do you think will help improving the performance of your models?


```
# Place your comments / conclusions / insight here
```
