# Group Assignment 1: Human Activity Detection
In this assignment you will create you own dataset for classification. You will explore which ML algorithms are best to classify this and you will present your best solution. 

- Create your own dataset for custom human motions using Phyphox
- There should be at least 3 distinct types of motions
- The motions should be different to the ones used in the UCI dataset (Not: walking, sitting, standing, laying, stairs)
- Follow the steps and answer the questions given in this notebook

### Generating your dataset:

For this assignment you will create your own dataset of motions that you collect with an Accelerometer and Gyroscope. For this you can use your phone as a sensor.
To be able to collect your data you can best use an app called [phyphox](https://phyphox.org/), this is a free app available in app stores. This app can be configured to acces your sensordata, sample it as given frequency's. you can set it up te have experiment timeslots, and the data with a timestamp can be exported to a needed output format.

![](https://phyphox.org/wp-content/uploads/2019/06/phyphox_dark-1024x274.png)

When you installed the app you can setup a custum experiment by clicking on the + button. Define an experiment name, sample frequency and activate the Accelerometer and Gyroscope. Your custom experiment will be added, you can run it pressing the play button and you will see sensor motion. Pressing the tree dots (...) lets you define timed runs, remote access and exporting data.

Phyphox will generate 2 files with sensor data, one for the Accelerometer and one for the Giro. Both files will have timestamps which might not match the recorded sensor data for each sensor. Please, preprocess and merge the files for using it as your dataset for training, testing and deploying your own supervised learning model.

### steps

With your own generated dataset the similar sequence of steps should be taken to train your model.

These are the generic steps to be taken
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.
9. Additional Questions


---
In the Notebook this structure is used for dividing the different steps, so make sure you do the implementation and analisis at these location in the notebook. 

You may add additinal code blocks, but keep the seperation of the given structure.

At the end of each block summarize / comment / conclude your current step in the given textblocks.




```
Project group 20:
Lars Claassen   - 4159632
Tonnie Bour     - 4130456
Pjotr Maes      - 3839001
```


# 1. Frame the problem and look at the big picture
Describe the problem at hand and explain your approach

```
A model is designed to classify human activities. To achieve this, data is gathered to train multiple models for classification.

We use our phones and their built-in sensors to track the following activities:
    - Casual walking
    - Walking while looking at the phone
    - Drinking
For each activity, gyroscope, acceleration, and linear acceleration data are recorded and logged. Each activity is measured separately in 10-second intervals. The recorded data is then compressed into various features, including minimum, maximum, mean, and standard deviation values, resulting in 38 features per measurement.

The compressed values from each measurement are compiled into a single dataset, which is then split into training and testing sets. Different model types are trained and evaluated to determine the most suitable one, providing insight into which model performs best with our data.

```


# 2. Get the data.

Initialize the system, get all needed libraries, retreive the data and import it

> Create your own dataset

> Explain and show (with a few images) which motions you are classifing, how you generated them, what the problems where you encountered in this process! 


```
To gather the data we used an app called Phyphox. Within this app we created our own exercise that tracks the gyroscope, acceleration, and linear acceleration. This data is logged and saved per activity. Each activity/measurement records for about 10 seconds, this is for each exercise the same to prevend a bias from forming towards the "longer during" activity.

For casual walking we held the phone in the palm of our hand while swinging it as you would naturally do while walking. 
```

<img src="Images/W1.jpg" alt="Person holding phone" width="300">
<img src="Images/W2.jpg" alt="Person holding phone" width="300">
<img src="Images/W3.jpg" alt="Person holding phone" width="300">

```
To measure walking while looking at the phone a subject would walk with the phone as if he/she would be texting or looking at their phone. 
```

<img src="Images/WW1.jpg" alt="Person holding phone" width="300">
<img src="Images/WW2.jpg" alt="Person holding phone" width="300">
<img src="Images/WW3.jpg" alt="Person holding phone" width="300">

```
To measure drinking a subject would hold the phone in the hand together with their beverage (of their choice) and take multiple sips. 
```

<img src="Images/D1.jpg" alt="Person holding phone" width="300">
<img src="Images/D2.jpg" alt="Person holding phone" width="300">
<img src="Images/D3.jpg" alt="Person holding phone" width="300">

```
Here’s a revised version with improved grammar and flow:

To achieve the best model performance, a large amount of data is required. However, creating a prototype model using a smaller dataset is still feasible.

In our case, three subjects perform each activity 10 times, resulting in 30 datasets per activity across three subjects. This brings the total dataset size to 90 measurements.

The recorded data is compressed into various features, including minimum, maximum, mean, and standard deviation values, resulting in 38 features per measurement. This gives the dataset a final shape of 90x38.
```

````
Two significant problems were encountered during data collection:
1) Interpreting and compressing the raw data into useful training data.
2) Aligning different naming conventions, as various phone brands produced results with differing labels.



In [17]:
# YOUR CODE HERE 
import os
import pandas as pd

# Define the path to the root folder containing the action folders
root_folder = "datasets/testSet"

# Create an empty list to hold the summarized data
summary_data = []

# Iterate through all folders in the root folder
for folder_name in os.listdir(root_folder):
    folder_path = os.path.join(root_folder, folder_name)

    # Check if the folder_path is a directory
    if os.path.isdir(folder_path):
        # Extract person and action from the folder name
        parts = folder_name.split("-")
        if len(parts) >= 2:
            person = parts[0]
            action_with_index = parts[1].rsplit(" ", 1)[0]
            action = ''.join([i for i in action_with_index if not i.isdigit()])

            summary_row = {
                'subject': person,
                'Activity': action,
                
            }

            # Iterate through all CSV files in the folder
            for file_name in os.listdir(folder_path):
                if file_name.endswith(".csv"):
                    file_path = os.path.join(folder_path, file_name)
                    try:
                        # Read the CSV file
                        data = pd.read_csv(file_path)

                        # Skip empty files
                        if data.empty:
                            print(f"Skipping empty file: {file_path}")
                            continue
                        
                        # Transfer names to insure device campatibility (Apple using different naming convension)
                        if {"X (m/s^2)", "Y (m/s^2)", "Z (m/s^2)"}.issubset(data.columns) and file_name == "Accelerometer.csv":
                            data.rename(columns={
                                "X (m/s^2)": "Acceleration x (m/s^2)",
                                "Y (m/s^2)": "Acceleration y (m/s^2)",
                                "Z (m/s^2)": "Acceleration z (m/s^2)"
                            }, inplace=True)
                        elif {"X (rad/s)", "Y (rad/s)", "Z (rad/s)"}.issubset(data.columns):
                            data.rename(columns={
                                "X (rad/s)": 'Gyroscope x (rad/s)',
                                "Y (rad/s)": 'Gyroscope y (rad/s)',
                                "Z (rad/s)": 'Gyroscope z (rad/s)'
                            }, inplace=True)
                        elif {"X (m/s^2)","Y (m/s^2)","Z (m/s^2)"}.issubset(data.columns):
                            data.rename(columns={
                                "X (m/s^2)": 'Linear Acceleration x (m/s^2)',
                                "Y (m/s^2)": 'Linear Acceleration y (m/s^2)',
                                "Z (m/s^2)": 'Linear Acceleration z (m/s^2)'
                            }, inplace=True)



                        # Determine which type of data (Accelerometer, Gyroscope, or Linear Acceleration)
                        if {'Acceleration x (m/s^2)', 'Acceleration y (m/s^2)', 'Acceleration z (m/s^2)'}.issubset(data.columns):
                            data_type = "Accelerometer"
                            columns = ['Acceleration x (m/s^2)', 'Acceleration y (m/s^2)', 'Acceleration z (m/s^2)']

                        elif {'Gyroscope x (rad/s)', 'Gyroscope y (rad/s)', 'Gyroscope z (rad/s)'}.issubset(data.columns):
                            data_type = "Gyroscope"
                            columns = ['Gyroscope x (rad/s)', 'Gyroscope y (rad/s)', 'Gyroscope z (rad/s)']

                        elif {'Linear Acceleration x (m/s^2)', 'Linear Acceleration y (m/s^2)', 'Linear Acceleration z (m/s^2)'}.issubset(data.columns):
                            data_type = "Linear Acceleration"
                            columns = ['Linear Acceleration x (m/s^2)', 'Linear Acceleration y (m/s^2)', 'Linear Acceleration z (m/s^2)']

                        else:
                            print(f"File {file_path} does not contain recognized column names. Skipping.")
                            continue

                        # Calculate statistics for relevant columns
                        mean_values = data[columns].mean()
                        std_values = data[columns].std()
                        min_values = data[columns].min()
                        max_values = data[columns].max()

                        # add to the summary row
                        for col in columns:
                            summary_row[f'{col}_mean'] = mean_values[col]
                            summary_row[f'{col}_std'] = std_values[col]
                            summary_row[f'{col}_min'] = min_values[col]
                            summary_row[f'{col}_max'] = max_values[col]
                        
                    except Exception as e:
                        print(f"Error reading file {file_path}: {e}")
            summary_data.append(summary_row)
# Create a dataframe from the summary data
if not summary_data:
    print("No valid data found. Summary CSV will not be created.")
else:
    summary_df = pd.DataFrame(summary_data)

    # Save the summarized data to a single CSV file
    output_path = os.path.join(root_folder, "data_total.csv")
    summary_df.to_csv(output_path, index=False)

    print(f"Summary data saved to {output_path}")

print(summary_df.shape)
    

Summary data saved to datasets/testSet\data_total.csv
(90, 38)


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
file_path = 'datasets/testSet/data_total.csv'
data = pd.read_csv(file_path)

# Set your split ratio (e.g., 0.8 for 80% training, 20% testing)
split_ratio = 0.8  # Change this value as desired (0 < split_ratio < 1)

# Split the dataset into train and test sets
train_set, test_set = train_test_split(data, test_size=(1 - split_ratio), random_state=42)

# Save the splits to new CSV files (optional)
train_set.to_csv('datasets/train_set.csv', index=False)
test_set.to_csv('datasets/test_set.csv', index=False)

# Print the sizes of the splits
print(f"Train set size: {train_set.shape}")
print(f"Test set size: {test_set.shape}")


Train set size: (72, 38)
Test set size: (18, 38)


```
# Place your comments / conclusions / insight here
```


# 3. Explore the data to gain insights.

Explore the data in any possible way, visualize the results (if you have multiple plots of the same kind of data put them in one larger plot)

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms

prepare your data, is it normalized? are there outlier? Make a training and a test set.

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 5. Explore many different models and short-list the best ones.

Explore / train and list the top 3 algorithms that score best on this dataset.

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 6. Fine-tune your models and combine them into a great solution.

can you get better performance within a model? e.g if you use a KNN classifier how does it behave if you change K (k=3 vs k=5 vs k=?). Which parameters are here to tune in the chosen models? 

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 7. Present your solution.

Explain why you would choose for a specific model

In [None]:
# YOUR CODE HERE 

```
# Place your comments / conclusions / insight here
```


# 8. Launch, monitor, and maintain your system.

Can you Deployment the model?

> NOTE: The app provides the option for remote access, so you are able to get live sensordata from the phone

# 9. Additional Questions

* Explain the chosen motions you chose to be classified. 

* Which of these motions is easier/harder to classify and why?

* After your experience, which extra sensor data might help getting a better classifier and why?

* Explain why you think that your chosen algorithm outperforms the rest? 

* While recording the same motions with the same sensor data, what do you think will help improving the performance of your models?


```
# Place your comments / conclusions / insight here
```
