# Simple NN for Activity Classification
---
This notebook contains _incomplete_ code to train a neural net to predict an activity—running or walking—based on some provided user motion data. 

### The data

The data is a single csv file of accelerometer and gyroscope readings, collected from users' watches in 10 second intervals. Each data point is tagged as one of two activity types: running or walking. 

> Take a quick look at this data in Trove, where each column is described: [Run or walk data in Trove](https://trove.apple.com/dataset/run_walk_motion/1.0.0). 

### NN model creation

Andy has started the process of creating and training a neural net, but it will be up to you to fix his code, and improve upon it! 

The **goal** of this notebook is that you gain experience defining neural nets in PyTorch code in multiple ways, and gain some intuition for what choices you can make to get better performing models.

This notebook is broken up into the following NN model creation steps:
>1. Load the data
2. Create train/test dataloaders
3. Define a neural network
4. Train the model
5. Evaluate the performance of our trained model on a test dataset
6. Un-mount the Trove data
7. UX Considerations

Andy has gotten up to step 3, but his code to train a neural net is incomplete. Can you fix Andy's code, and create an NN model of your own that achieves better performance? 

As usual, your tasks will be marked as **TASKS** in markdown and as `## TODO's` in code. 

> **TASK**: Run the provided code to load the data and create train and test dataloaders. 

In [2]:
## run provided code

# import PyTorch libraries
import torch
from torch.utils.data import DataLoader
torch.manual_seed(0) # reproducibility

# import data libraries
import turitrove as trove
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

---
# Load the Data


The following cells load run/walk activity data from Trove as a DataFrame.

In [4]:
# load data from Trove

TROVE_URI = 'dataset/run_walk_motion@1.0.0'
trove.umount(TROVE_URI)

# check for temp_data dir, if not found, make one and mount the dataset
if not os.path.isdir('temp_data'):
    os.makedirs('temp_data')

activity_dataset = trove.mount(TROVE_URI, 'temp_data')

dataset/run_walk_motion@1.0.0 is not mounted


AttributeError: Local mount does not support spaces

In [None]:
# make dataframe from csv path

data_path = activity_dataset.raw_file_path + '/'+ activity_dataset.primary_index['path'][0]
df = pd.read_csv(data_path)
df.head()

### Explore the data

Andy has done some data exploration to see the amount of data and number of users in the provided data.

> **TASK**: Write some code to answer the question: What percentage of this data is tagged as walking vs. running? 

The answer to this question should give you a good idea of whether you have _enough_ data for each class to make accurate predictions. 

In [None]:
# explore data a bit

# how much data?
print('# rows: ', len(df))
print()
# how many users?
print('Unique users: ', len(df['username'].unique()))
# descriptive stats
df.describe()

In [None]:
## TODO: Find the distribution of run/walk data



### Format and save for model training

This cell formats the DataFrame for model training; dropping non-featurized columns of data and moving the target to be the last column.

Finally, the formatted data is saved in a local `data/` directory as a binary file `run_walk_formatted.pkl`, which can be read in again, later in the notebook. 

In [None]:
# prep for model training, format data

# date, time, username columns dropped 
df_formatted = df.drop(['date', 'time', 'username'], axis=1)
# put target (activity) column last
column_order = ['wrist', 'acceleration_x', 'acceleration_y', 'acceleration_z', 
                'gyro_x', 'gyro_y', 'gyro_z','activity']
df_formatted = df_formatted.reindex(columns=column_order)
df_formatted.head()

In [None]:
# save as pkl file 

if not os.path.isdir('data'):
    os.makedirs('data')
    
# save to local data/ dir
df_formatted.to_pickle('data/run_walk_formatted.pkl')

### Create a dataset

The following code creates one dataset that loads in the formatted csv as **tensors**; where each sample of data holds input features and one, corresponding target `activity` variable.

In [None]:
# create dataset

from helpers import RunWalkDataset
run_walk_dataset = RunWalkDataset('data/run_walk_formatted.pkl')

In [None]:
# print out a few (3) samples to see that it looks right
for i in range(3):
    sample = run_walk_dataset[i]
    print()
    print(sample)


### DataLoaders for Train/Test Datasets

DataLoaders allow you to do things like batch data, shuffle data, etc.—they are the standard way to iterate through data for training a PyTorch model.

The below code also _randomly_ splits the single, loaded RunWalkDataset into separate train and test datasets. 

> **TASK**: Critique Andy's method for splitting this data (you do not need to change the code).

> In a sentence or two, describe one thing about the below split is good practice and one thing that is not. 

**Your answer here**: Double-click to edit.

In [None]:
from torch.utils.data import random_split

# split data into train and test sets randomly ~ 80/20

# lengths or # samples in each dataset
split_80 = len(run_walk_dataset)*80//100
split_20 = len(run_walk_dataset) - split_80

# random split
train_dataset, test_dataset = random_split(run_walk_dataset, [split_80, split_20])

# how many samples per batch
batch_size = 64

# train and test loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

---
# Define the Neural Network Architecture

The architecture should be responsible for transforming input features into a single target class value between 0-1. 

> **TASK**: Something is missing from Andy's code, find out what it is and fix it so that you can train and calculate test metrics on this model.  


In [None]:
# importing NN modules
import torch.nn as nn
import torch.nn.functional as F

## TODO: Fix Andy's code

class AndyNet(nn.Module):
    
    ## Defines a single-layer NN
    def __init__(self, input_dim, output_dim):
        '''Defines layers of a neural network.
           :param input_dim: Number of input features
           :param output_dim: Number of outputs
         '''
        super(AndyNet, self).__init__()
                
        # define a linear layer, input > output
        self.fc1 = nn.Linear(input_dim, output_dim)
        
    
    ## Defines the feedforward behavior of the network
    def forward(self, x):
        '''Feedforward behavior of the net.
           :param x: A batch of input features
           :return: A batch of output values; predictions
         '''
        out = self.fc1(x)
        out = self.sigmoid(out) # final output, activation fn to get class probs 
        return out 

In [None]:
# instantiating the simple NN with specified dimensions

input_dim = 7 # input feats
output_dim = 1 # one target value

model = AndyNet(input_dim, output_dim)

# print model layers (from init fn)
model

### Define loss and optimization strategy

The loss function defines what a network tries to minimize in terms of comparing actual versus predicted values. 

In classification tasks, it is common to use a **cross entropy loss**; here since there is only one value output by the model—a value between 0-1—there is a special *binary* cross entropy loss, `BCELoss`. 

The optimizer defines how a neural network's weights update, as a result of trying to minimize the loss function. 

In [None]:
# loss function (categorical cross-entropy for classification)
criterion = nn.BCELoss()

# optimizer (stochastic gradient descent) and learning rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

---
# Train Loop

In the `helpers.py` file, there is code for a training loop that does the following:
* Iterate through the training data in batches provided by the `train_loader` 
* Calculate the loss (binary cross-entropy) and backpropagate to find the source of this error
* Update the weights of this NN to decrease the loss
* Return the trained model


In [None]:
from helpers import train

# number of epochs - times you iterate through the entire training dataset
n_epochs = 5

# call provided train function with all params
model = train(model, train_loader, n_epochs, optimizer, criterion)

---
# Test the Trained Network

> **TASK**: Record the accuracy that Andy's network gets.

**Andy's test accuracy was**: Your answer here (double-click to edit)

In [None]:
from helpers import test_eval

# calculate test accuracy with helper function
num_correct = test_eval(model, test_loader, criterion)

print('Test accuracy: {:.6f}\n'.format(num_correct/len(test_dataset)))


---
## An Improved NN

Now it's your turn to improve upon this code. 

> **TASK**: Using the *Sequential* module, create a new NN class that improves upon Andy's solution. For at least 2 experiments, record:
>* Hypothesis: What you think will improve a model's accuracy and why (e.g., changing the number of nodes in a hidden layer)
>* Experiment results: The resultant test accuracy

🏆 Your final experiment should aim for about **98% test accuracy**! 

In [None]:
## TODO: Define, train, and test your own NN, using the Sequential module

## TODO, note in markdown or cell comments, which model choices seemed to work best (# epochs, layers ,etc.)


In [None]:
## Room for your experiment notes!


## Un-mount your data

When you're totally done with the Trove dataset, un-mount it to clean uop this working directory.

> **TASK**: Un-mount the run/walk Trove data.

In [None]:
## TODO: Un-mount Trove data


---
# Further UX Considerations 📝 

At Apple, we are always thinking about the nuances of the user experience for different populations. This section represent answers to a set of questions that ask us to consider inclusive design practices, such as:

* **Failure cases**: What might go wrong with activity classification, and how does the likelihood of failures vary across users?
* **Delight**: What potential impact of a run/walk detection feature are you most excited about?

For any model you are thinking of putting into production or sharing with a larger team, you should critically consider the different tradeoffs and impacts such a trained model could have on different users. 

> **TASK**: In a sentence or a short bullet point, write down at least one potential failure case for run/walk detection and the user impact of that failure. 

> Additionally, share whether or not you would release your model more widely considering how _big_ this dataset, how many users it represents, and what data might be useful that is missing from this data.


**Your answer here**: 
