# Unit 3 Assessment

In this assignment, we will focus on sensor data. The dataset contains accelerometer data from cell phones. Accelerometer helps measure the speed and acceleration of a cell phone's movement. Each row represents a single measurement (captured on a timeline). There are a total of 20 time steps (columns). This is a multiclass classification task: predict what type of transportation each measurement (i.e., row) represents based on the accelerometer data. 

## Description of Variables

You will use the **movement.csv** data set for this assignment. Each row represents a single measurement. Columns labeled as 1 from 20 are the time steps on the timeline (there are 20 time steps, each time step has only one measurement). 

The last column is the target variable. It shows the label (category) of the measurement. Because it is a text-based column, **it must be converted to ordinal values.**

## Goal

Use the data set **movement.csv** to predict the column called **Target**. The input variables are columns labeled as **1 to 20**. 

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data (1 points)

In [1]:
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import mean_squared_error
import numpy as np
import os
import pandas as pd

np.random.seed(42)




In [2]:
# Ingest the data
movement = pd.read_csv('movement.csv')

# Review the data
movement.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,12,13,14,15,16,17,18,19,20,Target
0,1.179784,1.179784,0.810629,0.810629,1.041816,1.041816,1.041816,0.453604,0.453604,0.48392,...,0.250571,0.250571,0.250571,0.250571,0.250571,0.250571,0.167502,0.167502,0.167502,Bus
1,1.115912,0.860983,0.860983,0.860983,0.860983,1.020423,1.020423,1.333723,1.333723,1.333723,...,1.582763,0.936744,0.936744,0.936744,0.936744,1.412754,3.283429,3.283429,3.283429,Bus
2,0.5723,0.14704,0.14704,0.14704,0.14704,0.14704,0.14704,0.14704,0.14704,2.662993,...,2.662993,2.662993,2.662993,1.449779,1.449779,1.147295,0.978355,0.978355,0.978355,Bus
3,1.128633,1.128633,1.128633,1.128633,1.128633,3.181596,4.012386,4.012386,1.349989,1.266019,...,1.266019,1.266019,0.492464,0.710132,0.710132,0.251398,0.251398,1.347456,1.347456,Bus
4,0.548065,0.548065,0.548065,1.441688,0.631261,0.631261,0.631261,9.25807,1.495908,0.723835,...,8.073005,8.073005,8.073005,8.073005,1.124158,0.399042,0.399042,0.399042,0.561521,Car


In [3]:
# Confirm the shape of the data
movement.shape

(118, 21)

In [4]:
# Apply label encoder on the Target values for classification
from sklearn.preprocessing import LabelEncoder

lbl_enc = LabelEncoder()
movement['Target'] = lbl_enc.fit_transform(movement['Target'])

In [5]:
movement.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,12,13,14,15,16,17,18,19,20,Target
0,1.179784,1.179784,0.810629,0.810629,1.041816,1.041816,1.041816,0.453604,0.453604,0.48392,...,0.250571,0.250571,0.250571,0.250571,0.250571,0.250571,0.167502,0.167502,0.167502,0
1,1.115912,0.860983,0.860983,0.860983,0.860983,1.020423,1.020423,1.333723,1.333723,1.333723,...,1.582763,0.936744,0.936744,0.936744,0.936744,1.412754,3.283429,3.283429,3.283429,0
2,0.5723,0.14704,0.14704,0.14704,0.14704,0.14704,0.14704,0.14704,0.14704,2.662993,...,2.662993,2.662993,2.662993,1.449779,1.449779,1.147295,0.978355,0.978355,0.978355,0
3,1.128633,1.128633,1.128633,1.128633,1.128633,3.181596,4.012386,4.012386,1.349989,1.266019,...,1.266019,1.266019,0.492464,0.710132,0.710132,0.251398,0.251398,1.347456,1.347456,0
4,0.548065,0.548065,0.548065,1.441688,0.631261,0.631261,0.631261,9.25807,1.495908,0.723835,...,8.073005,8.073005,8.073005,8.073005,1.124158,0.399042,0.399042,0.399042,0.561521,1


In [6]:
y = movement[['Target']]
x = movement.drop('Target', axis=1)

In [7]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3)

In [8]:
train_x[:1]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
62,2.314853,4.819996,2.373515,2.373515,2.373515,5.343154,5.234107,5.328037,2.147838,5.013407,5.082002,3.035419,3.068428,1.625365,1.625365,2.398082,4.666547,4.666547,5.222103,5.222103


In [9]:
#Convert input variables to a 2-D array with float data type
train_x= np.array(train_x)
test_x= np.array(test_x)

train_x = train_x.astype(np.float32)
test_x = test_x.astype(np.float32)

In [10]:
train_x[:1]

array([[2.3148527, 4.819996 , 2.373515 , 2.373515 , 2.373515 , 5.343154 ,
        5.234107 , 5.3280373, 2.1478376, 5.0134068, 5.082002 , 3.0354187,
        3.0684278, 1.6253647, 1.6253647, 2.3980818, 4.6665473, 4.6665473,
        5.2221026, 5.2221026]], dtype=float32)

In [11]:
#Keras expects a different input format:
#Data needs to have 3 dimensions

train_x = np.reshape(train_x, (train_x.shape[0], train_x.shape[1], 1))
test_x = np.reshape(test_x, (test_x.shape[0], test_x.shape[1], 1))

In [12]:
train_x.shape

(82, 20, 1)

In [13]:
train_y[:1]

Unnamed: 0,Target
62,4


In [14]:
#Convert target variable to a 2-D array with float data type
train_y= np.array(train_y)
test_y= np.array(test_y)

train_y = train_y.astype(np.float32)
test_y = test_y.astype(np.float32)

In [15]:
train_y[:1]

array([[4.]], dtype=float32)

In [16]:
#Keras expects a different input format:
#Data needs to have 3 dimensions

train_x = np.reshape(train_x, (train_x.shape[0], train_x.shape[1], 1))
test_x = np.reshape(test_x, (test_x.shape[0], test_x.shape[1], 1))

In [17]:
train_y.shape

(82, 1)

# Find the baseline (0.5 point)

In [18]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_y)

In [19]:
from sklearn.metrics import accuracy_score
#Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_y, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.32926829268292684


In [20]:
#Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_y, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.4444444444444444


# Build a cross-sectional (i.e., a regular) Neural Network model using Keras (with only one hidden layer) (2 points)

In [21]:
NN = keras.models.Sequential()

NN.add(keras.layers.Input(shape=[train_x.shape[1],]))
NN.add(keras.layers.Dense(20, activation='relu'))
NN.add(keras.layers.Dense(5, activation='softmax'))




In [22]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

NN.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = NN.fit(train_x, train_y, epochs=30,
                    validation_data=(test_x, test_y),
                   batch_size=20)

Epoch 1/30


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [23]:
# Train values
train_scores = NN.evaluate(train_x, train_y, verbose=0)

# Print the values
print(f"Train {NN.metrics_names[0]}: {train_scores[0]:.2f}")
print(f"Train {NN.metrics_names[1]}: {train_scores[1]*100:.2f}%")

Train loss: 0.52
Train accuracy: 85.37%


In [24]:
# Test values

test_scores = NN.evaluate(test_x, test_y, verbose=0)
print(f"Test {NN.metrics_names[0]}: {test_scores[0]:.2f}")
print(f"Test {NN.metrics_names[1]}: {test_scores[1]*100:.2f}%")

Test loss: 0.92
Test accuracy: 66.67%


# Build a deep cross-sectional (i.e., regular) Neural Network model using Keras (with two or more hidden layers) (2 points)

In [25]:
DNN = keras.models.Sequential()

DNN.add(keras.layers.Input(shape=[train_x.shape[1],]))
DNN.add(keras.layers.Dense(20, activation='relu'))
DNN.add(keras.layers.Dense(15, activation='relu'))
DNN.add(keras.layers.Dense(10, activation='relu'))
DNN.add(keras.layers.Dense(5, activation='softmax'))

In [26]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

DNN.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = DNN.fit(train_x, train_y, epochs=30,
                    validation_data=(test_x, test_y),
                   batch_size=20)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [27]:
DNN.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 20)                420       
                                                                 
 dense_3 (Dense)             (None, 15)                315       
                                                                 
 dense_4 (Dense)             (None, 10)                160       
                                                                 
 dense_5 (Dense)             (None, 5)                 55        
                                                                 
Total params: 950 (3.71 KB)
Trainable params: 950 (3.71 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [28]:
# Train values
train_scores = DNN.evaluate(train_x, train_y, verbose=0)

# Print the values
print(f"Train {DNN.metrics_names[0]}: {train_scores[0]:.2f}")
print(f"Train {DNN.metrics_names[1]}: {train_scores[1]*100:.2f}%")

Train loss: 0.48
Train accuracy: 81.71%


In [29]:
# Test values
test_scores = DNN.evaluate(test_x, test_y, verbose=0)
print(f"Test {DNN.metrics_names[0]}: {test_scores[0]:.2f}")
print(f"Test {DNN.metrics_names[1]}: {test_scores[1]*100:.2f}%")

Test loss: 0.84
Test accuracy: 61.11%


# Build a LSTM Model (with only one layer) (2 points)

In [30]:
n_steps = 20
n_inputs = 1

LTSM = keras.models.Sequential([
    keras.layers.LSTM(20, input_shape=[n_steps, n_inputs]),
    keras.layers.Dense(5, activation='softmax')
])

In [31]:
from tensorflow.keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='loss', patience=5, verbose=1, mode='auto')

callback = [earlystop]

In [32]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Adam(learning_rate=0.01)

LTSM.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = LTSM.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y),
                    batch_size=20, 
                    callbacks=callback)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [33]:
# evaluate the model

scores = LTSM.evaluate(test_x, test_y, verbose=0)
print("%s: %.2f" % (LTSM.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (LTSM.metrics_names[1], scores[1]*100))

loss: 0.69
accuracy: 63.89%


# Build a deep LSTM Model (with only two layers) (2 points)

In [34]:
n_steps = 20
n_inputs = 1

DLSTM = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences=True, input_shape=[n_steps, n_inputs]),
    keras.layers.LSTM(20, return_sequences=True),
    keras.layers.LSTM(20),
    keras.layers.Dense(5, activation='softmax'),
])

In [35]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Adam(learning_rate=0.01)

DLSTM.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = DLSTM.fit(train_x, train_y, epochs=20,
                    batch_size=20, 
                   validation_data = (test_x, test_y),
                    callbacks=callback)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [36]:
# evaluate the model

scores = DLSTM.evaluate(test_x, test_y, verbose=0)
print("%s: %.2f" % (DLSTM.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (DLSTM.metrics_names[1], scores[1]*100))

loss: 0.65
accuracy: 72.22%


# Build a GRU Model (with only one layer) (2 points)

In [37]:
n_steps = 20
n_inputs = 1

GRU = keras.models.Sequential([
    keras.layers.GRU(20, input_shape=[n_steps, n_inputs]),
    keras.layers.Dense(5, activation='softmax')
])

In [38]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Adam(learning_rate=0.01)

GRU.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = GRU.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y),
                    batch_size=20, 
                    callbacks=callback)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [39]:
# evaluate the model

scores = GRU.evaluate(test_x, test_y, verbose=0)
print("%s: %.2f" % (GRU.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (GRU.metrics_names[1], scores[1]*100))

loss: 0.60
accuracy: 69.44%


# Build a deep GRU Model (with only two layers) (2 points)

In [40]:
n_steps = 20
n_inputs = 1

DGRU = keras.models.Sequential([
    keras.layers.GRU(20, return_sequences=True, input_shape=[n_steps, n_inputs]),
    keras.layers.GRU(20, return_sequences=True),
    keras.layers.GRU(20),
    keras.layers.Dense(5, activation='softmax')
])

In [41]:
np.random.seed(42)
tf.random.set_seed(42)

optimizer = keras.optimizers.Adam(learning_rate=0.01)

DGRU.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])

history = DGRU.fit(train_x, train_y, epochs=20,
                   validation_data = (test_x, test_y),
                    batch_size=20, 
                    callbacks=callback
                   )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [42]:
# evaluate the model

scores = DGRU.evaluate(test_x, test_y, verbose=0)
print("%s: %.2f" % (DGRU.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (DGRU.metrics_names[1], scores[1]*100))

loss: 0.57
accuracy: 72.22%


# Discussion

## List the test values of each model you built (0.5 points)

## Which model performs the best and why? (0.5 points) 

## How does it compare to baseline? (0.5 points)

# Extra credit: 2 points

The dataset is very small. This means your test values are likely unreliable. Use your best model and run a 10-fold cross validation on it. Then, find and report the mean accuracy score.

Note: to be eligible for this extra credit, you should run your 10-fold cross validation on the unsplit data.

In [43]:
# Convert data to appropriate dimension array for keras
y = movement['Target'].values
x = movement.drop('Target', axis=1).values

In [44]:
x=x.reshape(-1,n_steps, n_inputs)

In [45]:
from sklearn.model_selection import StratifiedKFold

accuracy_scores = []

# Initialize StratifiedKFold
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [46]:
# Iterate through folds
for train_index, test_index in kf.split(x, y):
    train_x, test_x = x[train_index], x[test_index]
    train_y, test_y = y[train_index], y[test_index]

    # Train the model
    history = DLSTM.fit(train_x, train_y, epochs=20, batch_size=12, callbacks=callback, verbose=0)
    
    # Evaluate the model on the test set
    y_pred = DLSTM.predict(test_x)
    y_pred_classes = np.argmax(y_pred, axis=1)
    
    # Calculate accuracy for this fold
    accuracy = accuracy_score(test_y, y_pred_classes)
    accuracy_scores.append(accuracy)

Epoch 13: early stopping
Epoch 14: early stopping
Epoch 10: early stopping
Epoch 10: early stopping
Epoch 18: early stopping
Epoch 7: early stopping
Epoch 8: early stopping


In [47]:
# Calculate the mean accuracy score
mean_accuracy = np.mean(accuracy_scores)

In [48]:
accuracy_scores

[0.5833333333333334,
 0.75,
 0.6666666666666666,
 0.75,
 0.75,
 0.75,
 0.8333333333333334,
 0.8333333333333334,
 0.6363636363636364,
 0.9090909090909091]

In [49]:
# Report the mean accuracy score
print("Mean Accuracy Score:", mean_accuracy)

Mean Accuracy Score: 0.7462121212121213
