# WebApp

In this notebook, I'll be preparing the models required to run the web application built in flask.

The purpose of the webapp is to produce a prediction and classify the movement that is being made, based on phone sensor data. We'll be loading in a csv of new data and seeing how it performs against our ML model.

---

### Data

In the dataset we've been working on, the data was collected from a Samsung Galaxy S II on the waist of participants, collecting data at a sampling rate of 50Hz. I'll be replicating this experiment, collecting my own data, and running it through the ML model that's running on the web app.

To replicate this experiment properly, there's a few things that we'll have to do.

Since the experiment's clean, feature engineered data was so complicated with 561 features, I'll be using the raw data to train the model.
For a time series model, we'll need to use the previous n records to predict the future, so we'll need to do that before training the model.

Also, the app I'll be using collects data samples at a rate of 100Hz, I'll need to downsample my data with pandas `resample` method.

We can start off by importing the required libraries for this analysis.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib
import glob

Now, let's load in the raw data

In [2]:
# instantiate empty dataframes
acc_features = ['acc_x','acc_y','acc_z','activity']
gyro_features = ['gyro_x','gyro_y','gyro_z','activity']
accelerometer = pd.DataFrame(columns = acc_features)
gyroscope = pd.DataFrame(columns = gyro_features)

We can use the `glob` function to get the file names in the raw data folder, in order to read the files and add them to the empty dataframes.

In [3]:
files = []
for file in glob.glob('data/RawData/*.txt'):
    files.append(file)

In [4]:
files.sort()
files.pop() # getting rid of the activity labels file
files

['data/RawData/acc_exp01_user01.txt',
 'data/RawData/acc_exp02_user01.txt',
 'data/RawData/acc_exp03_user02.txt',
 'data/RawData/acc_exp04_user02.txt',
 'data/RawData/acc_exp05_user03.txt',
 'data/RawData/acc_exp06_user03.txt',
 'data/RawData/acc_exp07_user04.txt',
 'data/RawData/acc_exp08_user04.txt',
 'data/RawData/acc_exp09_user05.txt',
 'data/RawData/acc_exp10_user05.txt',
 'data/RawData/acc_exp11_user06.txt',
 'data/RawData/acc_exp12_user06.txt',
 'data/RawData/acc_exp13_user07.txt',
 'data/RawData/acc_exp14_user07.txt',
 'data/RawData/acc_exp15_user08.txt',
 'data/RawData/acc_exp16_user08.txt',
 'data/RawData/acc_exp17_user09.txt',
 'data/RawData/acc_exp18_user09.txt',
 'data/RawData/acc_exp19_user10.txt',
 'data/RawData/acc_exp20_user10.txt',
 'data/RawData/acc_exp21_user10.txt',
 'data/RawData/acc_exp22_user11.txt',
 'data/RawData/acc_exp23_user11.txt',
 'data/RawData/acc_exp24_user12.txt',
 'data/RawData/acc_exp25_user12.txt',
 'data/RawData/acc_exp26_user13.txt',
 'data/RawDa

Now, we can create new dataframes with each of the files and add the associated activity labels to each of them.

In [5]:
activities_columns = ['experiment','user','activity','row_start','row_end']
activities = pd.read_csv('data/RawData/labels.txt', sep=' ', names = activities_columns)
activities

Unnamed: 0,experiment,user,activity,row_start,row_end
0,1,1,5,250,1232
1,1,1,7,1233,1392
2,1,1,4,1393,2194
3,1,1,8,2195,2359
4,1,1,5,2360,3374
...,...,...,...,...,...
1209,61,30,2,13842,14574
1210,61,30,3,14751,15427
1211,61,30,2,15588,16319
1212,61,30,3,16546,17250


We are going to have to loop through each file when we create the dataframe and add the associated activities to the dataframe by checking if the experiment and users match, then we'll choose the rows and column from the `activities` dataframe and label the rows that way.

In [6]:
# This for loop shows how to extract the user number and experiment number from the file name
for i in files:
    if 'acc' in i:
        print(f'acc | experiment:{i[20:22]}  user: {i[-6:-4]}')
    if 'gyro' in i:
        print(f'gyro | experiment:{i[21:23]}  user: {i[-6:-4]}')

acc | experiment:01  user: 01
acc | experiment:02  user: 01
acc | experiment:03  user: 02
acc | experiment:04  user: 02
acc | experiment:05  user: 03
acc | experiment:06  user: 03
acc | experiment:07  user: 04
acc | experiment:08  user: 04
acc | experiment:09  user: 05
acc | experiment:10  user: 05
acc | experiment:11  user: 06
acc | experiment:12  user: 06
acc | experiment:13  user: 07
acc | experiment:14  user: 07
acc | experiment:15  user: 08
acc | experiment:16  user: 08
acc | experiment:17  user: 09
acc | experiment:18  user: 09
acc | experiment:19  user: 10
acc | experiment:20  user: 10
acc | experiment:21  user: 10
acc | experiment:22  user: 11
acc | experiment:23  user: 11
acc | experiment:24  user: 12
acc | experiment:25  user: 12
acc | experiment:26  user: 13
acc | experiment:27  user: 13
acc | experiment:28  user: 14
acc | experiment:29  user: 14
acc | experiment:30  user: 15
acc | experiment:31  user: 15
acc | experiment:32  user: 16
acc | experiment:33  user: 16
acc | expe

Now, we'll create individual dataframes for each of the files and add the activity labels.

It sounds simple, but there's a level of complexity to it. Let me explain.

First, we'll start off by instantiating our `main` sensor dataframes. These will be used to hold all the file data combined together. To make sure the files for both the gyroscope and accelerometer data are compatible to be concatenated together, and are loaded in the same sequential order, we'll instantiate some empty lists.

The code below will loop through all the file names obtained above in the `files` variable, and if the name contains `acc` or `gyro`, we'll add the file data to their respective dataframes. During this process, we'll also get some information about the file, such as the experiment number, user id, number of rows and columns for each file. The purpose of this is to match each experiment, user and number of rows for both the accelerometer and gyroscope data, so that when we combine both dataframes everything will be in sync, as long as there are no row discrepancies.

**Note:** file data is stored in temp_df

When looping through the files, we'll check the experiment number and user id with the activity label dataframe in order to obtain the rows associated with each label. There will be more than one activity label for each experiment, so we get all the indicies of these labels and loop through these to find the correct data to label our file data with. 

Then, after all the data is labeled correctly, we'll add the file data into the main dataframes.

In [7]:
# creating column names
acc_features = ['acc_x','acc_y','acc_z','acc_activity','acc_user','acc_experiment']
gyro_features = ['gyro_x','gyro_y','gyro_z','gyro_activity','gyro_user','gyro_experiment']

# instantiate empty dataframe
accelerometer = pd.DataFrame(columns = acc_features)
gyroscope = pd.DataFrame(columns = gyro_features)

# instantiate empty lists for file comparison
acc_file_params = []
gyro_file_params = []

# looping through each file in folder
for i in files:
    
    # if accelerometer experiment
    if 'acc' in i:
        
        # extract user and experiment numbers
        experiment = int(i[20:22])
        user = int(i[-6:-4])
        
        # read in the file
        temp_df = pd.read_csv(i, sep = ' ', names = acc_features)
        
        #gather some parameters for file comparison
        temp_array = []
        temp_array.append(experiment)
        temp_array.append(user)
        temp_array.append(temp_df.shape[0])
        temp_array.append(temp_df.shape[1])
        
        #add parameters per file to main params list
        acc_file_params.append(temp_array)
        
        # find the rows for the activity label
        # by checking the user and experiment number
        exp_check = (activities['experiment'] == experiment)
        user_check = (activities['user'] == user)
        print(f'acc | experiment:{i[20:22]}  user: {i[-6:-4]}')
        
        #grab the indices of the activity label DF
        temp_index = activities[(exp_check) & (user_check)].index
        
        # loop through the indicies of the activity label DF
        # to get the row numbers for each activity, and add the labels to the temp_df
        for j in temp_index:
            start = activities.loc[j]['row_start']
            end = activities.loc[j]['row_end']
            activity = activities.loc[j]['activity']
            
            temp_df['acc_activity'].loc[start:end] = activity
            temp_df['acc_user'].loc[start:end] = user
            temp_df['acc_experiment'].loc[start:end] = experiment
            
            # concatenate the temp_df to our main accelerometer df
            accelerometer = pd.concat([accelerometer, temp_df], axis = 0)
        

    # if gyroscope experiment
    elif 'gyro' in i:
        
        # extract user and experiment numbers
        experiment = int(i[21:23])
        user = int(i[-6:-4])
        
        # read in the file
        temp_df = pd.read_csv(i, sep = ' ', names = gyro_features)
        
        #gather some parameters for file comparison
        temp_array = []
        temp_array.append(experiment)
        temp_array.append(user)
        temp_array.append(temp_df.shape[0])
        temp_array.append(temp_df.shape[1])
        
        #add parameters per file to main params list
        gyro_file_params.append(temp_array)
        
        # find the rows for the activity label
        # by checking the user and experiment number
        exp_check = (activities['experiment'] == experiment)
        user_check = (activities['user'] == user)
        print(f'gyro | experiment:{i[21:23]}  user: {i[-6:-4]}')
        
        #grab the indices of the activity label DF
        temp_index = activities[(exp_check) & (user_check)].index
        
        # loop through the indicies of the activity label DF
        # to get the row numbers for each activity, and add the labels to the temp_df
        for j in temp_index:
            start = activities.loc[j]['row_start']
            end = activities.loc[j]['row_end']
            activity = activities.loc[j]['activity']

            temp_df['gyro_activity'].loc[start:end] = activity
            temp_df['gyro_user'].loc[start:end] = user
            temp_df['gyro_experiment'].loc[start:end] = experiment

            # concatenate the temp_df to our main gyroscope df
            gyroscope = pd.concat([gyroscope, temp_df], axis = 0)
print('complete')

acc | experiment:01  user: 01
acc | experiment:02  user: 01
acc | experiment:03  user: 02
acc | experiment:04  user: 02
acc | experiment:05  user: 03
acc | experiment:06  user: 03
acc | experiment:07  user: 04
acc | experiment:08  user: 04
acc | experiment:09  user: 05
acc | experiment:10  user: 05
acc | experiment:11  user: 06
acc | experiment:12  user: 06
acc | experiment:13  user: 07
acc | experiment:14  user: 07
acc | experiment:15  user: 08
acc | experiment:16  user: 08
acc | experiment:17  user: 09
acc | experiment:18  user: 09
acc | experiment:19  user: 10
acc | experiment:20  user: 10
acc | experiment:21  user: 10
acc | experiment:22  user: 11
acc | experiment:23  user: 11
acc | experiment:24  user: 12
acc | experiment:25  user: 12
acc | experiment:26  user: 13
acc | experiment:27  user: 13
acc | experiment:28  user: 14
acc | experiment:29  user: 14
acc | experiment:30  user: 15
acc | experiment:31  user: 15
acc | experiment:32  user: 16
acc | experiment:33  user: 16
acc | expe

Wow, that was a lot of code, and a lot to take in. Let's double check our work!

Remember our comparison lists? Let's take a look to see that all the files had the same number of rows and columns for their respective users and experiments.

In [8]:
acc_file_params == gyro_file_params

True

In [9]:
acc_file_params[:10]

[[1, 1, 20598, 6],
 [2, 1, 19286, 6],
 [3, 2, 18026, 6],
 [4, 2, 16565, 6],
 [5, 3, 20994, 6],
 [6, 3, 17493, 6],
 [7, 4, 17668, 6],
 [8, 4, 15888, 6],
 [9, 5, 16864, 6],
 [10, 5, 15038, 6]]

In [10]:
gyro_file_params[:10]

[[1, 1, 20598, 6],
 [2, 1, 19286, 6],
 [3, 2, 18026, 6],
 [4, 2, 16565, 6],
 [5, 3, 20994, 6],
 [6, 3, 17493, 6],
 [7, 4, 17668, 6],
 [8, 4, 15888, 6],
 [9, 5, 16864, 6],
 [10, 5, 15038, 6]]

Everything looks like it's lining up properly, let's now take a look at the final dataframe before we concatenate everything.

In [11]:
accelerometer.sample(5, random_state = 42)

Unnamed: 0,acc_x,acc_y,acc_z,acc_activity,acc_user,acc_experiment
8461,0.034722,0.783333,0.629167,6.0,30.0,61.0
2190,0.993056,0.120833,0.226389,4.0,29.0,59.0
16287,-0.45,0.759722,0.506944,,,
6391,1.015278,-0.105556,-0.094444,,,
8643,0.920833,-0.418056,-0.104167,1.0,1.0,1.0


In [12]:
gyroscope.sample(5, random_state = 42)

Unnamed: 0,gyro_x,gyro_y,gyro_z,gyro_activity,gyro_user,gyro_experiment
8461,0.00336,0.002749,-0.001222,6.0,30.0,61.0
2190,0.002749,0.008858,-0.005498,4.0,29.0,59.0
16287,-0.071166,-0.029016,0.069944,,,
6391,-0.001527,0.008858,-0.003054,,,
8643,0.743423,1.408655,-0.309709,1.0,1.0,1.0


There seems to be some missing values to be taken care of, but overall, the columns look like they've been filled properly.

In [13]:
print(f'The accelerometer dataframe has {accelerometer.shape[0]} rows and {accelerometer.shape[1]} columns')
print(f'The gyroscope dataframe has {gyroscope.shape[0]} rows and {gyroscope.shape[1]} columns')

The accelerometer dataframe has 22496259 rows and 6 columns
The gyroscope dataframe has 22496259 rows and 6 columns


Both the dataframes have the same number of rows, that leads us to believe that everything is good to combine together now.

In [14]:
data = pd.concat((gyroscope, accelerometer), axis=1)

Perfect, let's see if everything concatenated correctly.

In [15]:
data.shape

(22496259, 12)

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22496259 entries, 0 to 19081
Data columns (total 12 columns):
 #   Column           Dtype  
---  ------           -----  
 0   gyro_x           float64
 1   gyro_y           float64
 2   gyro_z           float64
 3   gyro_activity    float64
 4   gyro_user        float64
 5   gyro_experiment  float64
 6   acc_x            float64
 7   acc_y            float64
 8   acc_z            float64
 9   acc_activity     float64
 10  acc_user         float64
 11  acc_experiment   float64
dtypes: float64(12)
memory usage: 2.2 GB


Nice! There's actually a few more things we can do for our data to slim it down a little.

The user and experiment columns aren't necessary for our analysis, these columns were only added to check the compatibility of the individual sensor dataframes.

The acc_activity and gyro_activity can also be combined into one column, so we're not repeating information. Then, we should have a total of 7 columns.

We can also remove any rows with missing activity labels, these will not add any insights to our analysis. Duplicate values should be fine because movement patterns or static movements will most likely have some duplicate records over time.

In [23]:
#removing user and experiment columns
data = data.drop(columns=['gyro_experiment','gyro_user','acc_experiment','acc_user'])
data.info()

KeyError: "['gyro_experiment' 'gyro_user' 'acc_experiment' 'acc_user'] not found in axis"

Let's get rid of the missing values now before we continue.

In [19]:
data.isna().sum()

gyro_x                  0
gyro_y                  0
gyro_z                  0
gyro_activity    13755959
acc_x                   0
acc_y                   0
acc_z                   0
acc_activity     13755959
dtype: int64

We have 13,755,959 missing values in both activity columns, if we subtract that from our 22,496,259 total rows.

In [20]:
print(f"we'll have {22496259-13755959} rows left")

we'll have 8740300 rows left


In [21]:
# dropping na rows
data = data.dropna()

In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8740300 entries, 250 to 18097
Data columns (total 8 columns):
 #   Column         Dtype  
---  ------         -----  
 0   gyro_x         float64
 1   gyro_y         float64
 2   gyro_z         float64
 3   gyro_activity  float64
 4   acc_x          float64
 5   acc_y          float64
 6   acc_z          float64
 7   acc_activity   float64
dtypes: float64(8)
memory usage: 600.1 MB


Everything's looking good so far, we can see that we have the expected number of rows left.

For combining the activity column, we can check to see if all the rows have the same activity label, if so, we can just use one row to fill in our new column.

In [24]:
# checking if the columns contain same values
data[data['gyro_activity'] != data['acc_activity']]

Unnamed: 0,gyro_x,gyro_y,gyro_z,gyro_activity,acc_x,acc_y,acc_z,acc_activity


The cell above is checking to see if there are any rows in the activity columns that are not aligned, and as we can see, there are no rows that have discrepant values. We can now safely use a single one of these columns as our `activity` column, and drop the individual sensor's activity columns.

In [25]:
data['activity'] = data['gyro_activity']
data = data.drop(columns = ['gyro_activity','acc_activity'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['activity'] = data['gyro_activity']


In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8740300 entries, 250 to 18097
Data columns (total 7 columns):
 #   Column    Dtype  
---  ------    -----  
 0   gyro_x    float64
 1   gyro_y    float64
 2   gyro_z    float64
 3   acc_x     float64
 4   acc_y     float64
 5   acc_z     float64
 6   activity  float64
dtypes: float64(7)
memory usage: 533.5 MB


In [27]:
data.head()

Unnamed: 0,gyro_x,gyro_y,gyro_z,acc_x,acc_y,acc_z,activity
250,-0.002749,-0.004276,0.002749,1.020833,-0.125,0.105556,5.0
251,-0.000305,-0.002138,0.006109,1.025,-0.125,0.101389,5.0
252,0.012217,0.000916,-0.00733,1.020833,-0.125,0.104167,5.0
253,0.011301,-0.001833,-0.006414,1.016667,-0.125,0.108333,5.0
254,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333,5.0


Let's also reset the indicies of the columns

In [28]:
data = data.reset_index(drop=True)

Now that we've got our raw data all organized, we can feature engineer some columns for our analysis.

We will create a csv file from the data that we have to continue in google colab in notebook 05 WebApp.ipynb

In [46]:
data.to_csv('data.csv',index=False)

---