# WebApp

### Feature Engineering Training Data

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import joblib

For our data prediction, we'll be using the previous 10 records in order to try to make a classification on the data. To do this, we'll make 10 new sets of our XYZ gyro/acc data, that means there'll be a total of 60 new columns added.

Firstly, we'll have to read in our exported data.csv from notebook `04 WebApp.ipynb`

In [4]:
data = pd.read_csv('/data.csv')

In [5]:
data.head()

Unnamed: 0,gyro_x,gyro_y,gyro_z,acc_x,acc_y,acc_z,activity
0,-0.002749,-0.004276,0.002749,1.020833,-0.125,0.105556,5.0
1,-0.000305,-0.002138,0.006109,1.025,-0.125,0.101389,5.0
2,0.012217,0.000916,-0.00733,1.020833,-0.125,0.104167,5.0
3,0.011301,-0.001833,-0.006414,1.016667,-0.125,0.108333,5.0
4,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333,5.0


In [7]:
# create a copy of the data to mess around with
data1 = data.copy()

Now, we can begin the process of lagging the data for the new features. 

By using the `.shift(n)` method from pandas, we can easily slide the rows down to create the new features to help our analysis, where `n` is the number of rows shifted.

Then we can concatenate the data into one big dataframe for our final clean data.

In [8]:
shifted_data = pd.DataFrame()
for i in range(1,11):
    print(f'shift: {i}')
    temp_df = data1.shift(i)
    temp_df = temp_df.rename(columns={
        'gyro_x':f'gyro_x{i}',
        'gyro_y':f'gyro_y{i}',
        'gyro_z':f'gyro_z{i}',
        'acc_x':f'acc_x{i}',
        'acc_y':f'acc_y{i}',
        'acc_z':f'acc_z{i}',
        'activity':f'activity{i}'})
    # drop this because this is our target, we can only have 1 target and 
    #the test data will not have these activity columns as features
    temp_df = temp_df.drop(columns=[f'activity{i}'])
    shifted_data = pd.concat((shifted_data,temp_df),axis=1)
data2 = pd.concat((data1,shifted_data),axis=1)

shift: 1
shift: 2
shift: 3
shift: 4
shift: 5
shift: 6
shift: 7
shift: 8
shift: 9
shift: 10


Awesome, let's take a look at all the new columns we made. We should be expecting 677 rows, because the organized data contained 7 rows, and we estimated to calculated 60 addition rows.

In [9]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1424526 entries, 0 to 1424525
Data columns (total 67 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   gyro_x    1424526 non-null  float64
 1   gyro_y    1424526 non-null  float64
 2   gyro_z    1424526 non-null  float64
 3   acc_x     1424526 non-null  float64
 4   acc_y     1424526 non-null  float64
 5   acc_z     1424525 non-null  float64
 6   activity  1424525 non-null  float64
 7   gyro_x1   1424525 non-null  float64
 8   gyro_y1   1424525 non-null  float64
 9   gyro_z1   1424525 non-null  float64
 10  acc_x1    1424525 non-null  float64
 11  acc_y1    1424525 non-null  float64
 12  acc_z1    1424525 non-null  float64
 13  gyro_x2   1424524 non-null  float64
 14  gyro_y2   1424524 non-null  float64
 15  gyro_z2   1424524 non-null  float64
 16  acc_x2    1424524 non-null  float64
 17  acc_y2    1424524 non-null  float64
 18  acc_z2    1424524 non-null  float64
 19  gyro_x3   1424523 non

In [10]:
data2.shape

(1424526, 67)

Everything looks to be in order, now all we have to do is drop the rows with missing values. When we shifted the data, we created missing values because when you shift the data, there's no existing data before you start recording.

In [11]:
data3 = data2.dropna()

In [12]:
data3.head()

Unnamed: 0,gyro_x,gyro_y,gyro_z,acc_x,acc_y,acc_z,activity,gyro_x1,gyro_y1,gyro_z1,acc_x1,acc_y1,acc_z1,gyro_x2,gyro_y2,gyro_z2,acc_x2,acc_y2,acc_z2,gyro_x3,gyro_y3,gyro_z3,acc_x3,acc_y3,acc_z3,gyro_x4,gyro_y4,gyro_z4,acc_x4,acc_y4,acc_z4,gyro_x5,gyro_y5,gyro_z5,acc_x5,acc_y5,acc_z5,gyro_x6,gyro_y6,gyro_z6,acc_x6,acc_y6,acc_z6,gyro_x7,gyro_y7,gyro_z7,acc_x7,acc_y7,acc_z7,gyro_x8,gyro_y8,gyro_z8,acc_x8,acc_y8,acc_z8,gyro_x9,gyro_y9,gyro_z9,acc_x9,acc_y9,acc_z9,gyro_x10,gyro_y10,gyro_z10,acc_x10,acc_y10,acc_z10
10,0.00336,-0.002749,0.000305,1.019445,-0.119444,0.094444,5.0,0.016493,0.003665,0.00336,1.019445,-0.115278,0.094444,0.009774,-0.006414,0.000305,1.020833,-0.127778,0.098611,0.013744,-0.014966,0.004276,1.016667,-0.123611,0.097222,0.010079,-0.003665,0.000305,1.019445,-0.125,0.101389,0.009163,-0.003054,0.010079,1.018056,-0.129167,0.104167,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333,0.011301,-0.001833,-0.006414,1.016667,-0.125,0.108333,0.012217,0.000916,-0.00733,1.020833,-0.125,0.104167,-0.000305,-0.002138,0.006109,1.025,-0.125,0.101389,-0.002749,-0.004276,0.002749,1.020833,-0.125,0.105556
11,-0.00336,-0.008552,0.007941,1.022222,-0.120833,0.1,5.0,0.00336,-0.002749,0.000305,1.019445,-0.119444,0.094444,0.016493,0.003665,0.00336,1.019445,-0.115278,0.094444,0.009774,-0.006414,0.000305,1.020833,-0.127778,0.098611,0.013744,-0.014966,0.004276,1.016667,-0.123611,0.097222,0.010079,-0.003665,0.000305,1.019445,-0.125,0.101389,0.009163,-0.003054,0.010079,1.018056,-0.129167,0.104167,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333,0.011301,-0.001833,-0.006414,1.016667,-0.125,0.108333,0.012217,0.000916,-0.00733,1.020833,-0.125,0.104167,-0.000305,-0.002138,0.006109,1.025,-0.125,0.101389
12,-0.005803,-0.010079,0.003971,1.019445,-0.120833,0.1,5.0,-0.00336,-0.008552,0.007941,1.022222,-0.120833,0.1,0.00336,-0.002749,0.000305,1.019445,-0.119444,0.094444,0.016493,0.003665,0.00336,1.019445,-0.115278,0.094444,0.009774,-0.006414,0.000305,1.020833,-0.127778,0.098611,0.013744,-0.014966,0.004276,1.016667,-0.123611,0.097222,0.010079,-0.003665,0.000305,1.019445,-0.125,0.101389,0.009163,-0.003054,0.010079,1.018056,-0.129167,0.104167,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333,0.011301,-0.001833,-0.006414,1.016667,-0.125,0.108333,0.012217,0.000916,-0.00733,1.020833,-0.125,0.104167
13,-0.009163,0.000916,0.001833,1.016667,-0.120833,0.095833,5.0,-0.005803,-0.010079,0.003971,1.019445,-0.120833,0.1,-0.00336,-0.008552,0.007941,1.022222,-0.120833,0.1,0.00336,-0.002749,0.000305,1.019445,-0.119444,0.094444,0.016493,0.003665,0.00336,1.019445,-0.115278,0.094444,0.009774,-0.006414,0.000305,1.020833,-0.127778,0.098611,0.013744,-0.014966,0.004276,1.016667,-0.123611,0.097222,0.010079,-0.003665,0.000305,1.019445,-0.125,0.101389,0.009163,-0.003054,0.010079,1.018056,-0.129167,0.104167,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333,0.011301,-0.001833,-0.006414,1.016667,-0.125,0.108333
14,-0.004887,0.002443,0.007941,1.019445,-0.120833,0.097222,5.0,-0.009163,0.000916,0.001833,1.016667,-0.120833,0.095833,-0.005803,-0.010079,0.003971,1.019445,-0.120833,0.1,-0.00336,-0.008552,0.007941,1.022222,-0.120833,0.1,0.00336,-0.002749,0.000305,1.019445,-0.119444,0.094444,0.016493,0.003665,0.00336,1.019445,-0.115278,0.094444,0.009774,-0.006414,0.000305,1.020833,-0.127778,0.098611,0.013744,-0.014966,0.004276,1.016667,-0.123611,0.097222,0.010079,-0.003665,0.000305,1.019445,-0.125,0.101389,0.009163,-0.003054,0.010079,1.018056,-0.129167,0.104167,0.010996,-0.001527,-0.004887,1.018056,-0.127778,0.108333


In [13]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1424515 entries, 10 to 1424524
Data columns (total 67 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   gyro_x    1424515 non-null  float64
 1   gyro_y    1424515 non-null  float64
 2   gyro_z    1424515 non-null  float64
 3   acc_x     1424515 non-null  float64
 4   acc_y     1424515 non-null  float64
 5   acc_z     1424515 non-null  float64
 6   activity  1424515 non-null  float64
 7   gyro_x1   1424515 non-null  float64
 8   gyro_y1   1424515 non-null  float64
 9   gyro_z1   1424515 non-null  float64
 10  acc_x1    1424515 non-null  float64
 11  acc_y1    1424515 non-null  float64
 12  acc_z1    1424515 non-null  float64
 13  gyro_x2   1424515 non-null  float64
 14  gyro_y2   1424515 non-null  float64
 15  gyro_z2   1424515 non-null  float64
 16  acc_x2    1424515 non-null  float64
 17  acc_y2    1424515 non-null  float64
 18  acc_z2    1424515 non-null  float64
 19  gyro_x3   1424515 no

Now that we have our feature engineered data, let's try running a logistic regression on it!

---

### Logistic Regression

Let's start off with importing the required library, and splitting our dataframe into X and y variables, where y is the target.

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
X = data3.loc[:, data3.columns != 'activity']
y = data3['activity']

Now, we can run the logistic regression!

In [17]:
# instantiate model
logit = LogisticRegression()

# fit on X and y values
logit.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now let's save this model as a pkl file so we can use it later on in the web app.

In [19]:
joblib.dump(logit, "logit.pkl")

['logit.pkl']