# A Multi Variable Linear Regression Model Notebook for the Wallaroo Platform
<!-- A Comprehensive Tutorial for:
1. Building a Linear Regression Model
2. Deploying the Model into Wallaroo
3. Using Wallaroo's Monitoring Capabilities to Analyze the Model. -->

#### A Comprehensive Tutorial to predict the amount of people in the room using multiple variable
### Table of Contents
* [Building a Simple Linear Regression Model](#build)
    * [Importing the Necessary Python Libraries](#pylib)
    * [Importing the Data Set using Pandas](#data)
    * [Pulling our Independent (x) and Dependent (y) Variables from the DataFrame](#variable)
    * [Splitting the Data Set into Testing and Training Subsets](#test_train)
    * [Creating the Linear Regression Model using Sklearn and Fitting our Training Data to the Model](#make_model)
    * [Predicting the Room Occupancy from our Independent Variable Test Set](#predict)
    * [Finding the Metrics to Analyze the Prediction](#metrics)
    * [Plotting the Linear Regression](#plot)
* [Deploying the Model into Wallaroo](#deploying)
    * [Converting the Sklearn Model into Onnx for use on the Wallaroo Platform](#onnx)
    * [Implementing into Wallaroo](#implement)
    * [Creating a Pipeline and Uploading the Model](#pipeline)
* [Using Wallaroo's Monitoring Capabilities to Analyze the Model.](#analyze)
    * [--------](#blank)

## Building a Multiple Variable Linear Regression Model <a class="anchor" id="build"></a>

This model will be built from a data set of sensor values and the occupancy of the room the sensors are in.
The goal is to use the sensor data to predict the room occupancy.

### Importing the Necessary Python Libraries <a class="anchor" id="pylib"></a>

We will use a variety of libraries to implement the linear regression model
#### These libraries include:
- matplotlib
- numpy
- sklearn
- pandas
- onnx

In [4]:
# Code Source: Us

# Needed for data visualization
import matplotlib.pyplot as plt

# Needed for data tuning
import numpy as np

# Needed for creating the linear regression model
from sklearn import linear_model

# Needed for metrics of the model
from sklearn.metrics import mean_squared_error, r2_score

# Needed for csv importing
import pandas as pd

### Importing the Data Set using Pandas <a class="anchor" id="data"></a>
The first step creating a linear regression model is read in the dataset using the pandas library  
The `read_csv` method is responsible for reading in the data and `head()` method acesses the first few rows in the data  
When picking a variable from the data we'll use `.corr()` to find which variable has the best correlation in the dataset

In [5]:
# Reading and displaying the dataset
data = pd.read_csv('Occupancy_Estimation.csv')
data.head()

Unnamed: 0,Date,Time,S1_Temp,S2_Temp,S3_Temp,S4_Temp,S1_Light,S2_Light,S3_Light,S4_Light,S1_Sound,S2_Sound,S3_Sound,S4_Sound,S5_CO2,S5_CO2_Slope,S6_PIR,S7_PIR,Room_Occupancy_Count
0,2017/12/22,10:49:41,24.94,24.75,24.56,25.38,121,34,53,40,0.08,0.19,0.06,0.06,390,0.769231,0,0,1
1,2017/12/22,10:50:12,24.94,24.75,24.56,25.44,121,33,53,40,0.93,0.05,0.06,0.06,390,0.646154,0,0,1
2,2017/12/22,10:50:42,25.0,24.75,24.5,25.44,121,34,53,40,0.43,0.11,0.08,0.06,390,0.519231,0,0,1
3,2017/12/22,10:51:13,25.0,24.75,24.56,25.44,121,34,53,40,0.41,0.1,0.1,0.09,390,0.388462,0,0,1
4,2017/12/22,10:51:44,25.0,24.75,24.56,25.44,121,34,54,40,0.18,0.06,0.06,0.06,390,0.253846,0,0,1


In [57]:
# Displays the correlations between each and every variable
# occupancy_features = ["S1_Temp", "S1_Light", "S1_Sound", "S5_CO2", "Room_Occupancy_Count"]
# data[occupancy_features].corr()
data.corr()[16:]

Unnamed: 0,S1_Temp,S2_Temp,S3_Temp,S4_Temp,S1_Light,S2_Light,S3_Light,S4_Light,S1_Sound,S2_Sound,S3_Sound,S4_Sound,S5_CO2,S5_CO2_Slope,S6_PIR,S7_PIR,Room_Occupancy_Count
Room_Occupancy_Count,0.700868,0.671263,0.652047,0.526509,0.849058,0.788764,0.793081,0.355715,0.573748,0.557853,0.531685,0.460287,0.660144,0.601105,0.633133,0.695138,1.0


### Pulling our Independent (x) and Dependent (y) Variables from the DataFrame <a class="anchor" id="variable"></a>
Next we are going to access the Independent variable `S1_Temp`, `S1_Light`, `S1_Sound`, `S5_CO2` to be stored in X variable and Dependent variable `Room_Occupancy_Count` to be stored Y variable  
The `values` function accesses the values in the dataset at the index of the given variable name

In [112]:
# Matrix of features x and prints data x
X = data[["S1_Temp", "S1_Light", "S1_Sound", "S5_CO2"]].values
print("The matrix of features: \n{}\n".format(X))

Y = data['Room_Occupancy_Count'].values
print("The dependent variable matrix: \n{}".format(Y))

The matrix of features: 
[[2.494e+01 1.210e+02 8.000e-02 3.900e+02]
 [2.494e+01 1.210e+02 9.300e-01 3.900e+02]
 [2.500e+01 1.210e+02 4.300e-01 3.900e+02]
 ...
 [2.513e+01 6.000e+00 1.100e-01 3.450e+02]
 [2.513e+01 6.000e+00 8.000e-02 3.450e+02]
 [2.513e+01 6.000e+00 8.000e-02 3.450e+02]]

The dependent variable matrix: 
[1 1 1 ... 0 0 0]


### Splitting the Data Set into Testing and Training Subsets <a class="anchor" id="test_train"></a>
Next we use the `train_test_split()` method in the sklearn library to split the data into test and train sets 
The train_test_split gives test/train data to x and y  
Create test and train datasets with 0.2 (20%) of the dataset being test data  
The `random_state` decides which indices of data to pull from  

In [113]:
from sklearn.model_selection import train_test_split

# The train_test_split gives test/train data to x and y
# Create test and train datasets with 0.2 (20%) of the dataset being test data
# The random_state decides which indices of data to pull from
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)

### Creating the Linear Regression Model using Sklearn and Fitting our Training Data to the Model <a class="anchor" id="make_model"></a>
Now using `LinearRegression()` method we create a linear regression object which we call `regr`  
Then we take `.fit(x_train, y_train)` method uses x and y train data as parameters to see how well it fits the model

In [114]:
from sklearn.linear_model import LinearRegression

# Creating the linear regression object
occupancy_model = LinearRegression()

# The regr.fit() measures how well the x and y train data fit the model
occupancy_model.fit(x_train, y_train)

LinearRegression()

### Predicting the Room Occupancy from our Independent Variable Test Set <a class="anchor" id="predict"></a>
In this step we take in the independent variable test set for a the parameter in the `predict()` method in order to predict the outcome for the dependent variable.

In [115]:
# The regr.predict() creates a prediction based on the x test data
y_pred = occupancy_model.predict(x_test)

### Finding the Metrics to Analyze the Prediction <a class="anchor" id="metrics"></a>
In this step we use various functions and methods in order to see how well our linear regression model is predicting our data. The  `coef_` function tells us the **correlation coefficient**, which shows in what way our variables correlate with each other. Next up we have the `mean_squared_error()` method, which shows us the distance from the estimated values and the true values; The best possible score would be 0. Lastly there's the `r2_score()` method which is responsible for displaying how well our data fits the current model, with an R^2 score of 1.

In [116]:
from sklearn.metrics import mean_squared_error, r2_score

In [118]:
# Prints the coefficients
print("Coefficients: \n", occupancy_model.coef_)

# Prints the mean squared error
print("Root mean squared error: %.2f" % mean_squared_error(y_test, y_pred, squared=False))

# Prints the coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))

Coefficients: 
 [0.16267571 0.01127446 0.22855009 0.00083072]
Root mean squared error: 0.47
Coefficient of determination: 0.74


### Plotting the Linear Regression <a class="anchor" id="plot"></a>
Using `plt` in **matplotlib** library we can print out the different aspects of our linear regression graph.

In [None]:
# The plt.scatter() plots the x and y test points in the linear regression model
plt.scatter(x_test, y_test, color="black")

# The plt.plot() creates the line of best fit
plt.plot(x_test, y_pred, color="blue", linewidth=3)

# Sets the x-axis, y-axis, and title of the model
plt.xlabel('S1_Temp, S1_Light, S1_Sound, and S5_CO2')
plt.ylabel('Room Occupancy')
plt.title('Room Occupancy VS S1_Temp, S1_Light, S1_Sound, and S5_CO2 Sensors')

# The plt.show() displays the model
plt.show()

## Deploying the Model into Wallaroo <a class="anchor" id="deploying"></a>
The model that was created will now be deployed into the Wallaroo platform

### Converting the Sklearn Model into Onnx for use on the Wallaroo Platform <a class="anchor" id="onnx"></a>
For the next step refer to [sklearn-regression-to-onnx tutorial](https://docs.wallaroo.ai/wallaroo-tutorials/conversion-tutorials/sklearn-regression-to-onnx/) in the wallaroo documentation for how to convert file to onnx.

In [119]:
# Used to load the sk-learn model
import pickle

# Used for the conversion process
import onnx, skl2onnx, onnxmltools
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx.common.data_types import DoubleTensorType

In [120]:
# The model_to_onnx converts the model to onnx to be upload to Wallaroo Platfrom
# For more detailed steps refer to "model_conversion"
def model_to_onnx(model, cols, *, input_type='Double'):
    input_type_lower=input_type.lower()
    # How to manage float values
    if input_type=='Double':
        tensor_type=DoubleTensorType
    elif input_type=='Float':
        tensor_type=FloatTensorType
    else:
        raise ValueError("bad input type")
    tensor_size=cols
    initial_type=[(f'{input_type_lower}_input', tensor_type([None, tensor_size]))]
    onnx_model=onnxmltools.convert_sklearn(model,initial_types=initial_type)
    return onnx_model

In [141]:
# The model_to_onnx() takes the pickle file and converts it to onnx
onnx_model_converted = model_to_onnx(occupancy_model, 4)

# The onnx.save_model() saves the converted model into a file
onnx.save_model(onnx_model_converted, "occupancy_model.onnx")

### Implementing into Wallaroo <a class="anchor" id="implement"></a>
Reference the [Wallaroo 101 Tutorial](https://docs.wallaroo.ai/wallaroo-101/) for how to access the wallaroo platform

In [142]:
# Needed for the use of Wallaroo
import wallaroo

# The wallaroo.Client() allows the file to access wallaroo platform
wl = wallaroo.Client()

In [143]:
# Creates the name for workspace, pipeline, and model
workspace_name = 'multi variable'
pipeline_name = 'occupancymultivarpipeline'
model_name = 'occupancymultivarmodel'

# Created to fetch the model
model_file_name = 'occupancy_model.onnx'

In [144]:
# The get_workspace() gets/create the workspace when needed
# For more detailed steps refer to "wallaroo-101"
def get_workspace(name):
    workspace = None
    for ws in wl.list_workspaces():
        if ws.name() == name:
            workspace= ws
    if(workspace == None):
        workspace = wl.create_workspace(name)
    return workspace

# The get_pipeline() gets/create the pipeline when needed
# For more detailed steps refer to "wallaroo-101"
def get_pipeline(name):
    try:
        pipeline = wl.pipelines_by_name(pipeline_name)[0]
    except EntityNotFoundError:
        pipeline = wl.build_pipeline(pipeline_name)
    return pipeline

In [145]:
# Calls function to create workspace
workspace = get_workspace(workspace_name)

# The wl.set_current_workspace() sets the workspace to currently being worked on
ws = wl.set_current_workspace(workspace)

In [146]:
# Thw wl.list_workspaces() prints the lists of the workspaces
wl.list_workspaces()

Name,Created At,Users,Models,Pipelines
v-joseph.bigger@wallaroo.ai - Default Workspace,2022-10-25 20:01:52,['v-joseph.bigger@wallaroo.ai'],0,0
multi variable,2022-10-26 15:19:52,['v-joseph.bigger@wallaroo.ai'],5,1


In [147]:
wl.set_current_workspace(workspace)
gw = wl.get_current_workspace()

In [148]:
# The wl.upload_model() uploads the model to the platform
occupancy_model_onnx = wl.upload_model(model_name, model_file_name).configure()
module_post = wl.upload_model("postprocess", "./postprocess.py").configure('python')

### Creating a Pipeline and Uploading the Model <a class="anchor" id="pipeline"></a>
In this step we are using `build_pipeline()`. Here we create the pipeline by giving the method a string. We defined `pipeline_name` earlier.
To upload the model we use `upload_model()`. Here we need give a string, and a file. Both we defined earlier in the tutorial.
Lastly, we add the model as a step to the pipeline using `add_model_step()`. All we have to give here is give the function our model we used earlier.

In [149]:
# The wl.build_pipeline() creates the pipeline
occupancy_pipeline = wl.build_pipeline(pipeline_name)

# The occuupancy_pipeline.add_model_step() adds the model to pipeline to be deployed
occupancy_pipeline = occupancy_pipeline.add_model_step(occupancy_model_onnx)

In [150]:
occupancy_pipeline = occupancy_pipeline.add_validation('no_negative_people', occupancy_model_onnx.outputs[0][0] >= float(0))

In [151]:
occupancy_pipeline = occupancy_pipeline.add_model_step(module_post)

In [152]:
# The occupancy_pipeline.deploy() activating the pipeline
occupancy_pipeline.deploy()

Waiting for deployment - this will take up to 45s .. ok


0,1
name,occupancymultivarpipeline
created,2022-10-26 17:19:32.363105+00:00
last_updated,2022-10-26 17:26:03.231937+00:00
deployed,True
tags,
steps,occupancymultivarmodel


## Using Wallaroo's Monitoring Capabilities to Analyze the Model <a class="anchor" id="analyze"></a>
Since the model is now deployed we will continue to monitor it and analyze it

In [153]:
# The occupancy_pipeline.status() displays the status of the pipeline
occupancy_pipeline.status()

{'status': 'Running',
 'details': None,
 'engines': [{'ip': '10.244.3.68',
   'name': 'engine-5859994cc8-jsn2k',
   'status': 'Running',
   'reason': None,
   'pipeline_statuses': {'pipelines': [{'id': 'occupancymultivarpipeline',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'postprocess',
      'version': 'b0803986-0514-4560-b47f-db49f6549439',
      'sha': 'c9326405453ebeb9fc2db95678d08f06d3d2968581b324e8d48daf1b93c76cf8',
      'status': 'Running'},
     {'name': 'occupancymultivarmodel',
      'version': 'cb2628e0-dd79-4a8c-941d-c6eb0a50cb6a',
      'sha': '220af1e3ccd12a792748f4da11411abf9e6a53b244d991aaf228be58f0b2f955',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.244.3.67',
   'name': 'engine-lb-67c854cc86-cfpw2',
   'status': 'Running',
   'reason': None}]}

In [154]:
# Needed for the infrences
import json
from wallaroo.object import EntityNotFoundError

In [155]:
# The pandas_to_dict() converts the values into dictionary for infrences
def pandas_to_dict(df):
    input_dict = {
    'tensor': df.to_numpy().tolist()
    }
    return input_dict

In [156]:
# The data.iloc[# of rows, # of columns] and
# 2,6, 10, 14
raw = data.iloc[:10,[2, 6, 10, 14]]
raw

Unnamed: 0,S1_Temp,S1_Light,S1_Sound,S5_CO2
0,24.94,121,0.08,390
1,24.94,121,0.93,390
2,25.0,121,0.43,390
3,25.0,121,0.41,390
4,25.0,121,0.18,390
5,25.0,121,0.13,390
6,25.0,120,1.39,390
7,25.0,121,0.09,390
8,25.0,122,0.09,390
9,25.0,101,3.84,390


In [161]:
# Store values for infrences
# input_dict = pandas_to_dict(raw)
# input_dict = {'tensor': x_test.tolist()}
input_dict = {'tensor': [[25, 121, 0.9, 390], [24.8, -117, 0.5, 390], [24.95, 101, 3.27, 390]]}
# input_dict

In [162]:
# The occupancy_pipeline.infer() creates a result based on data given
result = occupancy_pipeline.infer(input_dict)
result

[InferenceResult({'check_failures': [{'False': {'expr': 'occupancymultivarmodel.outputs[0][0] '
                                        '>= 0'}}],
  'elapsed': 330305,
  'model_name': 'postprocess',
  'model_version': 'b0803986-0514-4560-b47f-db49f6549439',
  'original_data': {'tensor': [[25, 121, 0.9, 390],
                               [24.8, -117, 0.5, 390],
                               [24.95, 101, 3.27, 390]]},
  'outputs': [{'Json': {'data': [{'original': {'outputs': [{'Double': {'data': [1.5067474012753257,
                                                                                -1.3005291444940523,
                                                                                1.8147881464737432],
                                                                       'dim': [3,
                                                                               1],
                                                                       'v': 1}}]},
                           

In [163]:
result[0].data()[0].tolist()

[2.0, -1.0, 2.0]

In [164]:
logs = occupancy_pipeline.logs()
# type(logs)
# type(logs[0])
# vars(logs[0])
logs

Timestamp,Output,Input,Anomalies
2022-26-Oct 17:32:03,"[array([2., 1., 9.])]","[[25, 121, 0.9, 390], [24.8, 117, 0.5, 390], [24.95, 101, 32.7, 390]]",0
2022-26-Oct 17:33:05,"[array([ 2., -1., 9.])]","[[25, 121, 0.9, 390], [24.8, -117, 0.5, 390], [24.95, 101, 32.7, 390]]",1
2022-26-Oct 17:33:23,"[array([ 2., -1., 2.])]","[[25, 121, 0.9, 390], [24.8, -117, 0.5, 390], [24.95, 101, 3.27, 390]]",1
