# Prediction Model Blueprint

Topic: Energy consumption prediction

Author: Ananya W.

Packages: [pandas](https://pandas.pydata.org/docs/reference/index.html), [scikit-learn (sklearn)](https://scikit-learn.org/stable/api/index.html), [seaborn](https://seaborn.pydata.org/tutorial.html)

Files:`blueprint_data_{train/assessment}.csv`

Ananya's work in this document is divided into 5 phases (steps):
1. Data analysis
2. Data cleansing
3. Feature engineering
4. Data set splitting
5. Model training and testing
6. Evaluation and encapsulation of the results
   
Ananya has also left you a message at the end of the notebook.

## 1. Data analysis
First, we focus on a brief exploratory data analysis to obtain an initial overview.

### Importing the data set

Importing the data set and outputting dimensions, columns and statistics.

In [None]:
import pandas as pd

dataframe = pd.read_csv('blueprint_data_train.csv')

Output of basic information on the data record:

In [None]:
dataframe.info()

The `energy_joule` column contains zero values that must be treated. This is dealt with in part 2 (data cleansing).

Use the `describe()` method to output the value ranges and simple column statistics:

In [None]:
dataframe.describe()

The column `location_id` has variance 0 (all entries have the value 1). The column is removed in part 2 (data cleansing).

Output of the first lines of the data record:

In [None]:
dataframe.head(10)

### Plotting of simple visualizations

The external library `seaborn` is used for visualization.

Display of histograms for each column:

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

for col in dataframe.columns:
    print(col)
    plt.figure(1, figsize=(5,5))
    sns.histplot(dataframe, x=col, discrete=True)
    plt.show()

The third histogram shows that there are two types of robots, namely “Robot1” and “Robot2”. These are encoded as strings and must be encoded as numeric values as most machine learning models only accept numeric input. This is covered in Part 3 (Feature Engineering).

Pairwise visualization of the distributions of two columns.

The row for the prediction target `energy_joule` is particularly relevant here in order to derive an estimate of which features influence energy consumption.

In [None]:
sns.pairplot(dataframe)

The feature columns `distance`, `number_of_turns`, `additional_cargo` and `rush_level` appear to correlate with energy consumption and are therefore retained (as input for the regression model).

## 2. Data cleansing

In this part, we first deal with the zero and NaN values contained in the ‘energy_joule’ column. We decide to simply delete the corresponding rows:

In [None]:
dataframe = dataframe.dropna()

In [None]:
dataframe.info()

The entries in the `commissioner` column are not relevant for the robot's energy consumption. 
This suspicion is suggested by the logical understanding of the problem and is confirmed by the above visualization.

Furthermore, the column `localtion_id` has no variance.

The columns are removed from the data set:

In [None]:
dataframe = dataframe.drop(columns=["commissioner", "location_id"])

## 3. Feature Engineering

As observed in part 1, the `robot` column describes the type of robot (`robot1` or `robot2`) in string format. The entries in this column can be ordinally encoded as integers; `Robot1` is replaced with `1` and `Robot2` is replaced with `2`:

In [None]:
def convert_string_robot_to_numeric(str_robot):
    if str_robot == "Robot1":
        return 1
    elif str_robot == "Robot2":
        return 2
    else:
        print("str_robot must be either Robot1 or Robot2")
        print("Returning None")
        return None
        
dataframe["robot"] = dataframe["robot"].map(convert_string_robot_to_numeric)

In [None]:
display(dataframe)

Instead of the ordinal encoding performed, you can also create binary variables that refer to `Robot1` and `Robot2`. This is called one-hot encoding.

## 4. Dataset splitting

In order to be able to evaluate the results of a model properly, we first split our training data into two parts:
The first part is actually used for model training, and the second part is reserved for evaluating (testing) the model.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(dataframe, test_size=0.3, random_state=23)

## 5. Model training and testing

We train different models on the training dataset and evaluate them using the split test data.

We will look at the following models:

* Linear regression
* Polynomial regression
* Decision Tree
* Random Forest

Linear and polynomial regression assume monotonic relationships between the features and the value to be predicted. 
We must therefore consider for each input whether this feature fulfills this requirement or whether it can be preprocessed accordingly.
The Scikit-Learn documentation on [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [Polynomial Features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) will also help you here.
The [Scikit-Learn User Guide](https://scikit-learn.org/stable/modules/linear_model.html) provides additional information.

Tree-based models (e.g. decision tree and random forest) can usually work well with most features in their raw form and, unlike other methods, are not dependent on (approximately) uniform scaling or variances.
Individual decision trees are rarely used as a forecasting model in practice, as they often do not generalize very well, i.e. they often do not achieve very good results on unseen data.
Instead, so-called random forests are often used, which consist of a large number of trees trained on parts of the data set that contribute equally to the prediction result.
Further information on this can be found in the [Scikit-Learn User Guide](https://scikit-learn.org/stable/modules/tree.html#) and the [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

We train and evaluate all four model types on the prepared data and then select the best model.

In [None]:
target_column = "energy_joule"
feature_columns = dataframe.columns.drop(target_column)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error

mse_dict = {}

##### 1) Lineare Regression

In [None]:
from sklearn.linear_model import LinearRegression

model_lin_reg = LinearRegression()
model_lin_reg.fit(train_df[feature_columns], train_df[target_column])

predictions = model_lin_reg.predict(test_df[feature_columns])

lin_reg_mse = mean_squared_error(test_df[target_column], predictions)

mse_dict["Linear Regression"] = lin_reg_mse
print(f"MSE: {lin_reg_mse}")

##### 2) Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

preprocessor_pol_reg = PolynomialFeatures(degree=2)
train_features_poly = preprocessor_pol_reg.fit_transform(train_df[feature_columns])
model_pol_reg = LinearRegression()
model_pol_reg.fit(train_features_poly, train_df[target_column])

test_features_poly = preprocessor_pol_reg.transform(test_df[feature_columns])
predictions = model_pol_reg.predict(test_features_poly)
pol_reg_mse = mean_squared_error(test_df[target_column], predictions)

mse_dict["Polynomial Regression"] = pol_reg_mse
print(f"MSE: {pol_reg_mse}")

##### 3) Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

model_tree = DecisionTreeRegressor()
model_tree.fit(train_df[feature_columns], train_df[target_column])

predictions = model_tree.predict(test_df[feature_columns])
tree_mse = mean_squared_error(test_df[target_column], predictions)

mse_dict["Decision Tree"] = tree_mse
print(f"MSE: {tree_mse}")

##### 4) Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

model_forest = RandomForestRegressor(n_estimators=100)
model_forest.fit(train_df[feature_columns], train_df[target_column])

predictions = model_forest.predict(test_df[feature_columns])
forest_mse = mean_squared_error(test_df[target_column], predictions)

mse_dict["Random Forest"] = forest_mse
print(f"MSE: {forest_mse}")

In [None]:
for key, value in mse_dict.items():
    print(f"{key:25s}\t{value:>8.3f}")

The best results (lowest MSE) were achieved with the polynomial regression model. This is recommended for further use.

## 6. Evaluation and encapsulation of the results
#### Creation of a function to evaluate an unseen data set

We now implement a function that executes and evaluates the model on unseen datasets in the identical format.
The best results were obtained with polynomial regression, so this model is used in the function.

The function is then executed on the existing test dataset `blueprint_data_assessment.csv`, and the results are displayed.

First, the data pre-processing steps carried out above are summarized in a function:

In [None]:
def preprocess_data(df):
    df = df.dropna()
    df = df.drop(columns=["location_id", "commissioner"])
    df["robot"] = df["robot"].map(convert_string_robot_to_numeric) 
    
    return df

We then write a function to execute the previously trained model on a preprocessed data set:

In [None]:
def run_model(df):
    features = preprocessor_pol_reg.transform(df[feature_columns])
    predictions = model_pol_reg.predict(features)
    return predictions

We now combine the reading, preprocessing, prediction and calculation of the error in a single function:

In [None]:
def run_and_evaluate_on_dataset(dataset_path):
    dataset_df = pd.read_csv(dataset_path)
    dataset_proc = preprocess_data(dataset_df)
    dataset_pred = run_model(dataset_proc)
    
    error_mse = mean_squared_error(dataset_proc[target_column], dataset_pred)
    error_mape = mean_absolute_percentage_error(dataset_proc[target_column], dataset_pred)
    
    return error_mse, error_mape

We perform the above steps on the test data set using a single function call and output the results to the console:

In [None]:
file = 'blueprint_data_assessment.csv'
error_mse, error_mape = run_and_evaluate_on_dataset(file)
print(f"Evaluationsergebnisse der Daten {file}")
print(f"{'Error MSE':30s} {error_mse:>10.3f}")
print(f"{'Error MAPE':30s} {100*error_mape:>10.2f} %")

An average percentage deviation of 5.17 % in the prediction of energy consumption is achieved on the assessment data set. A solid result!

### A message from Ananya:

We are totally excited to see what the long-promised dataset with the records of the productive system will look like!
Hopefully our approaches and models can also be used for this new data.

Unfortunately, we will no longer be on the project team to analyze the new dataset ourselves.
Whoever takes over the project, we hope that the analyses and experiments we have already conducted will help our successors to solve this exciting problem.


All the best and good luck!

Ananya W.