# IDS Challenge
## 02 Data Science: Instructions and notes

Your task in this sub-project is the conception and training of a data science model for the
**prediction of the energy consumption** of the robots during the execution of orders (e.g. picking up a sample from a machine).

For this purpose, a **data set** is available that contains actual measured energy consumption of orders that have already been carried out.
You are to analyze, (pre-)process and use this data set to create a forecast model for future orders.
The use of appropriate Python libraries such as *pandas* and *scikit-learn* is recommended.

Don't worry, you don't have to start from scratch to solve the task. 
In your previous team, Ananya was responsible for data science.
She had already started analyzing data and experimenting with different models and model configurations on a slightly different data basis.
From this source, some **code snippets** are available in the notebook `blueprint_EN.ipynb`, which you can reuse and develop further.

Implement your solutions in the existing Jupyter notebook `submission_tp2_data_science.ipynb`. 
Then upload the Jupyter Notebook with your solutions to ILIAS.

### Repository overview

Here you will find an overview of the files for this sub-project.

```bash
├───prediction_blueprint
│   ├───blueprint_EN.ipynb
│   ├───blueprint_data_assessment.csv
│   ├───blueprint_data_train.csv
└───production_dataset
│   ├───robot_energy_data_train.csv
├───submission_tp2_data_science.ipynb
├───README_EN.ipynb
```

### Dataset

The data set relevant for your forecast model can be found at
`production_dataset/robot_energy_data_train.csv`.

This contains data on completed orders and the energy consumption measured.
The aim is therefore to develop a machine learning model that can predict energy consumption as accurately as possible based on given job characteristics.

The data set has a number of columns that need to be critically examined.
Not all columns are necessarily relevant for energy consumption during order execution.
Furthermore, the data in some columns may not be able to be processed in the existing form,
but must first be suitably pre-processed (transformed).

Minimal documentation on the columns of the data set is available:

| Column name | Data type | Description |
|---|---|---|
| **Robot** | str | The name of the robot that performed the task. |
| **Distance [m]** | float | Distance to be covered in meters. |
| **Levels** | int | Floor difference between start and destination point (with sign). |
| **Cargo [kg]** | float | Weight of additional load of the robot during order execution in kg. |
| **Elevation [m]** | int | Difference in height between start and destination point in meters. |
| **Battery Level** | float | Battery level of the robot at the start of job execution in %. According to the robot manufacturer, battery levels below 10% may result in increased energy consumption. |
| **Time of Day [h]** | float | Time of day of job execution, specifically: time in hours since midnight (at the start of the job). May be relevant, as between 9:00 a.m. and 5:00 p.m. there is increased work activity by the logistics employees and the movement of the robot can be influenced by necessary evasive maneuvers. |
| **Battery Error** | bool | Indicates whether there was a battery problem during order execution (e.g., complete emptying). |
| ***Energy [kJ]*** | float | Energy consumed in kilojoules during job execution. Prediction target. |   


Take a close look at the data and, based on the description, statistics, visualizations and understanding of the problem, decide 
which of the columns should be included in the forecast model and in what form.

In some cases, it may make sense not to enter the data for a column into the model in its raw form, but to carry out pre-processing first.

You may also find a few helpful hints in the blueprint notebook (description follows).

Your company's CTO will also let you know that part of the data set has been split off in advance.
He will keep this split-off "assessment" part of the data under lock and key and use it at the end of the project, 
to check the quality of your forecast model.

### Data science task and blueprint

As already described, the central task of this sub-project is to **create a prediction model to predict the energy consumption** of the robots for individual orders.

The Jupyter notebook under `prediction_blueprint/blueprint_EN.ipynb` contains the work on energy consumption prediction started by Ananya from the previous team.
At the time of creating this blueprint notebook, the existing dataset with production data on robot energy consumption was not yet available.
Therefore, a similar but different data set with partially different characteristics is considered in the blueprint notebook, 
which is located in the same folder as the notebook (`blueprint_data_{train/assessment}.csv`).

The steps to be carried out when processing the task,
which are also carried out in the blueprint notebook, are as follows:

1. Data analysis
2. Data cleansing
3. Feature engineering
4. Dataset splitting
5. Model training and testing
6. Evaluation and encapsulation of the results

You are welcome to use this blueprint as a guide when implementing your forecast model.
However, please note that there are some differences between the data set used in it and the current data set and check exactly which parts of the code can be adopted and where adjustments or further/other steps are necessary.

*Note:* The blueprint notebook does not necessarily contain all the experiments carried out by the former project team.
Ananya tried many different variants and experimented a lot with the data/features before finding a variant that worked well.
So don't be afraid to experiment a lot - a suitable data science solution is rarely found at the first attempt!


### Requirements due to the new data basis

As already described, the exemplary model configuration in the blueprint was carried out on an older preliminary dataset. 

The more up-to-date data set now available contains the energy consumption of two different robot types (R1, R2).
Both robots are fundamentally different in design and have different characteristics (size & shape, movement speeds, maximum additional load,...).
It can therefore be assumed that the parameters of the machine learning models used may have to be selected very differently depending on the robot type.
It is therefore imperative to split the data set into two partial data sets and to train and evaluate separate models for the two robot types.

At least **three different model types** must be trained, including at least one parameter-based model (e.g. linear regression) and one tree-based model (e.g. decision tree).

In total, each model type should be trained at least three times, i.e. on the **different data bases**:
* On the overall data set
* On the data for robots *R1*
* On the data for robots *R2*

*Note:* To avoid redundancy and keep your code and notebook organized, use functions!

Specify **appropriate error metrics** for all trained models.
Make sure that such error metrics are only meaningful if they are calculated on data that has not already been used in the model training.

**Decide on one of the models that you want to propose for production-use** - justify your decision!
Finally, the selected model should be evaluated on the **assessment dataset** provided so far.
Specify the quantitative results (meaningful error metrics).


### Acceptance criteria

* Jupyter notebook `submission_tp2_data_science.ipynb` with presentation of the solution path (data analysis, data preprocessing, model training, evaluation)
* Visualization of the most important data correlations (with *seaborn* or *matplotlib*)
* Model training and evaluation for the various sub-data sets and the entire data set (see above). Suitable error metrics must be calculated and specified for all cases.
* Proposal of a scikit-learn prediction model that meets the CEO's performance requirements (function for evaluating an unseen data set!)
* Comprehensible justification of the data processing performed and the selected model configuration (Witten as text in Jupyter Notebook)
* Appealing presentation of the procedure and the results in the lecture (10 minutes in total)