<div class='bar_title'></div>

*Enterprise AI*

# Assignment 3 - Hyperparameter Optimization

Gunther Gust / Justus Ameling<br>
Chair of Enterprise AI

Summer Semester 2024

<img src="https://github.com/GuntherGust/tds2_data/blob/main/images/d3.png?raw=true" style="width:20%; float:left;" />

## Introduction

In the previous assignment, we have constructed our first pipeline using zenML. Now, our DataEngineering team provides us with a clean and preprocessed dataset. Therefore, we will use this dataset to train a model and perform hyperparameter tuning. However, we are facing a different project structure in this assignment, which is significantly different from the previous one. Last time, we created our complete pipeline in a single Jupyter Notebook. While this is a simple option, creating larger pipelines or ML Systems can be messy using one single file. Thus, we are using a different approach this time, splitting the notebook into multiple Python files. So let us first try to understand our new structure:<br>
<img src="./images/Bildschirmfoto 2024-05-08 um 14.21.00.png">


Let us start with the root folder, which contains five files.
- **.gitignore**: The file is used to exclude files or even folders that should not been committed to our GitHub repository
- **README.md**: This is a simple markdown file and is, by default, the first file that is shown in a GitHub repository.
- **requirements.txt**: It includes all packages(like scikit-learn) that we would like to install and the corresponding version that should be used
- **run.py**: This is our first Python file. It can be used to start our pipeline and can be executed by running the command: `python run.py` in our terminal. Or by running `!python run.py` in a notebook cell.
- **main.ipynb**: The current notebook you are working in.

<img src="./images/pipeline_folder.png"><br>
Next, let us look at the pipeline folder. It is our first <a href="https://www.geeksforgeeks.org/python-packages/">package</a> and contains two files:
- **\_\_init\_\_.py**: This file is a special file in Python. It is used to define a package. It can be empty, or as in our case, it can include some imports. The advantage of using this file is that it allows us to import our package into other files.
- **pipeline.py**: This file contains the definition of our pipeline. It is the heart of our pipeline and includes all steps that are necessary to train our model.

<img src="./images/steps.png"><br>
In our training pipeline, we need to import steps. These steps are all organized in the steps folder, which is also a package. Again, you can identify the package by the \_\_init\_\_.py file. The steps folder contains multiple files(modules):

- **evaluate_model.py**: This file includes the evaluation step.
- **hp_tuning.py**: This file includes the hyperparameter tuning step.
- **loading_data.py**: This file includes the loading data step.
- **model_trainer.py**: This file includes the model training step.
- **split_data.py**: This file includes the split data step.


## Task

### Examine the provided data set
First of all, let us understand the provided dataset. Therefore, you should solve some minor tasks using the `Pandas` library.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)

In [4]:
# Load the data from the file `Weather_Perth_transformed.csv`
data = pd.read_csv('./data/Weather_Perth_transformed.csv',index_col="Date")

In [5]:
# Display the first few rows of the data
data.head()

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,WindGustDir_E,WindGustDir_ENE,WindGustDir_ESE,WindGustDir_N,WindGustDir_NE,WindGustDir_NNE,WindGustDir_NNW,WindGustDir_NW,WindGustDir_S,WindGustDir_SE,WindGustDir_SSE,WindGustDir_SSW,WindGustDir_SW,WindGustDir_W,WindGustDir_WNW,WindGustDir_WSW,WindDir9am_E,WindDir9am_ENE,WindDir9am_ESE,WindDir9am_N,WindDir9am_NE,WindDir9am_NNE,WindDir9am_NNW,WindDir9am_NW,WindDir9am_S,WindDir9am_SE,WindDir9am_SSE,WindDir9am_SSW,WindDir9am_SW,WindDir9am_W,WindDir9am_WNW,WindDir9am_WSW,WindDir3pm_E,WindDir3pm_ENE,WindDir3pm_ESE,WindDir3pm_N,WindDir3pm_NE,WindDir3pm_NNE,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW,RainTomorrow
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1
2008-07-01,0.108911,0.189873,0.0,0.047059,0.654676,0.1,0.0,0.225806,0.976744,0.516484,0.737089,0.768868,0.25,0.375,0.086826,0.258359,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2008-07-02,0.231023,0.25,0.0,0.105882,0.503597,0.128571,0.2,0.290323,0.77907,0.362637,0.65493,0.639151,0.0,0.75,0.164671,0.306991,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2008-07-03,0.234323,0.224684,0.007018,0.129412,0.52518,0.257143,0.0,0.129032,0.825581,0.714286,0.483568,0.558962,0.125,0.375,0.194611,0.246201,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
2008-07-04,0.333333,0.202532,0.031579,0.070588,0.338129,0.185714,0.366667,0.193548,0.930233,0.736264,0.542254,0.625,0.75,0.75,0.227545,0.246201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2008-07-05,0.333333,0.113924,0.031579,0.082353,0.352518,0.442857,0.433333,0.548387,0.651163,0.56044,0.568075,0.712264,0.875,0.625,0.308383,0.194529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1


In [6]:
print("It rained the next day on",format(data["RainTomorrow"].sum()/data["RainTomorrow"].count(), ".2%"),"of the days.")

It rained the next day on 20.20% of the days.


### Fix the pipeline
Before you can execute the pipeline, you need to fix it and add some code to the pipeline steps. Look at the following Python files and follow the instructions:
- **loading_data.py**
- **hp_tuning.py**
- **evaluate_model.py**

In [7]:
# Execute your Pipeline and train your model
from pipelines import training_pipeline
training_pipeline()

[1;35mInitiating a new run for the pipeline: [0m[1;36mtraining_pipeline[1;35m.[0m
[1;35mRegistered new version: [0m[1;36m(version 110)[1;35m.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mdefault[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mYou can visualize your pipeline runs in the [0m[1;36mZenML Dashboard[1;35m. In order to try it locally, please run [0m[1;36mzenml up[1;35m.[0m
[1;35mCaching [0m[1;36mdisabled[1;35m explicitly for [0m[1;36mloading_data[1;35m.[0m
[1;35mStep [0m[1;36mloading_data[1;35m has started.[0m
[33mBy default, the [0m[1;36mPandasMaterializer[33m stores data as a [0m[1;36m.csv[33m file. If you want to store data more efficiently, you can install [0m[1;36mpyarrow[33m by running '[0m[1;36mpip install pyarrow[33m'. This will allow [0m[1;36mPandasMaterializer[33m to automati

PipelineRunResponse(body=PipelineRunResponseBody(created=datetime.datetime(2024, 5, 27, 21, 39, 4, 117916), updated=datetime.datetime(2024, 5, 27, 21, 39, 8, 340423), user=UserResponse(body=UserResponseBody(created=datetime.datetime(2024, 4, 30, 8, 6, 25, 461523), updated=datetime.datetime(2024, 5, 6, 19, 44, 2, 646066), active=True, activation_token=None, full_name='', email_opted_in=False, is_service_account=False, is_admin=True), metadata=None, resources=None, id=UUID('b69b3745-b086-42d8-9608-735d4e7a646f'), permission_denied=False, name='default'), status=<ExecutionStatus.COMPLETED: 'completed'>, stack=StackResponse(body=StackResponseBody(created=datetime.datetime(2024, 4, 30, 8, 6, 25, 204420), updated=datetime.datetime(2024, 4, 30, 8, 6, 25, 204421), user=None), metadata=None, resources=None, id=UUID('4783e71a-5396-4e6f-ba3e-307ba02e47c5'), permission_denied=False, name='default'), pipeline=PipelineResponse(body=PipelineResponseBody(created=datetime.datetime(2024, 5, 27, 21, 39, 

In [9]:
# Retrieve the artifact ("Accuracy") from the ZenML repository using the Client and print the value
from zenml.client import Client
client = Client()
artifact = client.get_artifact_version("Accuracy")
artifact.load()

0.8716744913928013