# Hydraulic System Condition Prediction

## Preparing the tools

We're going to use Pandas, Patplotlib and NumPy for data analysis and manipulation.

In [25]:
# Import all the tools we need

# Regular EDA (Exploratory Data Analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
# for sklearn version 1.2+, plot_roc_curve is replaced with RocCurveDisplay
from sklearn.metrics import RocCurveDisplay

pd.set_option("display.max_colwidth", 1)


## Load The Data

In [38]:
# # Import our dataset from Kaggle. Note: comment the line below after the dataset has been downloaded
# !kaggle datasets download -d jjacostupa/condition-monitoring-of-hydraulic-systems

In [3]:
# Extract the csv file in the data folder
import zipfile

with zipfile.ZipFile("condition-monitoring-of-hydraulic-systems.zip") as f:
    f.extractall(path="data/")

## 1. Problem Definition

The data set addresses the condition assessment of a hydraulic test rig based on multi sensor data. Four fault types are superimposed with several severity grades impeding selective quantification.

## 2. Data

Original data source: https://archive.ics.uci.edu/dataset/447/condition+monitoring+of+hydraulic+systems

The data set was experimentally obtained with a hydraulic test rig. This test rig consists of a primary working and a secondary cooling-filtration circuit which are connected via the oil tank [1], [2]

The system cyclically repeats constant load cycles (duration 60 seconds) and measures process values such as pressures, volume flows and temperatures while the condition of four hydraulic components (cooler, valve, pump and accumulator) is quantitatively varied.

## 3. Evaluation

If we can reach > 95% accuracy at predicting the hydraulic system conditions during the proof of concept, we'll pursue the project.

## 4. Features

The data set contains raw process sensor data (i.e. without feature extraction) which are structured as matrices (tab-delimited) with the rows representing the cycles and the columns the data points within a cycle.

The sensors involved are:

In [14]:
data = [
    ("PS1",	"Pressure", "bar", "100 Hz"),
    ("PS2",	"Pressure", "bar", "100 Hz"),
    ("PS3",	"Pressure", "bar", "100 Hz"),
    ("PS4",	"Pressure", "bar", "100 Hz"),
    ("PS5",	"Pressure", "bar", "100 Hz"),
    ("PS6",	"Pressure", "bar", "100 Hz"),
    ("EPS1", "Motor power", "W", "100 Hz"),
    ("FS1", "Volume flow", "l/min", "10 Hz"),
    ("FS2", "Volume flow", "l/min", "10 Hz"),
    ("TS1", "Temperature", "°C", "1 Hz"),
    ("TS2", "Temperature", "°C", "1 Hz"),
    ("TS3", "Temperature", "°C", "1 Hz"),
    ("TS4", "Temperature", "°C", "1 Hz"),
    ("VS1", "Vibration", "mm/s", "1 Hz"),
    ("CE", "Cooling efficiency (virtual)", "%", "1 Hz"),
    ("CP", "Cooling power (virtual)", "kW",	"1 Hz"),
    ("SE", "Efficiency factor", "%", "1 Hz")
]

sensors_df = pd.DataFrame(data, columns=["Sensor", "Physical Quantity", "Unit", "Sampling Rate"])

sensors_df

Unnamed: 0,Sensor,Physical Quantity,Unit,Sampling Rate
0,PS1,Pressure,bar,100 Hz
1,PS2,Pressure,bar,100 Hz
2,PS3,Pressure,bar,100 Hz
3,PS4,Pressure,bar,100 Hz
4,PS5,Pressure,bar,100 Hz
5,PS6,Pressure,bar,100 Hz
6,EPS1,Motor power,W,100 Hz
7,FS1,Volume flow,l/min,10 Hz
8,FS2,Volume flow,l/min,10 Hz
9,TS1,Temperature,°C,1 Hz


**Here are our Labels/ Targets which we will make the predictions:**

The target condition values are cycle-wise annotated in profile.txt (tab-delimited). As before, the row number represents the cycle number. The columns are

1. Cooler condition / %:
	* 3: close to total failure
	* 20: reduced effifiency
	* 100: full efficiency

2. Valve condition / %:
	* 100: optimal switching behavior
	* 90: small lag
	* 80: severe lag
	* 73: close to total failure

3. Internal pump leakage:
	* 0: no leakage
	* 1: weak leakage
	* 2: severe leakage

4. Hydraulic accumulator / bar:
	* 130: optimal pressure
	* 115: slightly reduced pressure
	* 100: severely reduced pressure
	* 90: close to total failure

5. stable flag:
	* 0: conditions were stable
	* 1: static conditions might not have been reached yety factor		%		1 Hz

Since we are provided with the targets that we will use to make the predictions, let's load the profile.txt into pandas as our target dataset.

In [20]:
columns = ["Cooler Condition (%)", "Valve Condition (%)", "Internal Pump Leakage", "Hydraulic Accumulator (bar)", "Stable Flag"]

target_df = pd.read_csv("./data/profile.txt", sep="\t", names=columns)

target_df.head()

Unnamed: 0,Cooler Condition (%),Valve Condition (%),Internal Pump Leakage,Hydraulic Accumulator (bar),Stable Flag
0,3,100,0,130,1
1,3,100,0,130,1
2,3,100,0,130,1
3,3,100,0,130,1
4,3,100,0,130,1


In [21]:
target_df.shape

(2205, 5)

Let's try to create some datasets with our sensors.

In [27]:
pressure_sensor_1_df = pd.read_csv("./data/PS1.txt", sep="\t", header=None)
pressure_sensor_2_df = pd.read_csv("./data/PS2.txt", sep="\t", header=None)
pressure_sensor_3_df = pd.read_csv("./data/PS3.txt", sep="\t", header=None)
pressure_sensor_4_df = pd.read_csv("./data/PS4.txt", sep="\t", header=None)
pressure_sensor_5_df = pd.read_csv("./data/PS5.txt", sep="\t", header=None)
pressure_sensor_6_df = pd.read_csv("./data/PS6.txt", sep="\t", header=None)
motor_power_df = pd.read_csv("./data/EPS1.txt", sep="\t", header=None)
volume_flow_1_df = pd.read_csv("./data/FS1.txt", sep="\t", header=None)
volume_flow_2_df = pd.read_csv("./data/FS2.txt", sep="\t", header=None)
temperature_sensor_1_df = pd.read_csv("./data/TS1.txt", sep="\t", header=None)
temperature_sensor_2_df = pd.read_csv("./data/TS2.txt", sep="\t", header=None)
temperature_sensor_3_df = pd.read_csv("./data/TS3.txt", sep="\t", header=None)
temperature_sensor_4_df = pd.read_csv("./data/TS4.txt", sep="\t", header=None)
vibration_df = pd.read_csv("./data/VS1.txt", sep="\t", header=None)
cooling_eff_df = pd.read_csv("./data/CE.txt", sep="\t", header=None)
cooling_power_df = pd.read_csv("./data/CP.txt", sep="\t", header=None)
eff_factor_df = pd.read_csv("./data/SE.txt", sep="\t", header=None)

In [47]:
print(f"PS1:{pressure_sensor_1_df.shape},\nPS2:{pressure_sensor_2_df.shape},\nPS3:{pressure_sensor_3_df.shape},\nPS4:{pressure_sensor_4_df.shape},\nPS5:{pressure_sensor_5_df.shape},\nPS6:{pressure_sensor_6_df.shape},\nEPS1:{motor_power_df.shape},\nFS1:{volume_flow_1_df.shape},\nFS2:{volume_flow_2_df.shape},\nTS1:{temperature_sensor_1_df.shape},\nTS2:{temperature_sensor_2_df.shape},\nTS3:{temperature_sensor_3_df.shape},\nTS4:{temperature_sensor_4_df.shape},\nVS1:{vibration_df.shape},\nCE:{cooling_eff_df.shape},\nCP:{cooling_power_df.shape},\nSE:{eff_factor_df.shape}")

PS1:(2205, 6000),
PS2:(2205, 6000),
PS3:(2205, 6000),
PS4:(2205, 6000),
PS5:(2205, 6000),
PS6:(2205, 6000),
EPS1:(2205, 6000),
FS1:(2205, 600),
FS2:(2205, 600),
TS1:(2205, 60),
TS2:(2205, 60),
TS3:(2205, 60),
TS4:(2205, 60),
VS1:(2205, 60),
CE:(2205, 60),
CP:(2205, 60),
SE:(2205, 60)


From the shape of our sensor datasets, the total cycles (rows) are 2,205 and they are the same for all sensors.

This matches the `Number of Instances: 2205` from the description.txt

For the data points within a cycle (columns) there are 3 different values:
1. 6,000 data points per cycle for sensors PS1 to PS6 and EPS1 ➡️ Total 7
2. 600 data points per cycle for sensors FS1 to FS2 ➡️ Total 2
3. 60 data points per cycle for sensors TS1 to TS4, VS1, CE, CP and SE ➡️ Total 8

This also matches the `Number of Attributes: 43680 (8x60 (1 Hz) + 2x600 (10 Hz) + 7x6000 (100 Hz))`

**Question: What does this mean?**

## Data Exploration (Exploratory Data Analysis or EDA)