# Exective summary of Work Package 2

## Objectives

In this WP, you will work on a given training dataset. Your goal is to develop a fault detection model using the classification algorithms learnt in the class, in order to achieve best F1 score.

## Tasks

- Task 1: Develop a fault detection model using the unsupervised learning algorithms learnt in the class, in order to achieve best F1 score.
- Task 2: With the help of the supporting script, develop a cross-validation scheme to test the performance of the developed classification algorithms.
- Task 3: Develop a fault detection model using the classification algorithms learnt in the class, in order to achieve best F1 score.

## Delierables

- A Jupyter notebook reporting the process and results of the above tasks


# Before starting, please:
- Fetch the most up-to-date version of the github repository.
- Create a new branch with your name, based on the "main" branch and switch to your own branch.
- Copy this notebook to the work space of your group, and rename it to TD_WP_2_Your name.ipynb
- After finishing this task, push your changes to the github repository of your group.

# Task 1: Unsupervised learning approaches

## Implement the statistical testing approach for fault detection

In this exercise, we interpret the statistical testing approach for fault detection. The basic idea of statistical testing approach is that we fit a multi-dimensitional distribution to the observation data under normal working condition. Then, when a new data point arrives, we design a hypothesis test to see whether the new data point is consistent with the distribution. If the new data point is consistent with the distribution, we can conclude that the fault is not due to the faulty component.

The benefit of this approach is that, to design the detection algrothim, we do not need failed data. Also, the computational time is short as all we need is just to compute the pdf and compare it to a threshold.

In this exercise, you need to:
- Fit a multi-dimensitional distribution to the training dataset (all normal samples).
- Design a fault detection algorithm based on the fitted distribution to detect faulty components.

The following block defines a few functions that you can use.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import multivariate_normal


def estimateGaussian(X):
    '''Given X, this function estimates the parameter of a multivariate Gaussian distribution.'''
    mu = np.mean(X, axis=0)
    sigma2 = np.var(X, axis=0)
    return mu, sigma2


def classify(X, distribution, log_epsilon=-50):
    '''Given X, this function classifies each sample in X based on the multivariate Gaussian distribution. 
       The decision rule is: if the log pdf is less than log_epsilon, we predict 1, as the sample is unlikely to be from the distribution, which represents normal operation.
    '''
    p = distribution.logpdf(X)
    predictions = (p < log_epsilon).astype(int)
    
    return predictions

Let us use the dataset `20240105_164214` as training dataset, as all the samples in this dataset are normal operation. We will use the dataset `20240325_155003` as testing dataset. Let us try to predict the state of motor 1. For this, we first extract the position, temperature and voltage of motor 1 as features (you can change the features if you want). 

In [2]:
import sys
sys.path.insert(0, 'C:/Users/marce/OneDrive/Documents/CS/SG8_industry_4.0/digital_twin_robot/projects/maintenance_industry_4_2024/supporting_scripts/WP_1')

from utility import read_all_csvs_one_test
import pandas as pd

# Specify path to the dictionary.
base_dictionary = '../../dataset/training_data/'
dictionary_name = '20240105_164214'
path = base_dictionary + dictionary_name

# Read the data.
df_data = read_all_csvs_one_test(path, dictionary_name)

# Get the features
X_train = df_data.drop(columns=['time', 'test_condition', 'data_motor_6_position', 'data_motor_5_temperature', 'data_motor_1_label', 'data_motor_2_label', 'data_motor_3_label', 'data_motor_4_label', 'data_motor_5_label', 'data_motor_6_label'])


# We do the same to get the test dataset.
dictionary_name = '20240325_155003'
path = base_dictionary + dictionary_name

# Read the data.
df_data = read_all_csvs_one_test(path, dictionary_name)

# Get the features
X_test = df_data.drop(columns=['time', 'test_condition', 'data_motor_6_position', 'data_motor_5_temperature', 'data_motor_1_label', 'data_motor_2_label', 'data_motor_3_label', 'data_motor_4_label', 'data_motor_5_label', 'data_motor_6_label'])
y_test = df_data['data_motor_1_label']

Please design your algorithm below:

In [13]:
from sklearn.metrics import accuracy_score

# Estimate the Gaussian parameters from the training set
mu, sigma2 = estimateGaussian(X_train)

# Regularize the covariance matrix by adding a small value to the diagonal
regularization_value = 1  # This value can be adjusted as needed
regularized_cov = np.diag(sigma2) + np.eye(X_train.shape[1]) * regularization_value

# Construct the multivariate Gaussian distribution
# Set 'allow_singular=True' to handle the potential singularity issue
distribution = multivariate_normal(mean=mu, cov=regularized_cov, allow_singular=True)

# Now, let's try to predict the labels of the test set X_test.
# Classify the test set and calculate the accuracy
y_pred = classify(X_test, distribution, log_epsilon=-50)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.19422730006013228


**Discussions:**
- Can you please try to improve the performance of this approach?
    - For example, by normalizating the data?
    - By smoothing the data?
    - By reducing feature number?
    - etc.
- The parameter log_epsilon defines the threshold we use for making classification. What happens if you change it?
- Could you discuss how we should get the best value for this parameter?

In [42]:
sys.path.insert(0, 'C:\\Users\eugid\Desktop\CoursCS2A\Maintenance 4.0\digital_twin_robot\projects\maintenance_industry_4_2024\supporting_scripts\WP_1')

from TD_1_Functions import *
from TD_1_Functions import remove_outliers_2

window_size = 10
alpha = 10

# On sélectionne 

#  X_train
X_train_smooth = smooth_data_moving_average(X_train, window_size)
X_train_no_outliers,  removed_indices_train = remove_outliers_2(X_train_smooth, [], alpha)
scaler = MinMaxScaler()
X_train_normalized = scaler.fit_transform(X_train_no_outliers)

#  X_test
X_test_smooth = smooth_data_moving_average(X_test, window_size)
X_test_no_outliers, removed_indices_test = remove_outliers_2(X_test_smooth, [], alpha)
X_test_normalized = scaler.transform(X_test_no_outliers)

y_test = np.delete(y_test, removed_indices_test, axis=0)


ImportError: cannot import name 'remove_outliers2' from 'TD_1_Functions' (C:\Users\eugid\Desktop\CoursCS2A\Maintenance 4.0\digital_twin_robot\projects\maintenance_industry_4_2024\supporting_scripts\WP_1\TD_1_Functions.py)

In [None]:
# Estimate the Gaussian parameters from the training set
mu, sigma2 = estimateGaussian(X_train_normalized)

# Regularize the covariance matrix by adding a small value to the diagonal
regularization_value = 1e-6  # This value can be adjusted as needed
regularized_cov = np.diag(sigma2) + np.eye(X_train_normalized.shape[1]) * regularization_value

# Construct the multivariate Gaussian distribution
# Set 'allow_singular=True' to handle the potential singularity issue
distribution = multivariate_normal(mean=mu, cov=regularized_cov, allow_singular=True)

# Now, let's try to predict the labels of the test set X_test.
# Classify the test set and calculate the accuracy
y_pred = classify(X_test_normalized, distribution, log_epsilon=-50)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

ValueError: Found input variables with inconsistent numbers of samples: [6652, 4863]

## Local outiler factor (LOF)

The local outlier factor (LOF) algorithm computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors. You can easiliy implement LOF in scikit-learn ([tutorial](https://www.datatechnotes.com/2020/04/anomaly-detection-with-local-outlier-factor-in-python.html)).

Please implement local outlier factor (LOF) algorithm on the dataset of `20240325_155003`. You can try first to detect the failure of motor 1 using this model. Please calculate the accuracy score of your prediction.