tcn-hard-disk-failure-prediction

Introduction

A comprehensive machine learning project for predicting whether a hard disk will fail within a given time interval

More than 2500 petabytes of data are generated every day by sources such as social media, IoT devices, etc., and every bit of it is valuable. That’s why modern storage systems need to be reliable, scalable, and efficient. To ensure that data is not lost or corrupted, many large-scale distributed storage systems, such as Ceph or AWS, use erasure-coded redundancy or mirroring. Although this provides reasonable fault tolerance, it can make it more difficult and expensive to scale up the storage cluster.

This project seeks to mitigate this problem using machine learning. Specifically, the goal of this project is to train a model that can predict if a given disk will fail within a predefined future time window. These predictions can then be used by Ceph (or other similar systems) to determine when to add or remove data replicas. In this way, the fault tolerance can be improved by up to an order of magnitude, since the probability of data loss is generally related to the probability of multiple, concurrent disk failures.

In addition to creating models, we also aim to catalyze community involvement in this domain by providing Jupyter notebooks to easily get started with and explore some publicly available datasets such as Backblaze Dataset and Ceph Telemetry. Ultimately, we want to provide a platform where data scientists and subject matter experts can collaborate and contribute to this ubiquitous problem of predicting disk failures.

Code Structure

The code is structured as follows:

tcn-hard-disk-failure-prediction
│
├── algorithms
│   ├── app.py
│   ├── Classification.py
│   ├── Dataset_manipulation.py
│   ├── GeneticFeatureSelector.py
│   ├── json_param.py
│   ├── network_training.py
│   ├── Networks_pytorch.py
│   └── utils.py
│
├── datasets_creation
│   ├── files_to_failed.py
│   ├── find_failed.py
│   ├── get_dataset.py
│   ├── config.py
│   └── toList.py
│
├── inference
│   ├── app.py
│   ├── Dataset_processing.py
│   ├── Inference.py
│   └── Networks_inference.py
│
├── .gitignore
└── README.md

Description of each file

Folder: algorithms
- app.py: This script is used to display the gradio interface and run the classification functions.
- Classification.py: This script contains the code to train and test various classification models including RandomForest, TCN, and LSTM.
- Dataset_manipulation.py: This script is used for manipulating the dataset. It could include tasks such as cleaning, preprocessing, feature extraction, etc.
- GeneticFeatureSelector.py: This script is used to select the best features using the genetic algorithm.
- json_param.py: This script is used to load the parameters from the JSON file or save the to the JSON file.
- logger.py: This script is used to log the information during the training and testing process.
- network_training.py: This script is used to train the deep learning networks or the classification models.
- Networks_pytorch.py: This script contains the implementation of the deep learning networks using PyTorch.
- utils.py: This script contains the utility functions used in the training and testing process.
Folder: datasets_creation
- app.py: This script is used to display the gradio interface and run the dataset creation functions.
- get_dataset.py: This script is used to fetch and possibly preprocess the dataset for the hard disk failure prediction task.
- save_to_grouped_list.py: This script is used to convert certain data structures to a list format, possibly for easier manipulation or usage in the project.
- save_to_list.py: This script is used to find and mark failed instances in the dataset.
- TODO: save_to_mysql.py: This script is used to save the dataset to a MySQL database.
- save_to_pkl.py: This script is used to convert files to a failed state, as part of the dataset creation process.
Folder: inference
- app.py: This script is used to display the gradio interface and run the inference functions.
- Dataset_processing.py: This script is used to preprocess the dataset for the inference task.
- Inference.py: This script is used to run the inference on the preprocessed dataset.
- Networks_inference.py: This script contains the implementation of the deep learning networks for the inference task.

Wiki

For more information, please refer to the wiki.

How to run the code

Clone the repository:

git clone git@github.com:Prognostika/tcn-hard-disk-failure-prediction.git

Install the required packages:
```
pip install -r requirements.txt
```
Download the dataset via the app.py script:
```
python .\datasets_creation\app.py
```
After running the script, the dataset will be downloaded and saved in the HDD_dataset directory in the parent folder of the repository. (The total dataset and zip package will be around 50 GB, make sure you have enough space on your disk.)
Run the classification script:
```
python .\algorithms\app.py
```
The script will preprocess the dataset from the HDD_dataset directory and save the preprocessed dataset as pkl file in the output folder, then it will train and test the classification models on the dataset.
Run the inference script:
```
python .\inference\app.py
```
The script will preprocess the inference data from uploaded csv file, then it will load the trained model and start the predictions on the parsed data.

Core Parts of this Algorithm

Feature Selection: Currently we use the t-test for feature selection. We select the top 18 features based on the t-test scores.
Dataset Unbalancing: Currently we use SMOTE for upsampling on the failed disk samples to balance the dataset, and use RandomUnderSampler for downsampling the majority class.
Hyperparameter Tuning: Currently we use sklearn GridSearchCV and RandomSearchCV for hyperparameter tuning, and use 'RMSE', 'MAE', 'FDR', 'FAR', 'F1', 'recall', and 'precision' metrics to evaluate the model. (We use the 'F1' score as the main metric for hyperparameter tuning). For deep learning model, we use the ray tuning library for hyperparameter tuning.
Data Training: Currently we use RandomForest, TCN, and LSTM for training the data, and use 'RMSE', 'MAE', 'FDR', 'FAR', 'F1', 'recall', and 'precision' metrics to evaluate the model, according to the result, the TCN model performs better than the other models.

Articles

TODO

Thoroughly test the code and fix the bugs.
Add more comments to the code for better understanding, especially the parameters that need to be tuned.
Store the data in PostgreSQL database instead of CSV files.
Refactor the Python scripts in the algorithm folder using jupyter notebook, do not directly import the Python scripts in the algorithm folder.
Add a visual interface for dataset_creation.ipynb for adjusting the parameters.
Add multiple disk support for dataset input.
Add dataset re-train for transfer learning.

Future Work for Algorithm

Use the Genetic Algorithm (provided by DEAP) before the t-test for the statistical significance of the selected features.
Add multiple disk models for prediction. (Currently, we only have one disk model, ST4000DM000)
Use time-based SMOTE for data augmentation on failed disk samples to balance the dataset.
Use transfer learning to improve the model performance.

Contact

This project is maintained by Prognostika. If you have any questions, please feel free to contact us at lrt2436559745@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
.vscode		.vscode
algorithms		algorithms
datasets_creation		datasets_creation
docs		docs
inference		inference
notebook		notebook
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tcn-hard-disk-failure-prediction

Table of Contents

Introduction

Code Structure

Description of each file

Wiki

How to run the code

Core Parts of this Algorithm

Articles

TODO

Future Work for Algorithm

Contact

About

Releases

Packages

Languages

License

Prognostika/tcn-hard-disk-failure-prediction

Folders and files

Latest commit

History

Repository files navigation

tcn-hard-disk-failure-prediction

Table of Contents

Introduction

Code Structure

Description of each file

Wiki

How to run the code

Core Parts of this Algorithm

Articles

TODO

Future Work for Algorithm

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages