# Technical Design Document

This document outlines the technical design for the project, including architecture, data flow, components, diagrams and decisions for the implementation of the machine learning project.
>This project seeks to predict machine failures using historical sensor data and machine learning techniques. The goal is to build a predictive
maintenance system that can forecast potential failures before they occur, allowing for proactive maintenance actions.

## Table of Contents
1. [Data Loading](#1-data-loading)
2. [Preliminary Exploratory Data Analysis](#2-preeliminary-exploratory-data-analysis)
3. [Data Preprocessing](#3-data-preprocessing)
4. [Feature Engineering](#4-feature-engineering)
5. [Model Training](#5-model-training)
6. [Model Evaluation](#6-model-evaluation)

## 1. Data Loading

The first step in the project is to load the data from the source. The data is stored in a CSV format and contains sensor readings from various machines along with their operational status. There are five main CSV files that need to be loaded, all available in the Kaggle's Microsoft Azure Predictive Maintenance datasets [here](https://www.kaggle.com/datasets/arnabbiswas1/microsoft-azure-predictive-maintenance/data).

This data can also be found in the `data` directory of the project repository, after running the `src/data_acquistion.py` script, which will download the data from Kaggle, order, parse, and save it in the appropriate format.
The data loading process involves reading the CSV files into Pandas DataFrames, which allows for easy manipulation and analysis of the data. The following code snippet demonstrates how to download the data:

```python
    import kagglehub
    import pandas as pd

    path = kagglehub.dataset_download("arnabbiswas1/microsoft-azure-predictive-maintenance")
    print("\nPath to dataset files:", path)

```
The dataset contains the following files, named as saved in the `data` directory:
>- `telemetry.csv`: Which consists of hourly average of voltage, rotation, pressure, vibration collected from 100 machines for the year 2015.

>- `error.csv`: Which consists of the error encountered by the machines while in operating condition. Since, these errors don't shut down the machines, these are not considered as failures. The error date and times are rounded to the closest hour since the telemetry data is collected at an hourly rate.

>- `maintenance.csv`: This file contains records of component replacements for the machines. Rounding the data the closest hour, since the telemetry data is collected at an hourly rate. Components are replaced under two situations:
  >>- Proactive Maintenance: When during the regular scheduled visit, the technician replaced it .
  >>- Reactive Maintenance: When a component breaks down and then the technician does an unscheduled maintenance to replace the component

>- `failure.csv`: This file contains records that  represent replacement of a component due to failure.This data is a subset of Maintenance data. The failure date and times are rounded to the closest hour since the telemetry data is collected at an hourly rate.

>- `machines.csv`: This file contains metadata about the machines, such as their model and age.

#### Note
The date times in the original dataset are written in the ISO 8601 date format and are objects. Python recognizes this format so this columns can be easily made into dates using the `parse_date` function. 


## 2. Preeliminary Exploratory Data Analysis
The next step is to perform preliminary exploratory data analysis (EDA) on the loaded data. This involves examining the data to understand its structure, identify any missing values, and gain insights into the relationships between different variables. This step is crucial for understanding the data and preparing it for further analysis. This analysis is performed in the `src/eda.py` script.
This step includes:
- Checking the data types of each column to ensure they are appropriate for analysis.
- Displaying basic information about the datasets (shape and first 5 rows).
- Identifying and handling missing/duplicated values.*
- Displaying summary statistics for numerical columns to understand their distributions.
- Visualizing the distribution of key variables to understand their characteristics.
- Identifying any outliers or anomalies in the data that may need to be addressed.
- Checking the relationships between different variables to identify potential correlations or patterns.


From the preliminary EDA, we can summarize the following insights:
>- All the date columns have been accurately converted to datetime64[ns] data type.
>- All the datasets share a common column `machine_id` made of integers, which can be used to join the datasets.
>- There are no missing values in the datasets.
>- There are no duplicated rows in the datasets.
>- With the exception of the maintenance dataset that begins in June 1, 2014, all the datasets contain data from January 1, 2015 to January 1, 2016.
>- Since there are only 6 hours of data from January 1, 2016, these rows can be dropped from the datasets to avoid any bias in the analysis.
>- The telemetry dataset contains continuous numerical variables (voltage, rotation, pressure, vibration) with a distribution similar to a Gaussian/Normal, that can be used as features for modeling.
>- The error and failure datasets contain categorical variables (error types and failure types) that can be used as features for modeling.
>- There are no outliers in the datasets, as all the values are within a reasonable range.
>- There are no correlations between the variables, as all the correlation coefficients are close to zero. Meaning that the variables are independent of each other, preventing multicollinearity issues in the modeling phase.

## 3. Data Preprocessing
The next step is to preprocess the data to prepare it for modeling. This involves cleaning the data, handling missing values, and transforming the data into a suitable format for analysis. The preprocessing steps are performed in the `src/data_preprocessing.py` script.


## 4. Feature Engineering
The feature engineering step involves creating new features from the existing data that can help improve the performance of the machine learning models. This includes:

## 5. Model Training
The model training step involves selecting appropriate machine learning algorithms and training them on the preprocessed data. This step is crucial for building a predictive model that can accurately forecast machine failures.

## 6. Model Evaluation
The model evaluation step involves assessing the performance of the trained models using various metrics. This step is essential to ensure that the models are capable of making accurate predictions on unseen data. The evaluation metrics include accuracy, precision, recall, F1-score, and ROC-AUC score.