
**Mini Project: Predictive Maintenance using Machine Learning**
**Project Description**

In this mini project, you will develop a predictive maintenance system using machine
learning techniques. 

The goal is to predict machine failures based on sensor data from
industrial equipment. 

You will work with a dataset containing various sensor measurements
and use two different machine learning models of your choice.

**------------------Solution------------------**

**Step 1:** Loading dependencies:

The first step is to load in the necessary dependencies, as i will be using these packages to load the training data, and train the models.

In [5]:
import numpy as np
import sklearn as sk
import pandas as pd
import keras as kr


**Step 2:** Loading the dataset with `pandas`

I will only load relevant data, since i am only interested in the datapoints for Air temperature, Process temperature, Rotational speed, Torque, Tool wear and Machine failure.

In [6]:
data = pd.read_csv("ai4i2020.csv", usecols=['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', "Machine failure"])
data

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,298.1,308.6,1551,42.8,0,0
1,298.2,308.7,1408,46.3,3,0
2,298.1,308.5,1498,49.4,5,0
3,298.2,308.6,1433,39.5,7,0
4,298.2,308.7,1408,40.0,9,0
...,...,...,...,...,...,...
9995,298.8,308.4,1604,29.5,14,0
9996,298.9,308.4,1632,31.8,17,0
9997,299.0,308.6,1645,33.4,22,0
9998,299.0,308.7,1408,48.5,25,0


**Step 3:** Choice of models

The miniproject dictates, that i should choose two different models to try and predict machine failures based on the collected data. 

Choosing which models necessitates some considerations, as to which models would be best fit to perform this task. Here i will list a number of models that have been introduced throughout the course.

- Random Forests                    (Lecture 5)
- Decision Trees                    (Lecture 5)
- Logistic Regression               (Lecture 6)
- Support Vector Machines (SVM)     (Lecture 6)
- Artificial Neural Networks (ANNs) (Lecture 9)

**Dataset Format Considerations**

The provided dataset consists of 10,000 rows and the following columns:
- **Features**: Air temperature [K], Process temperature [K], Rotational speed [rpm], Torque [Nm], Tool wear [min].
- **Target Variable**: Machine failure (binary classification: 0 or 1).

**Characteristics of the Dataset:**
   - All features are numerical and continuous, making the dataset suitable for a wide range of ML models without requiring extensive preprocessing.
   - The target variable ("Machine failure") makes the problem a binary classification task, which is ideal for models like Logistic Regression, Random Forests, or Neural Networks.
   - Features like temperatures and rotational speed have ranges that do not vary drastically. However, scaling methods (e.g., standardization) might be necessary for using models that are sensitive to feature scaling (e.g., SVM, Neural Networks).
   - The dataset might exhibit non-linear relationships or interactions between features, favoring models that can capture such complexities (e.g., Random Forests, ANNs).
   - With 10,000 rows, the dataset is large enough to benefit from complex models like Random Forests and ANNs while remaining manageable in terms of computationality.

**Model Format Considerations**
**Random Forests**
- **Strengths**: Robust ensemble model, handles non-linear relationships, provides feature importance, resists overfitting.
- **Weaknesses**: Computationally intensive for large datasets.
- **Reasoning**: Ideal for structured data and reliable classification.

**Decision Trees**
- **Strengths**: Simple, interpretable, handles non-linear data well.
- **Weaknesses**: Prone to overfitting, sensitive to data changes.
- **Reasoning**: Useful as a baseline due to simplicity and interpretability.

**Logistic Regression**
- **Strengths**: Simple, interpretable, efficient for binary classification.
- **Weaknesses**: Assumes linear relationships, struggles with complex patterns.
- **Reasoning**: Could provide a baseline for comparison with more complex models.

**Support Vector Machines (SVM)**
- **Strengths**: Handles linear and non-linear boundaries, works well with smaller datasets.
- **Weaknesses**: Computationally expensive, requires careful tuning.
- **Reasoning**: Suitable for structured data but limited scalability.

**Artificial Neural Networks (ANNs)**
- **Strengths**: Models complex relationships, scalable to large datasets.
- **Weaknesses**: Computationally expensive, less interpretable.
- **Reasoning**: Strong candidate for capturing non-linear relationships.

**Selected Models:**
1. **Random Forests**: Chosen for robustness and interpretability.
2. **Artificial Neural Networks**: Selected for handling complex patterns effectively.