In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Problem Statement

In real-world industrial systems, machine failures are **rare but critical events**, resulting in highly **imbalanced datasets** where normal operating conditions significantly outnumber failure cases. Machine learning models trained on such data often become biased toward the majority class, leading to poor failure detection performance and unreliable maintenance decisions.

The problem addressed in this project is to **develop and evaluate machine learning approaches capable of accurately identifying machine failure events from highly imbalanced operational data**. Using a synthetic milling machine dataset, the project focuses on understanding the impact of class imbalance on model performance and applying suitable techniques—such as resampling strategies, class weighting, and appropriate evaluation metrics—to improve failure detection without compromising model reliability.

The ultimate goal is not only to achieve predictive accuracy, but also to **build a practical understanding of modeling trade-offs** in predictive maintenance scenarios, where false negatives can be significantly more costly than false positives.

## Objective 

**Project Context and Objective**

This project uses a **synthetic predictive maintenance dataset** modeled after a real-world milling machine. The dataset includes multiple independent failure modes and a highly **imbalanced target variable**, closely reflecting real industrial operating conditions.

The primary objective of this project is to **learn and demonstrate techniques for handling imbalanced datasets** in machine learning, with a focus on preprocessing strategies, evaluation metrics, and modeling considerations relevant to predictive maintenance applications.

---

**Dataset Attribution and Credits**

The dataset used in this project originates from the following research publication:

**S. Matzka**  
*Explainable Artificial Intelligence for Predictive Maintenance Applications*  
Proceedings of the **2020 Third International Conference on Artificial Intelligence for Industries (AI4I)**, pp. 69–74.

All credit for the dataset design, failure modeling, and methodology belongs to the **original author and publisher**. This project is strictly for **educational and learning purposes**.

## EDA

### 1.1 Dataset info

In [None]:
df= pd.read_csv("C:/sai files/projects/predictive-maintenance-end2end/test.csv")

In [18]:
df.head()

Unnamed: 0,id,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],TWF,HDF,PWF,OSF,RNF
0,136429,L50896,L,302.3,311.5,1499,38.0,60,0,0,0,0,0
1,136430,L53866,L,301.7,311.0,1713,28.8,17,0,0,0,0,0
2,136431,L50498,L,301.3,310.4,1525,37.7,96,0,0,0,0,0
3,136432,M21232,M,300.1,309.6,1479,47.6,5,0,0,0,0,0
4,136433,M19751,M,303.4,312.3,1515,41.3,114,0,0,0,0,0


In [33]:
df.shape

(90954, 13)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90954 entries, 0 to 90953
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       90954 non-null  int64  
 1   Product ID               90954 non-null  object 
 2   Type                     90954 non-null  object 
 3   Air temperature [K]      90954 non-null  float64
 4   Process temperature [K]  90954 non-null  float64
 5   Rotational speed [rpm]   90954 non-null  int64  
 6   Torque [Nm]              90954 non-null  float64
 7   Tool wear [min]          90954 non-null  int64  
 8   TWF                      90954 non-null  int64  
 9   HDF                      90954 non-null  int64  
 10  PWF                      90954 non-null  int64  
 11  OSF                      90954 non-null  int64  
 12  RNF                      90954 non-null  int64  
dtypes: float64(3), int64(8), object(2)
memory usage: 9.0+ MB


In [23]:
df.describe()

Unnamed: 0,id,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],TWF,HDF,PWF,OSF,RNF
count,90954.0,90954.0,90954.0,90954.0,90954.0,90954.0,90954.0,90954.0,90954.0,90954.0,90954.0
mean,181905.5,299.859493,309.939375,1520.528179,40.335191,104.293962,0.001473,0.005343,0.002353,0.00387,0.002309
std,26256.302529,1.857562,1.385296,139.970419,8.504683,63.871092,0.038355,0.072903,0.048449,0.06209,0.047995
min,136429.0,295.3,305.7,1168.0,3.8,0.0,0.0,0.0,0.0,0.0,0.0
25%,159167.25,298.3,308.7,1432.0,34.6,48.0,0.0,0.0,0.0,0.0,0.0
50%,181905.5,300.0,310.0,1493.0,40.5,106.0,0.0,0.0,0.0,0.0,0.0
75%,204643.75,301.2,310.9,1579.0,46.2,158.0,0.0,0.0,0.0,0.0,0.0
max,227382.0,304.4,313.8,2886.0,76.6,253.0,1.0,1.0,1.0,1.0,1.0


In [35]:
duplicate_count = df.duplicated().sum()
duplicate_count

np.int64(0)

In [36]:
# Missing Values/Null Values Count
df.isna().sum() #gives total count of null values in each column
df.isnull().sum()

id                         0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
TWF                        0
HDF                        0
PWF                        0
OSF                        0
RNF                        0
dtype: int64

### 1.2 variables Description

1. **Type**
   Indicates the quality of the product, classified into categories such as Low, Medium, or High.

2. **Air Temperature [K]**
   Represents the air temperature, which is simulated using a random process and adjusted to have a certain variability around a standard value.

3. **Process Temperature [K]**
   Represents the temperature within the process, generated with a slight increase over the air temperature and adjusted to a specific variability.

4. **Rotational Speed [rpm]**
   Describes the speed at which the machine operates, calculated based on a fixed power level with added random variation.

5. **Torque [Nm]**
   Measures the force applied by the machine, distributed around a certain average value with specific variation, ensuring positive values only.

6. **Tool Wear [min]**
   Indicates the wear on the tool, with the duration increasing based on the quality category of the product.

7. **Tool Wear Failure**
   Occurs when the tool is replaced or fails after a certain amount of usage time, which is randomly determined within a specific range.

8. **Heat Dissipation Failure**
   Happens if the temperature difference between the air and process is too small and the machine speed is below a certain threshold.

9. **Power Failure**
    Occurs when the power required for the process, calculated from torque and speed, falls outside of a defined acceptable range.

10. **Overstrain Failure**
    Occurs if the combined effect of tool wear and torque exceeds specific limits based on the product quality.

11. **Random Failures**
    Represents a small probability of failure occurring randomly, independent of other process parameters.

### 1.3 Dataset Overview

The dataset consists of **90,954 machine operation records** with **13 distinct features**, representing operational, thermal, mechanical, and failure-related characteristics of a milling machine. All columns contain **non-null values**, indicating the absence of structurally missing data. While this reduces the need for imputation, further validation is required to identify **outliers, abnormal operating conditions, or logically inconsistent values** that may affect downstream analysis and modeling.

The dataset is derived from the **AI4I 2020 Predictive Maintenance dataset** and represents a larger version of the original data.

---

**Key Observations**

**1. Data Volume & Completeness**
- The dataset includes **90,954 observations**, providing sufficient scale for **exploratory data analysis (EDA)** and robust machine learning experiments.
- All **13 features are fully populated**, enabling straightforward preprocessing without handling missing values.

---

**2. Feature Composition and Data Types**
The dataset contains a structured mix of feature types:

- **Numerical features**
  - Continuous variables such as `Air temperature [K]`, `Process temperature [K]`, and `Torque [Nm]`, capturing thermal and mechanical behavior.
  - Discrete numerical variables including `Rotational speed [rpm]` and `Tool wear [min]`, representing operational intensity and tool condition.

- **Categorical features**
  - `Product ID` and `Type`, indicating product variants and quality categories used during machine operation.

- **Failure indicator variables**
  - Binary columns (`TWF`, `HDF`, `PWF`, `OSF`, `RNF`) representing independent failure modes associated with machine operation.

- **Identifier attribute**
  - `id`, serving as a unique row-level identifier and not intended for predictive modeling.

---

**3. Target Variable Construction**
The dataset does not provide a single explicit target label for machine failure. Instead, failure information is distributed across **five independent binary failure indicators**:

- **Tool Wear Failure (TWF)**
- **Heat Dissipation Failure (HDF)**
- **Power Failure (PWF)**
- **Overstrain Failure (OSF)**
- **Random Failure (RNF)**

A derived **machine failure target** must therefore be constructed, where a machine failure is defined as the occurrence of **at least one** of the above failure modes within a given observation. This mirrors realistic industrial scenarios in which the precise failure cause may not be directly observable at prediction time.

---

**4. Failure Distribution and Imbalance**
The individual failure mode counts in the dataset are as follows:

- **Tool Wear Failure (TWF):** 134 occurrences  
- **Heat Dissipation Failure (HDF):** 486 occurrences  
- **Power Failure (PWF):** 214 occurrences  
- **Overstrain Failure (OSF):** 352 occurrences  
- **Random Failure (RNF):** 210 occurrences  

These counts indicate that failure events are **rare relative to the total number of observations**, resulting in a **highly imbalanced classification problem** once a unified machine failure target is constructed. This imbalance reflects real-world predictive maintenance settings, where failure events are infrequent but operationally critical.

---

**5. Operational and Mechanical Characteristics**
- Thermal conditions are captured through air and process temperatures, enabling analysis of **heat-related failure mechanisms**.
- Mechanical stress is represented by torque, rotational speed, and accumulated tool wear, which are directly linked to **mechanical degradation and overstrain failures**.
- Product quality variations, captured via the `Type` feature, introduce heterogeneity in operating conditions and failure thresholds.

---

**Summary**

Overall, the dataset is **well-structured, complete, and operationally rich**, making it well-suited for **predictive maintenance and failure detection tasks**. The absence of missing values simplifies preprocessing, while the presence of multiple failure modes and severe class imbalance provides a realistic and challenging environment for exploring **imbalanced classification techniques**, feature engineering, and model evaluation strategies.


### 1.4 Visualizations