Uploading Data

In [None]:
from google.colab import files
uploaded = files.upload()

Import Libraries & Load Data

In [24]:
import pandas as pd
import numpy as np

# Load raw data
raw_data = pd.read_csv('machine failure data.csv')

## Feature Engineering & Preprocessing

Based on the insights from the Exploratory Data Analysis (EDA):

- Elevated process temperatures with low temperature differences (ΔT) increase the risk of Heat Dissipation Failures (HDF).
- Extreme torque × rotational speed combinations indicate mechanical stress, contributing to Power or Overstrain Failures (PWF/OSF).
- Tool wear accumulates over time, with high wear strongly correlating with Tool Wear Failures (TWF).

To prepare the dataset for statistical and machine learning modeling, we performed the following:

1. **Derived physics-informed features:**  
   - **ΔT (Temperature Gradient):** captures thermal stress.  
   - **Power (Torque × Speed):** captures combined mechanical load.  
   - **Normalized Tool Wear:** allows models to interpret relative degradation.  
   - **Interaction Feature (ΔT × Tool Wear):** captures combined thermal + mechanical risk patterns.  

2. **Dropped irrelevant columns:**  
   - `Product ID` (high cardinality, no physical meaning).  

3. **Encoded categorical columns:**  
   - `Type` column transformed into `Type_encoded` (L=0, M=1, H=2) for numeric model input.

This ensures that all features are physically meaningful, interpretable, and ready for modeling.


## Feature Engineering

## 3.1 Temperature Gradient (ΔT)

Why: EDA showed failures occur when the process temperature is high but temperature difference with air is small.

In [35]:
# Create Temperature Gradient
raw_data['Temp_diff'] = raw_data['Process temperature [K]'] - raw_data['Air temperature [K]']

# Quick check
raw_data[['Air temperature [K]', 'Process temperature [K]', 'Temp_diff']].head()

Unnamed: 0,Air temperature [K],Process temperature [K],Temp_diff
0,298.1,308.6,10.5
1,298.2,308.7,10.5
2,298.1,308.5,10.4
3,298.2,308.6,10.4
4,298.2,308.7,10.5


## 3.2 Mechanical Stress Feature (Power Proxy)

Why: Failures occur at extreme torque × speed combinations. This captures the combined mechanical load.

In [36]:
# Create Power Feature (Torque × Rotational Speed)
raw_data['Power'] = raw_data['Torque [Nm]'] * raw_data['Rotational speed [rpm]']

# Normalize by max power for scaling
raw_data['Power_norm'] = raw_data['Power'] / raw_data['Power'].max()

raw_data[['Torque [Nm]', 'Rotational speed [rpm]', 'Power', 'Power_norm']].head()

Unnamed: 0,Torque [Nm],Rotational speed [rpm],Power,Power_norm
0,42.8,1551,66382.8,0.663958
1,46.3,1408,65190.4,0.652032
2,49.4,1498,74001.2,0.740157
3,39.5,1433,56603.5,0.566146
4,40.0,1408,56320.0,0.56331


## 3.3 Normalized Tool Wear

Why: Tool wear strongly correlates with failures. Normalizing helps model interpret relative wear.

In [40]:
# Normalize Tool Wear (0-1)
raw_data['Tool_wear_norm'] = raw_data['Tool wear [min]'] / raw_data['Tool wear [min]'].max()

raw_data[['Tool wear [min]', 'Tool_wear_norm']].head()

Unnamed: 0,Tool wear [min],Tool_wear_norm
0,0,0.0
1,3,0.011858
2,5,0.019763
3,7,0.027668
4,9,0.035573


## 3.4 Interaction Feature: ΔT × Tool Wear

Why: Captures combined thermal and mechanical stress; may reveal risk patterns not visible from individual features alone.

In [38]:
# Interaction between thermal and mechanical degradation
raw_data['Temp_Wear_Interaction'] = raw_data['Temp_diff'] * raw_data['Tool_wear_norm']

raw_data['Temp_Wear_Interaction'].head()

Unnamed: 0,Temp_Wear_Interaction
0,0.0
1,0.124506
2,0.205534
3,0.287747
4,0.373518


## 3.5 Drop Product ID Column

'Product ID' column has extremely high cardinality and does not provide meaningful physical information for predicting failures. Keeping it would add noise to the model.

In [28]:
# Drop Product ID column because it has too many unique values and is not useful for modeling
raw_data = raw_data.drop(columns=['Product ID'])

## 3.6 Encode Type Column

The Type column has values:
These are machine sizes or classes:

L → Small/Light machine

M → Medium machine

H → Heavy/Large machine

Since statistical and ML models need numeric input, we encode this as ordinal values (or one-hot if you prefer).

In [29]:
if 'Type' in raw_data.columns:
    type_mapping = {'L': 0, 'M': 1, 'H': 2}
    raw_data['Type_encoded'] = raw_data['Type'].map(type_mapping)
    raw_data = raw_data.drop(columns=['Type'])

## Checking

In [39]:
# Check all column names in the processed dataframe
print(raw_data.columns)

Index(['UDI', 'Air temperature [K]', 'Process temperature [K]',
       'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]',
       'Machine failure', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF', 'Temp_diff',
       'Power', 'Power_norm', 'Tool_wear_norm', 'Type_encoded',
       'Temp_Wear_Interaction'],
      dtype='object')


## Saving Processed Data


In [None]:
# Save to Colab environment
raw_data.to_csv('Machine_failure_processed.csv', index=False)

# Then download to your PC
from google.colab import files
files.download('Machine_failure_processed.csv')

print("File downloaded successfully")

### Summary

- Created physics-informed features: ΔT (Temperature Gradient), Power (Torque × Rotational Speed), Tool_wear_norm.
- Added an interaction feature: ΔT × Tool_wear to capture combined thermal + mechanical risk.
- Encoded `Type` column as Type_encoded (L=0, M=1, H=2) for model compatibility.
- Dropped the high-cardinality, irrelevant `Product ID` column.
- Saved the processed dataset for future modeling steps.
- Dataset is now clean, structured, and ready for both statistical analysis and predictive modeling.
