# Data Preprocessing

1. `Check for missing values`: Check if there are any missing values in the dataset and decide on how to handle them. If there are a lot of missing values, you may consider dropping those rows or imputing them with appropriate values.

2. `Check for duplicates`: Check if there are any duplicate rows in the dataset and remove them if necessary.

3. `Data type conversion`: Check if the data types of the columns are appropriate. For example, the 'UID' column should be of integer data type, while the 'productID' column should be categorical.

5. `Feature engineering`: Create new features that may be relevant for predictive maintenance. For example, you may create a new feature that combines the 'air temperature' and 'process temperature' to represent the temperature difference, which may be an important indicator of machine failure. Also, creating a new column by multiplying 'rotational speed' and 'torque' to capture power.

6. `Label encoding`: Convert the categorical variable 'productID' to numerical values using label encoding or one-hot encoding, depending on the algorithm you plan to use.

7. `Feature scaling`: Normalize or standardize the continuous variables to ensure that they have similar ranges. This will help the machine learning algorithm to converge faster.

8. `Balance the dataset`: Check if the dataset is balanced in terms of the 'machine failure' label. If there are a lot more non-failure instances than failure instances, you may consider oversampling or undersampling to balance the dataset.

9.  `Save the preprocessed dataset`: Save the preprocessed dataset in a suitable format, such as CSV or Parquet, for future use.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, MinMaxScaler

from imblearn.over_sampling import SMOTE

print('Libraries imported')

Libraries imported


In [2]:
raw_df = pd.read_csv('./data/raw/data.csv')
raw_df.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


### 1. Checking for missing values

In [3]:
raw_df.isnull().sum()*100/len(raw_df)

UDI                        0.0
Product ID                 0.0
Type                       0.0
Air temperature [K]        0.0
Process temperature [K]    0.0
Rotational speed [rpm]     0.0
Torque [Nm]                0.0
Tool wear [min]            0.0
Machine failure            0.0
TWF                        0.0
HDF                        0.0
PWF                        0.0
OSF                        0.0
RNF                        0.0
dtype: float64

* No missing values are present in this dataframe

### 2. Checking for duplicates

In [4]:
raw_df['Product ID'].duplicated().sum()

0

In [5]:
raw_df['UDI'].unique(), raw_df['UDI'].nunique()

(array([    1,     2,     3, ...,  9998,  9999, 10000]), 10000)

* No duplicates found in this dataframe

### 3. Data Type conversion

In [6]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

### 4. Feature engineering

In [7]:
def type_of_failure(row):
    if row['TWF'] == 1:
        return 'TWF'
    elif row['HDF'] == 1:
        return 'HDF'
    elif row['PWF'] == 1:
        return 'PWF'
    elif row['OSF'] == 1:
        return 'OSF'
    elif row['RNF'] == 1:
        return 'RNF'
    else:
        return 'No Failure'

# Apply the type_of_failure function to create the 'type_of_failure' column
raw_df['TYPE_OF_FAILURE'] = raw_df.apply(type_of_failure, axis=1)

In [8]:
raw_df

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF,TYPE_OF_FAILURE
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0,No Failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0,No Failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0,No Failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0,No Failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0,No Failure
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,M24855,M,298.8,308.4,1604,29.5,14,0,0,0,0,0,0,No Failure
9996,9997,H39410,H,298.9,308.4,1632,31.8,17,0,0,0,0,0,0,No Failure
9997,9998,M24857,M,299.0,308.6,1645,33.4,22,0,0,0,0,0,0,No Failure
9998,9999,H39412,H,299.0,308.7,1408,48.5,25,0,0,0,0,0,0,No Failure


In [9]:
raw_df.TYPE_OF_FAILURE.value_counts()

TYPE_OF_FAILURE
No Failure    9652
HDF            115
PWF             91
OSF             78
TWF             46
RNF             18
Name: count, dtype: int64

In [10]:
raw_df.drop(['TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1, inplace=True)

In [11]:
raw_df.drop(['UDI', 'Product ID'], axis=1, inplace=True)
raw_df.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TYPE_OF_FAILURE
0,M,298.1,308.6,1551,42.8,0,0,No Failure
1,L,298.2,308.7,1408,46.3,3,0,No Failure
2,L,298.1,308.5,1498,49.4,5,0,No Failure
3,L,298.2,308.6,1433,39.5,7,0,No Failure
4,L,298.2,308.7,1408,40.0,9,0,No Failure


In [12]:
### Converting Kelvin to Celsius

raw_df['Air temperature [C]'] = raw_df['Air temperature [K]'] - 273.15
raw_df['Process temperature [C]'] = raw_df['Process temperature [K]'] - 273.15
raw_df.drop(['Air temperature [K]', 'Process temperature [K]'], axis=1, inplace=True)
raw_df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TYPE_OF_FAILURE,Air temperature [C],Process temperature [C]
0,M,1551,42.8,0,0,No Failure,24.95,35.45
1,L,1408,46.3,3,0,No Failure,25.05,35.55
2,L,1498,49.4,5,0,No Failure,24.95,35.35
3,L,1433,39.5,7,0,No Failure,25.05,35.45
4,L,1408,40.0,9,0,No Failure,25.05,35.55


In [13]:
### creating a new column by multiplying 'rotational speed' and 'torque' to capture power.

raw_df['Power'] = raw_df['Rotational speed [rpm]'] * raw_df['Torque [Nm]']

In [14]:
raw_df

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TYPE_OF_FAILURE,Air temperature [C],Process temperature [C],Power
0,M,1551,42.8,0,0,No Failure,24.95,35.45,66382.8
1,L,1408,46.3,3,0,No Failure,25.05,35.55,65190.4
2,L,1498,49.4,5,0,No Failure,24.95,35.35,74001.2
3,L,1433,39.5,7,0,No Failure,25.05,35.45,56603.5
4,L,1408,40.0,9,0,No Failure,25.05,35.55,56320.0
...,...,...,...,...,...,...,...,...,...
9995,M,1604,29.5,14,0,No Failure,25.65,35.25,47318.0
9996,H,1632,31.8,17,0,No Failure,25.75,35.25,51897.6
9997,M,1645,33.4,22,0,No Failure,25.85,35.45,54943.0
9998,H,1408,48.5,25,0,No Failure,25.85,35.55,68288.0


### 5. Encoding

In [15]:
### Ordinal Encoding

encoder = OrdinalEncoder(categories=[['L', 'M', 'H']])

raw_df['Type'] = encoder.fit_transform(raw_df[['Type']])
raw_df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TYPE_OF_FAILURE,Air temperature [C],Process temperature [C],Power
0,1.0,1551,42.8,0,0,No Failure,24.95,35.45,66382.8
1,0.0,1408,46.3,3,0,No Failure,25.05,35.55,65190.4
2,0.0,1498,49.4,5,0,No Failure,24.95,35.35,74001.2
3,0.0,1433,39.5,7,0,No Failure,25.05,35.45,56603.5
4,0.0,1408,40.0,9,0,No Failure,25.05,35.55,56320.0


In [16]:
raw_df['Type'].value_counts()

Type
0.0    6000
1.0    2997
2.0    1003
Name: count, dtype: int64

In [17]:
encoder = LabelEncoder()

raw_df['TYPE_OF_FAILURE'] = encoder.fit_transform(raw_df['TYPE_OF_FAILURE'])
raw_df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TYPE_OF_FAILURE,Air temperature [C],Process temperature [C],Power
0,1.0,1551,42.8,0,0,1,24.95,35.45,66382.8
1,0.0,1408,46.3,3,0,1,25.05,35.55,65190.4
2,0.0,1498,49.4,5,0,1,24.95,35.35,74001.2
3,0.0,1433,39.5,7,0,1,25.05,35.45,56603.5
4,0.0,1408,40.0,9,0,1,25.05,35.55,56320.0


In [18]:
raw_df.loc[raw_df['Machine failure'] == 1]

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TYPE_OF_FAILURE,Air temperature [C],Process temperature [C],Power
50,0.0,2861,4.6,143,1,3,25.75,35.95,13160.6
69,0.0,1410,65.7,191,1,3,25.75,35.85,92637.0
77,0.0,1455,41.3,208,1,5,25.65,35.75,60091.5
160,0.0,1282,60.7,216,1,2,25.25,35.05,77817.4
161,0.0,1412,52.3,218,1,2,25.15,34.95,73847.6
...,...,...,...,...,...,...,...,...,...
9758,0.0,2271,16.2,218,1,5,25.45,36.65,36790.2
9764,0.0,1294,66.7,12,1,3,25.35,36.35,86309.8
9822,0.0,1360,60.9,187,1,2,25.35,36.25,82824.0
9830,0.0,1337,56.1,206,1,2,25.15,36.15,75005.7


In [19]:
raw_df.iloc[101]

Type                           0.00
Rotational speed [rpm]      1991.00
Torque [Nm]                   20.70
Tool wear [min]               59.00
Machine failure                0.00
TYPE_OF_FAILURE                1.00
Air temperature [C]           25.65
Process temperature [C]       35.65
Power                      41213.70
Name: 101, dtype: float64

In [20]:
encoder.classes_

array(['HDF', 'No Failure', 'OSF', 'PWF', 'RNF', 'TWF'], dtype=object)

In [21]:
classes = [0, 1, 2, 3,4, 5]
encoder.inverse_transform(classes)

array(['HDF', 'No Failure', 'OSF', 'PWF', 'RNF', 'TWF'], dtype=object)

In [22]:
raw_df['TYPE_OF_FAILURE'].value_counts()

TYPE_OF_FAILURE
1    9652
0     115
3      91
2      78
5      46
4      18
Name: count, dtype: int64

### 6. Feature Scaling

In [23]:
scaler = MinMaxScaler()

scale_cols = ['Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Air temperature [C]', 'Process temperature [C]']

raw_df_scaled = scaler.fit_transform(raw_df[scale_cols])
raw_df_scaled = pd.DataFrame(raw_df_scaled)
raw_df_scaled.columns = scale_cols

raw_df.drop(scale_cols, axis=1, inplace=True)

raw_df_scaled = pd.concat([raw_df, raw_df_scaled], axis=1)
raw_df_scaled.head()

Unnamed: 0,Type,Machine failure,TYPE_OF_FAILURE,Power,Rotational speed [rpm],Torque [Nm],Tool wear [min],Air temperature [C],Process temperature [C]
0,1.0,0,1,66382.8,0.222934,0.535714,0.0,0.304348,0.358025
1,0.0,0,1,65190.4,0.139697,0.583791,0.011858,0.315217,0.37037
2,0.0,0,1,74001.2,0.192084,0.626374,0.019763,0.304348,0.345679
3,0.0,0,1,56603.5,0.154249,0.490385,0.027668,0.315217,0.358025
4,0.0,0,1,56320.0,0.139697,0.497253,0.035573,0.315217,0.37037


In [24]:
raw_df_scaled.iloc[101]

Type                           0.000000
Machine failure                0.000000
TYPE_OF_FAILURE                1.000000
Power                      41213.700000
Rotational speed [rpm]         0.479045
Torque [Nm]                    0.232143
Tool wear [min]                0.233202
Air temperature [C]            0.380435
Process temperature [C]        0.382716
Name: 101, dtype: float64

## Oversampling
Oversampling the target variable since there is high class imbalance

In [25]:
smote = SMOTE(sampling_strategy='auto')

X = raw_df_scaled.drop('TYPE_OF_FAILURE', axis=1)
y = raw_df_scaled['TYPE_OF_FAILURE']

X_resampled, y_resampled = smote.fit_resample(X, y)

df_sampled = pd.concat([X_resampled, y_resampled], axis=1)
df_sampled.head()

Unnamed: 0,Type,Machine failure,Power,Rotational speed [rpm],Torque [Nm],Tool wear [min],Air temperature [C],Process temperature [C],TYPE_OF_FAILURE
0,1.0,0,66382.8,0.222934,0.535714,0.0,0.304348,0.358025,1
1,0.0,0,65190.4,0.139697,0.583791,0.011858,0.315217,0.37037,1
2,0.0,0,74001.2,0.192084,0.626374,0.019763,0.304348,0.345679,1
3,0.0,0,56603.5,0.154249,0.490385,0.027668,0.315217,0.358025,1
4,0.0,0,56320.0,0.139697,0.497253,0.035573,0.315217,0.37037,1


In [27]:
df_sampled.shape

(57912, 9)

In [26]:
df_sampled.TYPE_OF_FAILURE.value_counts()

TYPE_OF_FAILURE
1    9652
3    9652
5    9652
2    9652
4    9652
0    9652
Name: count, dtype: int64

In [29]:
# df_sampled.to_csv('./data/processed/data_processed.csv', index=False)