# Feature Engineering

This notebook covers the creation of new features and the selection of the most important features for the AI4I 2020 Predictive Maintenance Dataset using Partial Least Squares (PLS) regression.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cross_decomposition import PLSRegression
from sklearn.feature_selection import SelectFromModel

# Set plot style
sns.set(style='whitegrid')

## Load the Dataset

In [None]:
# Load the dataset
data = pd.read_csv('../data/ai4i2020.csv')
data.head()

## Data Preprocessing

In [None]:
# Define features and target variable
X = data.drop(['machine failure'], axis=1)
y = data['machine failure']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## PLS Regression for Feature Selection

In [None]:
# Apply PLS regression
pls = PLSRegression(n_components=10)
pls.fit(X_train_scaled, y_train)

# Select features based on the PLS regression model
model = SelectFromModel(pls, prefit=True)
X_train_selected = model.transform(X_train_scaled)
X_test_selected = model.transform(X_test_scaled)

# Get the selected feature names
selected_features = X.columns[model.get_support()]
selected_features

## Visualize Selected Features

In [None]:
# Plot the importance of selected features
plt.figure(figsize=(10, 8))
sns.barplot(x=pls.coef_[:, 0], y=selected_features)
plt.title('Feature Importance based on PLS Regression')
plt.xlabel('PLS Coefficient')
plt.ylabel('Feature')
plt.show()

## Save the Selected Features

In [None]:
# Save the selected features data
np.savez('../data/selected_features_data.npz', X_train=X_train_selected, X_test=X_test_selected, y_train=y_train, y_test=y_test)

## Conclusion

In this notebook, we have performed feature selection using Partial Least Squares (PLS) regression, visualized the importance of the selected features, and saved the selected features data for model training.