# Feature engineering


Exploring the time and frequency-domain behaviour of the signals in the first notebook, we found that in them there's information that correlates with the wearing process of the cutting piece. In this notebook we will systematically engineer features from the signals measured, these will subsequently be used to train a predictive model.


Besides the original time-domain data, we can obtain the decomposition of the signal into different frequency ranges using a wavelet decomposition. This process will result in a signal for each band of interest.
Using an 8-level wavelet decomposition, the signal is decomposed into 9 coefficients each of which represent a different frequency band:

| Coefficient | Frequency Band (Hz) | Name on this project |
| ----------- | ------------------- | -------------------- |
| cA8         | 0 to 97.6           | band 0               |
| cD8         | 97.5 to 195.3       | band 1               |
| cD7         | 195.3 to 390.6      | band 2               |
| cD6         | 390.6 to 781.25     | band 3               |
| cD5         | 781.25 to 1,562.5   | band 4               |
| cD4         | 1,562.5 to 3,125    | band 5               |
| cD3         | 3,125 to 6,250      | band 6               |
| cD2         | 6,250 to 12,500     | band 7               |
| cD1         | 12,500 to 25,000    | band 8               |

As we are not interested in frequencies above 1000 Hz, we will only keep coefficients cD5-8 and cA8.


The strategy we will follow is to extract the following set of features:

- min
- max
- mean
- std
- skew
- kurtosis
- energy
- entropy
- zero crossings

for each signal and wavelet transform.


In [1]:
import pandas as pd
import pywt
import numpy as np
from scipy.stats import skew, kurtosis, entropy
import os

# Silences pandas warning that ruin the display of the notebook on github
from warnings import simplefilter

simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

In [2]:
def extract_features(file_path):
    df = pd.read_csv(file_path, sep=",")
    df.columns = [
        "force_x",
        "force_y",
        "force_z",
        "vibration_x",
        "vibration_y",
        "vibration_z",
        "ae_rms",
    ]
    features = {}
    for column in df.columns:
        col_data = df[column]
        features[column + "_min"] = col_data.min()
        features[column + "_max"] = col_data.max()
        features[column + "_mean"] = col_data.mean()
        features[column + "_std"] = col_data.std()
        features[column + "_skew"] = skew(col_data)
        features[column + "_kurtosis"] = kurtosis(col_data)
        features[column + "_energy"] = np.sum(np.square(col_data))
        features[column + "_entropy"] = entropy(np.abs(col_data) / np.sum(col_data))
        features[column + "_zero_crossings"] = np.sum(np.diff(np.sign(col_data)) != 0)

        coeffs = pywt.wavedec(df[column], "db4", level=8)
        for i, coeff in enumerate(coeffs[0:5]):
            features[column + "_band_" + str(i) + "_min"] = np.min(coeff)
            features[column + "_band_" + str(i) + "_max"] = np.max(coeff)
            features[column + "_band_" + str(i) + "_mean"] = np.mean(coeff)
            features[column + "_band_" + str(i) + "_std"] = np.std(coeff)
            features[column + "_band_" + str(i) + "_skew"] = skew(coeff)
            features[column + "_band_" + str(i) + "_kurtosis"] = kurtosis(coeff)
            features[column + "_band_" + str(i) + "_energy"] = np.sum(np.square(coeff))
            features[column + "_band_" + str(i) + "_entropy"] = entropy(
                np.abs(coeff) / np.sum(coeff)
            )
            features[column + "_band_" + str(i) + "_zero_crossings"] = np.sum(
                np.diff(np.sign(coeff)) != 0
            )

    return pd.DataFrame([features])

In [3]:
folder_path = "../data/raw/c1/c1/"

df_list = []

for filename in os.listdir(folder_path):
    if os.path.isfile(os.path.join(folder_path, filename)):
        df_list.append(extract_features(folder_path + filename))

df_features_full = pd.concat(df_list, ignore_index=True)

In [4]:
df_features_full

Unnamed: 0,force_x_min,force_x_max,force_x_mean,force_x_std,force_x_skew,force_x_kurtosis,force_x_energy,force_x_entropy,force_x_zero_crossings,force_x_band_0_min,...,ae_rms_band_3_zero_crossings,ae_rms_band_4_min,ae_rms_band_4_max,ae_rms_band_4_mean,ae_rms_band_4_std,ae_rms_band_4_skew,ae_rms_band_4_kurtosis,ae_rms_band_4_energy,ae_rms_band_4_entropy,ae_rms_band_4_zero_crossings
0,-2.501,3.744,0.400855,0.842474,0.115213,-0.245204,1.108925e+05,11.470652,3086,3.372719,...,1263,-0.003087,0.002905,-0.000012,0.000750,0.148335,0.173608,0.002242,7.991690,2476
1,-4.219,8.427,1.028156,1.864984,0.606318,0.022628,9.854836e+05,11.912433,7897,-5.805412,...,2278,-0.006229,0.017295,0.000023,0.001936,1.052757,3.416955,0.025474,8.425849,4374
2,-5.994,11.534,1.776100,2.672098,0.605039,0.073306,2.248208e+06,11.936780,7446,0.016538,...,2273,-0.009216,0.021860,0.000030,0.002584,0.879653,2.244280,0.045633,8.505429,4478
3,-6.157,11.788,2.104932,2.992563,0.586797,0.102543,2.928016e+06,11.947874,7116,-4.159454,...,2282,-0.017264,0.031650,0.000003,0.003155,0.941339,3.642558,0.068113,8.495699,4525
4,-4.288,12.555,2.831399,2.979762,0.612709,0.016937,3.708553e+06,11.984017,6409,-9.086499,...,2282,-0.013142,0.020719,-0.000002,0.003228,0.693154,1.824094,0.071532,8.495760,4521
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
310,-78.148,125.820,19.623448,50.879922,-0.031695,-1.029682,6.527566e+08,12.055178,9158,-17.400563,...,2285,-0.028823,0.040273,0.000008,0.006923,0.533284,0.531403,0.329047,8.566006,4559
311,-76.741,127.270,19.879400,51.393585,-0.042143,-1.040532,6.469066e+08,12.031442,8891,4.396167,...,2217,-0.024646,0.040724,-0.000016,0.006923,0.604330,0.429033,0.319354,8.538497,4439
312,-75.875,121.780,18.625247,48.927014,-0.020812,-1.037154,6.008062e+08,12.055628,9141,-23.853052,...,2280,-0.031028,0.036597,0.000038,0.006910,0.618103,0.932967,0.327401,8.545840,4553
313,-70.778,117.770,17.553296,47.091621,-0.007970,-1.052196,5.531372e+08,12.059934,9143,-5.750263,...,2278,-0.022553,0.031956,0.000070,0.006635,0.556917,0.476263,0.301617,8.552242,4547


In [6]:
df_features_full.to_csv("../data/dashboard/c1/c1_features.csv", index=False)

Doing the same for the other datasets


In [7]:
folder_path = "../data/raw/c4/c4/"

df_list = []

for filename in os.listdir(folder_path):
    if os.path.isfile(os.path.join(folder_path, filename)):
        df_list.append(extract_features(folder_path + filename))

df_features_full = pd.concat(df_list, ignore_index=True)

df_features_full.to_csv("../data/dashboard/c4/c4_features.csv", index=False)

In [8]:
folder_path = "../data/raw/c6/c6/"

df_list = []

for filename in os.listdir(folder_path):
    if os.path.isfile(os.path.join(folder_path, filename)):
        df_list.append(extract_features(folder_path + filename))

df_features_full = pd.concat(df_list, ignore_index=True)

df_features_full.to_csv("../data/dashboard/c6/c6_features.csv", index=False)