In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import cm
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
# Colab
from google.colab import files
uploaded = files.upload()
# Only use this if running the notebook on your local machine
#plt.style.use('notebook.mplstyle')

### Task description
You are given a dataset containing frequency spectra computed from recorded sounds of either faulty or normal machines running at a fixed RPM setting. Your task is to train a model (e.g. logistic regression and SVM) that can detect and classify faulty machines based on their frequency spectra (characteristic sound).

### Step 1: Load the data

In [None]:
# Read the data into a pandas data frame (Colab)
df = pd.read_csv(io.BytesIO(uploaded['Machine_sound_data.csv']), index_col=0)
# Read the data into a pandas data frame (Local machine)
#df = pd.read_csv('Machine_sound_data.csv', index_col=0)
print(df.shape)
# Show the first five rows in the csv file
df.head()

The table contains 1543 rows (data points) and 121 columns. The first column contains the fault code (0 = OK, whereas all higher numbers denote a specific fault), and the remaining columns contain the frequency spectra, with the column label denoting the frequency. We can therefore go ahead and create the data matrix $\mathbf{X}$ and class label vector $\mathbf{y}$ as:

In [None]:
y = df['Fault code'].to_numpy()
X = df.iloc[:, 1:].to_numpy()
frequencies = [int(s) for s in df.columns[1:]]
# Lets also verify that X and y have the expected shape
print(y.shape)
print(X.shape)
# and that frequiencies contain column labels
print(frequencies)

### Check what our data looks like

In [None]:
# Lets plot a few spectra to see what our data looks like
# We add the argument figsize to get a figure that is wider than the default size
fig, ax = plt.subplots(1, 1, figsize=[15, 4])
ax.plot(frequencies, X[0:4, :].T, '-');
ax.set(xlabel='Frequency (Hz)', ylabel='Amplitude');

From the plot above, we note that the spectra has one peak around 150 Hz that dominates everything. This corresponds to the RPM (revolutions per minute) frequency that the machine is running at (10 000 / 60 = 166.7 Hz). However, we expect that the characteristic sound of faults is found other frequencies. One neat way to remove the dominance of the RPM frequency is to pre-process the data by simply taking the logarithm.

In [None]:
X_pp = np.log10(X)

# Lets plot a few spectra to see what the preprocessed data looks like
fig, ax = plt.subplots(1, 1, figsize=[15, 4])
ax.plot(frequencies, X_pp[0:4, :].T, '-');
ax.set(xlabel='Frequency (Hz)', ylabel='Amplitude');

That looks better, now it is possible to see differences in other frequencies besides just the RPM frequency. As a next step, try to get an idea of how difficult this classification problem is, that is, does normal and faulty machines appear to different spectra?

In [None]:
# A straight forward comparison is to simply plot 
# normal spectra and compare these to faulty ones by eye.
fig, ax = plt.subplots(1, 1, figsize=[15, 4])
# Plot att normal spectra
ax.plot(frequencies, X_pp[y==0, :].T, 'k-', alpha=0.1);
# Plot att spectra with a specific fault code
ax.plot(frequencies, X_pp[y==4, :].T, 'r-', alpha=0.1);
ax.set(xlabel='Frequency (Hz)', ylabel='Amplitude');

Nice, some faults clearly have a spectra that deviates from what is normal. The task of the machine learning model will thus be to automatically figure out which frequencies that are relevant for detecting faulty machines and how to compute relevant features from those frequencies.

### Step 2: divide the data into training and testing sets
Use the pre-processed data in X_pp.

### Step 3: train a model to classify spectra
Try training a classification model, e.g., logistic regression or support vector classifier. Observe that the mean value and variance for each feature (frequency bin) is very different for this data, and consequently it will have to scaled. The default approach is to mean center and scale to unit variance using a "StandardScaler".

### Step 4: evaluate how well the model performs
Evaluate how well the model performs on the test set and check which faults that it struggles to classify correctly by computing the confusion matrix.