# 🎯 Learning Objectives

By the end of this notebook, you will be able to:

- Understand the structure of the **MaFaulDa dataset** and its class hierarchy.  
- Download, extract, and organize multi-class datasets with **vibration and microphone recordings**.  
- Explore dataset metadata and create a **DataFrame of available files**.  
- Inspect and show top / bottom examples by signal length to illustrate variability.  
- Load and visualize sensor signals from CSV files, with **optional normalization** for fair comparison.  
- Compare signals from different classes in the **time domain** and overlay **rolling statistics** (mean/RMS).  
- Perform **spectrogram (time–frequency) analysis** for selected channels.  
- Listen to **microphone recordings** and relate sound to operating conditions.  
- Prepare for feature extraction and ML tasks by automating metadata collection and basic signal summaries.

In [None]:
# ---- Imports ----

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import urllib.request
import ssl
import shutil
from urllib.parse import urljoin
from scipy.signal import spectrogram
from scipy.fft import fft, fftfreq
from IPython.display import Audio, display

In [None]:
# ---- Define Dataset URLs ----

dataset_urls = {
    "Normal": "https://www02.smt.ufrj.br/~offshore/mfs/database/mafaulda/normal.zip",
    "Horizontal Misalignment": "https://www02.smt.ufrj.br/~offshore/mfs/database/mafaulda/horizontal-misalignment.zip",
    # "Vertical Misalignment": "https://www02.smt.ufrj.br/~offshore/mfs/database/mafaulda/vertical-misalignment.zip",
    # "Imbalance": "https://www02.smt.ufrj.br/~offshore/mfs/database/mafaulda/imbalance.zip",
    # "Underhang Bearing": "https://www02.smt.ufrj.br/~offshore/mfs/database/mafaulda/underhang-bearing.zip",
    # "Overhang Bearing": "https://www02.smt.ufrj.br/~offshore/mfs/database/mafaulda/overhang-bearing.zip",
}

In [None]:
# ---- Dataset Source Configuration ----

# Detect environment
try:
    import google.colab
    ON_COLAB = True
except ImportError:
    ON_COLAB = False

# Set paths based on environment and source (⚠️ Update COURSE_PATH below if your folder has a different name or location)
if ON_COLAB:
    # Mount Google Drive if using Colab
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    # Base folder for this course on Google Drive
    COURSE_PATH = "/content/drive/MyDrive/Industrial_ML_Course"
else:
    # Offline / local computer
    # Adjust COURSE_PATH to your local folder
    COURSE_PATH = r"D:\Industrial_ML_Course"

# Subfolders
DATASET_PATH = os.path.join(COURSE_PATH, "datasets/MaFaulDa")
NOTEBOOK_PATH = os.path.join(COURSE_PATH, "notebooks")

# Ensure directories exist
os.makedirs(DATASET_PATH, exist_ok=True)
os.makedirs(NOTEBOOK_PATH, exist_ok=True)

print("Environment:", "Colab" if ON_COLAB else "Local")
print("Course Path:", COURSE_PATH)
print("Dataset Path:", DATASET_PATH)

In [None]:
# ---- Download and Extract All Classes ----

# SSL context to ignore certificate issues
ssl_context = ssl._create_unverified_context()

def download_and_extract(cls, url, dataset_path=DATASET_PATH):
    """Download and extract a single MaFaulDa class dataset."""

    # Get folder name (remove .zip extension)
    folder_name = os.path.join(dataset_path, os.path.splitext(os.path.basename(url))[0])

    if os.path.exists(folder_name):
        print(f"[✔] {cls} folder already exists: {folder_name}")
        return

    zip_file = os.path.join(dataset_path, os.path.basename(url))

    # Download zip file if not exists
    if not os.path.exists(zip_file):
        print(f"[↓] Downloading {cls} zip file ...")
        with urllib.request.urlopen(url, context=ssl_context) as response, open(zip_file, 'wb') as out_file:
            out_file.write(response.read())
        print(f"[✔] {cls} zip file downloaded.")
    else:
        print(f"[✔] {cls} zip file already exists: {zip_file}")

    # Extract zip file
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(dataset_path)
    print(f"[✔] {cls} zip file extracted to {folder_name}")

    # Cleanup zip file after successful extraction
    try:
        os.remove(zip_file)
        print(f"[✗] {cls} zip file removed.")
    except OSError as e:
        print(f"[!] Warning: could not remove {zip_file}: {e}")

# Loop through all classes
for cls, url in dataset_urls.items():
    download_and_extract(cls, url)

In [None]:
# ---- Explore Files and Build Metadata ----

# Classes come directly from dataset_urls keys (converted to lowercase for consistency)
classes = [cls.lower().replace(" ", "-") for cls in dataset_urls.keys()]
metadata = []

for cls in classes:
    class_path = os.path.join(DATASET_PATH, cls)

    # Walk through all subfolders and collect CSV files
    for root, _, files in os.walk(class_path):
        for f in files:
            if f.endswith(".csv"):
                file_path = os.path.join(root, f)

                # Collect dataset metadata (filename, class label, full path, relative subfolder)
                metadata.append({
                    "file": f,
                    "class": cls,
                    "path": file_path,
                    "subfolder": os.path.relpath(root, class_path)
                })

# Convert to DataFrame
df_meta = pd.DataFrame(metadata)

# Display a small sample
display(df_meta.sample(5, random_state=42))
print(f"Total files: {len(df_meta)}")
print(df_meta["class"].value_counts())

In [None]:
# ---- Load and Compare Two Sample Signals from Different Classes ----

# Assign sensor names
sensor_names = [
    "tachometer",
    "under_axial", "under_radial", "under_tangential",
    "over_axial", "over_radial", "over_tangential",
    "microphone"
]

# Pick two random samples from different classes
samples = df_meta.groupby("class").sample(1, random_state=42).head(2)

# Settings for plotting improvements
max_samples = 5000   # keep plot readable
normalize_for_plot = True   # option: normalize each plotted channel to [-1,1]
overlay_rolling = True      # option: overlay rolling mean/RMS
rolling_window = 200        # samples

fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

for i, (_, sample) in enumerate(samples.iterrows()):
    # Load CSV file (no header → force header=None)
    df_signal = pd.read_csv(sample["path"], header=None)

    # Assign sensor names
    df_signal.columns = sensor_names

    print(f"Sample {i+1}: {sample['file']}")
    print(f"   Class: {sample['class']}")
    print(f"   Subfolder: {sample['subfolder']}")
    print(f"   Shape: {df_signal.shape}")
    print("\n")

    # Subset for plotting
    idx_end = min(max_samples, df_signal.shape[0])
    tach = df_signal['tachometer'].values[:idx_end]
    accel = df_signal['under_radial'].values[:idx_end]
    mic = df_signal['microphone'].values[:idx_end]

    # Normalization function
    def norm_sig(x):
        x = x.astype(float)
        m = np.max(np.abs(x))
        if m == 0:
            return x
        return x / m

    if normalize_for_plot:
        tach_plot = norm_sig(tach)
        accel_plot = norm_sig(accel)
        mic_plot = norm_sig(mic)
    else:
        tach_plot, accel_plot, mic_plot = tach, accel, mic

    axes[i].plot(tach_plot, label="Tachometer", alpha=0.9)
    axes[i].plot(accel_plot, label="Underhang Radial", alpha=0.9)
    axes[i].plot(mic_plot, label="Microphone", alpha=0.9)

    # overlay rolling stats on microphone for emphasis (optional)
    if overlay_rolling:
        mic_series = pd.Series(mic_plot)
        rm = mic_series.rolling(window=rolling_window, min_periods=1).mean()
        rrms = mic_series.rolling(window=rolling_window, min_periods=1).apply(lambda x: np.sqrt(np.mean(x**2)))
        axes[i].plot(rm, label="Mic Rolling Mean", color="k", linewidth=1.5, linestyle='--')
        axes[i].plot(rrms, label="Mic Rolling RMS", color="magenta", linewidth=1.5, linestyle=':')

    axes[i].set_title(f"Sample {i+1} - Class: {sample['class']}")
    axes[i].set_ylabel("Amplitude (normalized)" if normalize_for_plot else "Amplitude")
    axes[i].legend(loc='upper right')

axes[-1].set_xlabel("Time (samples)")
plt.tight_layout()
plt.show()

In [None]:
# ---- Spectrogram (time–frequency analysis) ----

fs = 50000  # sampling frequency

fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

for i, (_, sample) in enumerate(samples.iterrows()):
    # Load CSV file (no header → force header=None)
    df_signal = pd.read_csv(sample["path"], header=None)

    # Assign sensor names
    df_signal.columns = sensor_names

    # Pick the microphone channel (last column)
    signal = df_signal["microphone"].values[:max_samples]

    # Calculate spectrogram
    # nperseg: window length (smaller -> better time resolution; larger -> better freq resolution)
    # noverlap: overlap between windows
    f, t, Sxx = spectrogram(signal, fs=fs, nperseg=1024, noverlap=512)

    # Plot spectrogram
    pcm = axes[i].pcolormesh(t, f, 10 * np.log10(Sxx + 1e-12), shading='gouraud')  # small eps for numerical stability
    fig.colorbar(pcm, ax=axes[i], label="Power/Frequency (dB/Hz)")

    axes[i].set_title(f"Spectrogram of Microphone Channel - Class: {sample['class']}")
    axes[i].set_ylabel("Frequency [Hz]")
    axes[i].set_xlabel("Time [s]")
    axes[i].set_ylim(0, 5000)  # limit to lower frequencies if needed

plt.tight_layout()
plt.show()

In [None]:
# ---- Listen to the microphone channels ----

for _, sample in samples.iterrows():
    # Load CSV file (no header → force header=None)
    df_signal = pd.read_csv(sample["path"], header=None)

    # Assign sensor names
    df_signal.columns = sensor_names

    # Pick microphone channel
    mic_signal = df_signal['microphone'].values.astype(float)

    # Normalize to avoid clipping (guard against zeros)
    max_abs = np.max(np.abs(mic_signal))
    if max_abs == 0:
        mic_norm = mic_signal
    else:
        mic_norm = mic_signal / max_abs

    print(f"Class: {sample['class']} | File: {os.path.basename(sample['path'])}")
    display(Audio(mic_norm, rate=fs))

# 🚀 Explore More! (Guided Exercises)

1. **Channel Comparison**  
   - Compare signals from different channels (tachometer, accelerometers, microphone) for the same sample.  
   - Try plotting both with and without normalization. Which view makes differences between classes more visible?  

2. **Spectrogram Parameter Tuning**  
   - Experiment with `nperseg` and `noverlap`.  
   - Recall: smaller `nperseg` → better time resolution; larger `nperseg` → better frequency resolution.  
   - Which settings highlight differences between healthy and faulty cases most clearly?  

3. **Time-Domain Feature Extraction (Exploration)**
   - Using a **small subset of files and channels**, compute basic features for each signal:
     - `mean`, `std`, `RMS`, `peak-to-peak`, `skewness`, `kurtosis`.
   - Add these features as columns in a **mini DataFrame** for inspection.
   - Explore differences across channels and classes:
     - Which channels show the largest variability?
     - Are certain features more sensitive to faults or specific classes?
   - Purpose: **understand features and their interpretability** on a manageable scale.

4. **Rolling Statistics**  
   - Apply rolling averages or RMS with different window sizes (e.g., 50, 200, 1000 samples).  
   - How does the amount of smoothing affect the ability to spot patterns or anomalies?  

5. **Automation Challenge (Scaling Up)**
   - Write a function that automatically processes **all files and channels** in the dataset.
   - For each **file × channel** pair, compute the same features as in Exercise 3.
   - Build a **complete feature matrix**:
     - Rows = individual signals (file × channel)
     - Columns = extracted features
     - Include metadata: `file`, `class`, `channel`
   - Inspect the resulting table:
     - Are there clear differences between channels or classes?
     - Which channels or features are most informative for ML models?
   - Purpose: **scale up the manual exploration from Exercise 3** to a dataset-wide feature extraction for later classification.