# **Notebook 1: Data Understanding & Exploratory Data Analysis (EDA)**

## **Objective**

The goal of this notebook is to perform an initial exploration of the Case Western Reserve University (CWRU) Bearing Dataset. We will load the data, understand its structure, and visualize the vibration signals for different fault conditions. This foundational analysis is crucial for guiding our subsequent preprocessing and modeling decisions.

### **Key Activities:**
- **Load Data:** Import vibration signal data from the `.mat` files.
- **Inspect Structure:** Examine the shape, data types, and key components of the raw signals.
- **Visualize Signals:** Plot the time-domain waveforms for each bearing condition (Normal, Inner Race Fault, Outer Race Fault, Ball Fault).
- **Analyze Frequency Domain:** Use the Fast Fourier Transform (FFT) to visualize the frequency content of the signals and identify characteristic fault frequencies.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy.fft import fft

# Import custom utility functions from the src directory
import sys
sys.path.append('../src')
from data_utils import load_mat_file

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (15, 7)
plt.rcParams['font.size'] = 12

ImportError: cannot import name 'load_mat_file' from 'data_utils' (D:\Coding\GitHub\AI-Bearing-Diagnosis\notebooks\../src\data_utils.py)

## **1. Data Loading**

We begin by loading the raw vibration signal data for different fault conditions. The dataset is organized by motor load (0, 1, and 2 HP). Our project roadmap specifies using 0 and 1 HP for training and 2 HP for testing. 

Here, we'll load one representative file for each of the four conditions (Normal, BF, IRF, ORF) under a 1 HP load to perform our initial EDA.

In [None]:
SAMPLING_RATE = 12000 # 12k samples per second for 1HP and 0HP loads

# Define paths to representative data files
normal_path = '../data/raw/1_HP/Normal.mat'
bf_path = '../data/raw/1_HP/B007_1.mat' # Ball Fault 0.007 inches
irf_path = '../data/raw/1_HP/IR007_1.mat' # Inner Race Fault 0.007 inches
orf_path = '../data/raw/1_HP/OR007@6_1.mat' # Outer Race Fault 0.007 inches

# Load the signals using our utility function
signal_normal = load_mat_file(normal_path)
signal_bf = load_mat_file(bf_path)
signal_irf = load_mat_file(irf_path)
signal_orf = load_mat_file(orf_path)

# Store signals in a dictionary for easy access
signals = {
    'Normal': signal_normal,
    'Ball Fault': signal_bf,
    'Inner Race Fault': signal_irf,
    'Outer Race Fault': signal_orf
}

print(f"Loaded 'Normal' signal with shape: {signal_normal.shape}")
print(f"Loaded 'Ball Fault' signal with shape: {signal_bf.shape}")
print(f"Loaded 'Inner Race Fault' signal with shape: {signal_irf.shape}")
print(f"Loaded 'Outer Race Fault' signal with shape: {signal_orf.shape}")

## **2. Time-Domain Signal Visualization**

Visualizing the raw signals in the time domain is the first step to understanding their characteristics. We expect to see different patterns, amplitudes, and periodicities for healthy versus faulty bearings. Faulty bearings often introduce periodic impacts or vibrations that can be seen in the waveform.

In [None]:
# === Code Refinement ===
# The original notebook had repetitive plotting code.
# This has been consolidated into a single, reusable function for clarity and efficiency.
def plot_signals(signals_dict, points_to_plot=2048):
    """Plots the first N points of signals from a dictionary."""
    num_signals = len(signals_dict)
    fig, axes = plt.subplots(num_signals, 1, figsize=(18, 4 * num_signals), sharex=True)
    fig.suptitle('Time-Domain Vibration Signals for Different Bearing Conditions', fontsize=18, y=0.95)
    
    for i, (label, signal) in enumerate(signals_dict.items()):
        ax = axes[i]
        ax.plot(signal[:points_to_plot], label=label, lw=1)
        ax.set_ylabel('Amplitude')
        ax.set_title(f'Condition: {label}')
        ax.legend(loc='upper right')
        ax.grid(True, which='both', linestyle='--', linewidth=0.5)
        
    axes[-1].set_xlabel('Time (samples)')
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    
    # Ensure the results directory and its subdirectory exist
    output_dir = '../results/figures/'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    # Save the figure to the correct directory
    plt.savefig(os.path.join(output_dir, 'eda_time_domain_signals.png'))
    plt.show()

# Plot the loaded signals
plot_signals(signals)

### **Time-Domain Observations:**

- **Normal Signal:** The normal signal exhibits a relatively consistent, random pattern with lower amplitude variations compared to the faulty signals.
- **Faulty Signals (BF, IRF, ORF):** The signals for all three fault types show higher amplitudes and more distinct, periodic spikes or bursts of energy. These are the tell-tale signs of impacts caused by the rolling elements hitting the fault on the race or the ball itself. Distinguishing between them visually in the time domain can be difficult, which is why frequency-domain analysis is essential.

## **3. Frequency-Domain Analysis using FFT**

The Fast Fourier Transform (FFT) is a powerful tool that decomposes a signal into its constituent frequencies. For bearing fault diagnosis, specific fault types generate harmonics at predictable frequencies related to the geometry of the bearing and the motor's rotational speed. Plotting the frequency spectrum can make these patterns much clearer than in the time domain.

In [None]:
def plot_fft(signals_dict, sampling_rate):
    """Calculates and plots the FFT of signals from a dictionary."""
    num_signals = len(signals_dict)
    fig, axes = plt.subplots(num_signals, 1, figsize=(18, 4 * num_signals), sharex=True)
    fig.suptitle('Frequency-Domain (FFT) of Vibration Signals', fontsize=18, y=0.95)

    for i, (label, signal) in enumerate(signals_dict.items()):
        ax = axes[i]
        
        # Perform FFT
        N = len(signal)
        yf = fft(signal)
        xf = np.linspace(0.0, 1.0/(2.0/sampling_rate), N//2)
        
        # Plot the single-sided amplitude spectrum
        ax.plot(xf, 2.0/N * np.abs(yf[0:N//2]), lw=1)
        ax.set_ylabel('Amplitude')
        ax.set_title(f'Condition: {label}')
        ax.grid(True, which='both', linestyle='--', linewidth=0.5)
        ax.set_xlim(0, sampling_rate / 4) # Limit x-axis to a reasonable frequency range

    axes[-1].set_xlabel('Frequency (Hz)')
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    
    # Save the figure to the correct directory
    plt.savefig('../results/figures/eda_frequency_domain_signals.png')
    plt.show()

# Plot the FFTs
plot_fft(signals, SAMPLING_RATE)

### **Frequency-Domain Observations:**

- **Distinct Frequencies:** The FFT plots clearly show dominant frequency peaks for each fault condition that are not present in the normal signal's spectrum. 
- **Harmonics:** The faulty signals exhibit a series of harmonic frequencies (peaks at integer multiples of a fundamental frequency). These harmonics are characteristic signatures of bearing faults.
- **Diagnostic Potential:** The clear separation of patterns in the frequency domain confirms that frequency-based features will be highly effective for classification. This validates our plan to use techniques like FFT for feature engineering and spectrograms for deep learning models.

## **Conclusion of EDA**

This initial exploratory analysis confirms that the CWRU dataset contains distinct, identifiable patterns for each fault class, particularly in the frequency domain. The visual evidence strongly suggests that both classical machine learning with handcrafted features and deep learning models that can learn features automatically are viable approaches.

With this understanding, we can now proceed to the next stage: **systematic data preprocessing, feature engineering, and augmentation.**