# Gearbox Vibration Data Preprocessing

## üéØ Aim
To import, label, and merge vibration signal data from healthy and broken gearboxes. The goal is to create a structured dataset for further machine learning tasks. 

My task is to preprocess the data by:
- Assigning appropriate labels for the gearbox conditions (healthy or broken).
- Extracting metadata (load percentage) from filenames.
- Creating a single, structured DataFrame that combines both healthy and broken gearbox data.

## ü™ú Steps Involved
1. **Define and locate the data folders** (`Healthy` and `BrokenTooth`) inside the `data/` directory.
2. **Load all `.csv` files** from each folder.
3. **Assign appropriate column names** and add metadata:
   - `condition`: 0 for healthy, 1 for broken
   - `load`: extracted from filename
   - `source_file`: original file name
4. **Combine all the processed data** into a single DataFrame.
5. **Export the final structured data** as `processed_gearbox_data.csv`.

## üì¶ Final Output
A CSV file containing all vibration data with columns: `a1`, `a2`, `a3`, `a4`, `condition`, `load`, and `source_file`. This dataset will be used for machine learning tasks such as classification and anomaly detection.


In [2]:
import pandas as pd
from pathlib import Path

In [3]:
# Define paths
data_dir = Path("../data")
healthy_dir = data_dir / "Healthy"
broken_dir = data_dir / "BrokenTooth"

In [4]:
# Step 1: Read Healthy files
healthy_data = []
for file in healthy_dir.glob("*.csv"):
    try:
        df = pd.read_csv(file)
        df['condition'] = 0  # Healthy
        df['load'] = int(''.join(filter(str.isdigit, file.stem[-2:])))  # e.g., 90 from h30hz90
        df['source_file'] = file.name
        healthy_data.append(df)
    except Exception as e:
        print(f"‚ö†Ô∏è Error reading {file.name}: {e}")

In [5]:
# Step 2: Read BrokenTooth files
broken_data = []
for file in broken_dir.glob("*.csv"):
    try:
        df = pd.read_csv(file)
        df['condition'] = 1  # Broken
        df['load'] = int(''.join(filter(str.isdigit, file.stem[-2:])))  # e.g., 90 from b30hz90
        df['source_file'] = file.name
        broken_data.append(df)
    except Exception as e:
        print(f"‚ö†Ô∏è Error reading {file.name}: {e}")

In [6]:
# Step 3: Combine both
all_data = pd.concat(healthy_data + broken_data, ignore_index=True)

In [7]:
# Step 4: Save to CSV
output_file = Path("../data/processed_gearbox_data.csv")
all_data.to_csv(output_file, index=False)

In [8]:
# Step 5: Print summary
print("‚úÖ Data preprocessing complete!")
print(f"üî¢ Total samples: {len(all_data)}")
print(f"üîß Broken samples: {all_data['condition'].sum()}")
print(f"‚úÖ Healthy samples: {len(all_data) - all_data['condition'].sum()}")
print("\nüîç Sample data:")
print(all_data.head())

‚úÖ Data preprocessing complete!
üî¢ Total samples: 2021119
üîß Broken samples: 1005311
‚úÖ Healthy samples: 1015808

üîç Sample data:
         a1        a2        a3       a4  condition  load source_file
0  4.636710  0.516978 -3.205940  1.82241          0     0  h30hz0.csv
1  1.992800  4.184660 -2.740610  2.80436          0     0  h30hz0.csv
2 -3.764110  0.997335 -1.303090  1.83668          0     0  h30hz0.csv
3 -4.558710  6.104330 -1.720690  1.72311          0     0  h30hz0.csv
4  0.575382  0.170980 -0.497967 -1.32895          0     0  h30hz0.csv
