# Data Exploration and Cleaning

This notebook performs exploratory data analysis (EDA) on the marine engine fault dataset.

**Objectives:**
- Load and profile the dataset
- Check for missing values and data types
- Visualize fault label distribution
- Analyze sensor feature distributions
- Identify outliers
- Document cleaning decisions

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 1. Load Dataset

In [None]:
# Load the marine engine fault dataset
df = pd.read_csv('../data/marine_engine_fault_dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"Total records: {len(df):,}")
print(f"Total features: {df.shape[1]}")

In [None]:
# Display first few rows
df.head()

## 2. Dataset Profiling

In [None]:
# Check data types
print("Data Types:")
print(df.dtypes)

In [None]:
# Check for missing values
print("Missing Values:")
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percent
})
print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("\n✓ No missing values found in the dataset!")

In [None]:
# Check unique counts for all columns
print("Unique Value Counts:")
unique_counts = df.nunique().sort_values(ascending=False)
print(unique_counts)

In [None]:
# Statistical summary
df.describe()

## 3. Fault Label Distribution Analysis

In [None]:
# Define fault label mapping
FAULT_LABELS = {
    0: "Normal",
    1: "Fuel Injection Fault",
    2: "Cooling System Fault",
    3: "Turbocharger Fault",
    4: "Bearing Wear",
    5: "Lubrication Oil Degradation",
    6: "Air Intake Restriction",
    7: "Vibration Anomaly"
}

# Count distribution
fault_counts = df['Fault_Label'].value_counts().sort_index()
fault_percentages = (fault_counts / len(df)) * 100

print("Fault Label Distribution:")
for label, count in fault_counts.items():
    print(f"{label} - {FAULT_LABELS[label]}: {count:,} ({fault_percentages[label]:.2f}%)")

In [None]:
# Visualize fault label distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart with counts
fault_counts.plot(kind='bar', ax=ax1, color='steelblue', edgecolor='black')
ax1.set_title('Fault Label Distribution (Counts)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Fault Label', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.set_xticklabels([FAULT_LABELS[i] for i in range(8)], rotation=45, ha='right')
ax1.grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, v in enumerate(fault_counts):
    ax1.text(i, v + 50, str(v), ha='center', va='bottom', fontweight='bold')

# Pie chart with percentages
colors = plt.cm.Set3(range(8))
ax2.pie(fault_counts, labels=[FAULT_LABELS[i] for i in range(8)], autopct='%1.1f%%',
        startangle=90, colors=colors)
ax2.set_title('Fault Label Distribution (Percentages)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Check for class imbalance
max_class = fault_percentages.max()
min_class = fault_percentages.min()
imbalance_ratio = max_class / min_class

print(f"\nClass Balance Analysis:")
print(f"Most common class: {max_class:.2f}%")
print(f"Least common class: {min_class:.2f}%")
print(f"Imbalance ratio: {imbalance_ratio:.2f}x")

if imbalance_ratio < 2:
    print("✓ Dataset is well-balanced!")
else:
    print("⚠ Dataset shows class imbalance - consider stratified sampling")

## 4. Sensor Feature Analysis

Analyzing the 18 sensor features to understand their distributions and identify potential issues.

In [None]:
# Define sensor feature columns (excluding Timestamp and Fault_Label)
sensor_features = [
    'Shaft_RPM', 'Engine_Load', 'Fuel_Flow', 'Air_Pressure', 'Ambient_Temp',
    'Oil_Temp', 'Oil_Pressure', 'Vibration_X', 'Vibration_Y', 'Vibration_Z',
    'Cylinder1_Pressure', 'Cylinder1_Exhaust_Temp',
    'Cylinder2_Pressure', 'Cylinder2_Exhaust_Temp',
    'Cylinder3_Pressure', 'Cylinder3_Exhaust_Temp',
    'Cylinder4_Pressure', 'Cylinder4_Exhaust_Temp'
]

print(f"Total sensor features: {len(sensor_features)}")
print(f"Features: {sensor_features}")

### 4.1 Histograms for All Sensor Features

In [None]:
# Generate histograms for all 18 sensor features
fig, axes = plt.subplots(6, 3, figsize=(18, 20))
axes = axes.flatten()

for idx, feature in enumerate(sensor_features):
    ax = axes[idx]
    df[feature].hist(bins=50, ax=ax, color='skyblue', edgecolor='black', alpha=0.7)
    ax.set_title(f'{feature}', fontsize=11, fontweight='bold')
    ax.set_xlabel('Value', fontsize=9)
    ax.set_ylabel('Frequency', fontsize=9)
    ax.grid(axis='y', alpha=0.3)
    
    # Add mean and median lines
    mean_val = df[feature].mean()
    median_val = df[feature].median()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=1.5, label=f'Mean: {mean_val:.2f}')
    ax.axvline(median_val, color='green', linestyle='--', linewidth=1.5, label=f'Median: {median_val:.2f}')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.suptitle('Distribution of All Sensor Features', fontsize=16, fontweight='bold', y=1.001)
plt.show()

### 4.2 Boxplots to Identify Outliers

In [None]:
# Generate boxplots for all 18 sensor features
fig, axes = plt.subplots(6, 3, figsize=(18, 20))
axes = axes.flatten()

for idx, feature in enumerate(sensor_features):
    ax = axes[idx]
    df.boxplot(column=feature, ax=ax, patch_artist=True,
               boxprops=dict(facecolor='lightblue', color='black'),
               medianprops=dict(color='red', linewidth=2),
               whiskerprops=dict(color='black'),
               capprops=dict(color='black'),
               flierprops=dict(marker='o', markerfacecolor='orange', markersize=3, alpha=0.5))
    ax.set_title(f'{feature}', fontsize=11, fontweight='bold')
    ax.set_ylabel('Value', fontsize=9)
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.suptitle('Boxplots for Outlier Detection', fontsize=16, fontweight='bold', y=1.001)
plt.show()

In [None]:
# Quantify outliers using IQR method
print("Outlier Analysis (IQR Method):")
print("=" * 80)

outlier_summary = []

for feature in sensor_features:
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
    outlier_count = len(outliers)
    outlier_percent = (outlier_count / len(df)) * 100
    
    outlier_summary.append({
        'Feature': feature,
        'Outlier Count': outlier_count,
        'Percentage': outlier_percent,
        'Lower Bound': lower_bound,
        'Upper Bound': upper_bound
    })

outlier_df = pd.DataFrame(outlier_summary)
outlier_df = outlier_df.sort_values('Outlier Count', ascending=False)
print(outlier_df.to_string(index=False))

total_outliers = outlier_df['Outlier Count'].sum()
print(f"\nTotal outlier instances across all features: {total_outliers:,}")

## 5. Correlation Analysis

In [None]:
# Compute correlation matrix for sensor features
correlation_matrix = df[sensor_features].corr()

# Visualize correlation heatmap
plt.figure(figsize=(16, 14))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Sensor Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Identify highly correlated feature pairs
print("\nHighly Correlated Feature Pairs (|correlation| > 0.7):")
print("=" * 80)

high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            high_corr_pairs.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if high_corr_pairs:
    high_corr_df = pd.DataFrame(high_corr_pairs)
    high_corr_df = high_corr_df.sort_values('Correlation', ascending=False, key=abs)
    print(high_corr_df.to_string(index=False))
else:
    print("No highly correlated feature pairs found.")

## 6. Data Cleaning Decisions

### Summary of Findings:

1. **Missing Values**: The dataset has no missing values, which is excellent for model training.

2. **Data Types**: All sensor features are numeric (float64), and Fault_Label is integer - appropriate for ML.

3. **Class Balance**: The fault label distribution shows whether classes are balanced or if stratified sampling is needed.

4. **Outliers**: Outliers detected using IQR method. These are likely legitimate extreme operating conditions rather than errors, so we will retain them.

5. **Feature Correlations**: Some cylinder measurements may be correlated, which is expected for similar engine components.

### Cleaning Decisions:

- **No data removal**: All records are valid with no missing values
- **Retain outliers**: Outliers represent real fault conditions and extreme operating states
- **Keep all features**: All 18 sensor features provide valuable information
- **Timestamp handling**: Will drop Timestamp column for modeling (not a predictive feature)
- **Stratified splitting**: Will use stratified train-test split to maintain class balance

### Next Steps:

The dataset is clean and ready for preprocessing. The next notebook will handle:
- Feature-target separation
- Train-test splitting (stratified)
- Feature scaling using StandardScaler
- Saving the preprocessor for deployment

In [None]:
# Final dataset info
print("Final Dataset Summary:")
print("=" * 80)
print(f"Total records: {len(df):,}")
print(f"Total features: {len(sensor_features)}")
print(f"Target classes: {df['Fault_Label'].nunique()}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"\n✓ Dataset is clean and ready for preprocessing!")