# Random Forest â€“ Challenge: Marine Engine Condition Monitoring

## Overview

This notebook is designed as a hands-on coding challenge for beginners. You'll implement your own Random Forest classifier to predict marine engine maintenance requirements based on sensor telemetry data.

**Goal Question:** Can we predict engine maintenance status (Normal, Requires Maintenance, Critical) based solely on sensor readings and operational parameters?

## About the Dataset

**Data Source:** Marine engine telemetry data from industrial IoT sensors

### Dataset Details:
- **Total samples:** 5,200 engine readings from multiple vessels
- **Features:** 16 sensor measurements and operational parameters
- **Target:** Maintenance status (3 classes: Normal, Requires Maintenance, Critical)
- **Real-world application:** Predictive maintenance, maritime operations, cost optimization

### Features in Our Dataset:

| Feature | Description | Type | Example Values |
|---------|-------------|------|-----------------|
| **engine_temp** | Engine temperature (Â°C) | Numerical | 75-105Â°C |
| **oil_pressure** | Oil pressure (bar) | Numerical | 6-8 bar |
| **fuel_consumption** | Fuel consumption rate | Numerical | 1000-7000 L/h |
| **vibration_level** | Engine vibration (mm/s) | Numerical | 3-5 mm/s |
| **rpm** | Engine revolutions per minute | Numerical | 1400-1800 RPM |
| **engine_load** | Engine load percentage | Numerical | 20-80% |
| **coolant_temp** | Coolant temperature (Â°C) | Numerical | 75-100Â°C |
| **exhaust_temp** | Exhaust temperature (Â°C) | Numerical | 450Â°C |
| **running_period** | Operating time (hours) | Numerical | 50-150 hours |
| **fuel_consumption_per_hour** | Hourly fuel rate | Numerical | 100 L/h |
| **engine_type** | Type of marine engine | Categorical | 2-stroke, 4-stroke |
| **fuel_type** | Fuel used | Categorical | Diesel, HFO |
| **manufacturer** | Engine manufacturer | Categorical | MAN B&W, WÃ¤rtsilÃ¤, etc. |
| **failure_mode** | Current failure indication | Categorical | No Failure, Oil Leakage, etc. |

**Target Variable:** `maintenance_status` - Engine condition (Normal, Requires Maintenance, Critical)

## What You'll Learn

Through this exercise, you will:
- Handle mixed numerical and categorical industrial data
- Apply Random Forest to multi-class classification
- Work with imbalanced maintenance classes
- Understand feature importance in predictive maintenance
- Evaluate model performance for industrial applications
- Interpret results for maritime business decisions

## Instructions

**Reference Material:** Look at the Customer Churn notebook for Random Forest examples and patterns to follow.

ðŸ’¡ **Key Reminders:**
- This is a **multi-class classification** problem (3 maintenance states)
- Random Forest handles **mixed data types** well
- Use **stratified splitting** to maintain class balance
- **Feature importance** will reveal critical engine parameters
- Focus on **business interpretation** for maintenance decisions

Let's build a predictive maintenance system for marine engines!

## Step 1: Import Required Libraries

You'll need several Python libraries for this industrial IoT challenge. Import the essential tools for:
- **Data manipulation:** pandas, numpy
- **Visualisation:** matplotlib, seaborn  
- **Machine learning:** scikit-learn modules
- **Data preprocessing:** StandardScaler, OneHotEncoder, ColumnTransformer
- **Random Forest:** RandomForestClassifier
- **Model tuning:** GridSearchCV, train_test_split
- **Evaluation:** classification_report, confusion_matrix

ðŸ’¡ **Hint:** Look at the Customer Churn notebook to see which specific imports you need for Random Forest!

In [None]:
# TODO: Import all required libraries here
# Refer to Customer Churn.ipynb for the exact imports needed

# Data manipulation


# Visualisation  


# Machine learning


## Step 2: Load and Explore the Marine Engine Dataset

Your first task is to load the marine engine dataset and understand its structure. The dataset is stored in `Data/marine_engine_data.csv`.

### What to do:
1. **Load the data** using pandas
2. **Examine the shape** - How many engine readings and features?
3. **Check for missing values** - Any sensor failures?
4. **Display basic statistics** for numerical sensor readings
5. **Show value counts** for categorical features (engine types, manufacturers)
6. **Check the target distribution** - How balanced are the maintenance classes?

### Key Questions to Answer:
- How many engines are in Normal vs Maintenance vs Critical condition?
- Which sensors have the widest range of values?
- What types of engines and manufacturers are represented?
- Are there any obvious data quality issues?

In [None]:
# TODO: Load the marine_engine_data.csv dataset


# TODO: Explore the dataset structure


# TODO: Check for missing values


# TODO: Check maintenance status distribution


# TODO: Display basic info about the dataset


## Step 3: Visualise Engine Sensor Data

Create visualisations to understand patterns in the marine engine data. This will help you identify which sensors are most indicative of maintenance needs.

### Suggested Visualisations:
1. **Maintenance status distribution** - Bar chart of Normal vs Requires Maintenance vs Critical counts
2. **Engine temperature vs maintenance** - Box plot showing temperature distribution by status
3. **Oil pressure vs maintenance** - Box plot showing pressure patterns by status
4. **Vibration levels vs maintenance** - Box plot showing vibration by maintenance needs
5. **Engine type vs maintenance** - Count plot showing maintenance by engine type
6. **Correlation heatmap** - For numerical sensor readings only

### What to Look For:
- Which sensors show clear differences between maintenance states?
- Are there sensor value thresholds that indicate problems?
- Do different engine types have different maintenance patterns?
- Which sensors are most correlated with each other?

In [None]:
# TODO: Create visualisations to understand the engine data
# Set up plotting style similar to Customer Churn notebook

# TODO: 1. Maintenance status distribution bar chart
# TODO: 2. Engine temperature vs maintenance status box plot
# TODO: 3. Oil pressure vs maintenance status box plot  
# TODO: 4. Vibration level vs maintenance status box plot
# TODO: 5. Engine type vs maintenance status count plot
# TODO: 6. Correlation heatmap for numerical sensor readings only

## Step 4: Data Preprocessing for Mixed Industrial Data

Marine engine data combines sensor readings (numerical) with equipment specifications (categorical). Random Forest handles mixed data well, but proper preprocessing is still crucial.

### Major Preprocessing Tasks:

1. **Identify Feature Types:** Separate numerical sensors from categorical equipment data
2. **Handle Missing Values:** Check for and handle any sensor reading gaps
3. **Remove Non-Predictive Features:** Drop timestamp, engine_id if they don't add value
4. **Create Preprocessing Pipeline:** Use ColumnTransformer for mixed data handling
5. **Feature Scaling:** StandardScaler for sensor readings (optional for Random Forest)
6. **Categorical Encoding:** OneHotEncoder for equipment specifications
7. **Train/Test Split:** Stratified split to maintain maintenance class balance

### Why Each Step Matters:

- **Feature type identification:** Different preprocessing for sensors vs specifications
- **Missing value handling:** Sensor failures could be predictive or problematic
- **Non-predictive features:** Engine ID won't help predict other engines
- **Mixed data pipeline:** Consistent preprocessing for numerical and categorical features
- **Stratified split:** Maintains realistic maintenance class proportions

### Industrial Data Considerations:

- **Sensor ranges:** Different sensors have vastly different scales (RPM vs temperature)
- **Equipment categories:** Engine types, manufacturers create distinct groups
- **Time features:** Consider if timestamp adds temporal patterns
- **Operational context:** Running period and load affect sensor readings

In [None]:
# TODO: Data Preprocessing Steps
# Create preprocessing pipeline for mixed industrial data

# TODO: Identify numerical vs categorical features


# TODO: Handle any missing values


# TODO: Remove non-predictive features (timestamp, engine_id if needed)


# TODO: Create ColumnTransformer for mixed data preprocessing


# TODO: Prepare features (X) and target (y)


# TODO: Stratified train/test split to maintain class balance


## Step 5: Hyperparameter Tuning for Random Forest

Optimize your Random Forest for the best predictive maintenance performance using GridSearchCV.

### Key Hyperparameters for Industrial Applications:

1. **n_estimators:** Number of trees (100, 200) - More trees = more stable predictions
2. **max_depth:** Tree depth limit (None, 10, 20) - Controls overfitting
3. **min_samples_split:** Minimum samples to split (2, 5) - Controls granularity

### Evaluation Metric:
Use **accuracy** for balanced multi-class problems, or consider **balanced_accuracy** if classes are imbalanced.

### Business Considerations:
- **False negatives** (missing critical maintenance) are costly
- **False positives** (unnecessary maintenance) waste resources
- **Model stability** is crucial for operational decisions

In [None]:
# TODO: Hyperparameter Tuning for Random Forest
# Create pipeline with preprocessing + Random Forest

# TODO: Define hyperparameter grid to search


# TODO: Set up GridSearchCV with appropriate scoring metric


# TODO: Fit grid search and find best parameters


# TODO: Display best hyperparameters and cross-validation score


## Step 6: Evaluate Model Performance for Predictive Maintenance

Test your optimized Random Forest on unseen engine data to evaluate real-world performance.

### Multi-Class Evaluation for Maintenance:

- **Overall Accuracy:** Percentage of correct maintenance predictions
- **Per-Class Performance:** How well does the model predict each maintenance state?
- **Confusion Matrix:** Which maintenance states are confused with each other?
- **Classification Report:** Detailed precision, recall, F1-score for each class

### Business Impact Questions:
- Are "Critical" engines correctly identified (high recall for Critical class)?
- Do we have too many false alarms for "Requires Maintenance"?
- Is the model conservative or aggressive in maintenance recommendations?
- What's the cost of each type of prediction error?

In [None]:
# TODO: Evaluate model performance on test set
# Make predictions and calculate metrics

# TODO: Generate detailed classification report


# TODO: Create and visualize confusion matrix


# TODO: Calculate overall accuracy and interpret results


## Step 7: Feature Importance Analysis

Discover which sensors and parameters are most important for predicting maintenance needs.

### Feature Importance for Maintenance Decisions:

Random Forest provides feature importance scores that reveal:
- **Critical sensors:** Which measurements best predict maintenance needs
- **Equipment factors:** How engine type/manufacturer affects maintenance  
- **Operational patterns:** Which operational parameters matter most
- **Sensor redundancy:** Which sensors provide similar information

### Business Applications:
- **Sensor prioritization:** Focus monitoring on most predictive sensors
- **Maintenance scheduling:** Use key indicators for early intervention
- **Equipment selection:** Understand which engine types need more attention
- **Cost optimization:** Reduce monitoring costs by focusing on important features

In [None]:
# TODO: Extract and visualize feature importances
# Get feature names from preprocessing pipeline

# TODO: Get feature importance scores from trained Random Forest


# TODO: Create feature importance visualization (top 15-20 features)


# TODO: Analyze and interpret the most important features for maintenance
