# üöÄ Accurate GSM Prediction: CatBoost + Optuna

## Objective
Achieve a Mean Absolute Error (MAE) of **‚â§ 5 GSM** on fabric samples using 64 extracted computer vision features.

## Dataset
Using the augmented dataset (~1000 samples) containing engineered features:
- `weft_count`, `warp_count` (Thread counting)
- `weft_spacing_avg`, `warp_spacing_avg` (Density)
- `yarn_avg_area`, `yarn_std_area` (Yarn properties)
- `texture_energy`, `texture_entropy` (Surface texture)
- Frequency domain features (FFT/Gabor filters)

## Methodology
1. **Load Data**: Use `dataset_train.csv`, `dataset_val.csv`, `dataset_test.csv` from the augmented features folder.
2. **Preprocessing**: Robust scaling to handle outliers.
3. **Model**: **CatBoost Regressor** (Gradient Boosting).
4. **Optimization**: **Optuna** for Bayesian hyperparameter tuning to minimize MAE.
5. **Evaluation**: 
    - 5-Fold Cross-Validation.
    - Feature Importance Analysis (SHAP).
    - Error Analysis (Predicted vs Actual).

---

### 1. Setup & Environment

In [None]:
!pip install catboost optuna shap -q

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.preprocessing import RobustScaler
from catboost import CatBoostRegressor, Pool
import optuna
import shap
import warnings
import json

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries Installed & Imported")

### 2. Data Loading
We load the augmented datasets which contain the 64 engineered features.

In [None]:
# Mount Google Drive (for Colab)
try:
    from google.colab import drive
    drive.mount('/content/drive')
    IN_COLAB = True
    BASE_PATH = '/content/drive/MyDrive/fabric_gsm_pipeline'
except:
    IN_COLAB = False
    BASE_PATH = 'data' # Change this to your local path if not in Colab
    print("Running locally")

# Dataset paths
DATASET_PATH = f"{BASE_PATH}/augmented_features_dataset"
TRAIN_CSV = f"{DATASET_PATH}/dataset_train.csv"
VAL_CSV = f"{DATASET_PATH}/dataset_val.csv"
TEST_CSV = f"{DATASET_PATH}/dataset_test.csv"

# Load Data
df_train = pd.read_csv(TRAIN_CSV)
df_val = pd.read_csv(VAL_CSV)
df_test = pd.read_csv(TEST_CSV)

# Combine Train and Val for Cross-Validation during tuning
df_train_full = pd.concat([df_train, df_val], axis=0).reset_index(drop=True)

print(f"Train samples: {len(df_train)}")
print(f"Val samples:   {len(df_val)}")
print(f"Test samples:  {len(df_test)}")
print(f"Total samples: {len(df_train_full) + len(df_test)}")

# Target Column
TARGET = 'gsm'

# Identify Features (exclude metadata)
meta_cols = ['image_name', 'gsm', 'source', 'augmentation', 'original_image', 'split']
features = [col for col in df_train.columns if col not in meta_cols]

print(f"\nüî¨ Features used ({len(features)}): {features[:5]}...")

### 3. Preprocessing
- **RobustScaler**: Used to scale features. It is robust to outliers which might be present in CV-extracted features.

In [None]:
# Initialize Scaler
scaler = RobustScaler()

# Fit on Full Train Set (Train + Val)
X = df_train_full[features]
y = df_train_full[TARGET]

X_test = df_test[features]
y_test = df_test[TARGET]

X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for convenience with CatBoost (keeps column names)
X_scaled = pd.DataFrame(X_scaled, columns=features)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=features)

print("‚úÖ Data Scaled using RobustScaler")

### 4. Hyperparameter Tuning using Optuna
We use Optuna to find the best hyperparameters for CatBoost to minimize Mean Absolute Error (MAE).

In [None]:
def objective(trial):
    params = {
        'iterations': trial.suggest_int('iterations', 500, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'depth': trial.suggest_int('depth', 4, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'random_strength': trial.suggest_float('random_strength', 1e-9, 10, log=True),
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0, 1),
        'border_count': trial.suggest_int('border_count', 32, 255),
        'loss_function': 'MAE',  # Directly optimize visually interpretable metric
        'verbose': False,
        'random_seed': 42,
        'task_type': 'CPU' # Or GPU if available
    }

    # 5-Fold Cross-Validation within Optuna
    cv_data = Pool(data=X_scaled, label=y)
    
    from catboost import cv
    
    scores = cv(
        pool=cv_data,
        params=params,
        fold_count=5,
        seed=42,
        shuffle=True,
        stratified=False,
        plot=False,
        verbose=False
    )
    
    # Minimize Best Validation MAE
    return min(scores['test-MAE-mean'])

print("‚è≥ Starting Optuna Optimization...")
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print("\nüèÜ Best Params:", study.best_params)
print("üèÜ Best CV MAE:", study.best_value)

### 5. Final Model Training
Train variables on the full training set using the best parameters found.

In [None]:
best_params = study.best_params
best_params['loss_function'] = 'MAE'
best_params['verbose'] = 100
best_params['random_seed'] = 42

# Final Training
model = CatBoostRegressor(**best_params)
model.fit(X_scaled, y, eval_set=(X_test_scaled, y_test), early_stopping_rounds=50, verbose=100)

print("\n‚úÖ Model Training Completed")

### 6. Evaluation on Test Set
Checking if we met the **MAE ‚â§ 5** criteria.

In [None]:
# Predictions
y_pred = model.predict(X_test_scaled)

# Metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("="*40)
print("üß™ FINAL TEST RESULTS")
print("="*40)
print(f"MAE:  {mae:.4f} GSM")
print(f"RMSE: {rmse:.4f} GSM")
print(f"R¬≤:   {r2:.4f}")
print("="*40)

# Success Check
if mae <= 5.0:
    print("üéâ SUCCESS: MAE is within ¬±5 GSM target!")
else:
    print(f"‚ö†Ô∏è WARNING: MAE {mae:.2f} is above ¬±5 target.")

# Visualizing Predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6, color='blue', edgecolor='k', label='Samples')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=3, label='Perfect Prediction')

# Tolerance Band
plt.fill_between([y.min(), y.max()], 
                 [y.min()-5, y.max()-5], 
                 [y.min()+5, y.max()+5], 
                 color='green', alpha=0.1, label='¬±5 GSM Tolerance')

plt.xlabel('Actual GSM')
plt.ylabel('Predicted GSM')
plt.title(f'Actual vs Predicted GSM (Test Set)\nMAE: {mae:.2f}')
plt.legend()
plt.show()

### 7. Feature Importance Analysis
Understanding what drives the predictions (Physics check).

In [None]:
feature_importances = model.get_feature_importance()
feature_names = X_scaled.columns

# Create DataFrame
fi_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
fi_df = fi_df.sort_values(by='importance', ascending=False).head(20)

# Plot
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=fi_df, palette='viridis')
plt.title('Top 20 Important Features for GSM Prediction')
plt.xlabel('CatBoost Feature Importance')
plt.show()

### 8. Save Model
Saving the trained model for future inference.

In [None]:
model_save_path = "catboost_gsm_model.cbm"
model.save_model(model_save_path)
print(f"‚úÖ Model saved to {model_save_path}")