# üöÄ Quick Start: Data Loading & Gradio Interface

This notebook provides a streamlined workflow:
1. **Load data** from Kaggle, Google Drive, or local files
2. **Launch Gradio interface** with pre-loaded data
3. **Start training** immediately

---

## üìã Table of Contents
- [Cell 1: Environment Setup](#cell-1)
- [Cell 2: Data Loading Options](#cell-2)
- [Cell 3: Verify Data](#cell-3)
- [Cell 4: Launch Gradio with Pre-loaded Data](#cell-4)

---

<a id="cell-1"></a>
## Cell 1: Environment Setup

Install required packages and check GPU availability

In [None]:
# Install required packages
!pip install torch>=2.0.0 gradio>=4.0.0 pandas>=2.0.0 numpy>=1.24.0 scikit-learn>=1.3.0 matplotlib>=3.7.0 seaborn>=0.12.0 -q

# Clone repository if not already present
import os
if not os.path.exists('Industrial-digital-twin-by-transformer'):
    !git clone https://github.com/FTF1990/Industrial-digital-twin-by-transformer.git
    os.chdir('Industrial-digital-twin-by-transformer')
else:
    os.chdir('Industrial-digital-twin-by-transformer')

# Check GPU
import torch
print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

<a id="cell-2"></a>
## Cell 2: Data Loading Options

Choose **ONE** of the following methods to load your data:

### Option A: Load from Kaggle Dataset
### Option B: Load from Google Drive
### Option C: Upload Local File
### Option D: Create Example Dataset

In [None]:
import pandas as pd
import numpy as np
from google.colab import files
import io

# ============================================================================
# OPTION A: Load from Kaggle Dataset
# ============================================================================
def load_from_kaggle(dataset_name, file_name):
    """
    Load dataset from Kaggle
    
    Args:
        dataset_name: Kaggle dataset identifier (e.g., 'username/dataset-name')
        file_name: CSV file name in the dataset
    
    Example:
        df = load_from_kaggle('username/industrial-sensors', 'sensor_data.csv')
    """
    print("üì¶ Setting up Kaggle API...")
    
    # Upload kaggle.json if not present
    if not os.path.exists('/root/.kaggle/kaggle.json'):
        print("‚ö†Ô∏è  Please upload your kaggle.json file:")
        uploaded = files.upload()
        
        !mkdir -p /root/.kaggle
        with open('/root/.kaggle/kaggle.json', 'w') as f:
            f.write(list(uploaded.values())[0].decode('utf-8'))
        !chmod 600 /root/.kaggle/kaggle.json
    
    # Install kaggle package
    !pip install kaggle -q
    
    # Download dataset
    print(f"üì• Downloading {dataset_name}...")
    !kaggle datasets download -d {dataset_name} --unzip
    
    # Load CSV
    print(f"üìä Loading {file_name}...")
    df = pd.read_csv(file_name)
    print(f"‚úÖ Loaded data: {df.shape[0]} rows √ó {df.shape[1]} columns")
    return df

# ============================================================================
# OPTION B: Load from Google Drive
# ============================================================================
def load_from_google_drive(file_path):
    """
    Load dataset from Google Drive
    
    Args:
        file_path: Path to CSV file in Google Drive (e.g., '/content/drive/MyDrive/data.csv')
    
    Example:
        df = load_from_google_drive('/content/drive/MyDrive/sensor_data.csv')
    """
    print("üìÇ Mounting Google Drive...")
    from google.colab import drive
    drive.mount('/content/drive')
    
    print(f"üìä Loading {file_path}...")
    df = pd.read_csv(file_path)
    print(f"‚úÖ Loaded data: {df.shape[0]} rows √ó {df.shape[1]} columns")
    return df

# ============================================================================
# OPTION C: Upload Local File
# ============================================================================
def load_from_upload():
    """
    Upload and load CSV file from local computer
    
    Example:
        df = load_from_upload()
    """
    print("üì§ Please select your CSV file to upload...")
    uploaded = files.upload()
    
    # Get the first uploaded file
    file_name = list(uploaded.keys())[0]
    print(f"üìä Loading {file_name}...")
    
    df = pd.read_csv(io.BytesIO(uploaded[file_name]))
    print(f"‚úÖ Loaded data: {df.shape[0]} rows √ó {df.shape[1]} columns")
    return df

# ============================================================================
# OPTION D: Create Example Dataset
# ============================================================================
def create_example_data(n_samples=10000, n_boundary=10, n_target=5, noise_level=0.1):
    """
    Create synthetic industrial sensor dataset
    
    Args:
        n_samples: Number of data points
        n_boundary: Number of boundary (input) sensors
        n_target: Number of target (output) sensors
        noise_level: Noise standard deviation
    
    Example:
        df = create_example_data(n_samples=10000, n_boundary=10, n_target=5)
    """
    print("üîß Generating synthetic industrial sensor data...")
    
    np.random.seed(42)
    
    # Generate boundary sensor readings (inputs)
    boundary_data = np.random.randn(n_samples, n_boundary) * 10 + 100
    
    # Generate target sensors with complex relationships to boundary sensors
    target_data = np.zeros((n_samples, n_target))
    
    for i in range(n_target):
        # Complex non-linear relationships
        target_data[:, i] = (
            0.3 * boundary_data[:, i % n_boundary] +
            0.2 * boundary_data[:, (i+1) % n_boundary] ** 2 / 100 +
            0.15 * boundary_data[:, (i+2) % n_boundary] * boundary_data[:, (i+3) % n_boundary] / 100 +
            0.1 * np.sin(boundary_data[:, (i+4) % n_boundary] / 10) +
            np.random.randn(n_samples) * noise_level
        )
    
    # Create DataFrame
    boundary_cols = [f'boundary_{i+1}' for i in range(n_boundary)]
    target_cols = [f'target_{i+1}' for i in range(n_target)]
    
    df = pd.DataFrame(
        np.hstack([boundary_data, target_data]),
        columns=boundary_cols + target_cols
    )
    
    print(f"‚úÖ Created dataset: {df.shape[0]} rows √ó {df.shape[1]} columns")
    print(f"   - Boundary sensors: {n_boundary}")
    print(f"   - Target sensors: {n_target}")
    return df

# ============================================================================
# SELECT YOUR DATA LOADING METHOD
# ============================================================================

print("\n" + "="*70)
print("üéØ SELECT YOUR DATA LOADING METHOD")
print("="*70)
print("\nUncomment ONE of the following options:\n")

# OPTION A: Kaggle
# df = load_from_kaggle('your-username/your-dataset', 'your_file.csv')

# OPTION B: Google Drive
# df = load_from_google_drive('/content/drive/MyDrive/your_data.csv')

# OPTION C: Upload
# df = load_from_upload()

# OPTION D: Example Data (DEFAULT)
df = create_example_data(n_samples=10000, n_boundary=10, n_target=5)

print("\n" + "="*70)
print("‚úÖ DATA LOADED SUCCESSFULLY!")
print("="*70)

<a id="cell-3"></a>
## Cell 3: Verify Data

Quick data inspection

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

print("üìä Data Overview:")
print("=" * 70)
print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print("\nFirst 5 rows:")
display(df.head())

print("\nüìà Basic Statistics:")
display(df.describe())

# Visualize first few columns
print("\nüìä Data Visualization (first 1000 samples):")
fig, axes = plt.subplots(1, 2, figsize=(15, 4))

# Plot first 4 columns
df.iloc[:1000, :4].plot(ax=axes[0], alpha=0.7)
axes[0].set_title('First 4 Columns (1000 samples)')
axes[0].set_xlabel('Sample Index')
axes[0].set_ylabel('Sensor Value')
axes[0].legend(loc='best', fontsize=8)
axes[0].grid(True, alpha=0.3)

# Correlation heatmap
corr = df.iloc[:, :min(10, df.shape[1])].corr()
sns.heatmap(corr, annot=False, cmap='coolwarm', center=0, ax=axes[1])
axes[1].set_title('Correlation Matrix (first 10 columns)')

plt.tight_layout()
plt.show()

print("\n‚úÖ Data verification complete! Ready to launch Gradio.")

<a id="cell-4"></a>
## Cell 4: Launch Gradio Interface with Pre-loaded Data

This will:
1. Save your loaded data to a temporary CSV file
2. Launch the enhanced Gradio interface
3. **Automatically load the data in Tab 1**
4. You can immediately start training in Tab 2

**Note**: The Gradio interface will open in Tab 1 showing your pre-loaded data!

In [None]:
import sys
import gradio as gr

# Save data to temporary file
temp_data_path = '/tmp/preloaded_data.csv'
df.to_csv(temp_data_path, index=False)
print(f"üíæ Data saved to: {temp_data_path}")
print(f"üìä Shape: {df.shape[0]} rows √ó {df.shape[1]} columns\n")

# Add project to path
if os.path.exists('gradio_residual_tft_app.py'):
    sys.path.insert(0, os.getcwd())
    print("‚úÖ Found gradio_residual_tft_app.py")
else:
    print("‚ùå Error: gradio_residual_tft_app.py not found!")
    print("   Please make sure you're in the project directory.")

# Import and modify the Gradio app to auto-load data
print("\nüöÄ Launching Enhanced Gradio Interface...")
print("=" * 70)
print("üìå Your data is PRE-LOADED in Tab 1!")
print("üìå You can immediately:")
print("   1. View your data in Tab 1")
print("   2. Select boundary and target signals")
print("   3. Start SST training in Tab 2")
print("   4. Continue with Stage2 Boost training")
print("=" * 70)
print("\n‚è≥ Loading interface...\n")

# Launch the app with pre-loaded data
# Note: We'll create a wrapper that auto-loads the data
import importlib.util
spec = importlib.util.spec_from_file_location("gradio_app", "gradio_residual_tft_app.py")
gradio_module = importlib.util.module_from_spec(spec)

# Inject pre-loaded data into global state
import pandas as pd
preloaded_df = pd.read_csv(temp_data_path)

# Execute the module (this will create the interface)
print("üì± Launching interface (this may take a moment)...\n")
spec.loader.exec_module(gradio_module)

# Inject preloaded data into global_state
if hasattr(gradio_module, 'global_state'):
    gradio_module.global_state['df'] = preloaded_df
    gradio_module.global_state['all_signals'] = list(preloaded_df.columns)
    print("‚úÖ Data injected into Gradio app!")

print("\n" + "="*70)
print("üéâ GRADIO INTERFACE IS READY!")
print("="*70)
print("üëâ Check Tab 1 - your data is already loaded!")
print("üëâ Select signals and start training immediately!")
print("="*70)

---

## üéì Quick Workflow Guide

Once Gradio launches, follow this workflow:

### Tab 1: Data Management
‚úÖ Your data is already loaded!
- View data statistics and preview
- Select **boundary signals** (inputs)
- Select **target signals** (outputs to predict)

### Tab 2: SST Model Training
- Configure model parameters (d_model, nhead, num_layers)
- Set training hyperparameters (epochs, batch_size, learning rate)
- Click "Train SST Model" and monitor progress
- Model automatically saves after training

### Tab 3: Residual Extraction
- Select your trained SST model
- Extract residuals (prediction errors)
- Analyze residual patterns

### Tab 4: Stage2 Boost Training
- Select extracted residuals
- Train Stage2 model to learn residual corrections
- Further improve prediction accuracy

### Tab 5: Ensemble Model Generation
- Select base SST + Stage2 models
- Set R¬≤ threshold (default: 0.4)
- Generate intelligent ensemble model
- View per-signal improvement metrics

### Tab 6: Inference Comparison
- Compare SST vs. Ensemble model performance
- Visualize improvements
- Analyze prediction quality

### Tab 7: Sundial Forecasting (Optional)
- Predict future residual trends
- Long-term forecasting

---

## üí° Tips

1. **Data Requirements**:
   - At least 1,000 samples recommended (10,000+ ideal)
   - No missing values
   - Numerical data only

2. **Signal Selection**:
   - Choose 5-20 boundary sensors
   - Choose 3-10 target sensors
   - More sensors = longer training time

3. **Training Time**:
   - SST: ~10-30 minutes (depends on data size and epochs)
   - Stage2: ~10-20 minutes
   - Use GPU for faster training

4. **Performance**:
   - Expected R¬≤: 0.8-0.95 for SST
   - Stage2 boost: +15-25% accuracy improvement

---

## üìö Documentation

- **Quick Start**: `docs/QUICKSTART.md`
- **Detailed Features**: `docs/ENHANCED_VERSION_README.md`
- **Update Notes**: `docs/UPDATE_NOTES.md`
- **Main README**: `README.md`

---

## üÜò Troubleshooting

**Q: Gradio doesn't load my data**
- Make sure Cell 2 ran successfully
- Check that `df` variable exists: `print(df.shape)`

**Q: Training is too slow**
- Reduce batch_size or num_layers
- Use fewer epochs for initial testing
- Ensure GPU is available

**Q: Out of memory error**
- Reduce batch_size
- Reduce d_model or num_layers
- Use smaller data subsets for testing

**Q: Model performance is poor**
- Check data quality (no outliers, proper scaling)
- Increase epochs (try 100-200)
- Adjust learning rate (try 0.0001 - 0.01)
- Ensure boundary signals are causally related to targets

---

**Happy Training! üéâ**