# MAST Plasma Volume Challenge

This challenge asks you to infer plasma volume from frames captured by a wide-angle visible spectrum camera on the CCFE's Mega Ampere Spherical Tokamak (MAST).

## Overview

This is the second of three Data Science challenges for the ITER International School 2024.

The animation below shows footage from a wide-angle proton bullet camera installed on MAST. This visible spectrum camera captures high frame-rate recordings showing the complete plasma cross-section on both sides of the central column. Similar cameras first captured visual recordings of ELM structures on MAST and AUG in 2007.

![MAST Proton Camera Animation](../media/images/c3_proton_camera.gif)

**Challenge Goal:** Develop a machine learning algorithm that predicts plasma volume from a single frame of the camera feed.

This challenge introduces you to techniques for inferring a parameter from 2D image data.

The MAST Data Catalog provided the data for this competition. Thanks to the curators Samuel Jackson, Nathan Cummings, Saiful Khan, and the wider MAST community for this FAIR dataset initiative.

## Background

The image below shows the maximum plasma volume achieved for all shots in the MAST M9 campaign. These experiments were the last performed before the major upgrade to create the current MAST-U machine.

![Maximum Plasma Volume](../media/images/plasma_volume.png)

Maximum plasma volumes range from ~6 to ~10 cubic meters.

This challenge asks you to predict the volume of a plasma geometry from camera imagery and known characteristics, a problem with practical applications in tokamak operations.

## Dataset 

The `./fair_mast_data/plasma_volume` directory contains all necessary files for this challenge.

### Files
- `train.nc` - Training dataset in netCDF format
- `test.nc` - Test dataset in netCDF format

### Data Structure
- `shot_id` - Unique identifier for each shot
- `frame` - Stack of camera frames with dimensions (shot_id, height, width)

## Example

Both training and test datasets use the netCDF format. This self-describing format includes important metadata such as image dimensions alongside the data itself.

After importing the prerequisite libraries, load the datasets using the xarray library as shown below.

In [None]:
import pathlib

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn.decomposition
import sklearn.ensemble
import sklearn.kernel_ridge
import sklearn.metrics
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.model_selection
import xarray as xr

path = pathlib.Path().absolute().parent / "fair_mast_data/plasma_volume"

try:
    train = xr.open_dataset(path / "train.nc")
    test = xr.open_dataset(path / "test.nc")
except FileNotFoundError:
    print("The plasma volume dataset is too large to commit to the repository.")
    print("Please download it from the FAIR-MAST data repository.")
    # TODO implement FAIR-MAST API here to curate the required train and test datasets.


**Important:** The camera images have dimensions (shot_id, height, width). You must reshape this data to follow the (n_samples, n_features) format required by scikit-learn. After reshaping, you can split the data using standard techniques.

In [None]:
X = train.frame.values.reshape(train.sizes["shot_id"], -1)
y = train.plasma_volume

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, test_size=0.3, random_state=7
)


## Dimensionality Reduction

The camera frames have high dimensionality. To make the problem tractable, apply dimensionality reduction techniques. 

The pipeline below uses KernelPCA decomposition as a preprocessing step. This component offers several tuning hyperparameters that can affect solution accuracy.

In [7]:
pipeline = sklearn.pipeline.make_pipeline(
    sklearn.decomposition.KernelPCA(n_components=25),
    sklearn.linear_model.LinearRegression(),
)


## Model Training and Evaluation

With the preprocessing pipeline in place, fit the model to your training data and evaluate its performance on the test set.

In [None]:
try:
    pipeline.fit(X_train, y_train)
    y_predict = pipeline.predict(X_test)
    R2 = sklearn.metrics.r2_score(y_test, y_predict)
    print(f"model R2 {R2:1.3f}")
except NameError:
    print("Training data not available. Dataset must be loaded first.")


model R2 0.531


## Solution Submission

Prepare your solution file following the same approach as the plasma_current competition. Remember to reshape the test frames to match the (n_samples, n_features) format expected by your model.

In [None]:
volume = pipeline.predict(test.frame.values.reshape(test.sizes["shot_id"], -1))
solution = pd.DataFrame(
    {"plasma_volume": volume}, index=pd.Index(test.shot_id, name="shot_id")
)
solution.to_csv(path / "linear_regression.csv")

In [None]:
# Let's improve our model with hyperparameter tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'kernelpca__n_components': [25, 50, 100],
    'kernelpca__kernel': ['linear', 'rbf', 'poly'],
    'linearregression__fit_intercept': [True, False]
}

# Create grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='r2',
    verbose=1
)

# Train the model
grid_search.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
y_predict_best = best_model.predict(X_test)
R2_best = sklearn.metrics.r2_score(y_test, y_predict_best)
print(f"Best model R2: {R2_best:.3f}")

# Generate final predictions
best_volume = best_model.predict(test.frame.values.reshape(test.sizes["shot_id"], -1))
best_solution = pd.DataFrame(
    {"plasma_volume": best_volume}, index=pd.Index(test.shot_id, name="shot_id")
)
best_solution.to_csv(path / "optimized_regression.csv")