(Note) This notebook may do not reproduce the submitted solution. This is due to my insufficient knowledge of Kaggle notebooks, which led to the loss of the exact state at the time of submission.

This competition involved a very unique task: estimating the volume of plasma from its images. Due to strict time and computational constraints, the analysis here is far from comprehensive, but I will document my thoughts and experiments.

For this task, I attempted the following methods to create input data for the regression model:
(1) Compressing all images using PCA
(2) Extracting texture features from all images
(3) Extracting features using a pre-trained CNN

Methods (2) and (3) were abandoned because they performed poorly during cross-validation using the training data.

Regarding (2), it is likely that the critical factors characterizing plasma volume are related more to the shape of the plasma than its texture, which might explain why texture features were not useful.

Regarding (3), one notable characteristic of the dataset is that the images were captured from a strictly fixed camera. Thus, the translational and scale invariance of CNNs may not have been beneficial in this case. Additionally, the ResNet50 CNN used here extracts 2048-dimensional features from each image. This dimensionality may have been too large relative to the size of the dataset, potentially hindering effective training. It may be worth exploring whether compressing the features extracted by the CNN using PCA or other methods and then applying a regression model could improve performance.

The reason (1) worked effectively is likely because the camera was strictly fixed, which suggests that the generalization performance to other fusion experiment facilities may be poor. However, it seems to have performed well on the current dataset. PCA, FastICA, and KernelICA were applied for dimensionality reduction, followed by the application of Linear Regression, Support Vector Machine, Random Forest Regression, and K-Nearest Neighbor.

In this competition, standard PCA (32 components) combined with Random Forest Regression proved effective. However, ensemble inference may improve performance further.
Why standard PCA and Random Forest did work is not investigated, but it may be solely by chance.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/mast-plasma-volume/train.nc
/kaggle/input/mast-plasma-volume/test.nc


In [2]:
import pathlib

import appdirs
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn.decomposition
import sklearn.ensemble
import sklearn.kernel_ridge
import sklearn.metrics
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.model_selection
from sklearn.decomposition import *
import xarray as xr

path = pathlib.Path("/kaggle/input/mast-plasma-volume")
train = xr.open_dataset(path / "train.nc")
test = xr.open_dataset(path / "test.nc")

In [3]:
!pip install ripser pyfeats scikit-neuralnetwork --quiet

In [4]:
from persim import plot_diagrams
from ripser import ripser
from skimage import feature as sk_feature

In [5]:
train_images = [np.resize(np.array(train['frame'][i]), (256, 256)) for i in range(609)]
train_images_raw = [np.array(train['frame'][i]) for i in range(609)]
train_y = np.array(train.plasma_volume)

In [6]:
test_images = np.array([np.array(test['frame'][i]) for i in range(260)])

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score 

In [8]:
import pyfeats

In [9]:
def get_image_feat(image):
    feats_ls, labels_ls = [], []
    f1, f2, l1, l2 = pyfeats.glcm_features(image)
    feats_ls.extend(f1 + f2)
    labels_ls.extend(l1 + l2)
    pdf, cdf = pyfeats.grayscale_morphology_features(image, N=6)
    feats_ls.extend(list(pdf) + list(cdf))
    labels_ls.extend([f"gmf_pdf_{i}" for i in range(len(cdf))] + [f"gmf_cdf_{i}" for i in range(6)])
    f, l = pyfeats.hu_moments(image)
    feats_ls.extend(list(f))
    labels_ls.extend(list(l))
    return feats_ls, labels_ls

In [10]:
tmp = [get_image_feat(train_image) for train_image in train_images]
train_feats = [t[0] for t in tmp]
feat_names = tmp[0][1]

In [11]:
train_X = np.array(train_feats)
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
train_X_ = standard_scaler.fit_transform(train_X)

In [12]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
for model in (SVR(), XGBRegressor(), RandomForestRegressor(), KNeighborsRegressor()):
    print(cross_val_score(model, train_X_, train_y, scoring="r2"))

[0.27785906 0.18415027 0.29756476 0.44784071 0.2741435 ]
[ 0.49764081 -0.10480112  0.34669294  0.624867    0.28189469]
[0.46980869 0.20192377 0.68203589 0.76950454 0.3651174 ]
[0.34000227 0.28527663 0.4499328  0.51174406 0.1772999 ]


In [13]:
from sklearn.feature_selection import SelectFromModel, SequentialFeatureSelector
sfs = SequentialFeatureSelector(XGBRegressor(), scoring="r2", n_features_to_select=10)
sfs.fit(train_X_, train_y)

In [14]:
from sklearn.decomposition import PCA, SparsePCA, KernelPCA

In [15]:
from sklearn.ensemble import RandomForestRegressor
for model in (SVR(), XGBRegressor(), RandomForestRegressor(), KNeighborsRegressor()):
    print(cross_val_score(model, train_X_[:, sfs.support_], train_y, scoring="r2"))

[0.10830132 0.01348686 0.08387914 0.29290865 0.13605593]
[0.50891277 0.10285039 0.57599339 0.66126188 0.30703274]
[0.49016778 0.19223702 0.68911938 0.75089035 0.35092058]
[0.12549751 0.09192903 0.28816827 0.33431255 0.03747459]


In [16]:
def image_preprocess(image):
    image = np.array([image, image, image])
    return ((image / 255.) - 0.126) / 0.201

In [17]:
train_images_preprocessed = [image_preprocess(img) for img in train_images]

In [18]:
import torch
from torchvision.models import resnet50
from torchvision.models.feature_extraction import create_feature_extractor


model = resnet50(weights="IMAGENET1K_V1")

return_nodes = [
    "layer4"
    ]
cnn_feats_extractor = create_feature_extractor(model, return_nodes=return_nodes)
with torch.inference_mode():
    intermediate_outputs = cnn_feats_extractor(torch.Tensor(np.array(train_images_preprocessed)))

Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 159MB/s]


In [19]:
train_cnn_feats = intermediate_outputs['layer4'].numpy().mean((2, 3))

In [20]:
for n in [1, 3, 5, 8, 11, 15]:
    knr = KNeighborsRegressor(n)
    print(n, cross_val_score(knr, np.array(train_images_raw).reshape(609, -1), train_y, scoring="r2", cv=5))

1 [0.76969302 0.22323418 0.7826507  0.78316694 0.31904024]
3 [0.7776805  0.24523756 0.78149515 0.79351898 0.67496158]
5 [0.77573879 0.2600931  0.71011923 0.81510972 0.56962662]
8 [0.72224166 0.34994136 0.72506839 0.80759008 0.53242911]
11 [0.69362736 0.39616147 0.70200555 0.80142453 0.50045035]
15 [0.69575582 0.38734162 0.69279027 0.78164119 0.44695659]


In [21]:
train_images_raw = np.array(train_images_raw)

In [22]:
for num_comps in [2, 4, 8, 16, 32, 64]:
    print(num_comps)
    pca = PCA(num_comps).fit(train_images_raw.reshape(609, -1))
    train_X_ = pca.transform(train_images_raw.reshape(609, -1))
    for model_ in (LinearRegression(), KNeighborsRegressor(), SVR(), RandomForestRegressor()):
        print(cross_val_score(model_, train_X_, train_y, scoring="r2"))

2
[ 0.1530625  -0.08385636  0.16800747  0.19818772  0.23586218]
[0.24066072 0.01331015 0.14434639 0.42514518 0.250624  ]
[0.29576637 0.08030369 0.26576615 0.43770099 0.21496945]
[0.27196497 0.02673973 0.25267268 0.43876753 0.31250084]
4
[ 0.29275067 -0.06449239  0.20819413  0.43378426  0.247062  ]
[0.68004864 0.10773839 0.51636225 0.70981708 0.45447943]
[0.5657225  0.20741928 0.52490246 0.71483024 0.32698386]
[0.68066638 0.21945233 0.68451966 0.77987177 0.43561142]
8
[ 0.61318298 -0.53061409  0.5084511   0.65465601  0.40354231]
[0.78066419 0.28742298 0.66732716 0.83271408 0.55266067]
[0.674487   0.40473596 0.71146676 0.82951946 0.53049612]
[0.68156479 0.27489694 0.66447266 0.83957371 0.54650874]
16
[0.70487189 0.05470502 0.5805823  0.68516961 0.42975995]
[0.73526525 0.38603133 0.68793149 0.80727891 0.57729592]
[0.72231859 0.47901914 0.74889478 0.81885665 0.56209389]
[0.77934763 0.34811759 0.75734895 0.83997947 0.6439888 ]
32
[ 0.75501311 -0.32492338  0.58124925  0.75464279  0.49662169]

In [23]:
train_images_flatten, test_images_flatten = train_images_raw.reshape(609, -1), test_images.reshape(260, -1)
test_preds = []
pca = FastICA(16).fit(train_images_flatten)
train_X_, test_X_ = pca.transform(train_images_flatten), pca.transform(test_images_flatten)
model = SVR().fit(train_X_, train_y)
test_preds.append(model.predict(test_X_))

pca = KernelPCA(32).fit(train_images_flatten)
train_X_, test_X_ = pca.transform(train_images_flatten), pca.transform(test_images_flatten)
model = RandomForestRegressor().fit(train_X_, train_y)
test_preds.append(model.predict(test_X_))




In [24]:
test_preds[0]

array([7.52047035, 7.65557529, 8.6090596 , 8.39419779, 8.10068662,
       8.71852936, 8.81788353, 8.74446656, 8.78410227, 8.81071331,
       8.76249618, 7.95584919, 6.89764967, 8.98719227, 7.62471505,
       8.76440747, 8.98496448, 9.02797886, 7.05933257, 8.63984791,
       7.57810816, 8.89832732, 8.25140548, 9.06191341, 8.65446637,
       9.0032927 , 7.53873268, 7.27300487, 8.73297436, 8.09055616,
       7.45927478, 7.8106881 , 7.57950961, 8.99885108, 8.80449421,
       8.72455921, 7.7471516 , 9.09136654, 7.35794542, 7.09588413,
       9.07170766, 7.24391254, 9.266258  , 7.17984918, 8.58938118,
       8.95155961, 8.94773522, 6.10713884, 8.35162276, 7.35472152,
       9.11651581, 8.65195968, 8.93449506, 7.84164554, 7.73971743,
       7.45735044, 8.80821013, 7.72544848, 9.00968869, 7.72769014,
       7.68354475, 8.87211964, 9.15885462, 7.33366613, 8.79315326,
       7.70163246, 8.6789409 , 7.32482349, 7.16952148, 7.36665337,
       8.78463655, 8.18049013, 9.06500231, 8.66029762, 7.23434

In [25]:
pd.DataFrame({'shot_id': list(range(260)),
              'plasma_volume': test_preds[0]
             }).to_csv('ica_svr_volume.csv', index=False)
pd.DataFrame({'shot_id': list(range(260)),
              'plasma_volume': test_preds[1]
             }).to_csv('pca_rf_volume.csv', index=False)
pd.DataFrame({'shot_id': list(range(260)),
              'plasma_volume': np.mean(test_preds, 0)
             }).to_csv('mean_volume.csv', index=False)


In [26]:
from IPython.display import FileLink
FileLink('pca_rf_volume.csv')

In [27]:
# Champ: num_comp=16, model=SVR()
for num_comps in [2, 4, 8, 16, 32, 64]:
    print(num_comps)
    pca = FastICA(num_comps).fit(train_images_raw.reshape(609, -1))
    train_X_ = pca.transform(train_images_raw.reshape(609, -1))
    for model_ in (LinearRegression(), KNeighborsRegressor(), SVR(), RandomForestRegressor()):
        print(cross_val_score(model_, train_X_, train_y, scoring="r2"))

2




[ 0.1530625  -0.08385636  0.16800747  0.19818772  0.23586218]
[0.25981622 0.00119024 0.17257201 0.39991882 0.27117907]
[0.36907928 0.13062075 0.24037983 0.51261759 0.26503418]
[0.18190501 0.03368947 0.25375475 0.52290984 0.20376765]
4




[ 0.29275067 -0.06449239  0.20819413  0.43378426  0.247062  ]
[0.67104447 0.12253314 0.59662415 0.68961651 0.54218201]
[0.63599512 0.26512909 0.60751522 0.75242571 0.38169139]
[ 0.70874073 -0.06022756  0.53231567  0.70162682  0.55557059]
8




[ 0.61318308 -0.53061802  0.50845108  0.65465602  0.40354244]
[0.75513023 0.21310832 0.74831908 0.85609115 0.58019765]
[0.7368391  0.51313059 0.76124887 0.85106661 0.54957194]
[0.65072391 0.40986224 0.79282354 0.86431406 0.4572904 ]
16




[0.70486921 0.05173659 0.58057833 0.68522861 0.42960461]
[0.83345135 0.43096192 0.78347249 0.86324118 0.57438473]
[0.7215442  0.59736133 0.79110306 0.84233924 0.58075271]
[0.6451695  0.44937714 0.78251229 0.81743744 0.53953357]
32




[ 0.75464618 -0.32069362  0.58100425  0.75463995  0.4968593 ]
[0.83793053 0.28143602 0.75787456 0.88590109 0.47252791]
[0.6827727  0.44581287 0.79602741 0.8607657  0.48987253]
[0.79093065 0.48282776 0.77255531 0.64574022 0.47025145]
64




[ 0.43375316 -7.31020291  0.58793926  0.76609472  0.50748991]
[0.68866942 0.18101709 0.73576198 0.87824344 0.50796424]
[0.63493319 0.40297877 0.77634622 0.84978878 0.48717719]
[0.54950792 0.20566345 0.59116895 0.52202796 0.39598447]


In [28]:
# Champ: num_comps=32, model = RandomForestRegressor()
from sklearn.decomposition import KernelPCA
for num_comps in [2, 4, 8, 16, 32, 64]:
    print(num_comps)
    pca = KernelPCA(num_comps).fit(train_images_raw.reshape(609, -1))
    train_X_ = pca.transform(train_images_raw.reshape(609, -1))
    for model_ in (LinearRegression(), KNeighborsRegressor(), SVR(), RandomForestRegressor()):
        print(cross_val_score(model_, train_X_, train_y, scoring="r2"))

2
[ 0.1530625  -0.08385636  0.16800747  0.19818772  0.23586218]
[0.24066072 0.01331015 0.14434639 0.42514518 0.250624  ]
[0.29572797 0.08030369 0.26576615 0.43770099 0.21496945]
[0.28841889 0.02912537 0.2018356  0.4087928  0.3113741 ]
4
[ 0.29275067 -0.06449239  0.20819413  0.43378426  0.247062  ]
[0.68004864 0.10773839 0.51636225 0.70981708 0.45447943]
[0.56572252 0.20741928 0.52490245 0.71484875 0.32698385]
[0.6808329  0.21926585 0.66980803 0.77944296 0.41577417]
8
[ 0.61318308 -0.53061802  0.50845108  0.65465602  0.40354244]
[0.78066419 0.28742298 0.66732716 0.83271408 0.55266067]
[0.674452   0.40472531 0.71146676 0.82951956 0.53052131]
[0.72941177 0.31972248 0.71335976 0.8463862  0.56456497]
16
[0.70486921 0.05173659 0.58057833 0.68522861 0.42960461]
[0.73526525 0.38603133 0.68793149 0.80727891 0.57721039]
[0.72241537 0.47901121 0.74885825 0.81881613 0.56221385]
[0.77905686 0.40058556 0.75737179 0.8361189  0.62314559]
32
[ 0.75464618 -0.32069362  0.58100425  0.75463995  0.4968593 ]

In [29]:
for model in (SVR(), XGBRegressor(), RandomForestRegressor(), KNeighborsRegressor()):
    print(cross_val_score(model, train_cnn_feats.reshape(609, -1), train_y, scoring="r2"))

[0.5519821  0.31664282 0.63083604 0.70596369 0.3104576 ]
[0.38250609 0.0298751  0.35386261 0.67539642 0.3272993 ]
[0.39753388 0.29844707 0.50498519 0.66373812 0.33725037]
[0.61106453 0.29598313 0.62811468 0.73938629 0.30108352]
