<a href="https://colab.research.google.com/github/Harsha123456-gitty/LexiLite/blob/main/worked_examples/support_vector_machines/SVM%20example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/support_vector_machines/SVM%20example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Support Vector Machine Example
Support vector machines are often used for classification and regression tasks. They are particularly good for working within high dimensional spaces. They're memory efficeint and are robust to overfitting. However, they are computationally intensive, sensitive to noise, and can be hard to interpret.

For this notebook I'll be pulling some data from Materials Project. I'll use the old api using my MyPymatgen virtual environment

#### Video

https://www.youtube.com/watch?v=ebTe3o6M0Bg&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=21 (Support Vector Machines)

## Setup

Let's start by getting our API key loaded. This is important for use of the MPRester API.

In [3]:
!pip install mp-api pymatgen CBFV


Collecting mp-api
  Downloading mp_api-0.45.15-py3-none-any.whl.metadata (2.4 kB)
Collecting emmet-core>=0.86.2 (from mp-api)
  Downloading emmet_core-0.86.2-py3-none-any.whl.metadata (2.1 kB)
Collecting boto3 (from mp-api)
  Downloading boto3-1.42.32-py3-none-any.whl.metadata (6.8 kB)
Collecting pymatgen-io-validation>=0.1.1 (from emmet-core>=0.86.2->mp-api)
  Downloading pymatgen_io_validation-0.1.2-py3-none-any.whl.metadata (15 kB)
Collecting pybtex~=0.24 (from emmet-core>=0.86.2->mp-api)
  Downloading pybtex-0.25.1-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting blake3 (from emmet-core>=0.86.2->mp-api)
  Downloading blake3-1.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting botocore<1.43.0,>=1.42.32 (from boto3->mp-api)
  Downloading botocore-1.42.32-py3-none-any.whl.metadata (5.9 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->mp-api)
  Downloading jmespath-1.1.0-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.17.0,>=0.16.0

Now lets get our API

In [9]:
from mp_api.client import MPRester
import pandas as pd

API_KEY = "2X7aVvCcwC78Tr2tJSi1aVeutiTG2fxG"

with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(
        elements=["Cl"],
        energy_above_hull=(0, 0.001),
        fields=[
            "formula_pretty",
            "band_gap",
            "density",
            "formation_energy_per_atom",
            "volume"
        ]
    )

# Convert to DataFrame
df = pd.DataFrame([{
    "pretty_formula": d.formula_pretty,
    "band_gap": d.band_gap,
    "density": d.density,
    "formation_energy_per_atom": d.formation_energy_per_atom,
    "volume": d.volume
} for d in docs])

df.head()


print(df.shape)
df.head()


Retrieving SummaryDoc documents:   0%|          | 0/1883 [00:00<?, ?it/s]

(1883, 5)


Unnamed: 0,pretty_formula,band_gap,density,formation_energy_per_atom,volume
0,HgCl,2.7243,6.920947,-0.621252,113.26734
1,ICl,1.8522,3.599105,-0.464541,599.262172
2,ClF3,2.5175,2.497464,-0.618293,491.743096
3,AuCl,1.6744,7.497046,-0.264202,205.916722
4,ZrCl,0.0,4.519366,-1.476961,93.089214


Now let's grab some data to work with. We'll pick chlorides within 1 meV of the convex hull

In [10]:
df_cbfv = df.rename(columns={
    "pretty_formula": "formula",
    "density": "target"
})

print(df_cbfv[['formula', 'target']].head())
print(df_cbfv.shape)


  formula    target
0    HgCl  6.920947
1     ICl  3.599105
2    ClF3  2.497464
3    AuCl  7.497046
4    ZrCl  4.519366
(1883, 5)


CBFV DATA CONVERSION

In [13]:
from CBFV import composition

X_cbfv, y, formulae, skipped = composition.generate_features(
    df_cbfv,
    elem_prop="oliynyk",
    drop_duplicates=False,
    extend_features=True,
    sum_feat=True
)

print("X_cbfv shape:", X_cbfv.shape)
print("y shape:", y.shape)
print("Skipped formulas:", len(skipped))

# Drop columns that are entirely NaN
X_cbfv_clean = X_cbfv.dropna(axis=1, how="all")

# Align target
y_clean = y.loc[X_cbfv_clean.index]

print(X_cbfv_clean.shape, y_clean.shape)


print(X_cbfv.shape, y.shape)


Processing Input Data: 100%|██████████| 1883/1883 [00:00<00:00, 10395.80it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 1883/1883 [00:00<00:00, 5619.31it/s]



NOTE: Your data contains formula with exotic elements. These were skipped.
	Creating Pandas Objects...
X_cbfv shape: (1883, 311)
y shape: (1883,)
Skipped formulas: 19
(1883, 305) (1883,)
(1883, 311) (1883,)


normal training

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_cbfv_clean,
    y_clean,
    test_size=0.33,
    random_state=42
)

print(X_train.shape, y_train.shape)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_absolute_error

svr = SVR(kernel="rbf")

svr.fit(X_train_scaled, y_train)

y_pred = svr.predict(X_test_scaled)

print("Baseline R²:", r2_score(y_test, y_pred))
print("Baseline MAE:", mean_absolute_error(y_test, y_pred))


(1261, 305) (1261,)
Baseline R²: 0.9140993768996964
Baseline MAE: 0.24477071051545138


BAYESIAN OPTIMIZATION

In [17]:
import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    C = trial.suggest_float("C", 1e-2, 1e3, log=True)
    gamma = trial.suggest_float("gamma", 1e-4, 1e-1, log=True)
    epsilon = trial.suggest_float("epsilon", 1e-3, 1e-1, log=True)

    model = SVR(
        kernel="rbf",
        C=C,
        gamma=gamma,
        epsilon=epsilon
    )

    scores = cross_val_score(
        model,
        X_train_scaled,
        y_train,
        cv=5,
        scoring="r2"
    )

    return scores.mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)

print("Best parameters:", study.best_params)
print("Best CV R²:", study.best_value)



[I 2026-01-22 18:37:19,127] A new study created in memory with name: no-name-fd70e32c-a990-4ca6-9f35-8cf4c2d815ee
[I 2026-01-22 18:37:21,186] Trial 0 finished with value: 0.9279392816746082 and parameters: {'C': 1.512036159902833, 'gamma': 0.0011778276254221817, 'epsilon': 0.012323955854578943}. Best is trial 0 with value: 0.9279392816746082.
[I 2026-01-22 18:37:22,976] Trial 1 finished with value: 0.8748866220814616 and parameters: {'C': 1.486848305834237, 'gamma': 0.00018068928016906724, 'epsilon': 0.0219306585129759}. Best is trial 0 with value: 0.9279392816746082.
[I 2026-01-22 18:37:28,953] Trial 2 finished with value: 0.9492995279310203 and parameters: {'C': 239.8544194637118, 'gamma': 0.0001756713478301683, 'epsilon': 0.00439865774575191}. Best is trial 2 with value: 0.9492995279310203.
[I 2026-01-22 18:37:30,986] Trial 3 finished with value: 0.7847978417905784 and parameters: {'C': 90.71962626402072, 'gamma': 0.012784885409648231, 'epsilon': 0.0038760342445641255}. Best is tria

Best parameters: {'C': 119.5891650906348, 'gamma': 0.00022200612538595473, 'epsilon': 0.007421692801803034}
Best CV R²: 0.9495877889563105


FINAL TRAINING

In [18]:
best_params = study.best_params

svr_best = SVR(
    kernel="rbf",
    **best_params
)

svr_best.fit(X_train_scaled, y_train)

y_pred = svr_best.predict(X_test_scaled)

print("Final R²:", r2_score(y_test, y_pred))
print("Final MAE:", mean_absolute_error(y_test, y_pred))


Final R²: 0.9618732118875751
Final MAE: 0.19597375492886052
