## Support Vector Machine Example
Support vector machines are often used for classification and regression tasks. They are particularly good for working within high dimensional spaces. They're memory efficeint and are robust to overfitting. However, they are computationally intensive, sensitive to noise, and can be hard to interpret. 

For this notebook I'll be pulling some data from Materials Project. I'll use the old api using my MyPymatgen virtual environment

#### Video

https://www.youtube.com/watch?v=ebTe3o6M0Bg&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=21 (Support Vector Machines)

## Setup

Let's start by getting our API key loaded. This is important for use of the MPRester API. 

In [None]:
import pandas as pd
from pymatgen.ext.matproj import MPRester
import os

filename = r'G:\My Drive\teaching\5540-6640 Materials Informatics\old_apikey.txt'

def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)


Sparks_API = get_file_contents(filename)
mpr = MPRester(Sparks_API)

Now let's grab some data to work with. We'll pick chlorides within 1 meV of the convex hull

In [None]:
df = pd.DataFrame(columns=('pretty_formula', 'band_gap',
                           "density", 'formation_energy_per_atom', 'volume'))

# grab some props for stable chlorides
criteria = {'e_above_hull': {'$lte': 0.001},'elements':{'$all':['Cl']}}
# criteria2 = {'e_above_hull': {'$lte': 0.02},'elements':{'$all':['O']},
#              'band_gap':{'$ne':0}}

props = ['pretty_formula', 'band_gap', "density",
         'formation_energy_per_atom', 'volume']
entries = mpr.query(criteria=criteria, properties=props)

i = 0
for entry in entries:
    df.loc[i] = [entry['pretty_formula'], entry['band_gap'], entry['density'],
                 entry['formation_energy_per_atom'], entry['volume']]
    i += 1

When we try to build the SVR model without using a CBFV it scores poorly. 

In [None]:
from sklearn.svm import SVR
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['band_gap','formation_energy_per_atom','volume']]
y = df['density']


We will now split the found data into train test splits. This is useful for evaluating the model and seeing how accurate it is.

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)


Lastly, lets train the model and score it. We are using the SVR model from sklearn.

In [None]:

svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)

Our model isn't too great alone, but what if we add CBFV features? 

In [None]:
from CBFV import composition
import time

rename_dict = {'density': 'target', 'pretty_formula':'formula'}
df = df.rename(columns=rename_dict)


RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

X = df[['formula','band_gap','formation_energy_per_atom','volume']]
y = df['target']



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RNG_SEED)

X_train, y_train, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test, y_test, formulae_train, skipped_train = composition.generate_features(df, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)


#technically we should scale and normalize our data here... but lets skip it for now
# Start the timer
start_time = time.time()

# Calculate the training time
training_time = time.time() - start_time

svr = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('the r2 score is',r2)
mae = mean_absolute_error(y_test, y_pred)
print('the mean absolute error is',mae)
rmse_val = mean_squared_error(y_test, y_pred, squared=False)
print("Training time:", training_time, "seconds")


Way better! Our R^2 went way up and our MAE went way down

# Grid Search Hyperparameter Tuning

Now let's try one more time, but this time we'll do hyperparameter tuning! We will continue using the SVR model from sklearn but now we will utilize the sklearn GridSearchCV model to perform hyperparameter tuning. 

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1]
}

# Create the SVR model
svr = SVR()

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=svr, param_grid=param_grid, cv=5)

# Start the timer
start_time = time.time()

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Calculate the training time
training_time = time.time() - start_time

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Train the model with the best parameters
svr_best = SVR(**best_params)
svr_best.fit(X_train, y_train)

# Predict on the test data
y_pred = svr_best.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Best parameters:", best_params)
print("Best score:", best_score)
print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")


# Random search hyperparameter tuning
Now let's try random search hyperparameter tuning. This will use the same strategy as before but with the sklearn RandomizedSearchCV hyperparameter tuning model rather than the grid search.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import time

# Define the parameter grid
param_grid = {
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1]
}

# Create the SVR model
svr = SVR()

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=svr, param_distributions=param_grid, cv=5)

# Start the timer
start_time = time.time()

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Calculate the training time
training_time = time.time() - start_time

# Get the best parameters and best score
best_params = random_search.best_params_
best_score = random_search.best_score_

# Train the model with the best parameters
svr_best = SVR(**best_params)
svr_best.fit(X_train, y_train)

# Predict on the test data
y_pred = svr_best.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print("Best parameters:", best_params)
print("Best score:", best_score)
print("R2 score:", r2)
print("Mean absolute error:", mae)
print("Root mean squared error:", rmse)
print("Training time:", training_time, "seconds")
