The UCI Superconductivity Dataset has two files:

(1) train.csv contains 81 features extracted from 21263 superconductors along with the critical temperature in the 82nd column

(2) unique_m.csv contains the chemical formula broken up for all the 21263 superconductors from the train.csv file. The last two columns have the critical temperature and chemical formula. 

The goal was to predict the critical temperature based on the features extracted.

### Task: Build Supervised Learning Models

### Objectives
1: Dimensionality Reduction

2: Create a Regression and Classification model

________________________________________________________

In [1]:
import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.naive_bayes import CategoricalNB, GaussianNB
from sklearn.metrics import explained_variance_score, mean_squared_error, accuracy_score, max_error, mean_absolute_error
from sklearn.metrics import f1_score, confusion_matrix,matthews_corrcoef, precision_score, recall_score
from scipy.stats import pearsonr
from sklearn.model_selection import KFold

# Preprocessing data

In [33]:
#Load both base datasets
data_train = pd.read_csv("train.csv")
data_unique = pd.read_csv('unique_m.csv')

#ChatGPT generated conversion method but slightly adapted from original output
def get_temp_class(temp):
    if temp < 1.0:
        return "VeryLow"
    elif temp < 5.0:
        return "Low"
    elif temp < 20.0:
        return "Medium"
    elif temp < 100.0:
        return "High"
    else:
        return "VeryHigh"

    
data_train['temp_class'] = data_train['critical_temp'].apply(get_temp_class)

#Separate the dependent variable as our target. This value is the same in both datasets and same order
y = data_train.values[:,81]
y_classes = data_train.values[:,82]

#Drop the dependent variable and the identifier variables
data_train.drop(columns=["critical_temp", "temp_class"], inplace=True)
data_unique.drop(columns=["critical_temp", "material"], inplace=True)

#Create a dataset with both combined
data_combined = pd.concat([data_train, data_unique], axis=1)

## Objective 1: Dimensionality reduction by applying SVD

In [34]:
#Objective 1 - Apply SVD
#Apply Standard Scaler
scaler_combined = StandardScaler().fit(data_combined.values)
scaled_combined = scaler_combined.transform(data_combined.values)

U, S, VT = np.linalg.svd(scaled_combined)

for i in range(len(S)):
  print("first %d components have a combined importance of %7.4f" %(i+1, S[:i+1].sum()/S.sum()))


first 1 components have a combined importance of  0.0507
first 2 components have a combined importance of  0.0775
first 3 components have a combined importance of  0.1032
first 4 components have a combined importance of  0.1263
first 5 components have a combined importance of  0.1467
first 6 components have a combined importance of  0.1633
first 7 components have a combined importance of  0.1795
first 8 components have a combined importance of  0.1946
first 9 components have a combined importance of  0.2087
first 10 components have a combined importance of  0.2224
first 11 components have a combined importance of  0.2353
first 12 components have a combined importance of  0.2480
first 13 components have a combined importance of  0.2602
first 14 components have a combined importance of  0.2719
first 15 components have a combined importance of  0.2837
first 16 components have a combined importance of  0.2952
first 17 components have a combined importance of  0.3066
first 18 components hav

The Singular Value Decomposition (SVD) of a given matrix is a factorization of that matrix into three matrices (U, S and VT). It can be used to reduce the dimensionality of a dataset, i.e., the number of variables. This way it is possible to extract the more important variables in a dataset (or the ones that contibute more to explain the total variance of a given dataset). Dimension reduction is important because it can mean reducing the time to build and perform a model and score data.

In this particular case, after SVD of the data considering 167 variables we can conclude that the first 140 components are sufficient to explain 100% of the total variance of the dataset. Moreover, the 73 first components are sufficient to explain 80% of the data, the first 80 components explain 85%, the first 90 components explain 90% of the data, and the first 105 components explain 95% of the data. 

Usually, retaining >85% of total explained variance of a dataset can be considered suficient to perfom data analysis, but it depends on what kind of analysis is going to be performed next. 

So, let's check how data behaves considering dimensionality reduction while building a supervised learning model, both using regression and classification.

## Objective 2: Create a Regression and Classification model

### Objective 2.1: Regression Model using Linear Regression or Decision Tree regression


In [35]:
#Objective 2.1
X = scaled_combined
# Split the input data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Regarding the regression model analysis, the goal is to predict the critical temperature of a given superconductor.

We started by searching for the best hyperparameters for a Decision Tree Regressor model, considering max_depth and min_amples parameters.

In [36]:
rve = cs = irve = irmse = ics = ima = imar = jrmse = jrve = jma = jmar = 0
ma = mar = rmse = 100000

for i in range(2,20):
    for j in range(1,20):
        model_full= DecisionTreeRegressor(max_depth=j,min_samples_split=i,random_state=123).fit(X_train, y_train)
        preds=model_full.predict(X_test)
        explained_variance_score(y_test, preds)
        corr, pval=pearsonr(y_test, preds)


        if(explained_variance_score(y_test, preds)>rve):
            rve = explained_variance_score(y_test, preds)
            irve = i
            jrve = j
        if(mean_squared_error(y_test, preds, squared=False)<rmse):
            rmse =  mean_squared_error(y_test, preds, squared=False)
            irmse = i
            jrmse = j
        if(max_error(y_test, preds)<ma):
            ma = max_error(y_test, preds)
            ima = i
            jma = j
        if(mean_absolute_error(y_test, preds)<mar):
            mar = mean_absolute_error(y_test, preds)
            imar = i
            jmar = j
            
            
print("The best RVE is: ", rve , "for i: ", irve, ' j: ', jrve)
print("The best rmse is: ", rmse , "for i: ", irmse, ' j: ', jrmse)
print("The min Maximum Error is: ", ma , "for i: ", ima, ' j: ', jma)
print("The min Mean Absolute Error is: ", mar , "for i: ", imar, ' j: ', jmar)            

The best RVE is:  0.8898664821186723 for i:  16  j:  19
The best rmse is:  11.262595629776115 for i:  16  j:  19
The min Maximum Error is:  77.88934343434346 for i:  2  j:  3
The min Mean Absolute Error is:  6.074043295935849 for i:  12  j:  19


The best results were obtained considering max_depth=16 and min_samples=19.

The Decision Tree model considering these hyperparameters was applied to the entired dataset.

Decision Trees tend to overfit and to mitigate that, a K-fold cross validation was implemented.

In [37]:
#Full Dataset
kf = KFold(n_splits=5, shuffle=True, random_state=7)
TRUTH_nfold=None
PREDS_nfold=None

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    mdl = DecisionTreeRegressor(max_depth=16,min_samples_split=19, random_state=123)
    mdl.fit(X_train, y_train)
    preds = mdl.predict(X_test)
    if TRUTH_nfold is None:
        PREDS_nfold=preds
        TRUTH_nfold=y_test
    else:
        PREDS_nfold=np.hstack((PREDS_nfold, preds))
        TRUTH_nfold=np.hstack((TRUTH_nfold, y_test))

print("The RVE is: ", explained_variance_score(TRUTH_nfold, PREDS_nfold))
print("The rmse is: ", mean_squared_error(TRUTH_nfold, PREDS_nfold, squared=False))
corr, pval=pearsonr(TRUTH_nfold, PREDS_nfold)
print("The Correlation Score is is: %6.4f (p-value=%e)\n"%(corr,pval))
print("The Maximum Error is is: ", max_error(TRUTH_nfold, PREDS_nfold))
print("The Mean Absolute Error is: ", mean_absolute_error(TRUTH_nfold, PREDS_nfold))

The RVE is:  0.8858403464424475
The rmse is:  11.573429389916896
The Correlation Score is is: 0.9418 (p-value=0.000000e+00)

The Maximum Error is is:  181.4
The Mean Absolute Error is:  6.440566676770679


A search on the best number (K) of components extracted from SVD to have a good model was performed. 

Accordingly to results obtained in Objective 1, the first 30 to 140 combined components were considered.

In [38]:
#Compute the SVD of the training data
U, S, VT = np.linalg.svd(X_train)

best_score = -1
best_k = 0

#Test between the usage of 30 to 140 components
for k in range(30,140):
    VT_k = VT[:k, :]
    
    #Use the reduced-dimensional training data to train a linear regression model
    X_train_reduced = np.dot(X_train, VT_k.T)
    reg_model_reduced = DecisionTreeRegressor(random_state=123).fit(X_train_reduced, y_train)
    
    #Compute the reduced-dimensional test data using the same k principal components
    X_test_reduced = np.dot(X_test, VT_k.T)
    
    #Evaluate the performance of the model on the test data
    score = reg_model_reduced.score(X_test_reduced, y_test)

    if(score > best_score):
        best_score = score
        best_k = k
    
print("Best K: ", best_k, " with score: ", best_score)

Best K:  65  with score:  0.8616789428863798


A Decision Tree Regression model (considering the hyperparameters computed before) was applied to the dimensionality reduced data considering the first 65 components extracted by SVD from the data.

In [41]:
kf = KFold(n_splits=5, shuffle=True, random_state=7)
TRUTH_nfold=None
PREDS_nfold=None

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    U, S, VT = np.linalg.svd(X_train)
    VT_k = VT[:65, :]
    
    X_train_reduced = np.dot(X_train, VT_k.T)
    X_test_reduced = np.dot(X_test, VT_k.T)
    
    mdl = DecisionTreeRegressor(max_depth=16,min_samples_split=19,random_state=123)
    mdl.fit(X_train_reduced, y_train)
    preds = mdl.predict(X_test_reduced)
    if TRUTH_nfold is None:
        PREDS_nfold=preds
        TRUTH_nfold=y_test
    else:
        PREDS_nfold=np.hstack((PREDS_nfold, preds))
        TRUTH_nfold=np.hstack((TRUTH_nfold, y_test))

corr, pval=pearsonr(TRUTH_nfold, PREDS_nfold)

print("The RVE is: ", explained_variance_score(TRUTH_nfold, PREDS_nfold))
print("The rmse is: ", mean_squared_error(TRUTH_nfold, PREDS_nfold, squared=False))
print("The Correlation Score is is: %6.4f (p-value=%e)\n"%(corr,pval))
print("The Maximum Error is is: ", max_error(TRUTH_nfold, PREDS_nfold))
print("The Mean Absolute Error is: ", mean_absolute_error(TRUTH_nfold, PREDS_nfold))

The RVE is:  0.8642947946373474
The rmse is:  12.618554316049945
The Correlation Score is is: 0.9306 (p-value=0.000000e+00)

The Maximum Error is is:  125.41333333333333
The Mean Absolute Error is:  7.035567528780746


A Decision Tree Regressor model for the all dataset (167 variables) has a RVE of 0.886, while performing a dimensionality reduction and using only the first 65 components a RVE of 0.864 can be achieved with the same model.

This indicates that the model can still have a good performance with much less variable information (about 1/3 of the information).


### Objective 2.2: Classification Model using Decision Tree Classification and Naive Bayes

In [42]:
#Objective 2.2
X = scaled_combined
# Split the input data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Regarding the classification model analysis a similar approach was followed.

The goal is to predict the class of a given superconductor regarding 5 classes of their critical temperature (very low, low, medium, high and very high).

The best hyperparameters for a Decision Tree Classification model were investigated.

This was not necessary regarding Gaussian Naive Bayes algorithm.

In [43]:
#Test Params for Decision Tree Class
iPre = iRe = iF1 = iMath = iAcc = [0]
bestPre = bestRe = bestF1 = bestMath = bestAcc = 0

for i in range(3,20):
    for j in range(5,20):
        model_full= DecisionTreeClassifier(max_depth=j,min_samples_split=i, class_weight='balanced',random_state=123).fit(X_train, y_train)
        preds=model_full.predict(X_test)


        if(precision_score(y_test, preds, average='weighted')>bestPre):
            bestPre = precision_score(y_test, preds, average='weighted')
            iPre[0] = ([j,i])
        if(recall_score(y_test, preds, average='weighted')>bestRe):
            bestRe =  recall_score(y_test, preds, average='weighted')
            iRe[0] = ([j,i])
        if(f1_score(y_test, preds, average='weighted')>bestF1):
            bestF1 = f1_score(y_test, preds, average='weighted')
            iF1[0] = ([j,i])
        if(matthews_corrcoef(y_test, preds)>bestMath):
            bestMath = matthews_corrcoef(y_test, preds)
            iMath[0] = ([j,i])
        if(accuracy_score(y_test, preds)>bestAcc):
            bestAcc = matthews_corrcoef(y_test, preds)
            iMath[0] = ([j,i])
            
            
print("Best Precision : " , bestPre, "for (max_depth, min_samples_split): ", iPre)
print("Best Recall : " , bestRe, "for (max_depth, min_samples_split): ", iRe)
print("Best F1 Score : " , bestF1, "for (max_depth, min_samples_split): ", iF1)
print("Best Matthews Correlation : " , bestMath, "for (max_depth, min_samples_split): ", iMath) 
print("Best Accuracy : " , bestAcc, "for (max_depth, min_samples_split): ", iAcc) 

Best Precision :  0.8407690961786045 for (max_depth, min_samples_split):  [[19, 19]]
Best Recall :  0.8365859393369386 for (max_depth, min_samples_split):  [[19, 19]]
Best F1 Score :  0.837668835118235 for (max_depth, min_samples_split):  [[19, 19]]
Best Matthews Correlation :  0.760957900520992 for (max_depth, min_samples_split):  [[19, 19]]
Best Accuracy :  0.7275083769126107 for (max_depth, min_samples_split):  [[19, 19]]


The best results were obtained considering max_depth=19 and min_samples=19.

A Decision Tree model considering these hyperparameters was applied to the entired dataset.

In [44]:
kf = KFold(n_splits=5, shuffle=True, random_state=7)
TRUTH_nfold=None
PREDS_nfold=None

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y_classes[train_index], y_classes[test_index]
    
    mdl = DecisionTreeClassifier(max_depth=19,min_samples_split=19,class_weight='balanced',random_state=123)
    mdl.fit(X_train, y_train)
    preds = mdl.predict(X_test)
    if TRUTH_nfold is None:
        PREDS_nfold=preds
        TRUTH_nfold=y_test
    else:
        PREDS_nfold=np.hstack((PREDS_nfold, preds))
        TRUTH_nfold=np.hstack((TRUTH_nfold, y_test))
    
print("Precision : " , precision_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Recall : " , recall_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("F1 Score : " , f1_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Matthews Correlation : " , matthews_corrcoef(TRUTH_nfold, PREDS_nfold))
print("Accuracy : " , accuracy_score(TRUTH_nfold, PREDS_nfold))  


Precision :  0.8217176516808238
Recall :  0.8106570098292809
F1 Score :  0.8140121031432227
Matthews Correlation :  0.7262292946379665
Accuracy :  0.8106570098292809


Decision Tree Classification model (considering the hyperparameters computed before) was applied to the dimensionality reduced data (K=65).

In [46]:
kf = KFold(n_splits=5, shuffle=True, random_state=7)
TRUTH_nfold=None
PREDS_nfold=None

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y_classes[train_index], y_classes[test_index]
    
    U, S, VT = np.linalg.svd(X_train)
    VT_k = VT[:65, :]
    
    X_train_reduced = np.dot(X_train, VT_k.T)
    X_test_reduced = np.dot(X_test, VT_k.T)
    
    mdl = DecisionTreeClassifier(max_depth=19,min_samples_split=19,class_weight='balanced', random_state=123)
    mdl.fit(X_train_reduced, y_train)
    preds = mdl.predict(X_test_reduced)
    if TRUTH_nfold is None:
        PREDS_nfold=preds
        TRUTH_nfold=y_test
    else:
        PREDS_nfold=np.hstack((PREDS_nfold, preds))
        TRUTH_nfold=np.hstack((TRUTH_nfold, y_test))
        
print("Precision : " , precision_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Recall : " , recall_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("F1 Score : " , f1_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Matthews Correlation : " , matthews_corrcoef(TRUTH_nfold, PREDS_nfold))
print("Accuracy : " , accuracy_score(TRUTH_nfold, PREDS_nfold))     

Precision :  0.805306717583287
Recall :  0.7933029205662419
F1 Score :  0.7973855867880227
Matthews Correlation :  0.7008306366919892
Accuracy :  0.7933029205662419


Considering the classification models, the Decision Tree Classification model using the entire data set showed:

a precision of around 0.822, a Matthews Correlation around 0.726 and accuracy of 0.811;

while using the reduced data of 65 components the results are quite similar:

the precision is 0.805, a Matthews Correlation around 0.700 and accuracy of 0.793,

indicating that the model still have a good performance with much less information.



A Gaussian Naive Bayes model was also applied to the entired dataset.

In [47]:
kf = KFold(n_splits=5, shuffle=True, random_state=7)
TRUTH_nfold=None
PREDS_nfold=None

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y_classes[train_index], y_classes[test_index]
    
    mdl = GaussianNB()
    mdl.fit(X_train, y_train)
    preds = mdl.predict(X_test)
    if TRUTH_nfold is None:
        PREDS_nfold=preds
        TRUTH_nfold=y_test
    else:
        PREDS_nfold=np.hstack((PREDS_nfold, preds))
        TRUTH_nfold=np.hstack((TRUTH_nfold, y_test))
    
print("Precision : " , precision_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Recall : " , recall_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("F1 Score : " , f1_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Matthews Correlation : " , matthews_corrcoef(TRUTH_nfold, PREDS_nfold))
print("Accuracy : " , accuracy_score(TRUTH_nfold, PREDS_nfold))
   

Precision :  0.6149718717908375
Recall :  0.3755349668438132
F1 Score :  0.4027709317927835
Matthews Correlation :  0.2560469722805726
Accuracy :  0.3755349668438132


Gaussian Naive Bayes model was also applied to the dimensionality reduced data (K=65).

In [48]:
kf = KFold(n_splits=5, shuffle=True, random_state=7)
TRUTH_nfold=None
PREDS_nfold=None

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y_classes[train_index], y_classes[test_index]
    
    U, S, VT = np.linalg.svd(X_train)
    VT_k = VT[:65, :]
    
    X_train_reduced = np.dot(X_train, VT_k.T)
    X_test_reduced = np.dot(X_test, VT_k.T)
    
    mdl = GaussianNB()
    mdl.fit(X_train_reduced, y_train)
    preds = mdl.predict(X_test_reduced)
    if TRUTH_nfold is None:
        PREDS_nfold=preds
        TRUTH_nfold=y_test
    else:
        PREDS_nfold=np.hstack((PREDS_nfold, preds))
        TRUTH_nfold=np.hstack((TRUTH_nfold, y_test))
        
print("Precision : " , precision_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Recall : " , recall_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("F1 Score : " , f1_score(TRUTH_nfold, PREDS_nfold, average='weighted'))
print("Matthews Correlation : " , matthews_corrcoef(TRUTH_nfold, PREDS_nfold))
print("Accuracy : " , accuracy_score(TRUTH_nfold, PREDS_nfold))

Precision :  0.4557740817150485
Recall :  0.47655551897662607
F1 Score :  0.4395545545226367
Matthews Correlation :  0.19980652248165384
Accuracy :  0.47655551897662607



The Gaussian Naive Bayes model, however, shows a poorer performance compared to Decision Trees, 

even when using the entire dataset information, having:

a precision of 0.615, a Matthews Correlation of 0.256 and accuracy of 0.376.

When using dimensionality reduced data the scores get lower with:

a precision around 0.456, a Matthews Correlation of  0.200, but a higher accuracy of 0.477.


Naive Bayes makes the assumption that the variables are independent, which might be the cause for the decreasing in its precision in this case, compared to decision trees.



# Discussion

In the present assignment we were asked to build a Regression and a Classification Supervised Learning models to predict the critical temperature of some superconductores based on some of their features.

Those features were contained in two files: (i) train.csv and (2) unique_m.csv.

We decided to combined the two datasets in an attempt to extract the most important features to further build the SL models.


A SVD analysis indicated that the first 65 extracted components were sufficient to have a good Decision Trees, both for Regression and Classification models, having scores quite similar to the model containing the entire data. 

Gaussian Bayes used for classification seemed to perform worst compared to Decision Trees. That migh be due to variable dependency.

We have also used the train.csv data only to test the regression and classification models. 
The results were actualy quite similar and using the first 55 components extracted by SVD, a Decidion Tree Regression model with a RVE of 0.8555 can be built. 

When using the Decision Tree Classificator, the models showed a RVE of 0.822 for the entire dataset, and of 0.817 when using the first 55 components extracted by SVD from the 81 variables dataset.

Howewer, the Gaussian Naive Bayes model has shown to perform better when using the first 55 components extrated by SVD from the train dataset, with a precision of 0.634 and a increased accuracy of 0.626, compared to the the combined dataset results. This might be due to the dependency of some varibles in study. Thus eliminating some of those noisy varibles might be favourable for the model.

These results illustrates the importance of feature extraction using different SL models. 


