# Supervised Learning Models

**Group 19**

*Cátia Antunes* (fc60494) - 5h  
*Donato Aveiro* (fc46269) -  
*Márcia Vital* (fc59488) -   
*Seán Gorman* (fc59492) -  

The goal of this first home assignment was to predict the critical temperature of a superconductor based on 81 extracted features.

In [1]:
# Importation of all modules required for the assignment
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor


In [2]:
# Data loading 
df1 = pd.read_csv("train.csv")
#df1.info()

We began our data preprocess by taking a look at the data types of each column in order to see if there were some miss-labeled data types and if there were missing values. Since there were no missing values detected, it was not necessary to do any missing value imputation. (Outputs not show in order to save space.)

In [3]:
# As there were two data types, int and float, all integers where converted to float in order to have only one data type
df1[['number_of_elements', 'range_atomic_radius', 'range_Valence']] = df1[['number_of_elements', 'range_atomic_radius', 'range_Valence']].astype(float)

# Temperature sepparation
def critical_temp_sep(x):
    if x < 1.0:
        return 'VeryLow'
    elif x >= 1.0 and x < 5.0:
        return 'Low'
    elif x >= 5.0 and x < 20.0:
        return 'Medium'
    elif x >= 20.0 and x < 100.0:
        return 'High'
    else:
        return 'VeryHigh'
    
df1['critical_temp classes'] = df1['critical_temp'].apply(critical_temp_sep)

# With the critical_temp classes created, we can remove our classification and regression labels from our X dataset and store them in two separate datasets, y_clf and y_reg, for the classification and regression tasks, respectively. 
y_clf = df1['critical_temp classes']
y_reg = df1['critical_temp']
X = df1.drop(columns=['critical_temp', 'critical_temp classes'])



With the removal of the classification and regression labels, we can now split our data into train and test datasets.For the classification data, in order to account for unbalanced classes, the performed split was done with stratification to guarantee an even class distribution in both train and test sets. This is not a necessary step for the split of the regression data. For both cases, the data was split with a 67/33 ratio (67% for training and 33% for testing).

In [4]:
# Data split
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X, y_clf, 
                     test_size=0.33,
                     stratify=y_clf,
                     random_state=1)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X, y_reg, 
                    test_size=0.33,
                    random_state=1)

print('Train: ', X_train_clf.shape)
print('Test: ', X_test_clf.shape)

Train:  (14246, 81)
Test:  (7017, 81)


With the datasets divided into training and testing, we next proceeded to scale the datasets. This step needs to be done after the train and test split to avoid that the data present on the test dataset influences the training data. This problem is solved by first splitting the training and test sets and then fitting the scaler with the training set. The test set will be transformed with the fit done with the training set.
The scaling itself is an important process since there are features in different scales that would influence the algorithms. The standardization consists of bringing the features into a mean of zero and a standard variation of one.

In [5]:
# Scaling of training data for classification
std_scaler = StandardScaler()
X_train_clf = std_scaler.fit_transform(X_train_clf)
X_train_clf = pd.DataFrame(X_train_clf, columns = X.columns)
#X_train_clf

# Scaling of training data for linear regression
std_scaler = StandardScaler()
X_train_reg = std_scaler.fit_transform(X_train_reg)
X_train_reg = pd.DataFrame(X_train_reg, columns = X.columns)
#X_train_reg

# Transformation of testing data ????????????????????
X_test_reg = std_scaler.transform(X_test_reg)
X_test_reg = pd.DataFrame(X_test_reg, columns = X.columns)
#X_test_reg

PRINCIPAL COMPONENT ANALYSIS

Given the high number of features present in the data, a feature selection was necessary. This was achieved with Principal Component Analysis (PCA).

In [6]:
pca = PCA()
pca.fit_transform(X_train_clf)

pca = PCA(n_components=0.90) # our threshold ??? Did we select this threshold? I assume this is to find the nr of of components that represent 90% of de variation?
pca.fit_transform(X_train_clf)

pca = PCA(n_components = 12)
X_train_clf = pd.DataFrame(pca.fit_transform(X_train_clf))
#X_train_clf

X_test_clf = pd.DataFrame(pca.transform(X_test_clf))
#X_test_clf

pca = PCA(n_components = 12)
X_train_reg = pd.DataFrame(pca.fit_transform(X_train_reg))
#X_train_reg

X_test_reg = pd.DataFrame(pca.transform(X_test_reg))
#X_test_reg


LINEAR MODEL OF THE PCA 

By using a PCA, we can select a subset of principal components that capture most of the variation in the data, and use them as predictors in the regression model. This can help us avoid overfitting, multicollinearity, and noise issues that may arise from using too many or irrelevant features. Below is the linear regression of the projected data followed by a linear model of the full dataset for comparison.

In [None]:
''' Linear regression of the PCA projection '''

# associate data with the variables here
X = Superconduct_X_train ##### Replace here with train
y = Superconduct_y_train ##### Replace here with train

# Create and fit a linear regression model
model = linear_model.LinearRegression() # Create linear regression object
model.fit(Superconduct_X_train, Superconduct_y_train) # Train the model using the training sets

# Make predictions using the testing set
Superconduct_y_pred = model.predict(Superconduct_X_test) ##### Replace here with test

# Evaluate the model performance
print("Coefficients:", model.coef_) # print the coefficients
print("Intercept:", model.intercept_) # print the intercept
print("Mean squared error:", mean_squared_error(y, model.predict(X))) # print the MSE on training data
print("R2 score:", r2_score(y, model.predict(X))) # print the R2 score on training data

# Plot outputs
plt.scatter(Superconduct_X_test, Superconduct_y_test, color="black")
plt.plot(Superconduct_X_test, Superconduct_y_pred, color="blue", linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
''' Linear regression of the full dataset '''


# associate data with the variables here
X = Superconduct_X ##### Replace here with full clean data
y = Superconduct_y ##### Replace here with full clean data

# Create and fit a linear regression model
model = linear_model.LinearRegression() # Create linear regression object
model.fit(X, y) # Train the model using the training sets 

# Make predictions using the testing set
Superconduct_y_pred = model.predict(Superconduct_X_test) ##### Replace here with test  

# Evaluate the model performance
print("Coefficients:", model.coef_) # print the coefficients
print("Intercept:", model.intercept_) # print the intercept
print("Mean squared error:", mean_squared_error(y, model.predict(X))) # print the MSE on training data
print("R2 score:", r2_score(y, model.predict(X))) # print the R2 score on training data

# Plot outputs
plt.scatter(Superconduct_X_test, Superconduct_y_test, color="black")  
plt.plot(Superconduct_X_test, Superconduct_y_pred, color="blue", linewidth=3) 
plt.xticks(())
plt.yticks(())

plt.show()

**DISCUSSION**


bla bla, I wish I knew