# Model Selection 

In this notebook, I will be experimenting and evaluating the baseline performance of different models using one-hot encoding. As one-hot encoding gives more information about categorical data's correlation to outcome. 

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor 
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
from xgboost import XGBRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error

from sklearn.preprocessing import StandardScaler

## Loading in Data 

In [None]:
# One-hot encoded correlation 

# Load dataset 
ds = pd.read_csv('./CW1_train.csv')

# Identify categorical columns
categorical_cols = ['cut', 'color', 'clarity']  # Replace with actual categorical column names

# One-hot encode categorical variables
ds = pd.get_dummies(ds, columns=categorical_cols, drop_first=True)

In [None]:
# Create splits 
train, test = train_test_split(ds, test_size=0.2, random_state=123)
X_train = train.drop(columns=['outcome'])
y_train = train['outcome']
X_test = test.drop(columns=['outcome'])
y_test = test['outcome']

## Define Evaluation Metric

In [None]:
# R2 score 
def r2_fn(y_test, y_pred):
    eps = y_test - y_pred
    rss = np.sum(eps ** 2)
    tss = np.sum((y_test - y_test.mean()) ** 2)
    r2 = 1 - (rss / tss)
    return r2

## Linear Regression 

Experimenting with linear regression is simpler than other methods. Here we experiment with the performance without normalisation (using z-score) and transformation and with. Normalisation was considered as it helps to avoid the vanishing gradient problem during training, furthermore the transformation such as log and square root helps reduce skew. This in turn aids in fufilling the normality assumption of linear regression models better. 

Log transformation was applied to carat and price to reduce skew. However, a square root transformation was applied to y for the same purpose, but due to zero-values this was more appropriate. 

From observing the R2 score below, normalisation and without normalisation did not have an visible impact. This was expected as it was hypothesised that most/nearly all features were not linearly correlated from observing the scatterplots.  Hence, due to it's simplistic structure, assuming noramlity and linearity, this model had a really poor performance. 

### Without normalisation 

In [None]:
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Evaluate 
y_pred = model.predict(X_test)

# Scorers 
print(r2_fn(y_test, y_pred))
print(root_mean_squared_error(y_pred=y_pred, y_true=y_test))

0.2891358294062252
10.656523176745546


### With normalisation

In [None]:
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
y_train_scaled = y_train.copy()
y_test_scaled = y_test.copy()

# Log and square root transformations 
X_train_scaled['carat'] = np.log(X_train['carat'])
X_train_scaled['price'] = np.log(X_train['price'])
X_train_scaled['y'] = np.sqrt(X_train['y'])

X_test_scaled['carat'] = np.log(X_test['carat'])
X_test_scaled['price'] = np.log(X_test['price'])
X_test_scaled['y'] = np.sqrt(X_test['y'])

In [None]:
# Standard score normalisation 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train_scaled = scaler.fit_transform(y_train.to_frame())
y_test_scaled = scaler.transform(y_test.to_frame())

In [None]:
# Fit model with transformed normalised data
model = LinearRegression()

# Train the model
model.fit(X_train_scaled, y_train_scaled)

# Evaluate
y_pred_scaled = model.predict(X_test_scaled)

# Scorers 
print(r2_fn(y_test_scaled, y_pred_scaled))
print(root_mean_squared_error(y_pred=y_pred_scaled, y_true=y_test_scaled))

0.2891358294061971
0.8363255507891988


## Kernal Ridge

Although, this model is more flexible than linear regression. It did not perform well on both normalised transform data and the raw data. The poor performance further supports the fact that features are not linearly correlated. 

### Non-normalised data

In [None]:
kr_model = KernelRidge()  

# Train the model
kr_model.fit(X_train, y_train)

# Evaluate 
y_pred = kr_model.predict(X_test)

# Scorers 
print(r2_fn(y_test, y_pred))
print(root_mean_squared_error(y_pred=y_pred, y_true=y_test))

0.2569791821786881
10.894886548521145


### Normalised and transformed data

In [None]:
kr_model = KernelRidge()

# Train the model
kr_model.fit(X_train_scaled, y_train_scaled)

# Evaluate 
y_pred_scaled = kr_model.predict(X_test_scaled)

# Scorers 
print(r2_fn(y_test_scaled, y_pred_scaled))
print(root_mean_squared_error(y_pred=y_pred_scaled, y_true=y_test_scaled))

0.289145238838981
0.8363200157127226


## Random Forests

As Random Forest (RF) is based on partitioning the data instead of comparing feature values for predictions, it does not require normalisation. However, the nearly double improvement in R2 performance compared to linear regression models indicates that features are related nonlinearly. 

Furthermore, RFs are more robust to noise and outliers , which works really well with our dataset. As previously from the EDA, we observed noisy variables such as a1 during the histogram plot and calculated the number of outliers for each variable in the boxplots. RFs also have the ability to calculate a feature's importance which may help better priorise important features even after feature selection. 

However, it is important to note that RFs could overfit. Meaning careful validation processes have to be used. 

In [None]:
rf_model = RandomForestRegressor()

# Train the model
rf_model.fit(X_train, y_train)

# Evaluate
y_pred = rf_model.predict(X_test)

# Scorers 
print(r2_fn(y_test, y_pred))
print(root_mean_squared_error(y_pred=y_pred, y_true=y_test))

0.4197266709518436
9.628054191787001


## SVR

As expected, Support Vector Regressor (SVR) did not perform well at all, as it is a type of linear model. 

### Non-normalised Data

In [None]:
svr_model = SVR() 

# Train the model
svr_model.fit(X_train, y_train)

# Evaluate 
y_pred = svr_model.predict(X_test)

# Scorers 
print(r2_fn(y_test, y_pred))
print(root_mean_squared_error(y_pred=y_pred, y_true=y_test))

-0.0016752197724354545
12.649859043169714


### Normalised and Transformed Data

In [None]:
svr_model = SVR() 

# Train the model
svr_model.fit(X_train_scaled, y_train_scaled)

# Evaluate
y_pred_scaled = svr_model.predict(X_test_scaled)

# Scorers 
print(r2_fn(y_test_scaled, y_pred_scaled))
print(root_mean_squared_error(y_pred=y_pred_scaled, y_true=y_test_scaled))

  y = column_or_1d(y, warn=True)


-2757.873224353075
0.8202511178469603


## XgBoost

Experimenting with another type non-linear regression model, gradient boosting. XGBoost presented itself as a great potential model. Yet again, because XGboost is essentially an ensemble algorithm composed of decision trees, normalisation was not required. XGBoost is also robust to outliers and noise well. 

As such, we can observe a relatively high R2 score with this model too! 

In [None]:
xgboost_model = XGBRegressor()

# Train the model
xgboost_model.fit(X_train, y_train)

# Evaluate 
y_pred = xgboost_model.predict(X_test)

# Scorers 
print(r2_fn(y_test, y_pred))
print(root_mean_squared_error(y_pred=y_pred, y_true=y_test))

0.39304559866614586
9.846916467093138


# Conclusion 

In conclusion, random forests and gradient boosting with XGBoost both had the higher performance than the other linear models.

Random forests had a higher performance difference of 0.02668. Weighing the advantages of both models, I decided to continue with feature selection and hypertuning on XGBoost. 

As both models were robust to noise and outliers, it had a high baseline performance and was able to adapt to the non-linear nature of the data. According to [XGBoosting articles](https://xgboosting.com/xgboost-vs-random-forest/), XGBoosting tends to train faster and it also has built in regularisation techniques to help prevent overfitting. Random forests hyperparameter sensitivity is less sensitive than XGBoost. High sensitivity during tuning could help us reach and maximise performance further with XGBoost. Even though, random forest excels with parallelisation it is not a factor that needs to be priortised in this project. 