# Baselines

We are predicting a continuous value so we are doing **regression**: regressing accuracy onto our features. Our evaluation criteria will be **root mean squared error**. Given our estimator $\hat{f}$, datapoints $x_i$ and true accuracies $y_i$, we calculate

$$
\sqrt{\sum_i \left( y_i - \hat{f}(x_i) \right)^2}
$$

**NOTE**: we are most concerned with **interpretability** of these results - i.e. we want to know what influences a student's performance. Prediction accuracy, whilst it should indirectly help, should not be all that you focus on.

Our baselines will include the following:
- Mean value (empirical mean from the data)
- Linear Regression model with fairly basic features (not considering time or play history)
- Support Vector Regressor with the same features
- Feedforward Neural Network with the same features

More will be added in the coming days.

In [1]:
from sml import exp_data as ex, util, baselines as bl

In [2]:
data = ex.load(just_accs=True)

In [3]:
X_train, y_train, X_val, y_val, X_test, y_test = bl.get_X_y(data)

Generating one-hot vectors
Determining splits...
Ready to go.


In [6]:
X_train.shape

(151508, 242)

In [7]:
y_train.shape

(151508,)

In [8]:
X_test.shape

(9440, 242)

In [9]:
y_test.shape

(9440,)

## Features Used

One-hot encodings of
- teacher
- class
- level
- unit_module

We want to try and keep the dimensionality down (i.e. size of our feature vectors) so we exclude the user. Adding user information should certainly increase the performance.

## 1. Mean Accuracy in Training Set

As a simple heuristic, and to see how difficult the problem is.

In [11]:
bl.eval_mean_acc(data)

Predicting with mean acc in training set yields 0.2138025611910368 rmse on the test set.


Just taking the mean accuracy in the training set to predict the test set yields a RMSE of ~0.21. That means on a 100 point scale of accuracy, we are on average about 21 points off the real value. That is very inaccurate. We should have room to improve, if we can make the most of our data.

## 2. Linear Regression

Our linear regression model is also very simple and we don't expect it to do much better. It uses one-hot encodings of each of our entities of interest. So it still doesn't consider time.

In [12]:
bl.eval_linear_regression(X_train, y_train, X_test, y_test)

Linear regression gives RMSE of 0.2037514594740634


And indeed that is a small improvement. However, we have applied no regularization and done no hyperparameter tuning...

## 3. Support Vector Regressor

We use the same features, but a stronger model than simple linear regression. We also don't tune hyperparamters, leaving that for you to do if you like.

In [4]:
bl.eval_svr(X_train, y_train, X_test, y_test)

SVR gives RMSE of 0.20645149457236966


## 4. Neural Network