# Learning outcomes

When you've worked through the exercise in this notebook, you'd have

* built and evaluated SVM and logistic regression models using a standard software library;

* explored selection of the best hyperparameters for your models.

# Objectives


* To explore an existing dataset
> This week, we'll use a subset of the UK Met dataset. You can read more about the UK Met dataset here: https://rmets.onlinelibrary.wiley.com/doi/10.1002/gdj3.78. We will use the 60km-resolution data for 2010 to 2022.

* To apply support vector machine (SVM) and logistic regression algorithms from Week 3 mini-videos to automatic detection of the number of days of ground frost based on other weather variables.

# Section 1 - Explore the UK Met (60km, 2010-2022) dataset

See the dataset on the Week 3 page for the module, on Canvas (see 'Week 3 Lab Dataset' on the page). The file is named c*urated_data_1month_2010-2022_nonans.csv*.
* What does each variable in the dataset represent?
* What is the distribution of the number of days of ground frost in the dataset? What of for the number of days of snow?
* What does this tell you about the data?
* What else can you tell about the data?


# Section 2 - Load the dataset





1. You need to first download the data before you can get started. Download from the Week 3 page for the module, on Canvas (see 'Week 3 Lab Dataset' on the page). The file you download will be named *curated_data_1month_2010-2022_nonans.csv*.

2. Then, use the file menu in Google Colab to upload the file to your Colab directory. Once upload is complete, you should be able to see the file on the listed contents of your Colab directory.

3. You can now run the code in the cell below to load the data.

In [None]:
import csv
import numpy



data_file_full_path = "/content/curated_data_1month_2010-2022_nonans.csv"

data_as_list = []

# load the dataset
with open(data_file_full_path) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')

    row_count = 0
    for row in csv_reader:

      if row_count > 0:
        data_as_list.append([float(val) for val in row])
      row_count += 1
data = numpy.array(data_as_list)

# check its shape
print("\n The dataset has shape: "+str(data.shape))


# get features and labels from the data
# based on the objectives (see the Objectives section)
feat_col = [5, 6, 7, 8, 9, 10]
ground_frost_col = 4


feats = data[:, feat_col]
ground_frost_label = data[:, ground_frost_col]



# take a peek
print("\n A peek at the dataset features: \n"+str(feats))
print("\n A peek at the ground frost labels: \n"+str(ground_frost_label))



# Section 3 - Split into training, validation, and test sets

In [None]:
from sklearn.model_selection import train_test_split

all_ids = numpy.arange(0, feats.shape[0])

random_seed = 1

# First randomly split the data into 70:30 to get the training set
train_set_ids, rem_set_ids = train_test_split(all_ids, test_size=0.3, train_size=0.7,
                                 random_state=random_seed, shuffle=True)


# Then further split the remaining data 50:50 into validation and test sets
val_set_ids, test_set_ids = train_test_split(rem_set_ids, test_size=0.5, train_size=0.5,
                                 random_state=random_seed, shuffle=True)


train_data = feats[train_set_ids, :]
train_ground_frost_labels = ground_frost_label[train_set_ids]


val_data = feats[val_set_ids, :]
val_ground_frost_labels = ground_frost_label[val_set_ids]


test_data = feats[test_set_ids, :]
test_ground_frost_labels = ground_frost_label[test_set_ids]


# Section 4 - Scale the features

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()



scaler.partial_fit(train_data)
scaled_train_data = scaler.transform(train_data)


scaler.partial_fit(val_data)
scaled_val_data = scaler.transform(val_data)


scaler.partial_fit(test_data)
scaled_test_data = scaler.transform(test_data)


# Section 5 - Train and evaluate a SVM regression model (with hyperparameter optimization)

* Have a read of the documentation for the software library that implements the SVM for regression: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html.

* Adapt the code below to optimize the regularization parameter C (aka 'box constraint').

In [None]:

from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error
import sys



# train a SVM regression model
model_SVM = LinearSVR(random_state=random_seed, loss='squared_epsilon_insensitive')
model_SVM.fit(scaled_train_data, train_ground_frost_labels)

# evaluate the trained model using the test set
test_pred_SVM = model_SVM.predict(scaled_test_data)
mse_SVM = mean_squared_error(test_ground_frost_labels, test_pred_SVM)
print('\n The test mean squared error (MSE) is: '+str(mse_SVM))


# Section 6 - Train and evaluate a LR classification model

* Use the information from Section 1 to split the ground frost label values into 4 classes.
* Apply this to create classification labels for the labels in Section 2.
* Use the classification labels to train and evaluate a logistic regression model using the Scikit Learn library (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).


# Section 7 - Evaluate using other classification metrics

In [None]:
from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# F1 score similar to accuracy in that it ranges between 0 and 1
# We will look at this metric in Week 6
avg_f1_score_LR = f1_score(test_ground_frost_labels_class, test_pred_LR, average='macro')
f1_scores_LR = f1_score(test_ground_frost_labels_class, test_pred_LR, average=None)
print('\n The F1 scores for each of the classes are: '+str(f1_scores_LR))
print('\n The average F1 score is: '+str(avg_f1_score_LR))
print()

# Confusion shows the misclassification
# We will look at this metric in Week 6
confusion_matrix_SVM = confusion_matrix(test_ground_frost_labels_class, test_pred_LR)
disp = ConfusionMatrixDisplay(confusion_matrix_SVM)
disp.plot()
plt.show()
