# Support Vector Machines & Random Forests

Here, we will explore the support vector machine (SVM) and random forest algorithms that we covered in the last class.


## Introduction to the data

-- We will use the Covertype dataset this week. Read more about the dataset here: https://scikit-learn.org/stable/datasets/real_world.html#covtype-dataset and https://archive.ics.uci.edu/ml/datasets/Covertype.


-- Here are some prompts to help you find out more about the dataset:

*   What type of labels does it have (real continuous or categorical)**?** What are the range of values for the labels**?** If categorical labels, have they been provided in numeric form, e.g. as integers**?**
*   What is the feature dimensionality, i.e. the number of features**?** What are the range of values of each feature**?** Is the range the same across features**?**
*   Can you find out how the data was collected**?**
*   What of how it was labelled**?**



















## Loading the data


-- You need to download the data before you can get started. Download from https://archive.ics.uci.edu/ml/datasets/Covertype.

-- You would have downloaded a *covtype.data.gz* file. Unzip this, e.g. using the free 7Zip software for those using Windows.

-- The data file in the unzipped folder would be *covtype.data* which can be opened with any text editor. Use the file menu in Google Colab to upload the file to your Colab directory. Once upload is complete, you should be able to see the file on the listed contents of your Colab directory.

-- You can now run the code in the cell below to load the data.


In [10]:
import csv
import numpy


!ls  /content

data_file_full_path = r"C:\Users\Administrator\Desktop\master\machine learnning\covtype.data\covtype.data"

covtype_data_as_list = []

# load the dataset
with open(data_file_full_path) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')

    for row in csv_reader:

      covtype_data_as_list.append([float(val) for val in row])


covtype_data = numpy.array(covtype_data_as_list)


print("\n The dataset has shape: "+str(covtype_data.shape))


# get the features and the labels
feat_col = numpy.arange(0, 54)
label_col = 54

cov_type_feats = covtype_data[:, feat_col]
cov_type_labels = covtype_data[:, label_col]

print("\n A peek at the dataset features: \n"+str(cov_type_feats))
print("\n A peek at the dataset labels: \n"+str(cov_type_labels))


'ls' 不是内部或外部命令，也不是可运行的程序
或批处理文件。



 The dataset has shape: (581012, 55)

 A peek at the dataset features: 
[[2.596e+03 5.100e+01 3.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.590e+03 5.600e+01 2.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.804e+03 1.390e+02 9.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 ...
 [2.386e+03 1.590e+02 1.700e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.384e+03 1.700e+02 1.500e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.383e+03 1.650e+02 1.300e+01 ... 0.000e+00 0.000e+00 0.000e+00]]

 A peek at the dataset labels: 
[5. 5. 2. ... 3. 3. 3.]


## Exploring data & Splitting into training, validation, and test sets

-- After you load the data, you can explore the relationships between each individual feature and the labels, e.g. using scatter plots or boxplots. The code below uses a histogram plot to check the distribution of the labels in the data.

-- Next, you can split the data into training, validation, and test sets. The code below does that using a random split. What is the distribution of the labels in the training set**?**

In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


random_seed = 1

# Plot the frequencies of each class in the labels
plt.figure()
_, _, _ = plt.hist(cov_type_labels, bins=numpy.unique(cov_type_labels), align='left')
plt.title('Class frequencies for full dataset')
plt.show()
print('\n')

# Let's only consider data with class labels '1' and '2' and only one tenth of that subset
# You can use a larger portion of the dataset
# But it will take longer to train your model
all_ids = numpy.arange(0, cov_type_feats.shape[0])
mid_sub_ids = all_ids[cov_type_labels<3]
cov_type_labels_mid_sub = cov_type_labels[mid_sub_ids]
cov_type_feats_mid_sub = cov_type_feats[mid_sub_ids, :]
all_ids_sub = numpy.arange(0, cov_type_feats_mid_sub.shape[0])

_, sub_ids = train_test_split(all_ids_sub, test_size=0.1, train_size=0.9, 
                                 random_state=random_seed, shuffle=True, stratify=cov_type_labels_mid_sub)
cov_type_labels_sub = cov_type_labels_mid_sub[sub_ids]
cov_type_feats_sub = cov_type_feats_mid_sub[sub_ids, :]

# First randomly split the data 80/20 into training and test sets
all_ids_sub = numpy.arange(0, cov_type_feats_sub.shape[0])
train_set_ids, test_set_ids = train_test_split(all_ids_sub, test_size=0.2, train_size=0.8, 
                                 random_state=random_seed, shuffle=True)


# Then further split the training set into training and validation sets
train_set_ids, val_set_ids = train_test_split(train_set_ids, test_size=0.1, train_size=0.9, 
                                 random_state=random_seed, shuffle=True)



# Show the distribution of the labels in the final training set
plt.figure()
_, _, _ = plt.hist(cov_type_labels_sub[train_set_ids], bins=numpy.unique(cov_type_labels), align='left')
plt.title('Class frequencies for training set')
plt.show()



# Preprocessing & Modelling

-- Now that you have training, validation, and test sets, you can start modelling. Let's start with an SVM model: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html. Have a look at the parameters of the function for building the model. Which can you relate to the parameters that we considered in class**?** Which hyperparameters would you choose to optimize for your modelling**?**

-- Build the model using the training set, use the validation set to select the optimal values for the hyperparameters that you have decided to optimize, then finally evaluate the model with these hyperparameter settings using the test set. 

-- The code below uses a linear SVM and optimizes C (the box constraint). The performance of the trained model is evaluated using F1 scores and confusion matrices, which are classification metrics that we will look at in the lectures on *Model Validation*. The code also computes accuracy just for the sake of completeness. 

-- Some things to try or think about from the results of code below are:


*   Do you find any discrepancies between the accuracy, F1 scores, and average F1 scores: do they tell the same story about the performance of the model**?**
*   What do you think that the confusion matrix shows**?**

*   What do you notice about the range of values of the features in *scaled_cov_type_feats_sub*, compared to those in *cov_type_feats_sub* **?**
*   What do you think would be the outcome if you used *cov_type_feats_sub* instead of *scaled_cov_type_feats_sub* **?** You can try using the former to see what happens.

*   Can you see the use of the validation set**?**
*   What if you optimized for a different range of the hyperparameter values**?**

*   Do you think that the results reflect imbalance in the class distribution in the training set**?** You could try setting the *class_weight* parameter of the *SVC* function to 'balanced' to see what changes that leads to in performance.


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay



# Scaling the features to the same range of values
scaler = StandardScaler()
scaler.fit(cov_type_feats_sub)
scaled_cov_type_feats_sub = scaler.transform(cov_type_feats_sub)
print("\n A peek at the scaled dataset features: \n"+str(scaled_cov_type_feats_sub))


# Use the validation set to optimize the hyperparameters you wish to
c_options = [0.1, 1.0, 10.0]
best_c = 0.1
best_c_perf = 0
val_labels = cov_type_labels_sub[val_set_ids]
train_labels = cov_type_labels_sub[train_set_ids];

for c in c_options:
  print("\n for c="+str(c)+"...")
  model = SVC(C=c, kernel='rbf', degree=3, gamma='scale', class_weight=None, random_state=random_seed)
  model.fit(scaled_cov_type_feats_sub[train_set_ids, :], train_labels)
  val_pred = model.predict(scaled_cov_type_feats_sub[val_set_ids, :])

  avg_f1_score = f1_score(val_labels, val_pred, average='macro')

  if avg_f1_score > best_c_perf:
    best_c = c
    best_c_perf = avg_f1_score

print('\n The optimal c for this data is: '+str(best_c))

# Use the optimized hyperparameter to train the final model
model = SVC(C=best_c, kernel='rbf', degree=3, gamma='scale', class_weight=None, random_state=random_seed)
model.fit(scaled_cov_type_feats_sub[train_set_ids, :], train_labels)

# Evaluate the trained model using the test set
test_labels = cov_type_labels_sub[test_set_ids]
test_pred = model.predict(scaled_cov_type_feats_sub[test_set_ids, :])

avg_f1_score = f1_score(test_labels, test_pred, average='macro')
f1_scores = f1_score(test_labels, test_pred, average=None)
print('\n The F1 scores for each of the classes are: '+str(f1_scores))
print('\n The average F1 score is: '+str(avg_f1_score))


acc = accuracy_score(test_labels, test_pred)
print('\n The overall accuracy is: '+str(acc))

confusion_matrix = confusion_matrix(test_labels, test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix)
disp.plot()
plt.show()




# Comparing the SVM with a Random Forest

-- You can build additional models using the Random Forest classifier. You can find the documentations in https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for the *scikit learn* library. An hyperparameter that you could optimize here is the number of trees, i.e. 'n_estimators'. How does the performance of the Random Forest compare with the performance of the SVM**?** 

