# SoC Module Recommender System

In this project we design a recomendation engine (Don't worry about the effectiveness of the system. It maybe very bad. The idea is just to offer you a proof of concept!). The recommendation engine suggests the students a module that closely matches the modules already taken by the student. The dataset comprices of two files:
- List of modules in the School of Computing 
- List of graduated students and the modules they had taken during their studies

# Loading the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
import nltk
from nltk.corpus import stopwords
import string
nltk.download('popular')

# set seed to reproduce the result
rng = np.random.default_rng(seed=42)

courses = pd.read_csv("courses.tsv", sep='\t')
students = pd.read_csv("students.tsv", sep='\t')
courses['specialisation'].fillna('others',inplace=True)
courses['specialisation'] = courses['specialisation'].replace('Netoworking', 'Networking')

# Part 1: Creating the preprocessing pipeline

We want to create a sklearn pipeline to efficiently preprocess the data and prepare it for training a model. We use three different features in the `courses` data: `specialisation`, `info` and `workload`. We want to represent every feature in a numeric form and merge them to form a feature vector for every course. We do so in the following way:
- `specialisation` represents one of the six levels of the module. For instance: CS2103 is a Software Engineering (SE) specialisation module. Encode this categorical feature into a vector. The decision of handling missing values is left to you! *(Hint: You can use `MultiLabelBinerizer` to do so.)*
- `info` provides a short discription of the module. We want to convert it into a vector using CountVectorizer. *Don't forget to remove the stopwords* while doing so.
-  `workload` states the intended distribution of workload over lectures, tutorials, labs and self study. We want to find the workload as the sum of individual workloads. For instnce: 3-1-1-3-2 workload transforms to 10 hours.

Provide implementation for three classes that help us build the pipeline. `transformed_courses` should be a numpy array of shape `[n_courses X n_features]`.


In [3]:
class WorkloadTransformer:        
    def fit(self, X, y = None, **fit_params):
        #print('WorkloadTransformer.fit() finish')
        return self
    
    def transform(self, X, y = None, **fit_params):
        def sumWorkload(rows):
          xsplit = rows.split('-')
          return sum([float(x) for x in xsplit])

        #print('WorkloadTransformer.transform() finish')
        return X['workload'].apply(sumWorkload).values.reshape(-1,1)

In [4]:
class InfoTransformer:
    def preprocessText(self, text):
        text = text.translate(str.maketrans('', '', string.punctuation))
        stop_words = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(text)
        tokens = [t for t in tokens if t.lower() not in stop_words]
        return list(tokens)

    def fit(self, X, y = None, **fit_params):
        tokens = X['info'].apply(self.preprocessText)
        text = list(map(lambda x : ' '.join(x), tokens))
        self.vectorizer = CountVectorizer().fit(text)
        #print('InfoTransformer.fit() finish')
        return self
    
    def transform(self, X, y = None, **fit_params):
        tokens = X['info'].apply(self.preprocessText)
        text = list(map(lambda x : ' '.join(x), tokens))
        #print('InfoTransformer.transform() finish')
        return self.vectorizer.transform(text).toarray()

In [5]:
class SpecTransformer: 

    def fit(self, X, y = None, **fit_params):
        spec = X['specialisation'].apply(lambda x: x.replace(' ', '').split(','))

        self.mlb = MultiLabelBinarizer().fit(spec)
        #print('SpecTransformer.fit() finish')
        return self
    
    def transform(self, X, y = None, **fit_params):
        #print('SpecTransformer.transform() finish')
        return self.mlb.transform(X['specialisation'].apply(lambda x: x.replace(' ', '').split(',')))

In [6]:
featureTransformer = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer())])),
    ('info_processing', Pipeline([('info', InfoTransformer())])),
    ('spec_processing', Pipeline([('spec', SpecTransformer())])),
])

featureTransformer.fit(courses)
transformeed_courses = featureTransformer.transform(courses)
print(transformeed_courses)

[[10.  0.  0. ...  0.  0.  0.]
 [10.  0.  0. ...  0.  0.  0.]
 [10.  0.  0. ...  0.  0.  0.]
 ...
 [10.  0.  0. ...  0.  0.  1.]
 [10.  0.  0. ...  0.  0.  1.]
 [10.  0.  0. ...  0.  0.  1.]]


Now we prepare our testing data in the same way we preprocessed the course. 

`students` data comprises of 1000 students and a list of modules they have taken. 

Create `Xtest` and `Ytest` as two matrices. `Xtest`, of size `1000*5`, comprises of first five modules for every student in the list. `Ytest`, of size `1000*[remaining_modules]`, comprises of rest of the modules for every student in the list. 
We do so in order to assess the performance of the recommender. We assess the recommender based on its effectiveness to predict the modules given a list of five modules as the input.

For instance: 
- `Xtest[0] = ['CS2105', 'CS4222', 'CS6270', 'CS6205', 'CS4226']`
- `Ytest[0] = ['CS3282', 'CS6204', 'CS5223', 'CS3281', 'CS4344', 'CS5422', 'CS3237', 'CS5233']`.

In [7]:
# Write your code here
Xtest = list(map(lambda x:x.split(',')[:5], students['courses']))
Ytest = list(map(lambda x:x.split(',')[5:], students['courses']))

print("Xtest:\n",Xtest)
print("Ytest:\n",Ytest)

Xtest:
 [['CS5422', 'CS5223', 'CS4237', 'CS3281', 'CS6213'], ['CS6206', 'CS3241', 'CS5237', 'CS4350', 'CS3242'], ['CS5244', 'CS6270', 'CS6234', 'CS5223', 'CS3230'], ['CS5422', 'CS3220', 'CS3103', 'CS4224', 'CS4237'], ['CS6206', 'CS3241', 'CS5240', 'CS4350', 'CS5343'], ['CS5424', 'CS4225', 'CS5425', 'CS5229', 'CS4224'], ['CS2309', 'CS2104', 'CS4215', 'CS4216', 'CS3220'], ['CS3221', 'CS4216', 'CS5339', 'CS2107', 'CS5331'], ['CS3243', 'CS5345', 'CS3244', 'CS5228', 'CS3234'], ['CS5424', 'CS3281', 'CS5223', 'CS4237', 'CS3236'], ['CS5422', 'CS4344', 'CS3237', 'CS6205', 'CS3281'], ['CS6206', 'CS3241', 'CS2040', 'CS5237', 'CS1280'], ['CS6206', 'CS4215', 'CS4216', 'CS6202', 'CS3211'], ['CS4215', 'CS5424', 'CS4224', 'CS2220', 'CS5228'], ['CS5346', 'CS3103', 'CS5223', 'CS3281', 'CS5233'], ['CS4225', 'CS3241', 'CS5425', 'CS4224', 'CS2220'], ['CS1010FC/X', 'CS2040', 'CS1010E', 'CS1231', 'CS5228'], ['CS5223', 'CS5233', 'CS1010', 'CS4223', 'CS2309'], ['CS4347', 'CS4241', 'CS3271', 'CS1010S', 'CS5271'

For every student in `Xtest`, we need to transform the list of 5 modules to the feature space using the `featureTransformer` fit on the training data. For every module we will get a feature vector of size `n_features`. We add these feature vectors to get an aggregate feature vector for very student.

Write a function `getFeatureVector` that takes in the list of modules and `featureTransformer`. It returns the feature vector for the specified list of courses. For instance, `getFeatureVector(Xtest[0], featureTransformer)` will return a vector of size `n_features`.

In [8]:
def getFeatureVector(modules, featureTransformer):
    moduleFeatures = []
    for module in modules:
      if courses['code'].isin([module]).any():
        #CS4245 not found, if the course students taken is not in the courses list, then ignore
        moduleFeature = featureTransformer.transform(courses[courses['code'] == module])
        moduleFeature = moduleFeature.astype(np.float64)
        moduleFeatures.append(moduleFeature)
    return np.sum(moduleFeatures,axis=0)

print(getFeatureVector(Xtest[0], featureTransformer))


[[50.  0.  0. ...  0.  0.  3.]]


# Part 2: Content based recommender

We can use a model as simple as K-nearest neighbour (KNN) to perform a content based recommendation. If we provide a list of 5 modules to the recommender, it provide us a list of modules that are similar to the specified modules.

`sklearn` provides `NearestNeighbors` as well as `KNeighborsClassifier`, both of which have a similar functionality. `NearestNeighbors` provides as an easy functionality to predict a list of K nearest neighbours. Therefore, we prefer it over `KNeighborsClassifier`. If we want to find K nearest points to a datapoint`d`, we need to use `n_neighbors` as K + 1 because the list includes `d` itself.

You can now train the model using the training data, which comprises of `transformed_courses` and with their codes as the labels. 

In [9]:
## Write your code here
K = 5
model = NearestNeighbors(algorithm = "brute", n_neighbors = K + 1)
X_train = transformeed_courses
y_train = courses['code']

model.fit(X_train)

It is time to see our model in action. Let's see what modules our model reommends based on the modules taken by a student.

Write a function that takes in a *pre-trained* model of your choice as input and the list of modules. It returns the top-K recommendations of the model. Print the top 6 recommendations for the first student. 

In [10]:
def recommend(model, modulesTaken, k = 5):
    kRecommend = k + len(modulesTaken)
    X =  getFeatureVector(modulesTaken, featureTransformer)
    distances, indices = model.kneighbors(X, n_neighbors=kRecommend)
    
    recommendCourses = [courses.iloc[x]['code'] for x in indices[0]]
    #remove courses that have taken:
    res = []
    for course in recommendCourses:
      if course not in modulesTaken:
        res.append(course)
        if len(res) == k:
          return res
print(recommend(model, Xtest[0], 6))

['CS3203', 'CS3205', 'CS2020', 'CS3216', 'CS3217', 'CS4222']


# Part 3: Recommender evaluation

Is this the model any good?. To do so, we use **precision** and **recall** as our metrics. `Ytest` consists of true labels for every students. Using those labels as the ground truth, compute the precision and recall for every student. Write a code that prints values of average precision and recall for a specific value of `K` over the `students` dataset. Print the value of average precision and average recall for `K= 10`.


In [11]:
# Write your code 
from tqdm import tqdm
y_pred = []
for iXtest in tqdm(Xtest):
  y_pred.append(recommend(model, iXtest, 10))

100%|██████████| 1000/1000 [00:38<00:00, 25.97it/s]


In [12]:
def calMetrics(y_pred, Ytest):
  precision = []
  recall = []
  # Calculate the precision and recall for each student
  for iStudent in range(len(y_pred)):
      tp = 0
      fp = 0
      fn = 0
      for jCourse in y_pred[iStudent]:#y_pred[iStudent]:['CS3203', 'CS3205', 'CS5223', 'CS2020', 'CS3216', 'CS3217']
        if jCourse in Ytest[iStudent]:
          tp += 1
        else:
          fp += 1
      for jCrouse in Ytest[iStudent]:
        if jCrouse not in y_pred[iStudent]:
          fn += 1

      if tp + fp > 0:
          precision.append(tp / (tp + fp))
      if tp + fn > 0:
          recall.append(tp / (tp + fn))
  return precision, recall

precision, recall = calMetrics(y_pred, Ytest)

# Calculate the average precision and recall for K=10
avg_precision = np.mean(precision)
avg_recall = np.mean(recall)
print("Average precision for K=10:", avg_precision)
print("Average recall for K=10:", avg_recall)

Average precision for K=10: 0.12780000000000002
Average recall for K=10: 0.132065830914012


We observe that both precision and recall is not really great. The reason might be high feature dimension, which may even be noisy. Append the exisiting `featureTransformer` with a PCA to reduce the dimension. 

Print the value of average precision and recall for `K= 10` after the introduction of PCA.


In [14]:
# Write your code here
K = 5
pcaFeatureTransformer = Pipeline([
    ('feature_transformer', featureTransformer),
    ('pca', PCA(n_components=100))
])
#print(pcaFeatureTransformer)
pcaFeatureTransformer.fit(courses)
pca_X_train = pcaFeatureTransformer.transform(courses)

pcaModel = NearestNeighbors(algorithm = "brute", n_neighbors = K + 1)
pcaModel.fit(pca_X_train)


In [24]:
def pcaRecommend(model, modulesTaken, k = 5):
    # recommend function for pca dataset
    kRecommend = k + len(modulesTaken)
    X =  getFeatureVector(modulesTaken, pcaFeatureTransformer)
    distances, indices = model.kneighbors(X, n_neighbors=kRecommend)

    recommendCourses = [courses.iloc[x]['code'] for x in indices[0]]

    res = []
    for course in recommendCourses:
      if course not in modulesTaken:
        res.append(course)
        if len(res) == k:
          return res
#print(pcaRecommend(pcaModel, Xtest[0], 1))

pca_y_pred = []
for iXtest in tqdm(Xtest):
  pca_y_pred.append(pcaRecommend(pcaModel, iXtest, 10))

['CS5424']


100%|██████████| 1000/1000 [00:44<00:00, 22.53it/s]


In [25]:
pca_precision, pca_recall = calMetrics(pca_y_pred, Ytest)
# Calculate the average precision and recall for K=10
pca_avg_precision = np.mean(pca_precision)
pca_avg_recall = np.mean(pca_recall)
print("Average precision for K=10:", pca_avg_precision)
print("Average recall for K=10:", pca_avg_recall)

Average precision for K=10: 0.18740000000000004
Average recall for K=10: 0.19959447902458738


Compared the result without PCA, we could see an improvement.

Extend the code to perform a grid search for the value of 
`K` that provides the best `F1_score`. Try for values of K from 1 to 10.


In [17]:
def calF1score(y_pred, Ytest):
  F1score = []
  for iStudent in range(len(y_pred)):
      tp = 0
      fp = 0
      fn = 0
      for jCourse in y_pred[iStudent]: #y_pred[iStudent]:['CS3203', 'CS3205', 'CS5223', 'CS2020', 'CS3216', 'CS3217']
        if jCourse in Ytest[iStudent]:
          tp += 1
        else:
          fp += 1
      for jCrouse in Ytest[iStudent]:
        if jCrouse not in y_pred[iStudent]:
          fn += 1
      if 2*tp + fp + fn > 0:
        F1score.append(2*tp /(2*tp + fp + fn))
  #print(F1score[:5])
  return F1score

In [18]:
################### Grid Search to tune parameter ###################
F1scores = []
bestF1 = 0
best_k = 0

for k in range(1,11):
  pca_y_pred = []
  ipcaModel = NearestNeighbors(algorithm = "brute", n_neighbors = k + 1)
  ipcaModel.fit(pca_X_train)
  #print(ipcaModel)
  for iXtest in tqdm(Xtest):
    pca_y_pred.append(pcaRecommend(ipcaModel, iXtest, k))
  #print(pca_y_pred[:5])
  F1score = np.mean(calF1score(pca_y_pred, Ytest))
  F1scores.append(F1score)
  #print(f'\nF1-score = {F1score}, k = {k}.')
  if F1score > bestF1:
    bestF1 = F1score
    best_k = k

print(f'\nThe best F1-score = {bestF1}, k = {best_k}.')

100%|██████████| 1000/1000 [00:46<00:00, 21.56it/s]
100%|██████████| 1000/1000 [00:42<00:00, 23.52it/s]
100%|██████████| 1000/1000 [00:50<00:00, 19.81it/s]
100%|██████████| 1000/1000 [00:46<00:00, 21.72it/s]
100%|██████████| 1000/1000 [00:47<00:00, 21.04it/s]
100%|██████████| 1000/1000 [00:45<00:00, 22.02it/s]
100%|██████████| 1000/1000 [00:50<00:00, 19.62it/s]
100%|██████████| 1000/1000 [00:43<00:00, 22.79it/s]
100%|██████████| 1000/1000 [00:44<00:00, 22.71it/s]
100%|██████████| 1000/1000 [00:47<00:00, 21.00it/s]


The best F1-score = 0.1850544375169821, k = 10.





# **Addition Work**: #

Below are some extra work I tried to see whether I could improve the F1-score.

My idea is, if we treat the sum of workload as numerical variable, the value is too large, which means it may take too many weights in the model while we calculate the distances. So it may be better to convert the sum to categorical variables, and then process(encoding) categorical variables to numerical variables.

In order to test my thought, I made some attempts below:
1. Treat workload as categorical data, that means, after we get a sum of workload (7 posibilities in total), we convert it into numeric variables by **encoding**.
2. Not use PCA, to check if **step1** works.
3. Use PCA, to check whether **step1** could help to improve the score.

In [19]:
from sklearn.preprocessing import OneHotEncoder
class WorkloadTransformer2:  
    # convert workload to categorical data      
    def sumWorkload(self, rows):
        xsplit = rows.split('-')
        return sum([float(x) for x in xsplit])

    def fit(self, X, y = None, **fit_params):
        #print('WorkloadTransformer.fit() finish')
        workload = X['workload'].apply(self.sumWorkload).values.reshape(-1,1)
        self.oh_encoder = OneHotEncoder(categories='auto', sparse=False, handle_unknown='ignore')
        self.oh_encoder.fit(workload)
        return self
    
    def transform(self, X, y = None, **fit_params):

        workload = X['workload'].apply(self.sumWorkload).values.reshape(-1,1)
        workload = self.oh_encoder.transform(workload)
        return workload

In [20]:
featureTransformer2 = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer2())])),
    ('info_processing', Pipeline([('info', InfoTransformer())])),
    ('spec_processing', Pipeline([('spec', SpecTransformer())])),
])

featureTransformer2.fit(courses)
transformeed_courses_cat = featureTransformer2.transform(courses)
#print(transformeed_courses_cat.shape)




(184, 2289)


In [21]:
def recommend2(model, modulesTaken, k = 5):
    # recommend funtion for considering workload as categorical variables
    kRecommend = k + len(modulesTaken)
    X =  getFeatureVector(modulesTaken, featureTransformer2)
    distances, indices = model.kneighbors(X, n_neighbors=kRecommend)
    
    recommendCourses = [courses.iloc[x]['code'] for x in indices[0]]
    #remove courses that have taken:
    res = []
    for course in recommendCourses:
      if course not in modulesTaken:
        res.append(course)
        if len(res) == k:
          return res

def otherTest(isPCA=True, isCategorical=False):
  # function for other tests, including whether use PCA, whether treat workload as categorical variables
  if isPCA:
    if isCategorical:
      pcaFeatureTransformer = Pipeline([
          ('feature_transformer', featureTransformer2),
          ('pca', PCA(n_components=100))
      ])
    else:
      pcaFeatureTransformer = Pipeline([
          ('feature_transformer', featureTransformer),
          ('pca', PCA(n_components=100))
      ])
    func_recommend = pcaRecommend
  else:
    if isCategorical:
      pcaFeatureTransformer = featureTransformer2
      func_recommend = recommend2
    else:
      pcaFeatureTransformer = featureTransformer
      func_recommend = recommend

  pcaFeatureTransformer.fit(courses)
  X_train = pcaFeatureTransformer.transform(courses)
  #print(X_train.shape)
  #print(func_recommend)
  

  F1scores = []
  bestF1 = 0
  best_k = 0

  for k in range(1,11):
    y_pred = []
    Model = NearestNeighbors(algorithm = "brute", n_neighbors = k + 1)
    Model.fit(X_train)
    for iXtest in tqdm(Xtest):
      y_pred.append(func_recommend(Model, iXtest, k))
    #print(y_pred[:5])
    F1score = np.mean(calF1score(y_pred, Ytest))
    F1scores.append(F1score)
    #print(f'\nF1-score = {F1score}, k = {k}.')
    if F1score > bestF1:
      bestF1 = F1score
      best_k = k

  print(f'\nThe best F1-score = {bestF1}, k = {best_k}.')
  return F1scores

In [27]:
F1scores_noextraprocess = otherTest(False,False)
print("F1-scores, treating workload as numerical variable and not processed by PCA:")
print(F1scores_noextraprocess)

100%|██████████| 1000/1000 [00:20<00:00, 49.78it/s]
100%|██████████| 1000/1000 [00:21<00:00, 45.87it/s]
100%|██████████| 1000/1000 [00:22<00:00, 44.62it/s]
100%|██████████| 1000/1000 [00:23<00:00, 43.01it/s]
100%|██████████| 1000/1000 [00:20<00:00, 48.42it/s]
100%|██████████| 1000/1000 [00:24<00:00, 41.56it/s]
100%|██████████| 1000/1000 [00:22<00:00, 44.68it/s]
100%|██████████| 1000/1000 [00:22<00:00, 44.41it/s]
100%|██████████| 1000/1000 [00:22<00:00, 43.96it/s]
100%|██████████| 1000/1000 [00:22<00:00, 44.20it/s]


The best F1-score = 0.12430028214756803, k = 10.
F1-scores, treating workload as numerical variable and not processed by PCA:
[0.008746958064295525, 0.014122287018107452, 0.02654178673885902, 0.03511959761271916, 0.041751027130016896, 0.061417748963295465, 0.08149421172133381, 0.09865385744192301, 0.1128729301693241, 0.12430028214756803]





In [22]:
F1scores_cat = otherTest(False,True)
print("F1-scores, treating workload as categorical variable and not processed by PCA:")
print(F1scores_cat)

100%|██████████| 1000/1000 [00:26<00:00, 38.02it/s]
100%|██████████| 1000/1000 [00:27<00:00, 37.02it/s]
100%|██████████| 1000/1000 [00:27<00:00, 36.61it/s]
100%|██████████| 1000/1000 [00:27<00:00, 36.97it/s]
100%|██████████| 1000/1000 [00:28<00:00, 34.69it/s]
100%|██████████| 1000/1000 [00:28<00:00, 34.55it/s]
100%|██████████| 1000/1000 [00:27<00:00, 35.88it/s]
100%|██████████| 1000/1000 [00:29<00:00, 33.85it/s]
100%|██████████| 1000/1000 [00:28<00:00, 35.52it/s]
100%|██████████| 1000/1000 [00:28<00:00, 35.19it/s]


The best F1-score = 0.18437241794048245, k = 10.
F1-scores, treating workload as categorical variable and not processed by PCA:
[0.03874646161372168, 0.07124193376544637, 0.10027417742216505, 0.12270749120183297, 0.13714905595563556, 0.151735170966692, 0.16397946449229867, 0.1724123393490319, 0.17945255134502464, 0.18437241794048245]





In [23]:
pca_F1scores_cat = otherTest(True,True)
print("F1-scores, treating workload as categorical variable and processed by PCA:")
print(pca_F1scores_cat)

100%|██████████| 1000/1000 [00:51<00:00, 19.53it/s]
100%|██████████| 1000/1000 [01:02<00:00, 15.93it/s]
100%|██████████| 1000/1000 [00:53<00:00, 18.85it/s]
100%|██████████| 1000/1000 [00:46<00:00, 21.34it/s]
100%|██████████| 1000/1000 [00:46<00:00, 21.36it/s]
100%|██████████| 1000/1000 [00:45<00:00, 21.87it/s]
100%|██████████| 1000/1000 [00:43<00:00, 23.20it/s]
100%|██████████| 1000/1000 [00:47<00:00, 21.16it/s]
100%|██████████| 1000/1000 [00:46<00:00, 21.31it/s]
100%|██████████| 1000/1000 [00:48<00:00, 20.58it/s]


The best F1-score = 0.09109890583507584, k = 10.
F1-scores, treating workload as categorical variable and processed by PCA:
[0.016431046722455392, 0.031363326385105895, 0.04403655992270543, 0.052349182651452136, 0.06307058622088639, 0.07327365132750918, 0.08037445347720486, 0.08463053386772922, 0.08864527614732995, 0.09109890583507584]





From the result above, we can compare:

1.No PCA, not categoricalize workload:

The best F1-score = 0.12430028214756803, k = 10.

2.PCA, not categoricalize workload:

The best F1-score = 0.1850544375169821, k = 10.

3.No PCA, categoricalize workload:

The best F1-score = 0.18437241794048245, k = 10.

4.PCA, categoricalize workload:

The best F1-score = 0.09109890583507584, k = 10.

Compare the results above: 

Compare 1 and 3, we can find that the F1-score has boosted, which means while we do not do feature engineering using PCA, categoricalize workload does help to improve the performance.

However, the best performance comes from 2, using PCA while not categoricalizing workload. That shows PCA is quite help while doing feature engineering.

What surprised me is that the worst performance comes from 4, with both PCA and categoicalization. The potential explaination is applying both PCA and categoricalize to the dataset, some features are missed or misunderstood, thus lead to a worse prediction.


In this case, in conclusion, I will not make any extra change, just follow the instruction, using the sum value of workload as the feature, and process the features by PCA is relatively fine, when we just tune parameter n_neighbors from 1 to 10.