In this assignment we design a recomendation engine (*Don't worry about the effectiveness of the system. It maybe very bad. The idea is just to offer you a proof of concept!*). The recommendation engine suggests the students a module that closely matches the modules already taken by the student. The dataset comprices of two files:
- List of modules in the School of Computing 
- List of graduated students and the modules they had taken during their studies


# Loading the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion

'''
    YOU MUST USE THE RANDOM SEED WHEREVER NEEDED OR RANDOM_STATE as 42.
'''
rng = np.random.default_rng(seed=42)

courses = pd.read_csv("courses.tsv", sep='\t')
students = pd.read_csv("students.tsv", sep='\t')

In [2]:
courses

Unnamed: 0,code,name,credits,workload,info,specialisation
0,CS1010,Programming Methodology,4,2-1-1-3-3,This module introduces the fundamental concept...,Core
1,CS1010FC/X,Programming Methodology,4,2-1-1-3-3,This module introduces the fundamental concept...,Core
2,CS1010E,Programming Methodology,4,2-1-1-3-3,This module introduces the fundamental concept...,Core
3,CS1010J,Programming Methodology,4,2-1-1-3-3,This module introduces the fundamental concept...,Core
4,CS1010S,Programming Methodology,4,2-1-1-3-3,This module introduces the fundamental concept...,Core
...,...,...,...,...,...,...
179,CS6281,Topics in Computer Science II,4,3-0-0-3-4,Topics will be of an advanced computer science...,
180,CS6282,Topics in Computer Science III,4,3-0-0-3-4,Topics will be of an advanced computer science...,
181,CS6283,Topics in Computer Science IV,4,3-0-0-3-4,Topics will be of an advanced computer science...,
182,CS6284,Topics in Computer Science V,4,3-0-0-3-4,Topics will be of an advanced computer science...,


# Question 1: Creating the preprocessing pipeline

We want to create a sklearn pipeline to efficiently preprocess the data and prepare it for training a model. We use three different features in the `courses` data: `specialisation`, `info` and `workload`. We want to represent every feature in a numeric form and merge them to form a feature vector for every course. We do so in the following way:
- `specialisation` represents one of the six levels of the module. For instance: CS2103 is a Software Engineering (SE) specialisation module. Encode this categorical feature into a vector. The decision of handling missing values is left to you! *(Hint: You can use `MultiLabelBinerizer` to do so.)*
- `info` provides a short discription of the module. We want to convert it into a vector using CountVectorizer. *Don't forget to remove the stopwords* while doing so.
-  `workload` states the intended distribution of workload over lectures, tutorials, labs and self study. We want to find the workload as the sum of individual workloads. For instnce: 3-1-1-3-2 workload transforms to 10 hours.

Provide implementation for three classes that help us build the pipeline. `transformed_courses` should be a numpy array of shape `[n_courses X n_features]`.

                                                                                             

In [2]:
class WorkloadTransformer:        
    def fit(self, X, y = None, **fit_params):
        return self
    
    def transform(self, X, y = None, **fit_params):
        transformed = X['workload'].str.split('-').apply(lambda x: sum(float(val) for val in x))
        return transformed.values.reshape(-1, 1)

In [3]:
class InfoTransformer:   
    def __init__(self):
        self.vectorizer = CountVectorizer(stop_words='english')
        
    def fit(self, X, y = None, **fit_params):
        self.vectorizer.fit(X['info'])
        return self
    
    def transform(self, X, y = None, **fit_params):
        transformed = self.vectorizer.transform(X['info'])
        return transformed.toarray()

In [4]:
class SpecTransformer: 
    def __init__(self):
        self.mlb = MultiLabelBinarizer()
        
    def fit(self, X, y = None, **fit_params):
        self.mlb.fit(X['specialisation'].fillna('Missing').values.reshape(-1, 1))
        return self
    
    def transform(self, X, y = None, **fit_params):
        transformed = self.mlb.transform(X['specialisation'].fillna('Missing').values.reshape(-1, 1))
        return transformed

In [5]:
featureTransformer = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer())])),
    ('info_processing', Pipeline([('info', InfoTransformer())])),
    ('spec_processing', Pipeline([('spec', SpecTransformer())])),
])

featureTransformer.fit(courses)
transformed_courses = featureTransformer.transform(courses)

In [6]:
transformed_courses.shape

(184, 2139)

Now we prepare our testing data in the same way we preprocessed the course. `students` data comprises of 1000 students and a list of modules they have taken. 

Create `Xtest` and `Ytest` as two matrices. `Xtest`, of size `1000*5`, comprises of first five modules for every student in the list. `Ytest`, of size `1000*[remaining_modules]`, comprises of rest of the modules for every student in the list. 
We do so in order to assess the performance of the recommender. We assess the recommender based on its effectiveness to predict the modules given a list of five modules as the input.

For instance: 
- `Xtest[0] = ['CS2105', 'CS4222', 'CS6270', 'CS6205', 'CS4226']`
- `Ytest[0] = ['CS3282', 'CS6204', 'CS5223', 'CS3281', 'CS4344', 'CS5422', 'CS3237', 'CS5233']`.

<div align="right"></align>

In [7]:
students['courses'] = students['courses'].str.split(',')
Xtest = students['courses'].apply(lambda x: x[:5]).tolist()
Ytest = students['courses'].apply(lambda x: x[5:]).tolist()

For every student in `Xtest`, we need to transform the list of 5 modules to the feature space using the `featureTransformer` fit on the training data. For every module we will get a feature vector of size `n_features`. We *add* these feature vectors to get an aggregate feature vector for very student.

Write a function `getFeatureVector` that takes in the list of modules and `featureTransformer`. It returns the feature vector for the specified list of courses. For instance, `getFeatureVector(Xtest[0], featureTransformer)` will return a vector of size `n_features`.

<div align="right"></div>

In [8]:
def getFeatureVector(modules, featureTransformer):
    filtered_courses = courses[courses['code'].isin(modules)]
    transformed_modules = featureTransformer.transform(filtered_courses)
    aggregate_feature_vector = np.sum(transformed_modules, axis=0)
    return aggregate_feature_vector

In [10]:
getFeatureVector(Xtest[0], featureTransformer).shape

(2139,)

# Question 2: Content based recommender

We can use a model as simple as K-nearest neighbour (KNN) to perform a content based recommendation. If we provide a list of 5 modules to the recommender, it provide us a list of modules that are similar to the specified modules.

`sklearn` provides `NearestNeighbors` as well as `KNeighborsClassifier`, both of which have a similar functionality. `NearestNeighbors` provides as an easy functionality to predict a list of K nearest neighbours. Therefore, we prefer it over `KNeighborsClassifier`. If we want to find K nearest points to a datapoint`d`, we need to use `n_neighbors` as K + 1 because the list includes `d` itself.

You can now train the model using the training data, which comprises of `transformed_courses` and with their codes as the labels. 
<div align="right"></div>

In [11]:
K = 5
model = NearestNeighbors(algorithm = "brute", n_neighbors = K + 1)
model.fit(transformed_courses)

It is time to see our model in action. Let's see what modules our model reommends based on the modules taken by a student.

Write a function that takes in a *pre-trained* model of your choice as input and the list of modules. It returns the top-K recommendations of the model. Print the top 6 recommendations for the first student. 
<div align="right"></div>

In [12]:
def recommend(model, modulesTaken, k=5):
    course_codes = courses['code'].tolist()
    feature_vector = getFeatureVector(modulesTaken, featureTransformer)
    distances, indices = model.kneighbors([feature_vector],n_neighbors=k+1)
    recommended_indices = indices[0]
    recommended_courses = [course_codes[i] for i in recommended_indices]
    recommended_courses = [course for course in recommended_courses]
    return recommended_courses[:k]
print(recommend(model, Xtest[0], 6))

['CS3203', 'CS3205', 'CS5223', 'CS2020', 'CS3216', 'CS3217']


# Question 3: Recommender evaluation

Is this the model any good? To assess the performance of the model, we use **precision** and **recall** as our metrics. `Ytest` consists of true labels for every students. Using those labels as the ground truth, compute the precision and recall for every student. Write a code that prints values of average precision and recall for a specific value of `K` over the `students` dataset. Print the value of average precision and average recall for `K= 10`.

                                                                                             

In [13]:
def compute_precision_recall(model, Xtest, Ytest, k):
    precisions = []
    recalls = []
    
    for modules_taken, true_labels in zip(Xtest, Ytest):
        recommended = recommend(model, modules_taken, k)
        relevant_recommendations = len(set(recommended).intersection(set(true_labels)))
        precision = relevant_recommendations / len(recommended) if recommended else 0
        precisions.append(precision)
        recall = relevant_recommendations / len(true_labels) if true_labels else 0
        recalls.append(recall)
    
    avg_precision = np.mean(precisions)
    avg_recall = np.mean(recalls)
    return avg_precision, avg_recall

avg_precision, avg_recall = compute_precision_recall(model, Xtest, Ytest, 10)
print("Average Precision:", avg_precision)
print("Average Recall:", avg_recall)

Average Precision: 0.057
Average Recall: 0.056304128942093336


We observe that both precision and recall is not really great. The reason might be igh feature dimension, which may even be noisy. Append the exisiting `featureTransformer` with a PCA to reduce the dimension. 

Print the value of average precision and recall for `K= 10` after the introduction of PCA.

                                                                                              

In [15]:
pca_transformer = Pipeline([
    ('features', featureTransformer),
    ('pca', PCA(n_components=100))  
])

pca_transformer.fit(courses)

transformed_courses_pca = pca_transformer.transform(courses)

course_codes = courses['code'].tolist()

model_pca = NearestNeighbors(algorithm="brute", n_neighbors=K+1)
model_pca.fit(transformed_courses_pca)

def getFeatureVector(modules, featureTransformer):
    filtered_courses = courses[courses['code'].isin(modules)]
    if filtered_courses.empty:
        return np.zeros(featureTransformer.named_steps['pca'].n_components_)
    transformed_modules = featureTransformer.transform(filtered_courses)
    aggregate_feature_vector = np.sum(transformed_modules, axis=0)
    return aggregate_feature_vector

def recommend(model, modulesTaken, k=5):
    feature_vector = getFeatureVector(modulesTaken, pca_transformer)
    distances, indices = model.kneighbors([feature_vector])
    recommended_indices = indices[0]
    recommended_courses = [course_codes[i] for i in recommended_indices]
    recommended_courses = [course for course in recommended_courses if course not in modulesTaken]
    return recommended_courses[:k]

avg_precision_pca, avg_recall_pca = compute_precision_recall(model_pca, Xtest, Ytest, 10)
print("Average Precision:", avg_precision_pca)
print("Average Recall:", avg_recall_pca)

Average Precision: 0.22255
Average Recall: 0.046578402084671434


**From the above results, it can be seen that the model's precision reaches more than 21% with the addition of PCA, which is an improvement of about 16%, and the recall is also reduced by about 1%.**

Can you provide some **concrete** (something that you can implement) suggestions to improve the performance of the system? The improvement does not have to be very significant.

                                                                                              

Here are some concrete steps we can take to potentially improve the performance of the recommendation system:

**1. Enhance Text Preprocessing for 'info' Feature:**
* Instead of just removing stopwords, we can also include stemming or lemmatization to process the text in the 'info' column. This might help in getting a better representation of the text.
* Instead of using simple term frequencies from 'CountVectorizer' for the course descriptions, we can switch to the TF-IDF representation, which will give more weight to terms that are unique to a particular document and less weight to terms that are common across documents.
* Adjust parameters of the 'TfidfVectorizer' like 'max_df', 'min_df', and 'ngram_range' to fine-tune the feature extraction from text.

**2. Handle Missing Values in specialisation Feature:**
* Instead of just filling missing values with 'Missing', we can fill them by 'SimpleImputer' with the mode (most frequent value) of the 'specialisation' column.

**3. Feature Scaling:**
* After obtaining the feature matrix from the 'featureTransformer', we can apply feature scaling (e.g., MinMaxScaler) to ensure that all features have the same scale. This is particularly important for KNN which is distance-based.



In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

class EnhancedInfoTransformer:   
    def __init__(self):
        self.vectorizer = TfidfVectorizer(stop_words='english', max_df=0.85, min_df=2, ngram_range=(1, 2))
        
    def fit(self, X, y=None, **fit_params):
        self.vectorizer.fit(X['info'])
        return self
    
    def transform(self, X, y=None, **fit_params):
        transformed = self.vectorizer.transform(X['info'])
        return transformed.toarray()

class EnhancedSpecTransformer: 
    def __init__(self):
        self.mlb = MultiLabelBinarizer()
        self.imputer = SimpleImputer(strategy='most_frequent')
        
    def fit(self, X, y=None, **fit_params):
        filled_data = self.imputer.fit_transform(X['specialisation'].values.reshape(-1, 1))
        self.mlb.fit(filled_data)
        return self
    
    def transform(self, X, y=None, **fit_params):
        filled_data = self.imputer.transform(X['specialisation'].values.reshape(-1, 1))
        transformed = self.mlb.transform(filled_data)
        return transformed

enhanced_featureTransformer1 = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer())])),
    ('info_processing', Pipeline([('info', EnhancedInfoTransformer())])),
    ('spec_processing', Pipeline([('spec', EnhancedSpecTransformer())])),
])

enhanced_featureTransformer = Pipeline([
    ('features', enhanced_featureTransformer1),
    ('pca', PCA(n_components=100))  
])

enhanced_featureTransformer.fit(courses)
enhanced_transformed_courses = enhanced_featureTransformer.transform(courses)

# Feature Scaling
scaler = MinMaxScaler()
scaled_courses = scaler.fit_transform(enhanced_transformed_courses)

model2 = NearestNeighbors(algorithm="brute", n_neighbors=K+1)
model2.fit(scaled_courses)

def enhanced_getFeatureVector(modules, featureTransformer, scaler):
    filtered_courses = courses[courses['code'].isin(modules)]
    if filtered_courses.empty:
        return np.zeros(featureTransformer.named_steps['pca'].n_components_)
    transformed_modules = featureTransformer.transform(filtered_courses)
    aggregate_feature_vector = np.sum(transformed_modules, axis=0)
    
    # Scale the aggregate feature vector
    scaled_vector = scaler.transform([aggregate_feature_vector])
    
    return scaled_vector[0]

def enhanced_recommend(model, modulesTaken, k=5):
    course_codes = courses['code'].tolist()
    feature_vector = enhanced_getFeatureVector(modulesTaken, enhanced_featureTransformer, scaler)
    
    distances, indices = model2.kneighbors([feature_vector])
    recommended_indices = indices[0]
    
    # Filter the courses the student has taken and get the top k
    recommended_courses = [course_codes[i] for i in recommended_indices if course_codes[i] not in modulesTaken][:k]
    
    return recommended_courses
def compute_precision_recall(model, Xtest, Ytest, k):
    precisions = []
    recalls = []
    
    for modules_taken, true_labels in zip(Xtest, Ytest):
        recommended = enhanced_recommend(model2, modules_taken, k)
        relevant_recommendations = len(set(recommended).intersection(set(true_labels)))
        precision = relevant_recommendations / len(recommended) if recommended else 0
        precisions.append(precision)
        recall = relevant_recommendations / len(true_labels) if true_labels else 0
        recalls.append(recall)
    
    avg_precision = np.mean(precisions)
    avg_recall = np.mean(recalls)
    return avg_precision, avg_recall

avg_precision_enhanced, avg_recall_enhanced = compute_precision_recall(model2, Xtest, Ytest, 10)
print("Average Precision:", avg_precision_enhanced)
print("Average Recall:", avg_recall_enhanced)

Average Precision: 0.24341666666666667
Average Recall: 0.03865637150087615


**From the above results, it can be seen that the model has been improved to achieve a precision of more than 23%, which is an improvement of about 2%-3%, and the recall has been reduced by about 1%.**