# Model Training

In [4]:
from helpers.functs.nlp_backmap import build_token_backmap, make_pretty_term
from sklearn.feature_extraction.text import TfidfVectorizer
from helpers.functs.StudentProfile import StudentProfile
from sklearn.metrics.pairwise import cosine_similarity
from helpers.functs.NLP import soft_nlp, hard_nlp
from sklearn.decomposition import TruncatedSVD
import pandas as pd
import numpy as np
import random
import ast

# Load dataset
df = pd.read_csv('../Data/Cleaned/cleaned_dataset_hard-NLP.csv')

# Loading uncleaned dataset for feedback names, etc. that have not seen NLP for user friendliness
raw_df = pd.read_csv('../Data/Raw/Uitgebreide_VKM_dataset.csv')


For the first model, we will be using a Bag of Words (BoW) approach. This involves converting text data into numerical vectors based on word frequency. We will then use cosine similarity to measure the similarity between different text entries. Then we will simulate student profiles and recommend courses based on those profiles.

Because we already have a cleaned dataset, we can directly proceed to the model training phase. The dataset already has heavy NLP pre-processing applied, so we can skip that step.

## 0. A Mocked Student Profile + Filtering
Here we mock a student profile. This is all the data we would receive from a student wanting to use your module helper. This is the first thing we do because filtering should be our first step (e.g. student only wants modules given in a specific location.)

The interest have been written to fit module 388 best. This way we can test if the module is capable of including this module inside of it's recommendations.

In [5]:
student = StudentProfile(
    interests=[
        "Tekening", 
        "Animatie", 
        "Kunst",
        "Ik hou van optreden, zingen, dansen en spelen! Ook hou ik van uitdaging."
    ],
    wanted_study_credit_range=(15, 30),
    location_preference=["Den Bosch", "Breda", "Tilburg"],
    learning_goals=["skills", "career development"],
    level_preference=["NLQF5", "NLQF6"],
    preferred_language="NL"
)

matching_models = [388, 191]

Now we're gonna use the student profile to filter some modules out of the dataset that will not be a good match because of parameters inside the student profile. 

In [6]:
# Create filtered module and save. The filtered one won't be used by TF-IDF because that would create bias. (Smaller amount of modules compared > easier higher scores)
filtered_df = df.copy()

# Helper to normalize the list-like location strings such as "['Den Bosch', 'Tilburg']"
def normalize_locations(series):
    def _to_list(val):
        try:
            parsed = ast.literal_eval(str(val))
            if isinstance(parsed, list):
                return [str(x).strip().lower() for x in parsed]
            return [str(parsed).strip().lower()]
        except Exception:
            return [str(val).strip().lower()]
    return series.apply(_to_list)

# --- 1. Study credits range ---
if hasattr(student, "wanted_study_credit_range") and student.wanted_study_credit_range is not None:
    min_cred, max_cred = student.wanted_study_credit_range
    filtered_df = filtered_df[(filtered_df["studycredit"] >= min_cred) & (filtered_df["studycredit"] <= max_cred)]

# --- 2. Location preference ---
if hasattr(student, "location_preference") and student.location_preference:
    all_locs_filtered = normalize_locations(filtered_df["location"])
    loc_prefs_norm = [str(x).strip().lower() for x in student.location_preference]
    loc_mask = all_locs_filtered.apply(lambda lst: any(x in loc_prefs_norm for x in lst))
    filtered_df = filtered_df[loc_mask]

# --- 3. Language of the module vs preferred language of the student ---
# Pretty complicated to include and won't be of any use anyways since tf-idf won't be able to link interests written in difference language than de modules

# --- 4. Level preference (e.g. NLQF levels) ---
if hasattr(student, "level_preference") and student.level_preference:
    level_prefs = [str(x).strip().lower() for x in student.level_preference]
    filtered_df = filtered_df[filtered_df["level"].astype(str).str.lower().isin(level_prefs)]

# --- 5. Availability > 0 ---
filtered_df = filtered_df[filtered_df["available_spots"] > 0]

print(f"Original number of modules: {len(df)}")
print(f"Number of modules after filtering: {len(filtered_df)}")

Original number of modules: 211
Number of modules after filtering: 211


## 1. Combine Relevant Text Columns

Here we combine relevant text columns into a single text field for TF-IDF vectorization.

In [7]:
# Combine relevant text columns 
big_string = (
    df["name"].fillna("") + " " +
    df["description"].fillna("") + " " +
    df["learningoutcomes"].fillna("") + " " +
    df["module_tags"].apply(lambda x: " ".join(x) if isinstance(x, list) else "")
)

big_df = pd.DataFrame({
    "id": df["id"],
    "text": big_string
})

big_df.head()

Unnamed: 0,id,text
0,159,kennismak psychologi modul ler gedrag jezelf a...
1,160,learn work abroad student kiez binn stam oplei...
2,161,proactiev zorgplann jeroen bosch ziekenhuis gr...
3,162,rouw verlies modul stil gestan rouw verlies va...
4,163,acuut complex zorg modul student verdiep acut ...


## 2. Vectorization dataset
Converting the text into a matrix of TF-IDF scores.

In [8]:
vectorizer = TfidfVectorizer(
    max_features=1000000,    # Really big amount to max out the data. 
    ngram_range=(1,2),       # unigrams + bigrams
    stop_words=None,         # Stop words already removed
)

X_modules_tfidf = vectorizer.fit_transform(big_df["text"])
X_modules_tfidf.shape

(211, 12596)

## 3. Converting in to dataframe for later use
Linking the matrixes to the id's of our dataset for comparison with user input.

In [9]:
tfidf_hard_NLP = pd.DataFrame({
    "id": big_df["id"],
    "tfidf_vector": list(X_modules_tfidf)    # each row is a 1xN sparse vector
})
tfidf_hard_NLP.head()

Unnamed: 0,id,tfidf_vector
0,159,<Compressed Sparse Row sparse matrix of dtype ...
1,160,<Compressed Sparse Row sparse matrix of dtype ...
2,161,<Compressed Sparse Row sparse matrix of dtype ...
3,162,<Compressed Sparse Row sparse matrix of dtype ...
4,163,<Compressed Sparse Row sparse matrix of dtype ...


Check if id's are still matched and dataframe is correctly made up.

In [10]:
i = 10  # any row index you want

# showing ID and first 200 chars
print("ID:", big_df.iloc[i]["id"])
print("TEXT:", big_df.iloc[i]["text"][:200], "...")

# Get the tf-idf vector
vec = tfidf_hard_NLP.iloc[i]["tfidf_vector"]

# Find the highest tfidf weight word in this vector
row = vec.toarray().flatten()
top_index = row.argmax()

feature_name = vectorizer.get_feature_names_out()[top_index]
print("TOP TF-IDF WORD:", feature_name)

ID: 169
TEXT: stevig stan jeugdzorg bent stat the art kennis handelingsrepertoir methodisch toegerust werkveld jeugdzorg bred toetsvorm beroepsprestaties jeugd gezinsprofessional ontwikkeld ism jeugdwerkveld  ...
TOP TF-IDF WORD: jeugdzorg


## 4. NLP on student profile
If we don't use NLP, finding relations between the dataset (performed NLP on) and the user profile will go far worse.

In [11]:
studentInterests = student.to_text()
student_hardNLP = hard_nlp(studentInterests)

token_map = build_token_backmap(studentInterests, student_hardNLP)
pretty_term = make_pretty_term(token_map)

student_hardNLP

'teken animatie kunst hou optred zing dans spel hou uitdag skill carer development'

## 5. Vectorizing student interests
Using TF-IDF

In [12]:
X_interests_tfidf = vectorizer.transform([student_hardNLP])
X_interests_tfidf.shape

(1, 12596)

## 6. Dimensionality Reduction
We will perform dimensionality reduction on the vectorized student and module data. We choose to go for a final dimensions of 200 as we believe this will roughly give back good results. This ofcourse can change during our optimization stage.
We kiezen voor PCA dimensionality reduction. Vooral voor performance redenen. (TruncatedSVD)

In [13]:
# Final dimension amount
n_components = 200
svd = TruncatedSVD(n_components=n_components, random_state=42)

# Transforming the module vectors
X_modules_reduced = svd.fit_transform(X_modules_tfidf)

# Transforming single student vector
X_student_reduced = svd.transform(X_interests_tfidf)

print("Modules reduced:", X_modules_reduced.shape)  # (n_modules, 50)
print("Student reduced:", X_student_reduced.shape)  # (1, 50)

Modules reduced: (211, 200)
Student reduced: (1, 200)


## 7. Cosine Similarity + Motivation
After peforming dimensionality reduction, we can finally look at the cosine similarity values to see what our model comes up with.

Also we will make sure that in the top x modules output, there is a column mentioning the strongest words from the user input that matched with the suggested module. This way our model will be able to show how it came up with it's suggestions to the user. To make sure the model's feedback doesn't use the words that have been affected by NLP, we created an NLP backmap function that binds the original words to the words after NLP.

In [14]:
# Making sure both matrixes exist

if 'X_modules_reduced' not in globals():
    raise NameError("X_modules_reduced not found — run the TruncatedSVD cell first.")
if 'X_student_reduced' not in globals():
    raise NameError("X_student_reduced not found — transform the student TF-IDF first using the SVD pipeline.")

# Making sure student matrix is in form (1, n_components)
X_student_vec = X_student_reduced
if X_student_vec.ndim == 1:
    X_student_vec = X_student_vec.reshape(1, -1)

# Compute global cosine similarity scores for the student against all modules
scores_global = cosine_similarity(X_student_vec, X_modules_reduced)[0]  # (n_modules,)

# Restrict to filtered_df using module ids
candidate_ids = set(filtered_df["id"].tolist())
candidate_mask = big_df["id"].isin(candidate_ids)

# Cosine scores for only the candidate set
scores_candidates = scores_global[candidate_mask.values]
idx_candidates = np.where(candidate_mask.values)[0]

if len(idx_candidates) == 0:
    raise ValueError("No modules remain after filtering; cannot compute recommendations.")

# Select top-n among the candidates
top_n = 5
order = np.argsort(-scores_candidates)[:top_n]
top_idx = idx_candidates[order]  # indices in the original big_df / df space

# =======================================================================================
feature_names = vectorizer.get_feature_names_out()
student_vec = X_interests_tfidf.toarray().flatten()

# Collect top shared terms between student and each recommended module
def top_shared_terms(module_idx, top_k=5):
    module_vec = X_modules_tfidf[module_idx].toarray().flatten()
    # alleen features waar beide > 0
    mask = (student_vec > 0) & (module_vec > 0)
    if not mask.any():
        return ""
    shared_scores = (student_vec * module_vec) * mask
    top3_idx = np.argsort(-shared_scores)[:top_k]

    terms = []
    for i in top3_idx:
        if shared_scores[i] > 0:
            term = feature_names[i]
            if term not in terms:
                terms.append(term)
    return ", ".join(pretty_term(t) for t in terms)

motivation_list = [top_shared_terms(i) for i in top_idx]
# =======================================================================================

# Map from internal index (0..len(df)-1) to module id, then back to the correct row in raw_df using that id.
module_ids = big_df.iloc[top_idx]["id"].values

if "id" in raw_df.columns:
    # Align on module id, not on positional index
    module_names = []
    for mid in module_ids:
        row_match = raw_df[raw_df["id"] == mid]
        if not row_match.empty and "name" in row_match.columns:
            module_names.append(row_match.iloc[0]["name"])
        else:
            module_names.append("")
else:
    module_names = big_df.iloc[top_idx]["text"].values

# Build recommendations DataFrame using candidate scores only (no separate score_global column here)
recs = pd.DataFrame({
    'rank': list(range(1, len(top_idx) + 1)),
    'module_id': module_ids,
    'module_name': module_names,
    'score': scores_candidates[order],  # cosine score within filtered set
    'Motivation': motivation_list
,})

print('Top recommendations for the current student (filtered candidates, cosine scores):')
display(recs.reset_index(drop=True))

Top recommendations for the current student (filtered candidates, cosine scores):


Unnamed: 0,rank,module_id,module_name,score,Motivation
0,1,388,Tekenen,0.603486,tekening
1,2,191,De Kracht van de kunsten,0.514482,"kunst, zingen"
2,3,356,The art of biology,0.251662,"tekening, kunst"
3,4,315,European Project Semester,0.233205,spelen
4,5,385,Stopmotion,0.230948,"animatie, zingen"


Now we will convert the motivation into a nice understandable sentence for the user. 

In [15]:
# Function for putting the motivation into a nice sentence with randomization and Dutch and English variants.
def strength_phrase(score: float, is_dutch: bool = True) -> str:
    if is_dutch:
        if score >= 0.6:
            return random.choice([
                "sluit extreem goed aan bij jouw profiel",
                "is een bijna perfecte match met jouw interesses",
                "past heel sterk bij wat jij leuk vindt",
            ])

        elif score >= 0.4:
            return random.choice([
                "sluit goed aan bij jouw profiel",
                "is een sterke match met jouw interesses",
                "lijkt behoorlijk goed bij jou te passen",
            ])

        else:
            return random.choice([
                "kan een interessante extra optie zijn",
                "heeft een gematigde overlap met jouw interesses",
                "kan alsnog relevant zijn op basis van delen van jouw profiel",
            ])

    else:
        if score >= 0.6:
            return random.choice([
                "fits your profile extremely well",
                "is a near-perfect match for your interests",
                "aligns very strongly with what you like",
            ])

        elif score >= 0.4:
            return random.choice([
                "fits your profile well",
                "is a strong match for your interests",
                "seems to match you quite well"
            ])
        else:
            return random.choice([
                "could be an interesting additional option",
                "has a moderate overlap with your interests",
                "might still be relevant based on parts of your profile",
            ])

# Scaffolding final recommendation sentence
def motivation_sentence(row, is_dutch: bool = True) -> str:
    score = row["score"]
    words = row["Motivation"]
    module_name = row["module_name"]

    if is_dutch:
        base_options = [
            "Deze module {strength}. ",
            "{module} {strength}. ",
            "Op basis van jouw antwoorden {strength}. ",
        ]

    else:
        base_options = [
            "This module {strength}. ",
            "{module} {strength}. ",
            "Based on your answers, this module {strength}. ",
        ]

    strength = strength_phrase(score, is_dutch=is_dutch)
    base_template = random.choice(base_options)

    if "{module}" in base_template:
        base_text = base_template.format(module=module_name, strength=strength)
    else:
        base_text = base_template.format(strength=strength)

    # How we give back the hits
    if isinstance(words, str) and words.strip():
        if is_dutch:
            profile_templates = [
                "In jouw studentenprofiel noem je **{words}**, wat goed aansluit bij deze module.",
                "Je profiel vertelt over **{words}**, en deze interesses komen sterk terug in deze module.",
                "We zien dat **{words}** uit jouw antwoorden sterk overlappen met deze module.",
            ]

        else:
            profile_templates = [
                "Your student profile mentions **{words}**, which match well with this module.",
                "In your profile you talk about **{words}**, and these interests align with this module.",
                "We found that **{words}** from your answers overlap strongly with this module.",
            ]
        profile_part = random.choice(profile_templates).format(words=words)
        return base_text + profile_part

    # If no strong hits were found
    else:
        if is_dutch:
            fallback_options = [
                "Ook al vonden we geen heel specifieke trefwoorden, jouw totale profiel suggereert dat deze module interessant voor je kan zijn.",
                "We vonden geen hele sterke individuele woordmatches, maar jouw algemene profiel wijst toch in de richting van deze module.",
                "Er zijn geen heel duidelijke trefwoorden, maar op basis van het bredere profiel lijkt deze module alsnog bij je te passen.",
            ]

        else:
            fallback_options = [
                "Even though we did not find very specific keyword matches, the overall profile still suggests this module could be interesting.",
                "There are no very strong individual word matches, but your general profile still points towards this module.",
                "We did not find very clear keyword overlaps, but the broader profile still connects you to this module.",
            ]

        fallback = random.choice(fallback_options)
        return base_text + fallback

# Language selection for the motivation feedback based of of student profile preffered language
use_dutch = True
if hasattr(student, "preferred_language"):
    lang = str(student.preferred_language).strip().lower()
    if lang in ["en", "eng", "english"]:
        use_dutch = False

# Creating full motivation column (Dutch by default)
recs["motivation_full"] = recs.apply(lambda row: motivation_sentence(row, is_dutch=use_dutch), axis=1)

# Removing old motivation column
recs.drop(columns=["Motivation"], inplace=True)

# Pandas settings for max width columns
old_width = pd.get_option("display.max_colwidth")
pd.set_option("display.max_colwidth", None)

# Show updated recommendations
display(recs.reset_index(drop=True))

# Restoring original pandas setting
pd.set_option("display.max_colwidth", old_width)

Unnamed: 0,rank,module_id,module_name,score,motivation_full
0,1,388,Tekenen,0.603486,"Op basis van jouw antwoorden is een bijna perfecte match met jouw interesses. In jouw studentenprofiel noem je **tekening**, wat goed aansluit bij deze module."
1,2,191,De Kracht van de kunsten,0.514482,"Deze module is een sterke match met jouw interesses. In jouw studentenprofiel noem je **kunst, zingen**, wat goed aansluit bij deze module."
2,3,356,The art of biology,0.251662,"Op basis van jouw antwoorden kan een interessante extra optie zijn. Je profiel vertelt over **tekening, kunst**, en deze interesses komen sterk terug in deze module."
3,4,315,European Project Semester,0.233205,"Op basis van jouw antwoorden kan een interessante extra optie zijn. In jouw studentenprofiel noem je **spelen**, wat goed aansluit bij deze module."
4,5,385,Stopmotion,0.230948,"Op basis van jouw antwoorden kan alsnog relevant zijn op basis van delen van jouw profiel. We zien dat **animatie, zingen** uit jouw antwoorden sterk overlappen met deze module."


# -=-=-=-=-=-=-EVALUATION ADDITIONS-=-=-=-=-=-=-


## 8. Precision@k
We will use precision@k to determine the effectiveness of or model. Our ground truths have now been added below the mocking of the student profile. 

In [16]:
# Self made ground-truth
relevant_ids = set(matching_models)

# Modules given back from the model
recommended_ids = recs["module_id"].tolist()

# We use the same k as the number of recommended items
k = min(5, len(recommended_ids))
top_k_ids = recommended_ids[:k]



# Count how many of the top-k are in the ground-truth list
hits = sum(1 for mid in top_k_ids if mid in relevant_ids)
precision_at_k = hits / k if k > 0 else 0.0



print(f"Relevant module IDs (ground truth): {sorted(relevant_ids)}")
print(f"Top-{k} recommended IDs: {top_k_ids}")
print(f"Hits in top-{k}: {hits}")
print(f"precision@{k}: {precision_at_k:.3f}")

Relevant module IDs (ground truth): [191, 388]
Top-5 recommended IDs: [388, 191, 356, 315, 385]
Hits in top-5: 2
precision@5: 0.400
