# Model Training

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from StudentProfile import StudentProfile
from NLP import soft_nlp, hard_nlp
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv('../Data/Cleaned/cleaned_dataset_hard-NLP.csv')

# Loading uncleaned dataset for feedback names, etc. that have not seen NLP for user friendliness
raw_df = pd.read_csv('../Data/Raw/Uitgebreide_VKM_dataset.csv')


For the first model, we will be using a Bag of Words (BoW) approach. This involves converting text data into numerical vectors based on word frequency. We will then use cosine similarity to measure the similarity between different text entries. Then we will simulate student profiles and recommend courses based on those profiles.

Because we already have a cleaned dataset, we can directly proceed to the model training phase. The dataset already has heavy NLP pre-processing applied, so we can skip that step.

## 1. Combine Relevant Text Columns

Here we combine relevant text columns into a single text field for TF-IDF vectorization.

In [234]:
big_string = (
    df["name"].fillna("") + " " +
    df["description"].fillna("") + " " +
    df["learningoutcomes"].fillna("") + " " +
    df["module_tags"].apply(lambda x: " ".join(x) if isinstance(x, list) else "")
)

# Combineer id en tekst in één DataFrame
big_df = pd.DataFrame({
    "id": df["id"],
    "text": big_string
})

big_df.head()

Unnamed: 0,id,text
0,159,kennismak psychologi modul ler gedrag jezelf a...
1,160,learn work abroad student kiez binn stam oplei...
2,161,proactiev zorgplann jeroen bosch ziekenhuis gr...
3,162,rouw verlies modul stil gestan rouw verlies va...
4,163,acuut complex zorg modul student verdiep acut ...


## 2. Vectorization
Converting the text into a matrix of TF-IDF scores.

In [235]:
vectorizer = TfidfVectorizer(
    max_features=1000000,    # Really big amount to max out the data. 
    ngram_range=(1,2),       # unigrams + bigrams
    stop_words=None,         # Stop words already removed
)

X_modules_tfidf = vectorizer.fit_transform(big_df["text"])
X_modules_tfidf.shape

(211, 12596)

## 3. Converting in to dataframe for later use
Linking the matrixes to the id's of our dataset for comparison with user input.

In [236]:
tfidf_hard_NLP = pd.DataFrame({
    "id": big_df["id"],
    "tfidf_vector": list(X_modules_tfidf)    # each row is a 1xN sparse vector
})
tfidf_hard_NLP.head()

Unnamed: 0,id,tfidf_vector
0,159,<Compressed Sparse Row sparse matrix of dtype ...
1,160,<Compressed Sparse Row sparse matrix of dtype ...
2,161,<Compressed Sparse Row sparse matrix of dtype ...
3,162,<Compressed Sparse Row sparse matrix of dtype ...
4,163,<Compressed Sparse Row sparse matrix of dtype ...


Check if id's are still matched and dataframe is correctly made up.

In [237]:
i = 10  # any row index you want

# showing ID and first 200 chars
print("ID:", big_df.iloc[i]["id"])
print("TEXT:", big_df.iloc[i]["text"][:200], "...")

# Get the tf-idf vector
vec = tfidf_hard_NLP.iloc[i]["tfidf_vector"]

# Find the highest tfidf weight word in this vector
row = vec.toarray().flatten()
top_index = row.argmax()

feature_name = vectorizer.get_feature_names_out()[top_index]
print("TOP TF-IDF WORD:", feature_name)

ID: 169
TEXT: stevig stan jeugdzorg bent stat the art kennis handelingsrepertoir methodisch toegerust werkveld jeugdzorg bred toetsvorm beroepsprestaties jeugd gezinsprofessional ontwikkeld ism jeugdwerkveld  ...
TOP TF-IDF WORD: jeugdzorg


## 4. Mock student profiles

In [238]:
student = StudentProfile(
    current_education="Informatica",
    interests=[
        "Tekening", 
        "Animatie", 
        "Kunst",
        "Ik hou van optreden, zingen, dansen en spelen! Ook hou ik van uitdaging."
    ],
    wanted_study_credit_range=(15, 30),
    location_preference="Den Bosch",
    learning_goals=["skills", "career development"]
)

In [239]:
studentInterests = student.to_text()
student_hardNLP = hard_nlp(studentInterests)
student_hardNLP


'teken animatie kunst hou optred zing dans spel hou uitdag skill carer development'

## 5. Vectorizing student interests
Using TF-IDF

In [240]:
X_interests_tfidf = vectorizer.transform([student_hardNLP])
X_interests_tfidf.shape

(1, 12596)

## 6. Dimensionality Reduction
We will perform dimensionality reduction on the vectorized student and module data. We choose to go for a final dimensions of 200 as we believe this will roughly give back good results. This ofcourse can change during our optimization stage.
We kiezen voor PCA dimensionality reduction. Vooral voor performance redenen. (TruncatedSVD)

In [241]:
# Final dimension amount
n_components = 200
svd = TruncatedSVD(n_components=n_components, random_state=42)

# Transforming the module vectors
X_modules_reduced = svd.fit_transform(X_modules_tfidf)

# Transforming single student vector
X_student_reduced = svd.transform(X_interests_tfidf)

print("Modules reduced:", X_modules_reduced.shape)  # (n_modules, 50)
print("Student reduced:", X_student_reduced.shape)  # (1, 50)

Modules reduced: (211, 200)
Student reduced: (1, 200)


## 7. Cosine Similarity
After peforming dimensionality reduction, we can finally look at the cosine similarity values to see what our model comes up with.

In [242]:
# Ensure reduced matrices exist
if 'X_modules_reduced' not in globals():
    raise NameError("X_modules_reduced not found — run the TruncatedSVD cell first.")
if 'X_student_reduced' not in globals():
    raise NameError("X_student_reduced not found — transform the student TF-IDF first using the SVD pipeline.")

# Ensure the student reduced vector is shaped (1, n_components)
X_student_vec = X_student_reduced
if X_student_vec.ndim == 1:
    X_student_vec = X_student_vec.reshape(1, -1)

# Compute similarities and top-n recommendations
top_n = 3
scores = cosine_similarity(X_student_vec, X_modules_reduced)[0]  # (n_modules,)

# Get top results and fil into dataframe
top_idx = np.argsort(-scores)[:top_n]
recs = pd.DataFrame({
    'rank': list(range(1, len(top_idx) + 1)),
    'module_id': big_df.loc[top_idx, 'id'].values,
    'module_name': raw_df.loc[top_idx, 'name'].values if 'name' in df.columns else big_df.loc[top_idx, 'text'].values,
    'score': scores[top_idx]
})

print('Top-3 recommendations (reduced space) for the current student:')
display(recs.reset_index(drop=True))


Top-3 recommendations (reduced space) for the current student:


Unnamed: 0,rank,module_id,module_name,score
0,1,388,Tekenen,0.603486
1,2,191,De Kracht van de kunsten,0.514482
2,3,356,The art of biology,0.251662
