# Model Training

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import sklearn
import nltk

# Load dataset
df = pd.read_csv('../Data/Cleaned/cleaned_dataset_hard-NLP.csv')


For the first model, we will be using a Bag of Words (BoW) approach. This involves converting text data into numerical vectors based on word frequency. We will then use cosine similarity to measure the similarity between different text entries. Then we will simulate student profiles and recommend courses based on those profiles.

Because we already have a cleaned dataset, we can directly proceed to the model training phase. The dataset already has heavy NLP pre-processing applied, so we can skip that step.

## 1. Combine Relevant Text Columns

Here we combine relevant text columns into a single text field for TF-IDF vectorization.

In [34]:
big_string = (
    df["name"].fillna("") + " " +
    df["description"].fillna("") + " " +
    df["learningoutcomes"].fillna("") + " " +
    df["module_tags"].apply(lambda x: " ".join(x) if isinstance(x, list) else "")
)

# Combineer id en tekst in één DataFrame
big_df = pd.DataFrame({
    "id": df["id"],
    "text": big_string
})

big_df.head()

Unnamed: 0,id,text
0,159,kennismak psychologi modul ler gedrag jezelf a...
1,160,learn work abroad student kiez binn stam oplei...
2,161,proactiev zorgplann jeroen bosch ziekenhuis gr...
3,162,rouw verlies modul stil gestan rouw verlies va...
4,163,acuut complex zorg modul student verdiep acut ...


## 2. Vectorization
Converting the text into a matrix of TF-IDF scores.

In [35]:
vectorizer = TfidfVectorizer(
    max_features=10000,      # Big amount, later dimensionality reduction will be applied
    ngram_range=(1,2),       # unigrams + bigrams
    stop_words=None,         # Stop words already removed
)

X_tfidf = vectorizer.fit_transform(big_df["text"])
X_tfidf.shape

(211, 10000)

## 3. Converting in to dataframe for later use
Linking the matrixes to the id's of our dataset for comparison with user input.

In [36]:
tfidf_hard_NLP = pd.DataFrame({
    "id": big_df["id"],
    "tfidf_vector": list(X_tfidf)    # each row is a 1xN sparse vector
})
tfidf_hard_NLP.head()

Unnamed: 0,id,tfidf_vector
0,159,<Compressed Sparse Row sparse matrix of dtype ...
1,160,<Compressed Sparse Row sparse matrix of dtype ...
2,161,<Compressed Sparse Row sparse matrix of dtype ...
3,162,<Compressed Sparse Row sparse matrix of dtype ...
4,163,<Compressed Sparse Row sparse matrix of dtype ...


Check if id's are still matched and dataframe is correctly made up.

In [None]:
i = 10  # any row index you want

# showing ID and first 200 chars
print("ID:", big_df.iloc[i]["id"])
print("TEXT:", big_df.iloc[i]["text"][:200], "...")

# Get the tf-idf vector
vec = tfidf_hard_NLP.iloc[i]["tfidf_vector"]

# Find the highest tfidf weight word in this vector
row = vec.toarray().flatten()
top_index = row.argmax()

feature_name = vectorizer.get_feature_names_out()[top_index]
print("TOP TF-IDF WORD:", feature_name)

ID: 169
TEXT: stevig stan jeugdzorg bent stat the art kennis handelingsrepertoir methodisch toegerust werkveld jeugdzorg bred toetsvorm beroepsprestaties jeugd gezinsprofessional ontwikkeld ism jeugdwerkveld  ...
TOP TF-IDF WORD: jeugdzorg
