# Model Training

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from StudentProfile import StudentProfile
from NLP import soft_nlp, hard_nlp
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import sklearn
import nltk

# Load dataset
df = pd.read_csv('../Data/Cleaned/cleaned_dataset_hard-NLP.csv')


For the first model, we will be using a Bag of Words (BoW) approach. This involves converting text data into numerical vectors based on word frequency. We will then use cosine similarity to measure the similarity between different text entries. Then we will simulate student profiles and recommend courses based on those profiles.

Because we already have a cleaned dataset, we can directly proceed to the model training phase. The dataset already has heavy NLP pre-processing applied, so we can skip that step.

## 1. Combine Relevant Text Columns

Here we combine relevant text columns into a single text field for TF-IDF vectorization.

In [116]:
big_string = (
    df["name"].fillna("") + " " +
    df["description"].fillna("") + " " +
    df["learningoutcomes"].fillna("") + " " +
    df["module_tags"].apply(lambda x: " ".join(x) if isinstance(x, list) else "")
)

# Combineer id en tekst in één DataFrame
big_df = pd.DataFrame({
    "id": df["id"],
    "text": big_string
})

big_df.head()

Unnamed: 0,id,text
0,159,kennismak psychologi modul ler gedrag jezelf a...
1,160,learn work abroad student kiez binn stam oplei...
2,161,proactiev zorgplann jeroen bosch ziekenhuis gr...
3,162,rouw verlies modul stil gestan rouw verlies va...
4,163,acuut complex zorg modul student verdiep acut ...


## 2. Vectorization
Converting the text into a matrix of TF-IDF scores.

In [117]:
vectorizer = TfidfVectorizer(
    max_features=1000000,    # Really big amount to max out the data. 
    ngram_range=(1,2),       # unigrams + bigrams
    stop_words=None,         # Stop words already removed
)

X_modules_tfidf = vectorizer.fit_transform(big_df["text"])
X_modules_tfidf.shape

(211, 12596)

## 3. Converting in to dataframe for later use
Linking the matrixes to the id's of our dataset for comparison with user input.

In [118]:
tfidf_hard_NLP = pd.DataFrame({
    "id": big_df["id"],
    "tfidf_vector": list(X_modules_tfidf)    # each row is a 1xN sparse vector
})
tfidf_hard_NLP.head()

Unnamed: 0,id,tfidf_vector
0,159,<Compressed Sparse Row sparse matrix of dtype ...
1,160,<Compressed Sparse Row sparse matrix of dtype ...
2,161,<Compressed Sparse Row sparse matrix of dtype ...
3,162,<Compressed Sparse Row sparse matrix of dtype ...
4,163,<Compressed Sparse Row sparse matrix of dtype ...


Check if id's are still matched and dataframe is correctly made up.

In [119]:
i = 10  # any row index you want

# showing ID and first 200 chars
print("ID:", big_df.iloc[i]["id"])
print("TEXT:", big_df.iloc[i]["text"][:200], "...")

# Get the tf-idf vector
vec = tfidf_hard_NLP.iloc[i]["tfidf_vector"]

# Find the highest tfidf weight word in this vector
row = vec.toarray().flatten()
top_index = row.argmax()

feature_name = vectorizer.get_feature_names_out()[top_index]
print("TOP TF-IDF WORD:", feature_name)

ID: 169
TEXT: stevig stan jeugdzorg bent stat the art kennis handelingsrepertoir methodisch toegerust werkveld jeugdzorg bred toetsvorm beroepsprestaties jeugd gezinsprofessional ontwikkeld ism jeugdwerkveld  ...
TOP TF-IDF WORD: jeugdzorg


## 4. Mock student profiles

In [120]:
student1 = StudentProfile(
    current_education="Informatica",
    interests=[
        "machine learning", 
        "web development", 
        "AI",
        "I enjoy exploring new algorithms and building web applications to solve real-world problems."
    ],
    wanted_study_credit_range=(15, 30),
    location_preference="Den Bosch",
    learning_goals=["skills", "career development"]
)

student2 = StudentProfile(
    current_education="Psychologie",
    interests=[
        "cognitieve psychologie", 
        "gedragsanalyse", 
        "onderzoeksvaardigheden",
        "I love studying human behavior and understanding the mind through experiments."
    ],
    wanted_study_credit_range=(10, 20),
    location_preference="Eindhoven",
    learning_goals=["personal development", "understanding human behavior"]
)

student3 = StudentProfile(
    current_education="Bedrijfskunde",
    interests=[
        "marketing", 
        "finance", 
        "entrepreneurship",
        "I am passionate about growing businesses and creating innovative strategies."
    ],
    wanted_study_credit_range=(20, 40),
    location_preference="Amsterdam",
    learning_goals=["career advancement", "leadership skills"]
)

student4 = StudentProfile(
    current_education="Biomedische wetenschappen",
    interests=[
        "genetica", 
        "klinisch onderzoek", 
        "bio-informatica",
        "I enjoy analyzing genetic data and contributing to medical research projects."
    ],
    wanted_study_credit_range=(15, 30),
    location_preference="Utrecht",
    learning_goals=["research skills", "scientific contribution"]
)

students = [student1, student2, student3, student4]

for s in students:
    print(s.to_text())
    print("-"*60)


machine learning web development AI I enjoy exploring new algorithms and building web applications to solve real-world problems. skills career development
------------------------------------------------------------
cognitieve psychologie gedragsanalyse onderzoeksvaardigheden I love studying human behavior and understanding the mind through experiments. personal development understanding human behavior
------------------------------------------------------------
marketing finance entrepreneurship I am passionate about growing businesses and creating innovative strategies. career advancement leadership skills
------------------------------------------------------------
genetica klinisch onderzoek bio-informatica I enjoy analyzing genetic data and contributing to medical research projects. research skills scientific contribution
------------------------------------------------------------


In [121]:
studentInterests = student4.to_text()
student_hardNLP = hard_nlp(studentInterests)
student_hardNLP


'genetica klinisch onderzoek bioinformatica enjoy analyz genet data contribut medic research project research skill scientif contribut'

## 5. Vectorizing student interests
Using TF-IDF

In [122]:
X_interests_tfidf = vectorizer.transform([student_hardNLP])
X_interests_tfidf.shape

(1, 12596)

## 6. Dimensionality Reduction
We will perform dimensionality reduction on the vectorized student and module data. We choose to go for a final dimensions of 500 as we believe this will roughly give back good results. This ofcourse can change during our optimization stage.