# 1. Imports & Configuration

In [None]:
from pathlib import Path
from typing import Tuple, Dict, Any

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

pd.set_option("display.max_colwidth", 200)


# 2. Loading in the dataset

In [None]:
file_path = Path("../Data/Raw/Uitgebreide_VKM_dataset.csv")
df = pd.read_csv(file_path, low_memory=False)
df.head()

# 3. Overview dataset and data quality

In this section we explore the structure, data types, and overall data quality of the dataset. We look at the shape, column types, missing values, and basic numeric statistics.


In [None]:
# Basic info
print("Shape:", df.shape)
print("\nColumns and dtypes:")
print(df.dtypes)

# Missing values
miss = df.isnull().mean() * 100
print("\nMissing %:")
print(miss.sort_values(ascending=False))

# Numeric summary
num = df.select_dtypes(include=[np.number])
print("\nNumeric summary:")
print(num.describe().T)

# Categorical summary for selected columns
for col in ['name', 'shortdescription', 'description', 'content', 'location', 'level', 'learningoutcomes', 'module_tags', 'start_date']:
    print(f"\n--- {col} ---")
    vc = df[col].value_counts(dropna=False).head(10)
    print("Top values:\n", vc)
    print("Unique (non-null):", df[col].nunique(dropna=True))

# Date parsing for 'start_date' column
parsed_dates = pd.to_datetime(df['start_date'], errors='coerce')
print("\nstart_date parsing:")
print("Nulls after parsing:", parsed_dates.isna().sum())
print("Top parsed dates:")
print(parsed_dates.value_counts().head())

What we found from this:
- Most columns are of type "Object" containing textual data. This is also the sort of data most useful for a content-based recommender systen we plan making. 
- The color coded columns only have 2 rows containing data. Short descritpion and learning outcomes have some missing values we'll have to look into deeper during data cleaning.
- Popularity Score goes from 10 to 500. Dificulty goes from 1 to 5. Interest scores seems to go from 0 to 1. This could all be normalized to a value from 0.1 for consistency...
- There are some rows containing duplicate data --> Duplicates need to be removed during data cleaning.
- Bepaalde tags zijn leeg of zijn gevuld met 'ntb'

# 4. Numeric Values


In [None]:
from IPython.display import HTML, display
import html as _html

num = df.select_dtypes(include=[np.number]).copy()
summary = []

for col in num.columns:
    col_s = num[col].dropna()
    mean = col_s.mean()
    min_val = col_s.min() if len(col_s) > 0 else np.nan
    max_val = col_s.max() if len(col_s) > 0 else np.nan

    # Plots
    if len(col_s) > 0:
        fig, axes = plt.subplots(1, 2, figsize=(10, 3))
        sns.histplot(col_s, ax=axes[0], kde=True)
        axes[0].set_title(f"{col} — mean={mean:.2f}")
        sns.boxplot(x=col_s, ax=axes[1])
        axes[1].set_title(f"Boxplot (min={min_val:.2f}, max={max_val:.2f})")
        plt.tight_layout()
        plt.show()


What we found (besides the findings already mentioned earlier):
- Studycredit exists only in 15 or 30 points
- One contactId stands out for the rest 

# 5. Categorical Values

In [None]:
# top-10 (or lower if not applicable) bar charts for categorical columns

# Some categorical columns left out on purpose like shortdescription, description, etc. Not useful for bar charts. Name is used to find potential duplicates 
exclude_text_cols = ['shortdescription', 'description', 'content']
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
cat_cols = [c for c in cat_cols if c not in exclude_text_cols]

top_n = 10

# Shorting x-axis description otherwise too long images.
def _shorten(s, n=50):
    s = str(s)
    return s if len(s) <= n else s[:47] + '...'

for col in cat_cols:
    vc = df[col].fillna('<<NA>>').value_counts()
    top10 = vc.head(top_n)
    labels = [_shorten(x, 50) for x in top10.index]

    fig, ax = plt.subplots(figsize=(6, 3))
    colors = plt.cm.viridis(np.linspace(0, 1, len(top10)))

    top10.plot(kind='bar', color=colors, ax=ax)
    ax.set_xticklabels(labels, rotation=45, ha='right', fontsize=9)
    ax.set_title(f'{col} — top {len(top10)}')
    ax.set_ylabel('count')
    plt.show()


What we found (besides the findings already mentioned earlier):
- Duplicates in module name found, need to be removed during data cleanup
- Some location values have 2 places mixed, better to seperate these and add them to both locations. --> making Location an array
- Learningoutcomes have several columns with no information. Not only are there rows filled with NaN, but also rows with: 'nog te bepalen', 'nog te formuleren', etc. These will have to be cleaned. Also tricky value: 'Nog te bepalen. Bijvoorbeeld: bla bla bla....'. 

# 6. Correlation heatmap


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr = df.corr(numeric_only=True)

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Draw the heatmap
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True, cbar_kws={"shrink": .8})

plt.title("Correlation Heatmap")
plt.show()

# 7. Text Analysis (content & description)

In [None]:
df["exact_same"] = df["content"] == df["description"]
df["exact_same"].value_counts()


In [None]:
df.loc[~df["exact_same"], ["content", "description", "shortdescription", "module_tags"]]

In [None]:
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt

# --------------------------
# Text cleaning function
# --------------------------
def normalize(text):
    if text is None or (isinstance(text, float) and np.isnan(text)):
        return ""
    text = str(text).lower()
    text = re.sub(r"\d+", " ", text)
    text = re.sub(r"[^\w\s]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

# --------------------------
# Load model
# --------------------------
print("Loading sentence transformer model...")
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded!")

# --------------------------
# Function to compute similarity and visualize
# --------------------------
def explore_similarity(df, col1, col2, threshold=0.8):
    print("Filling NaN values...")
    df[col1] = df[col1].fillna("")
    df[col2] = df[col2].fillna("")
    
    # normalize
    print(f"Normalizing {col1} and {col2}...")
    df[f"{col1}_norm"] = df[col1].apply(normalize)
    df[f"{col2}_norm"] = df[col2].apply(normalize)

    # embeddings
    print("Computing embeddings...")
    df[f"{col1}_emb"] = df[f"{col1}_norm"].apply(lambda x: model.encode(x))
    df[f"{col2}_emb"] = df[f"{col2}_norm"].apply(lambda x: model.encode(x))

    # cosine similarity
    print("Calculating cosine similarity...")
    df["similarity"] = df.apply(
        lambda row: cosine_similarity([row[f"{col1}_emb"]], [row[f"{col2}_emb"]])[0][0],
        axis=1
    )

    # filter low similarity rows
    low_sim = df[df["similarity"] < threshold].copy()
    display_cols = [col1, col2, "similarity"]
    low_sim_display = low_sim[display_cols]

    # simple coloring function
    def highlight(val):
        if val < 0.5:
            return "background-color: red"
        elif val < 0.7:
            return "background-color: orange"
        else:
            return "background-color: yellow"

    print(f"Found {len(low_sim)} rows with similarity < {threshold}")
    return low_sim_display.style.applymap(highlight, subset=["similarity"])


In [None]:
# compare content vs description
explore_similarity(df, "content", "description", threshold=0.99)

In [None]:
# Checking if one of the columns is empty while the other is not
empty_short = df["shortdescription"].isna() | (df["shortdescription"] == "")
empty_tags = df["module_tags"].isna() | (df["module_tags"] == "")

df["empty_or_mismatch"] = empty_short | empty_tags
print("Rows where one column is empty and the other is not:")
print(df["empty_or_mismatch"].sum())

explore_similarity(df, "shortdescription", "module_tags", threshold=0.99)


We analyzed the similarity between different text columns in the dataset, focusing on content versus description and shortdescription versus module_tags. When we compared content and description, we found 11 rows with a cosine similarity below 0.8, which showed me that some module content didn’t really reflect their full descriptions. The lowest similarities were for modules like Oncologie (0.296) and Teaching English Abroad (0.219), which told me the titles were quite different from the content. For shortdescription versus module_tags, we noticed 6 rows with similarity below 0.88, which made me realize that some short descriptions or tags didn’t fully match the modules. Overall, most rows had decent similarity, but a few stood out as needing attention, so we know where we might want to refine or standardize the data.

## Additional Findings

- The dataset contains essentially no meaningful numerical relationships internally.  
- The high correlation between `ID` and `Contact-ID` can be ignored, as it has no practical significance.  
- Color values are not useful and can be disregarded.  
- **Shortdescription vs. module_tags similarity**  
  - Most rows show high similarity.  
  - A few outliers indicate where refinement or standardization may be needed.  
  - Not all modules have `shortdescription` filled in, so this field will likely be dropped during data cleaning.  
- **Content vs. description similarity**  
  - Most rows show high similarity, with some outliers needing review.  
  - Specific modules may require refinement of `content` or `description`.  
  - Ultimately, the `content` field will likely be dropped as it is inconsistently filled.


# 7. Thoughts for final model
Based on our exploration of the dataset, we cannot build a traditional prediction (classification) model. The available features show no significant correlations with the target variable (title or id), making it impossible to predict using standard methods. This is further complicated by the fact that most of the data is text. Additionally, the target itself represents content, which is not easily predicted using standard regression or classification techniques.

Because of this, a recommendation system is a much more suitable solution. Instead of predicting a specific value, a recommender focuses on similarity between items or user preferences, which matches the structure of our dataset.

Therefore, we will proceed with building a Content-Based Recommender system, which leverages the textual features of the dataset to recommend similar items based on their content. We will utilize techniques such as TF-IDF vectorization, embedding, and cosine similarity to measure the similarity between user profiles and the items in the dataset. This approach allows us to provide personalized recommendations without relying on traditional predictive modeling.

But first, we are going to clean the dataset. This is done in the next notebook.