# **Predicting and understanding viewer engagement with educational videos**

With the accelerating popularity of online educational experiences, the role of online lectures and other educational video continues to increase in scope and importance. Open access educational repositories such as videolectures.net, as well as Massive Open Online Courses (MOOCs) on platforms like Coursera, have made access to many thousands of lectures and tutorials an accessible option for millions of people around the world. Yet this impressive volume of content has also led to a challenge in how to find, filter, and match these videos with learners.

One critical property of a video is engagement: how interesting or "engaging" it is for viewers, so that they decide to keep watching. Engagement is critical for learning, whether the instruction is coming from a video or any other source. There are many ways to define engagement with video, but one common approach is to estimate it by measuring how much of the video a user watches. If the video is not interesting and does not engage a viewer, they will typically abandon it quickly, e.g. only watch 5 or 10% of the total.

A first step towards providing the best-matching educational content is to understand which features of educational material make it engaging for learners in general. This is where predictive modeling can be applied, via supervised machine learning. For this assignment, your task is to predict how engaging an educational video is likely to be for viewers, based on a set of features extracted from the video's transcript, audio track, hosting site, and other sources.

# **About the Dataset**

We extracted training and test datasets of educational video features from the VLE Dataset put together by researcher Sahan Bulathwela at University College London.

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single educational video, and includes information about diverse properties of the video content as described further below. The target variable is engagement which was defined as True if the median percentage of the video watched across all viewers was at least 30%, and False otherwise.

# **Evaluation**

Predictions will be given as the probability that the corresponding video will be engaging to learners. The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

def engagement_model():
    train = pd.read_csv("assets/train.csv")
    test = pd.read_csv("assets/test.csv")
    features = [
        "title_word_count", "document_entropy", "freshness",
        "easiness", "fraction_stopword_presence", "speaker_speed", "silent_period_rate"
    ]
    X = train[features]
    y = train["engagement"].astype(int)  # Convert to binary label
    X_test = test[features]

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_test_scaled = scaler.transform(X_test)
    X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

    model = RandomForestClassifier(random_state=42)
    param_grid = {
        "n_estimators": [50, 100, 200],
        "max_depth": [None, 10, 20],
        "min_samples_split": [2, 5, 10]
    }
    grid_search = GridSearchCV(model, param_grid, scoring="roc_auc", cv=3, n_jobs=-1)
    grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_
    y_val_pred = best_model.predict_proba(X_val)[:, 1]
    auc_score = roc_auc_score(y_val, y_val_pred)
    print(f"Validation AUC: {auc_score:.4f}")

    y_test_pred = best_model.predict_proba(X_test_scaled)[:, 1]

    rec = pd.Series(y_test_pred, index=test["id"], name="engagement")
    return rec

stu_ans = engagement_model()

In [None]:
stu_ans = engagement_model()
assert isinstance(stu_ans, pd.Series),
assert len(stu_ans) == 2309,
assert np.issubdtype(stu_ans.index.dtype, np.integer)

Validation AUC: 0.8971