# Predicting and understanding viewer engagement with educational videos 

## About the prediction problem

One critical property of a video is engagement: how interesting or "engaging" it is for viewers, so that they decide to keep watching. One common approach is to estimate engagement is by measuring how much of the video a user watches. If the video is not interesting and does not engage a viewer, they will typically abandon it quickly, e.g. only watch 5 or 10% of the total. 

The excercise consists of predicting which educational video is likely to be engaging for viewers, based on a set of features extracted from the video's transcript, audio track, hosting site, and other sources.

This problem

* It combines a variety of features derived from a rich set of resources connected to the original data;
* The manageable dataset size means the dataset and supervised models for it can be easily explored on a wide variety of computing platforms;
* Predicting popularity or engagement for a media item, especially combined with understanding which features contribute to its success with viewers, is a fun problem but also a practical representative application of machine learning in a number of business and educational sectors.


## About the dataset
Datasets put together by researcher Sahan Bulathwela at University College London.

The target variable is `engagement` which was defined as True if the median percentage of the video watched across all viewers was at least 30%, and False otherwise.


**File descriptions** 
    assets/train.csv <br>
    assets/test.csv 
<br>

**Data fields**

train.csv & test.csv:

    title_word_count - the number of words in the title of the video.
    
    document_entropy - a score indicating how varied the topics are covered in the video, based on the transcript. Videos with smaller entropy scores will tend to be more cohesive and more focused on a single topic.
    
    freshness - The number of days elapsed between 01/01/1970 and the lecture published date. Videos that are more recent will have higher freshness values.
    
    easiness - A text difficulty measure applied to the transcript. A lower score indicates more complex language used by the presenter.
    
    fraction_stopword_presence - A stopword is a very common word like 'the' or 'and'. This feature computes the fraction of all words that are stopwords in the video lecture transcript.
    
    speaker_speed - The average speaking rate in words per minute of the presenter in the video.
    
    silent_period_rate - The fraction of time in the lecture video that is silence (no speaking).
    
train.csv only:
    
    engagement - Target label for training. True if learners watched a substantial portion of the video (see description), or False otherwise.
    

## Evaluation


Evaluation metric for this assignment: Area Under the ROC Curve (AUC). 

Main function return a Pandas Series object of length 2309 with the data being the probability that each corresponding video from `readonly/test.csv` will be engaging (according to a model learned from the 'engagement' label in the training set), and the video index being in the `id` field.




In [2]:
import warnings
warnings.filterwarnings("ignore")

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)  

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

def grid_search(clf, grid_values, X_train, y_train):
    grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc', cv=5)
    grid_clf_auc.fit(X_train, y_train)
    #y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) 

    #print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
    print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
    print('Grid best train score (AUC): ', grid_clf_auc.best_score_)


def engagement_model():
    # df and splitting into training and validation  
    df = pd.read_csv('assets/train.csv', index_col=0)
    X = df.iloc[:,:-1]
    y = df.iloc[:,-1]
    X_train, X_val, y_train, y_val = train_test_split(X, y)
    
    test = pd.read_csv('assets/test.csv', index_col=0)
    
    
    # grid search decision tree
    clf = DecisionTreeClassifier()
    grid_values = {'max_features': ['auto', 'sqrt', 'log2'],
              'ccp_alpha': [0.1, .01, .001],
              'max_depth' : [5, 6, 7, 8, 9],
              'criterion' :['gini', 'entropy'] }
    grid_search(clf, grid_values, X_train, y_train)
    
    
    # classifier with best gridsearch hyperparams based on previous grid search
    clf = DecisionTreeClassifier(ccp_alpha= 0.001, criterion = 'entropy', max_depth= 5, max_features = 'sqrt').fit(X_train, y_train)
    print('AUC SCORE (test): {0:.3f}'.format(roc_auc_score(y_val, clf.predict_proba(X_val)[:, 1])))
    
    data = clf.predict_proba(test)[:, 1]
    result = pd.Series(data, index = test.index) 
    
    return result
    
    
   
engagement_model()

Grid best parameter (max. AUC):  {'ccp_alpha': 0.001, 'criterion': 'entropy', 'max_depth': 7, 'max_features': 'log2'}
Grid best score (AUC):  0.8463674026967425
AUC SCORE (test): 0.810


id
9240     0.009852
9241     0.084878
9242     0.033992
9243     0.904762
9244     0.033992
           ...   
11544    0.033992
11545    0.009852
11546    0.009852
11547    0.746154
11548    0.033992
Length: 2309, dtype: float64