# An Experimental Machine Learning Study of High Achievers at the Open University using the Learning Analytics Data Set
Key findings:
- Temporal moments (variance, skewness and kurtosis) of learning engagement were the best predictors of receiving a distinction (e.g. spreading engagement with the online learning environment across the module). This was followed by certain types of engagement (e.g. with quizzes and forums). Demographic characteristics were less important than engagement features.
- Having a previous degree - a factor I expected to be imporant - is an extremely poor predictor of high achievement


## Introduction

The Open University is designed for those who wish to study a degree but whose cirumstances do not allow for the typical on-campus school 
leaver experience. Many students are significantly older than typical undergraduates, and many study online and part-time, receiving a degree in six to eight years.

In 2017 the University made public a large database. It contained information on student enrolment and demographics, but also kept track of all times a student interacted with the virtual learning environment (VLE).


# Decision Trees and XGBoost
A 'decision tree' in a machine learning context refers to a set of nodes which 'branch' based on a 'split', where the split aims to maximise the information gain from classifying the data within an existing node. Decision trees can be grouped in a 'tree ensemble', most simply in the Random Forest algorithm which takes a set of decision trees based on sampling with replacement and uses a vote system to make predictions.
'XGBoost' is a further modification which weights previously misclassified examples more heavily. It is highly robust and reliable, and performs well on the Learning Analytics Data Set in particular.


## The focus of the project
A lot of data on unequal outcomes focusses on average scores (including Nottingham’s own dashboards). Work with the OULA database schema itself also tends to focus on predicting 'performance' in general, or 'dropping out'. I wanted to focus on something new; the highest achievers and, specifically, whether having a previous degree makes a substantial difference to the probability of achieving highly. I have focussed on:
-	Those who received a distinction in the module the first time they took it. This is the 'machine learning' portion of the project.
-	The top ten scorers in the tutor-marked and computer-marked portions of a module. This is a more exploratory portion of the project, and focuses particularly on the impact of a previous education.

At the advice of the Open University itself, I analyse the February and October presentations separately as their content and presentation structure often differ from each other.



## Database Creation
First, I created the necessary tables in SQL. I mostly use the recommendations on the University website, although sometimes these do not make logical sense so I use my own judgement.


## Data Load
Next, I loaded the data into SQL.


## Data Transformation
As the data is presented over seven tables, I had to transform the learning engagement clicks data into a usable format for XGBoost. I used this loop to get the sum of clicks of each type of engagement. Not all of these types of engagement are present in every module. Later on, in python, I will automatically filter this inforamtion such that only resources available for the module in question are considered as features.

I have designed a number of features relating to engagement with the Virtual Learning Environment. To start with, I have merged two tables to create variables 'total clicks' of each assessment type. This is essentially the total number of times, for each student's performance in a particular module, the student clicks on e.g. quizzes. For example, student '111568' might have clicked on resource 5540 (quizzes) a total of 11 times during their attempt at module AAA. You will notice the data is essentially entirely anonymised, even the module names! 





The data frame also contains the date at which the student clicked on the resource, relative to when the module started. This allows for the number of clicks of each type to be analysed, as well as the time dynamics. I realised while brainstorming that there's a pretty good theoretical rationale for all statistical 'moments' when it comes to the day on which resources were clicked. This isn't based on a literature review of any kind (I don't consider this a hugely academic project, far more a personal one) but my thinking was as follows:
- The variance (known as the second moment) is of course important. I expect that a student who spreads their study across the year is likely to do better than one who does not.
- The skewness (the third moment) should be too. A student whose studying is all done at the start, or who leaves their studying until the last few days, is also likely not to perform so well. 
- The kurtosis (the fourth moment) looks at the thickness of tails specifically. Someone whose studying is only done at the tails might also be less likely to excel, in a way that won't be picked up by the second or third moments.

In [None]:
```sql
CREATE TABLE studentInfo (code_module VARCHAR(45),
	code_presentation VARCHAR(45),
	id_student INT,
	gender VARCHAR(3),
	imd_band VARCHAR(16),
	highest_education VARCHAR(45),
	age_band VARCHAR(16),
	number_of_prev_attempts INT,
	studied_credits INT,
	region VARCHAR(45),
	disability VARCHAR(3),
	final_result VARCHAR(45),
	PRIMARY KEY (id_student, code_module, code_presentation),
	FOREIGN KEY (code_module, code_presentation) 
	REFERENCES courses(code_module, code_presentation))


In [None]:
-- Loops through each type of learning resource and finds the sum of clicks for each student on that resource
DO $$
DECLARE
	activity RECORD;
	col_name VARCHAR(45);
BEGIN
	FOR activity in 
		SELECT DISTINCT activity_type
		FROM vle
	LOOP
		col_name := 'total_clicks_' || activity.activity_type;
	
		EXECUTE format('ALTER TABLE studentInfo
					   ADD COLUMN %I INT DEFAULT 0;', col_name);
		
		EXECUTE format('UPDATE studentInfo
					   SET %I = subquery.total_clicks
					   	FROM (SELECT id_student, code_module, code_presentation, SUM(sum_click)
						AS total_clicks
					   	FROM studentVle
					   WHERE activity_type = %L
					   GROUP BY id_student, code_module, code_presentation) AS subquery
					   WHERE studentInfo.id_student = subquery.id_student
              		   AND studentInfo.code_module = subquery.code_module
             		   AND studentInfo.code_presentation = subquery.code_presentation;',
					   col_name, activity.activity_type
					   );
		END LOOP;

END $$

In [None]:

```sql
DO $$
DECLARE
	activity RECORD;
	col_name VARCHAR(45);
BEGIN
	FOR activity in 
		SELECT DISTINCT activity_type
		FROM vle
	LOOP
		col_name := 'total_clicks_' || activity.activity_type;
	
		EXECUTE format('ALTER TABLE studentInfo
					   ADD COLUMN %I INT DEFAULT 0;', col_name);
		
		EXECUTE format('UPDATE studentInfo
					   SET %I = subquery.total_clicks
					   	FROM (SELECT id_student, code_module, code_presentation, SUM(sum_click)
						AS total_clicks
					   	FROM studentVle
					   WHERE activity_type = %L
					   GROUP BY id_student, code_module, code_presentation) AS subquery
					   WHERE studentInfo.id_student = subquery.id_student
              		   AND studentInfo.code_module = subquery.code_module
             		   AND studentInfo.code_presentation = subquery.code_presentation;',
					   col_name, activity.activity_type
					   );
		END LOOP;

END $$

# Results - XGBoost
First, I try with just default parameters and with only demographic characteristics and the 'sum of clicks' for each resource. Here, the F1 score for not receiving a distinction is 0.96, and for receiving is 0.18. 

When I add my engagement features, and tune the hyperparameters, I can get to 0.92 and 0.40 respectively. This, to put it bluntly, is not fantastic. 

The interesting part, to me, is the feature importance. The variance, skewness and kurtosis seem to score highly in feature importance for essentially every module. Across all modules together, variance and skewness are the most imporant features followed by clicks on external resources, the forums, ou content and quizzes. Total clicks data is by some considerable margin more important than demographic features. Strikingly, having a previous degree is amongst the least important predictors.

# Conclusion and Discussion



## Personal Reflections
This was my first real machine learning project that wasn't basic logistic regression or factor analysis from my quantitative social science degrees. If nothing else, it's absolutely essential that I did this project following my machine learning courses. I feel significantly more confident in my ability to create and apply machine learning models. It has reinforced my understanding of decision trees specifically, but also of general ML principles - bias and variance, feature engineering, parameter tuning, and of course the ML project life cycle, to name but a few.

## Discussion of Results

The most striking result is that both the engagement features individually and their statistical moments are consistently rated higher in importance than demographic features. 
Clearly, the model struggles in some respects. By lowering the decision threshold I can get relatively high recall scores for both (just under 0.8) but at the expense of precision for non-distinction cases. There was very little I could find that made the precision for non-distinction cases high, while small adjustments to the decision threshold improved the precision remarkably. The F-score for the non-distinction cases was hard to raise above around 0.4, which is still of course far higher than a random guessing algorithm but not as high as I would have hoped.
I did not try everything. It was outside the scope of the project to adjust more than a few hyperparameters, or to create synthetic minority samples.

So we have a significant improvement relative to a random guessing algorithm.