Skip to content


Repository files navigation

Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning

GitHub release Build Open In Colab

This repository contains the code for our Graph Split algorithm, the concept embeddings we generated using concept names and concept definitions from Wikipedia, and the data we used for the experiments presented in the paper: Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning. This paper was submitted and accepted for the AIED track at ACM SAC 2024.

Table of Contents

Structure of the Repository

├───AL_CPL_Originial_Data #Contains the original AL-CPL data that can be found here:
├───Graph_Split #This is the training and testing data generated from our graph split
│   ├───data_mining # Each subfolder contains the data for a domain + the statistics of each split in `x_split_statistics.csv`
│   ├───geometry
│   ├───physics
│   └───precalculus
│   ├───data_mining
│   ├───geometry
│   ├───physics
│   └───precalculus
└───Scripts #Contains the scripts useful for both types of splits


By examining the concept prerequisite graph generated from the AL-CPL dataset, we noticed that some of the prerequisite relations are easier to deduce than others due to the transitive property. Therefore, we devise a novel graph-based stochastic split algorithm that avoids cases where the prerequisite inference in the test set is trivial due to the transitivity property (like in Subfigure 5.a). Our Graph Split algorithm is in the script

The following section explains how we split our data for our experiments and how we generate our embeddings.

Split Architecture

For reproducibility purposes, below is a schema that explains how we generate the appropriate splits to train our models.

Schema of the different types of split

What is called a 'Random Split' is simply the random split of the data you would get after running a function similar to sklearn.model_selection.train_test_split, whereas the 'Graph Split' is the random split obtained by our Graph Split algorithm (see The in-domain split, whether from a 'Graph Split' or a 'Random Split', generates 5 different train/val/test splits for each of the domains. The transductive split is the method of splitting the training data that is recommended by Lescovec for transductive link prediction. His slides are available here (see slide 68.), as well as the code we used to reproduce this split (we used option 2 with 'torch_geometric').


We use FastText, Transformers, and Sentence Transformers to generate concept embeddings to solve Concept Prerequisite Learning on the AL-CPL dataset. Below, you will find the methodology we followed to generate our embeddings and the metrics we used to evaluate them.

The Notebook Metrics_for_Emebddings.ipynb shows how to load the embeddings in Concepts_with_Description_and_Embeddings.csv, provides a visualization, and computes the metrics for each embedding method. If needed, Metrics_for_Emebddings.ipynb can be opened and executed in Google Colaboratory, you would just need to upload the Concepts_with_Description_and_Embeddings.csv file.

How our Embeddings Were Generated

We use the following methodology to represent concepts, with the name of the corresponding column in the file Concepts_with_Description_and_Embeddings.csv in parentheses:

  • FastText Embeddings (ConceptEmbeddings_FastText): We use FastText's cc.en.300.bin model. Its get_sentence_vector method is used on the name of concepts.

  • OpenAI Embeddings (Phrase_Embedding_OpenAI): We use Open AI's API to generate concept embeddings using the definitions of concepts (Query_Phrase) and their model text-embedding-ada-002.

  • Sentence Transformer Embeddings (Phrase_Embedding_all-*): We use the English models of the library Sentence Transformers and the definitions of concepts (Query_Phrase) to generate concept embeddings. The name of the model we used is included in the column's name. I.e. for the embeddings in column Phrase_Embedding_all-mpnet-base-v2, the model all-mpnet-base-v2 was used.


Additionally, we propose a method to evaluate embeddings for our CPL task, drawing inspiration from recommender systems. Our evaluation process involves the following steps: First, we compute the cosine similarity matrix for all embeddings. Next, we rank the results for each concept. Finally, we compute the evaluation metrics based on the definitions provided below:

Where for any given concept C, the term "related concepts" refers to either the prerequisites of the given concept C (ancestors) or the concepts for which C is a prerequisite (descendants). The query set, denoted as Q, contains all the related concepts to concept C.

All of these metrics return values in the interval [0,1]. The higher the value of the metric, the higher the probability of concepts being 'related' when their embeddings have a high cosine similarity (see the definition of "related concepts" given above).


In our paper, the PnPR-GCN architectures has two hyperparameters for the loss function $\lambda$ and $\mu$, and architecture hyperparameters. We use grid search to find the optimal hyper parameters for our PnPR-GCN architecture in the in-domain setting and report them in the table below:

Split Type Average Distance Between Nodes Number of GCN Embedding Layers Size of GCN embeddings MLP Depth MLP Dropout Rate
Random Split 1.121 1 675 2 0.8
Graph Split 1.0 1 764 2 0.7

We also find the optimal value for $\lambda$ and $\mu$ to be equal to $1$.


This work would not have been possible without:


Please, do cite our work if this repository was helpful to your research!

  title={Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning},
  booktitle = {X},
  publisher = {X},


No description, website, or topics provided.






No releases published


No packages published