Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning

This repository contains the code for our Graph Split algorithm, the concept embeddings we generated using concept names and concept definitions from Wikipedia, and the data we used for the experiments presented in the paper: Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning. This paper was submitted and accepted for the AIED track at ACM SAC 2024.

Structure of the Repository

.
├───AL_CPL_Originial_Data #Contains the original AL-CPL data that can be found here: https://github.com/harrylclc/AL-CPL-dataset
├───Generated_Images
├───Graph_Split #This is the training and testing data generated from our graph split
│   ├───data_mining # Each subfolder contains the data for a domain + the statistics of each split in `x_split_statistics.csv`
│   ├───geometry
│   ├───physics
│   └───precalculus
├───Random_Split
│   ├───data_mining
│   ├───geometry
│   ├───physics
│   └───precalculus
└───Scripts #Contains the scripts useful for both types of splits

Overview

By examining the concept prerequisite graph generated from the AL-CPL dataset, we noticed that some of the prerequisite relations are easier to deduce than others due to the transitive property. Therefore, we devise a novel graph-based stochastic split algorithm that avoids cases where the prerequisite inference in the test set is trivial due to the transitivity property (like in Subfigure 5.a). Our Graph Split algorithm is in the script Graph_Split.py.

The following section explains how we split our data for our experiments and how we generate our embeddings.

Split Architecture

For reproducibility purposes, below is a schema that explains how we generate the appropriate splits to train our models.

What is called a 'Random Split' is simply the random split of the data you would get after running a function similar to sklearn.model_selection.train_test_split, whereas the 'Graph Split' is the random split obtained by our Graph Split algorithm (see Graph_split.py). The in-domain split, whether from a 'Graph Split' or a 'Random Split', generates 5 different train/val/test splits for each of the domains. The transductive split is the method of splitting the training data that is recommended by Lescovec for transductive link prediction. His slides are available here (see slide 68.), as well as the code we used to reproduce this split (we used option 2 with 'torch_geometric').

Embeddings

We use FastText, Transformers, and Sentence Transformers to generate concept embeddings to solve Concept Prerequisite Learning on the AL-CPL dataset. Below, you will find the methodology we followed to generate our embeddings and the metrics we used to evaluate them.

The Notebook Metrics_for_Emebddings.ipynb shows how to load the embeddings in Concepts_with_Description_and_Embeddings.csv, provides a visualization, and computes the metrics for each embedding method. If needed, Metrics_for_Emebddings.ipynb can be opened and executed in Google Colaboratory, you would just need to upload the Concepts_with_Description_and_Embeddings.csv file.

How our Embeddings Were Generated

We use the following methodology to represent concepts, with the name of the corresponding column in the file Concepts_with_Description_and_Embeddings.csv in parentheses:

FastText Embeddings (ConceptEmbeddings_FastText): We use FastText's cc.en.300.bin model. Its get_sentence_vector method is used on the name of concepts.
OpenAI Embeddings (Phrase_Embedding_OpenAI): We use Open AI's API to generate concept embeddings using the definitions of concepts (Query_Phrase) and their model text-embedding-ada-002.
Sentence Transformer Embeddings (Phrase_Embedding_all-*): We use the English models of the library Sentence Transformers and the definitions of concepts (Query_Phrase) to generate concept embeddings. The name of the model we used is included in the column's name. I.e. for the embeddings in column Phrase_Embedding_all-mpnet-base-v2, the model all-mpnet-base-v2 was used.

Metrics

Additionally, we propose a method to evaluate embeddings for our CPL task, drawing inspiration from recommender systems. Our evaluation process involves the following steps: First, we compute the cosine similarity matrix for all embeddings. Next, we rank the results for each concept. Finally, we compute the evaluation metrics based on the definitions provided below:

Where for any given concept C, the term "related concepts" refers to either the prerequisites of the given concept C (ancestors) or the concepts for which C is a prerequisite (descendants). The query set, denoted as Q, contains all the related concepts to concept C.

All of these metrics return values in the interval [0,1]. The higher the value of the metric, the higher the probability of concepts being 'related' when their embeddings have a high cosine similarity (see the definition of "related concepts" given above).

Hyperparameters

In our paper, the PnPR-GCN architectures has two hyperparameters for the loss function $\lambda$ and $\mu$, and architecture hyperparameters. We use grid search to find the optimal hyper parameters for our PnPR-GCN architecture in the in-domain setting and report them in the table below:

Split Type	Average Distance Between Nodes	Number of GCN Embedding Layers	Size of GCN embeddings	MLP Depth	MLP Dropout Rate
Random Split	1.121	1	675	2	0.8
Graph Split	1.0	1	764	2	0.7

We also find the optimal value for $\lambda$ and $\mu$ to be equal to $1$.

Acknowledgements

This work would not have been possible without:

An INSIGHT grant from the Social Sciences and Humanities Research Council of Canada (SSHRC)
NetworkX
Pytorch Geometric
Hugging Face Transformers
Sentence Transformers Library
The Original AL-CPL dataset

Citation

Please, do cite our work if this repository was helpful to your research!

@article{X2023PrereqGCN,
  title={Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning},
  author={X},
  year={2023},
  booktitle = {X},
  publisher = {X},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
AL_CPL_Originial_Data		AL_CPL_Originial_Data
Generated_Images		Generated_Images
Graph_Split		Graph_Split
Random_Split		Random_Split
Scripts		Scripts
.gitignore		.gitignore
Concepts_with_Description_and_Embeddings.csv		Concepts_with_Description_and_Embeddings.csv
Graph_Split.py		Graph_Split.py
Metrics_for_Emebddings.ipynb		Metrics_for_Emebddings.ipynb
README.md		README.md
Random_Split.py		Random_Split.py

Lama-West/PnPR-GCN_ACM_SAC_24

Folders and files

Latest commit

History

Repository files navigation

Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning

Table of Contents

Structure of the Repository

Overview

Split Architecture

Embeddings

How our Embeddings Were Generated

Metrics

Hyperparameters

Acknowledgements

Citation

About

Resources

Stars

Watchers

Forks

Languages