sudo apt-get update
sudo apt-get install libpq-dev
Then, install the required Python packages:
pip install -r requirements.txt
pip install -e .
Defines the setup configuration for the project package, including dependencies and package metadata.
The main training script includes:
- Loading configurations: Sets up the environment and loads the necessary configurations.
- Training function: Manages the training loop, logging, gradient clipping, and checkpointing.
- Saving utilities: Handles model checkpoints and other utility functions.
- Main function: Initializes configurations, datasets, and starts the training process.
Contains utility functions for:
- Directory creation.
- Configuration loading.
- JSON handling.
Includes helper functions for:
- Setting up learning rate schedules.
- Logging.
- Ensuring reproducibility by setting random seeds.
A simple script to test a tokenizer.
Defines functions for connecting to:
- A Neo4j graph database.
- A PostgreSQL database.
- Retrieving graph data from these databases.
CLGP.py
: Defines the main model classCLGP
which includes initialization, parameter settings, and the forward pass.CKGGP.py
: Defines another model classCKGGP
, a variant of theCLGP
model.
transformer.py
: Contains the implementation of a transformer model used as a text encoder.graph_sage.py
: Implementation of the GraphSAGE model.dynamic_graph_cnn.py
: Implementation of the Dynamic Graph CNN model.graph_utils.py
: Utility functions for graph operations.graph_isomorphism_network.py
: Implementation of the Graph Isomorphism Network (GIN) model.feature_selection_graph_neural_net.py
: Implementation of the Feature Selection Graph Neural Network (FSGNN) model.graph_attention_network.py
: Implementation of the Graph Attention Network (GAT) model.graph_convolutionnal_network.py
: Implementation of the Graph Convolutional Network (GCN) model.
Handles data visualization using Plotly to create a 3D graph of question-graph links.
Script to train tokenizers for node features and prompts using the Unigram tokenizer.
Contains the implementation and training functions for the Unigram tokenizer.
Defines the function to get the data loader for training and evaluation.
Implements a custom dataset class CLGP_Ebiose_dataset
for handling prompt-graph pairs and preparing data for the model.
Includes functions for:
- Loading data.
- Processing data.
- Computing statistics and displaying results.
Configuration for data paths and tokenizer settings:
graph_data_file: 'Ebiose_CLGP/ebiose_clgp/data/data.jsonl'
gsm8k_validation_file: 'Ebiose_CLGP/ebiose_clgp/data/gsm8k-validation.pkl'
prompt_tokenizer: 'Ebiose_CLGP/ebiose_clgp/data_utils/prompt_tokenizer_1.json'
graph_feature_tokenizer: 'Ebiose_CLGP/ebiose_clgp/data_utils/graph_tokenizer_1.json'
prompt_context_length: 1000
node_feature_context_length: 1000
num_workers: 4
Configuration for model settings:
embed_dim: 256
graph_encoder:
name: 'GCN'
layers: 3
hidden: 256
node_feature_encoder:
name: 'Transformer'
layers: 2
heads: 2
width: 512
feedforward_dim: 512
activation_function: 'gelu'
layer_norm_eps: 1e-12
initializer_range: 0.02
text_encoder:
name: 'Transformer'
layers: 2
heads: 2
width: 512
feedforward_dim: 512
activation_function: 'gelu'
layer_norm_eps: 1e-12
initializer_range: 0.02
Configuration for training settings:
per_gpu_train_batch_size: 32
logging_steps: 1
gradient_accumulation_steps: 8
num_train_epochs: 20
train_batch_size: 32
model_save_name: 'CLGP_V1.pth'
saved_checkpoints: 'Model checkpoints'
save_steps: 100
optimizer:
lr: 0.00001
eps: 0.00000001
weight_decay: 0.1
To start training the model, run the following command:
python train.py
This script will:
- Load configurations from YAML files.
- Initialize the dataset and model.
- Train the model with logging, gradient clipping, and checkpointing.
- Save the trained model.
To visualize the data, run:
python data_viz.py
This will create a 3D graph of question-graph links using Plotly and open it in your default web browser.