Ebiose Contrastive Language Graph Pair (CLGP)

Dependencies

sudo apt-get update
sudo apt-get install libpq-dev

Then, install the required Python packages:

pip install -r requirements.txt
pip install -e .

Overview

Project Structure

`setup.py`

Defines the setup configuration for the project package, including dependencies and package metadata.

`train.py`

The main training script includes:

Loading configurations: Sets up the environment and loads the necessary configurations.
Training function: Manages the training loop, logging, gradient clipping, and checkpointing.
Saving utilities: Handles model checkpoints and other utility functions.
Main function: Initializes configurations, datasets, and starts the training process.

`utils.py`

Contains utility functions for:

Directory creation.
Configuration loading.
JSON handling.

`train_utils.py`

Includes helper functions for:

Setting up learning rate schedules.
Logging.
Ensuring reproducibility by setting random seeds.

`test_data_utils.py`

A simple script to test a tokenizer.

`search_and_retrieval.py`

Defines functions for connecting to:

A Neo4j graph database.
A PostgreSQL database.
Retrieving graph data from these databases.

Model Definitions

CLGP.py: Defines the main model class CLGP which includes initialization, parameter settings, and the forward pass.
CKGGP.py: Defines another model class CKGGP, a variant of the CLGP model.

Transformer and Neural Network Modules

transformer.py: Contains the implementation of a transformer model used as a text encoder.
graph_sage.py: Implementation of the GraphSAGE model.
dynamic_graph_cnn.py: Implementation of the Dynamic Graph CNN model.
graph_utils.py: Utility functions for graph operations.
graph_isomorphism_network.py: Implementation of the Graph Isomorphism Network (GIN) model.
feature_selection_graph_neural_net.py: Implementation of the Feature Selection Graph Neural Network (FSGNN) model.
graph_attention_network.py: Implementation of the Graph Attention Network (GAT) model.
graph_convolutionnal_network.py: Implementation of the Graph Convolutional Network (GCN) model.

`data_viz.py`

Handles data visualization using Plotly to create a 3D graph of question-graph links.

`train_tokenizers.py`

Script to train tokenizers for node features and prompts using the Unigram tokenizer.

`tokenizer.py`

Contains the implementation and training functions for the Unigram tokenizer.

`dataloader.py`

Defines the function to get the data loader for training and evaluation.

`dataset.py`

Implements a custom dataset class CLGP_Ebiose_dataset for handling prompt-graph pairs and preparing data for the model.

`data_analysis.py`

Includes functions for:

Loading data.
Processing data.
Computing statistics and displaying results.

Configuration Files

`data_config.yaml`

Configuration for data paths and tokenizer settings:

graph_data_file: 'Ebiose_CLGP/ebiose_clgp/data/data.jsonl'
gsm8k_validation_file: 'Ebiose_CLGP/ebiose_clgp/data/gsm8k-validation.pkl'
prompt_tokenizer: 'Ebiose_CLGP/ebiose_clgp/data_utils/prompt_tokenizer_1.json'
graph_feature_tokenizer: 'Ebiose_CLGP/ebiose_clgp/data_utils/graph_tokenizer_1.json'
prompt_context_length: 1000
node_feature_context_length: 1000
num_workers: 4

`model_config.yaml`

Configuration for model settings:

embed_dim: 256

graph_encoder:
  name: 'GCN'
  layers: 3
  hidden: 256

node_feature_encoder:
  name: 'Transformer'
  layers: 2
  heads: 2
  width: 512
  feedforward_dim: 512
  activation_function: 'gelu'
  layer_norm_eps: 1e-12
  initializer_range: 0.02

text_encoder:
  name: 'Transformer'
  layers: 2
  heads: 2
  width: 512
  feedforward_dim: 512
  activation_function: 'gelu'
  layer_norm_eps: 1e-12
  initializer_range: 0.02

`train_config.yaml`

Configuration for training settings:

per_gpu_train_batch_size: 32
logging_steps: 1
gradient_accumulation_steps: 8
num_train_epochs: 20
train_batch_size: 32
model_save_name: 'CLGP_V1.pth'
saved_checkpoints: 'Model checkpoints'
save_steps: 100

optimizer:
  lr: 0.00001
  eps: 0.00000001
  weight_decay: 0.1

Usage

Training the Model

To start training the model, run the following command:

python train.py

This script will:

Load configurations from YAML files.
Initialize the dataset and model.
Train the model with logging, gradient clipping, and checkpointing.
Save the trained model.

Data Visualization

To visualize the data, run:

python data_viz.py

This will create a 3D graph of question-graph links using Plotly and open it in your default web browser.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
ebiose_clgp		ebiose_clgp
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ebiose Contrastive Language Graph Pair (CLGP)

Dependencies

Overview

Project Structure

`setup.py`

`train.py`

`utils.py`

`train_utils.py`

`test_data_utils.py`

`search_and_retrieval.py`

Model Definitions

Transformer and Neural Network Modules

`data_viz.py`

`train_tokenizers.py`

`tokenizer.py`

`dataloader.py`

`dataset.py`

`data_analysis.py`

Configuration Files

`data_config.yaml`

`model_config.yaml`

`train_config.yaml`

Usage

Training the Model

Data Visualization

About

Releases

Packages

Languages

EtiNL/Ebiose_CLGP

Folders and files

Latest commit

History

Repository files navigation

Ebiose Contrastive Language Graph Pair (CLGP)

Dependencies

Overview

Project Structure

setup.py

train.py

utils.py

train_utils.py

test_data_utils.py

search_and_retrieval.py

Model Definitions

Transformer and Neural Network Modules

data_viz.py

train_tokenizers.py

tokenizer.py

dataloader.py

dataset.py

data_analysis.py

Configuration Files

data_config.yaml

model_config.yaml

train_config.yaml

Usage

Training the Model

Data Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`setup.py`

`train.py`

`utils.py`

`train_utils.py`

`test_data_utils.py`

`search_and_retrieval.py`

`data_viz.py`

`train_tokenizers.py`

`tokenizer.py`

`dataloader.py`

`dataset.py`

`data_analysis.py`

`data_config.yaml`

`model_config.yaml`

`train_config.yaml`

Packages