# Knowledge Graph Word2Vec Pipeline

This notebook reproduces the experimental data preparation pipeline described in the paper.  
It:

1. Loads and cleans the knowledge graph.
2. Removes all edges used in train/test sets from the knowledge graph.
3. Trains a Word2Vec model on the cleaned graph.
4. Builds train and test feature matrices (`X.npy`, `X_test.npy`) and label vectors (`y.npy`, `Y_test.npy`).

> **Important:** Run all cells in order from top to bottom. Make sure the relative paths in the code match the structure of your project repository.


## 1. Install Required Packages

In [None]:
# Install dependencies (uncomment and run if needed)
# !pip install pandas numpy gensim nltk scikit-learn matplotlib

## 2. Import Required Libraries

This cell imports all libraries used in the pipeline and downloads the NLTK tokenizer models (once).

In [None]:
# Import Required Libraries

import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import nltk

# Download tokenizer data (run once)
nltk.download('punkt')

## 3. Load and Preprocess the Knowledge Graph

This step loads the knowledge graph from CSV, removes duplicated rows and unnecessary columns,  
and normalizes entity prefixes for compounds, proteins, and diseases.

In [None]:
# Step 1: Load and Preprocess Knowledge Graph
knowledge_graph_path = "./data/Knowledge_graph.csv"
df = pd.read_csv(knowledge_graph_path)
df.drop_duplicates(inplace=True)
df.drop('Unnamed: 0', axis=1, inplace=True, errors='ignore')

# Replace unwanted prefixes
prefix_replacements = {
    'Compound::': 'Compound_',
    'protein::': 'protein_',
    'Disease::': 'Disease_'
}
for key, value in prefix_replacements.items():
    df['source'] = df['source'].str.replace(key, value)
    df['target'] = df['target'].str.replace(key, value)

print("Knowledge Graph shape:", df.shape)

## 4. Load Positive and Negative Train-Test Data

In this step, we load the positive and negative **test** interaction pairs.  
These will later be removed from the knowledge graph to avoid data leakage and will be used to build test features.

In [None]:
# Step 2: Load Positive and Negative Train-Test Data
positive_train_path = "./data/positive_test_df.csv"
negative_train_path = "./data/negative_test_df.csv"
positive_df = pd.read_csv(positive_train_path)
negative_df = pd.read_csv(negative_train_path)

# Rename columns for consistency
positive_df.rename(columns={"drug_id": "source", "ind_id": "target"}, inplace=True)
negative_df.rename(columns={"drug_id": "source", "ind_id": "target"}, inplace=True)

positive_df.drop('Unnamed: 0', axis=1, inplace=True, errors='ignore')
negative_df.drop('Unnamed: 0', axis=1, inplace=True, errors='ignore')

# Display dataset shapes
print("Positive Train-Test shape:", positive_df.shape)
print("Negative Train-Test shape:", negative_df.shape)

## 5. Remove Train/Test Edges from the Knowledge Graph

To prevent information leakage, we remove all edges from the knowledge graph that are used in:  
- positive interactions (`has_approved_interaction`)  
- negative interactions (`has_side_effect`)  

We identify such edges with a composite key (`source_target`) and drop them from the graph.

In [None]:
# Step 3: Filter Rows to Delete from Knowledge Graph

# Create composite keys for matching
df['composite_key'] = df['source'] + '_' + df['target']
positive_df['composite_key'] = positive_df['source'] + '_' + positive_df['target']
negative_df['composite_key'] = negative_df['source'] + '_' + negative_df['target']

# Identify rows to delete
df_to_delete_positive = df[(df['relation'] == 'has_approved_interaction') &
                           (df['composite_key'].isin(positive_df['composite_key']))]

df_to_delete_negative = df[(df['relation'] == 'has_side_effect') &
                           (df['composite_key'].isin(negative_df['composite_key']))]

df_to_delete = pd.concat([df_to_delete_positive, df_to_delete_negative])
df_cleaned = df.drop(df_to_delete.index)
df_cleaned.drop(columns=['composite_key'], inplace=True)

# Save cleaned DataFrame
df_cleaned_path = "./output/train_df_Word2Vec.csv"
df_cleaned.to_csv(df_cleaned_path, index=False)
print("Cleaned Knowledge Graph shape:", df_cleaned.shape)

## 6. Construct Sentences and Train Word2Vec

Here we convert each triple in the cleaned knowledge graph into a simple sentence of the form:

`source relation target .`

We then tokenize these sentences and train a Word2Vec model to obtain dense embeddings for all entities and relations in the graph.

In [None]:
# Step 4: Prepare Sentences for Word2Vec
sentences = [f"{row['source']} {row['relation']} {row['target']} ." for _, row in df_cleaned.iterrows()]
sentences_tokenized = [word_tokenize(sent) for sent in sentences]

# Step 5: Train Word2Vec Model
model = Word2Vec(vector_size=650, window=2, sg=0, min_count=1, epochs=100, alpha=0.001)
model.build_vocab(sentences_tokenized, progress_per=1500)
model.train(sentences_tokenized, total_examples=model.corpus_count, epochs=500, report_delay=1)
model.save("./output/model1.model")

## 7. Build Training Features (`X.npy`) and Labels (`y.npy`)

In this step, we:

1. Load **positive** and **negative** training interaction pairs.
2. Apply the same prefix cleaning to drug and disease identifiers.
3. Concatenate the drug and disease Word2Vec vectors to form a single feature vector per pair.
4. Create the label vector `y` (1 for positive, 0 for negative) and save both `X` and `y` as NumPy arrays.

In [None]:
# Step 6: Prepare X.npy and y.npy for Training
positive_train_path = "./data/positive_train_df.csv"
negative_train_path = "./data/negative_train_df.csv"
has = pd.read_csv(positive_train_path)
hasnt = pd.read_csv(negative_train_path)

# Rename columns and clean prefixes
has.rename(columns={"source": "drug_id", "target": "ind_id"}, inplace=True)
hasnt.rename(columns={"source": "drug_id", "target": "ind_id"}, inplace=True)
for key, value in prefix_replacements.items():
    has['drug_id'] = has['drug_id'].str.replace(key, value)
    has['ind_id'] = has['ind_id'].str.replace(key, value)
    hasnt['drug_id'] = hasnt['drug_id'].str.replace(key, value)
    hasnt['ind_id'] = hasnt['ind_id'].str.replace(key, value)

# Concatenate positive and negative data
frames = [has, hasnt]
f = pd.concat(frames)

# Generate feature vectors: concat(drug_embedding, disease_embedding)
X = [
    np.concatenate((model.wv[row['drug_id']].reshape((1, 650)),
                    model.wv[row['ind_id']].reshape((1, 650))), axis=None)
    for _, row in f.iterrows()
]

# Create labels: 1 for positive, 0 for negative
y = np.zeros(len(X))
y[:len(has)] = 1

np.save("./output/X.npy", X)
np.save("./output/y.npy", y)

## 8. Build Test Features (`X_test.npy`) and Labels (`Y_test.npy`)

Finally, we repeat the same feature construction procedure for the **test** positive and negative interaction pairs:

- Load the positive and negative test sets.
- Concatenate the corresponding drug and disease embeddings.
- Assign labels (1 for positive, 0 for negative).
- Save `X_test` and `Y_test` as NumPy arrays.

In [None]:
# Step 7: Prepare Test Data
positive_test_path = "./data/positive_test_df.csv"
negative_test_path = "./data/negative_test_df.csv"
positive_test_df = pd.read_csv(positive_test_path)
negative_test_df = pd.read_csv(negative_test_path)

# Add positive and negative test samples to test set
X_test = []
Y_test = []

for _, row in positive_test_df.iterrows():
    X_test.append(
        np.concatenate((model.wv[row['drug_id']].reshape((1, 650)),
                        model.wv[row['ind_id']].reshape((1, 650))), axis=None)
    )
    Y_test.append(1)

for _, row in negative_test_df.iterrows():
    X_test.append(
        np.concatenate((model.wv[row['drug_id']].reshape((1, 650)),
                        model.wv[row['ind_id']].reshape((1, 650))), axis=None)
    )
    Y_test.append(0)

np.save("./output/X_test.npy", X_test)
np.save("./output/Y_test.npy", Y_test)

# Final Message
print("All tasks completed successfully!")

## 9. Notes

- The `./data/` folder should contain all input CSV files:
  - `Knowledge_graph.csv`
  - `positive_train_df.csv`, `negative_train_df.csv`
  - `positive_test_df.csv`, `negative_test_df.csv`
- The `./output/` folder will be created automatically (if it does not exist, please create it) and will store:
  - `train_df_Word2Vec.csv`
  - `model1.model`
  - `X.npy`, `y.npy`
  - `X_test.npy`, `Y_test.npy`

You can now use these NumPy arrays as input to your downstream models (e.g., IDC_Conv1D or other classifiers).