## Part 1. 
Task:  Analyze node types, edges, and relationships. Provide basic statistics, including node and edge counts and any significant relationships. 

## Dataset Description
1. Nodes contain id, labels[Dataset, publication and scientific keyword] \
2. the last column in nodes.csv is properties which gives further information about the document
   conclusion: The properties column should be encoded and used as node features

In [32]:
# Keep all imports in this cell
import pandas as pd
import torch
from torch_geometric.data import Data
from sklearn.preprocessing import LabelEncoder
import ast
from sentence_transformers import SentenceTransformer

## Solution

The dataset needs to be represented as a PyG Data object, However the node features are in the form of text
so it's better to get embeddings for them instead
1. encode the node labels \
2. Generate representations for node properties to be used as node features
    a. use onehot encoding
    b. use text embeddings


In [None]:

# Load the model once
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#function for converting node properties into text embeddings
def properties_to_embedding(properties):
    # If properties is None or empty, return None
    if not properties:
        return None
    
    # Convert all properties to a single string
    # Use key-value pairs to preserve context
    property_strings = [f"{key}: {value}" for key, value in properties.items() if value]
    
    # Join the properties into a single string
    full_text = " | ".join(property_strings)
    
    # Generate embedding
    return model.encode(full_text)

#helper function for getting relationship embeddings 


In [36]:
nodes_df = pd.read_csv("Dataset/nodes.csv")
Enc=  LabelEncoder()
nodes_df['label'] = Enc.fit_transform(nodes_df['label'])
nodes_df["properties"] = nodes_df["properties"].apply(ast.literal_eval)
nodes_df['properties'] = nodes_df['properties'].apply(properties_to_embedding)

In [37]:
nodes_df.head()

Unnamed: 0,id,label,properties
0,0,1,"[-0.056339294, -0.07814679, -0.024745723, -0.0..."
1,1,1,"[-0.0901361, -0.056588154, 0.023120109, 0.0231..."
2,2,1,"[0.028579466, 0.02933162, 0.017187452, 0.01399..."
3,3,1,"[-0.06625115, -0.034307364, 0.070921, -0.03396..."
4,4,1,"[0.015955979, -0.021838093, 0.0187536, -0.0480..."


In [43]:
# Extract node labels
node_labels = torch.tensor(nodes_df['label'].values, dtype=torch.long)

# Extract node embeddings
node_features = torch.tensor(nodes_df['properties'].tolist(), dtype=torch.float)


#Handling Edges require a bit of manipulation

# Load edges
train_edges_df = pd.read_csv("Dataset/train_edges.csv")
val_edges_df = pd.read_csv("Dataset/val_links.csv")
test_edges_df = pd.read_csv("Dataset/test_links.csv")

# Create a LabelEncoder for relationship types
relationship_encoder = LabelEncoder()

# Fit and transform the relationship types
train_edges_df['relationship_encoded'] = relationship_encoder.fit_transform(train_edges_df['relationship_type'])

# Convert edges to tensor
train_edges = torch.tensor(train_edges_df[['source', 'target']].values.T, dtype=torch.long)
val_edges = torch.tensor(val_edges_df[['source', 'target']].values.T, dtype=torch.long)
test_edges = torch.tensor(test_edges_df[['source', 'target']].values.T, dtype=torch.long)

# Convert encoded relationships to tensor
train_edge_attrs = torch.tensor(train_edges_df['relationship_encoded'].values, dtype=torch.long)
# Create graph data object
data = Data(
    x=node_features,          # Node features
    y=node_labels,            # Node labels
    edge_index=train_edges,   # Edge indices
    edge_attr=train_edge_attrs,  # Encoded edge attributes
    val_pos_edge_index=val_edges,
    test_pos_edge_index=test_edges,
)

print(data)

#  to keep track of the mapping
relationship_mapping = dict(enumerate(relationship_encoder.classes_))
print("Relationship Mapping:", relationship_mapping)

Data(x=[5763, 384], edge_index=[2, 13820], edge_attr=[13820], y=[5763], val_pos_edge_index=[2, 860], test_pos_edge_index=[2, 861])
Relationship Mapping: {0: 'HAS_DATASET', 1: 'HAS_INSTRUMENT', 2: 'HAS_PLATFORM', 3: 'HAS_SCIENCEKEYWORD', 4: 'OF_PROJECT', 5: 'SUBCATEGORY_OF', 6: 'USES_DATASET'}
