In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# Data loading
df_to = pd.read_pickle("/content/drive/MyDrive/Projects/semantic_alignment_ontology/data/to_embeddings.pkl")
df_go = pd.read_pickle("/content/drive/MyDrive/Projects/semantic_alignment_ontology/data/go_embeddings.pkl")

In [39]:
# Double check if the gene terms are there...
df_go[df_go["Label"].str.lower() == "Unidimensional cell growth".lower()]

Unnamed: 0,URI,Label,Definition,Text,Embedding
7646,http://purl.obolibrary.org/obo/GO_0009826,unidimensional cell growth,The process in which a cell irreversibly incre...,unidimensional cell growth. The process in whi...,"[-0.021255095, -0.03466263, -0.0096545415, 0.0..."


In [53]:
# Double check if the plant trait terms are there...
df_to[df_to["Label"].str.lower() == "root hair density".lower()]

Unnamed: 0,URI,Label,Definition,Text,Embedding
1021,http://purl.obolibrary.org/obo/TO_0001051,root hair density,A root morphology trait (TO:0000043) which is ...,root hair density. A root morphology trait (TO...,"[0.055047, -0.069994874, -0.05368018, 0.011975..."


##Step 1: Small Demo: Vector embedding similarity aligns with biological sense (Or not?)
In this step we're using several examples from the GO and the PTO respectively, and test whether the cosine similarity between each gene label and trait label and their alignment with the biological sense and intuition.

The rationale of this step was to conduct a little experiment featuring a structured comparison between Gene terms and Plant Trait terms using cosine similarity of their language representations. We select examples where a gene and trait are either **biologically positively correlated**, **semantically but not biologically related**, or **biologically negatively correlated**.

The goal is to test whether language models can **capture meaningful biological relationships purely through text-distributional semantics**. Also, importantly, we would want to make sure that the biological correlation is not confounded by contextual linguistic associations. Thus I introduced **semantic controls**, i.e., a plant trait that is semantically linked to a gene term (for instance, it shows up together with the gene term often in texts but not functionally associated, or it overlaps semantically with the biologically related trait term).

I chose three gene terms that had functionally positively and negatively associated plant traits, one respectively. A semantic control for each gene terms was also included. Here's the explanation of the terms picked:

1. **_Response to abscisic acid (ABA)_**
* Biologically positively correlated: **_leaf water potential_**
(ABA signaling increases leaf water potential by closing stomata and reducing water loss)
* Semantically relevant but biologically uncorrelated: **_leaf size_**
(Related to leaf physiology, but not consistently influenced by ABA response pathways)
* Biologically negatively correlated: **_transpiration rate_**
(ABA causes stomatal closure, directly reducing transpiration rate)


2. **_Response to salt stress_**
* Biologically positively correlated: **_salt tolerance_**
(Salt stress response genes (e.g., SOS1, NHX) are directly linked to salt tolerance traits)
* Semantically relevant but biologically uncorrelated: **_chlorophyll content_**
(Often measured under stress, but not directly regulated by salt stress pathways)
* Biologically negatively correlated: **_shoot dry weight_**
(High salt stress often reduces shoot biomass due to osmotic stress)


3. **_Unidimensional cell growth_**
* Biologically positively correlated: **_root hair length_**
(Root hairs elongate via tip growth; unidimensional growth is essential for their development)
* Semantically relevant but biologically uncorrelated: **_root length_**
(Semantically close, but whole-root elongation involves radial expansion and is regulated differently)
* Biologically negatively correlated: **_root hair density_**
(In some cases, when individual hairs grow longer, fewer hairs are produced, as developmental trade-off)

In [55]:
# First list gene and trait label examples and corresponding intuitive hypothesis to test:
test_examples = [
    ("response to abscisic acid", "leaf water potential", "positive"),
    ("response to abscisic acid", "leaf size", "neutral"),
    ("response to abscisic acid", "transpiration rate", "negative"),
    ("response to salt stress", "salt tolerance", "positive"),
    ("response to salt stress", "chlorophyll content", "neutral"),
    ("response to salt stress", "shoot dry weight", "negative"),
    ("unidimensional cell growth", "root hair length", "positive"),
    ("unidimensional cell growth", "root length", "neutral"),
    ("unidimensional cell growth", "root hair density", "negative")]

In [5]:
# Function to retrieve the embeddings
def get_embedding(label, df):
  row = df[df["Label"].str.lower() == label.lower()]
  if row.empty:
    return None
  return row["Embedding"].values[0]

In [56]:
# Compute cosine similarity per pair
results = []
for gene, trait, hypo in test_examples:
  gene_emb = get_embedding(gene, df_go)
  trait_emb = get_embedding(trait, df_to)
  if gene_emb is not None and trait_emb is not None:
    score = cosine_similarity([gene_emb], [trait_emb])[0][0]
    results.append((gene, trait, hypo, score))
  else:
    results.append((gene, trait, hypo, None))  # if embedding is missing


In [57]:
df_results = pd.DataFrame(results, columns=["Gene", "Trait", "Hypothethical Relationship", "Score"])
df_results

Unnamed: 0,Gene,Trait,Hypothethical Relationship,Score
0,response to abscisic acid,leaf water potential,positive,0.199644
1,response to abscisic acid,leaf size,neutral,0.088481
2,response to abscisic acid,transpiration rate,negative,0.220608
3,response to salt stress,salt tolerance,positive,0.514879
4,response to salt stress,chlorophyll content,neutral,0.229908
5,response to salt stress,shoot dry weight,negative,0.114959
6,unidimensional cell growth,root hair length,positive,0.155698
7,unidimensional cell growth,root length,neutral,0.179412
8,unidimensional cell growth,root hair density,negative,0.186883


Interim conclusion:

From the examples above, there does not appear to be a systematic relationship in vector embeddings between gene and plant trait terms that can reliably predict functional gene-trait associations. The cosine similarities did not consistently differ across the biologically positive, neutral, and negative gene-trait pairs. Notably, even pairs with hypothetical negative biological correlations yielded positive similarity scores, sometimes exceeding those of positively correlated pairs.

However, as this was just a toy demo involving three curated examples, it is for now also hard to conclude that distributional semantic models could contribute little to functional prediction between genes and traits. The validity of the examples used, as well as the scalability of the experiment, awaits more expertised advice (notably, it was difficult to find gene terms with suitable plant traits that met the design of the experiment).

These results nonetheless highlight an important consideration: **context may heavily influence embedding representations**, such that both positively and negatively correlated (or even merely semantically associated) gene-trait pairs frequently co-occur in similar textual environments. As a result, they may appear equally "similar" in vector space, despite their opposing biological roles—making them difficult to disentangle using distributional semantics alone.

This indeed raises a broader question: if language models rely primarily on contextual co-occurrence, can they meaningfully contribute to functional discovery in biology, where relationships often require **directionality**,  **mechanistic interpretation**, or even **causality**? A reasonable next step could be to incorporate structured information, such as **knowledge graphs**, which encode logical and causal relationships between entities. Training vector representations based on these structured sources may offer a more biologically faithful foundation for modeling gene-trait associations.