<a href="https://colab.research.google.com/github/Deeksha-coder-debug/Stock-Prediction-project/blob/main/Graph_Neural_Networks_for_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

While Transformer-based foundation models like BERT [1] excel at capturing contextual and semantic information from sequential data, Graph Neural Networks (GNNs) are particularly adept at capturing structural and relational information from graph data. By combining their strengths, it is possible to create richer and more holistic representations for tasks requiring both semantic understanding and structural reasoning.

# Problem Statement & Objective

In my recent project, my team developed a multitasking BERT model capable of handling three downstream tasks: sentiment analysis, paraphrase detection, and semantic textual similarity. This model has three branches, one for each task. They share the same BERT layers then diverge into separate fully connected layers for their respective tasks. We observed that sentiment analysis on single sentences underperformed significantly compared to the other two tasks using sentence pairs.

In this project, I explore whether Graph Neural Networks (GNNs) can enhance single-sentence sentiment analysis.

In [2]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.10.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.14.1 stanza-1.10.1


In [6]:
import numpy as np
import torch

In [3]:
import stanza
# Stanford NLP's Python library

# 1. Download English models (run once)
# It includes everything needed to process English text: Tokenizer,POS tagger,
# Lemmatizer,Dependency parser,etc.
stanza.download('en')

# 2. Build pipeline
# a) 'en' - Tells Stanza to use the English model you just downloaded.
# b) tokenize - Splits raw text into sentences and words (tokens).
# c) mwt (Multi-Word Tokenizer)	Handles words that need to be split or merged.
#   Useful for languages like French, German, etc.	"can't" → "can", "not"
# d) pos (Part of Speech Tagger)	Labels each token with a POS tag.	"love" → VERB, "dog" → NOUN
# e) lemma	Reduces words to their base form (lemma).	"running" → "run", "better" → "good"
# f) depparse (Dependency Parser)	Finds grammatical relationships between words.
# "The dog chased the cat" → dog → subject, chased → verb, cat → object
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse')


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [4]:
import networkx as nx

def create_graph(sentence):
    '''
    Create a graph from a sentence.
    Input: sentence (str)
    Output: G (networkx.DiGraph)
    '''
    # 1️⃣ Create an empty directed graph to store words and relations.
    G = nx.DiGraph()

    # 2️⃣ Process the sentence using Stanza
    doc = nlp(sentence)

    # 3️⃣ Loop through each word in the first sentence
    for word in doc.sentences[0].words:
        # When you parse a sentence with Stanza, every token (word) gets an ID.
        # These IDs start from 1 (1-based indexing).
        # 0 is reserved for the ROOT word (main verb of the sentence).
        node_id = word.id - 1  # Convert to 0-based index for python and NetworkX

        # 4️⃣ Add word as a node in the graph
        # Adds a node representing the word.
        # Attributes:
        # • label → actual word text
        # • xpos → POS tag (like NOUN, VERB).
        G.add_node(node_id, label=word.text, xpos=word.xpos)

        # 5️⃣ Add edge: connect the head (parent) to the word
        if word.head != 0:  # 0 = ROOT, so skip it
            # In dependency parsing, each word in a sentence has a "head", which is
            # the parent word it depends on grammatically.
            # Stanza assigns:
            # word.head = 0 → The word depends on ROOT (usually the main verb of the sentence).
            # word.head = ID of another word → This word depends on that other word.
            parent_id = word.head - 1
            # Adds a directed edge from the parent (head) to the current word,
            # labeled with the dependency relation.
            G.add_edge(parent_id, node_id, label=word.deprel)

    return G


# What ROOT Means in Dependency Parsing
In dependency parsing:

There is always one main word in a sentence that everything else connects to — usually the main verb or action.

This special word is called ROOT.

Stanza marks the ROOT by setting word.head = 0.

##1. Sentence Example

"The cat chased the mouse"

##2. What Happens in Stanza

When Stanza parses this sentence, it gives:

Tokens (words),

POS tags (part-of-speech),

Dependency relations (who depends on whom).

Word (word.text) -> POS Tag (xpos) -> Head (word.head)	-> DepRel (word.deprel)

The -	DT (Determiner)	-2	- det (determiner)
cat -	NN (Noun) -	3 -	nsubj (subject)
chased	- VBD (Verb) -	0 (ROOT) -	root
the -	DT (Determiner) -	5 -	det (determiner)
mouse -	NN (Noun) -	3 -	obj (object)

##3. Build the Graph

We now convert this into a graph:

Concept	Representation
Nodes	-> Words in the sentence
Node Labels	-> Each node is labeled with its POS tag (e.g., NN, VBD)
Edges	-> A directed edgefrom head → dependent
Edge Types	-> The type of dependency relation (e.g., nsubj, obj, det)

## 4. Graph Example

Let's create it step-by-step.

Nodes

Each word is a node:

0: The
1: cat
2: chased
3: the
4: mouse


But node labels are POS tags:

0: DT
1: NN
2: VBD
3: DT
4: NN

Edges

From word.head → word.id:

From (Parent)	To (Child)	Dependency Label
cat (1) → The (0)	det
chased (2) → cat (1)	nsubj
chased (2) → mouse (4)	obj
mouse (4) → the (3)	det

##5. Visual Representation
     chased (VBD)
     /      \
  cat (NN)  mouse (NN)
   |           |
  The (DT)    the (DT)


Node label (inside parentheses) → POS Tag

Edge label → Dependency relation

# Node embeddings

Node embedding aims to capture similarity of the nodes in the embedding space. For instance, nodes that are close to each other in the original graph, such as those connected by an edge, should also have similar embeddings in the vector space.

Each word in the sentence is represented as a node in the graph. To initialize these nodes, I utilize pre-trained word embeddings from GloVe [3] (glove.42B.300d). This choice is motivated by GloVe’s ability to efficiently capture semantic relationships while remaining a straightforward and readily available resource.

I also experimented using the per-trained ‘bert-base-uncased’ for word embeddings, extracting each word’s representation from the last hidden state, but this approach underperformed compared to using GloVe embeddings.

In [7]:
GLOVE_PATH = "data/glove.42B.300d.txt"  # <-- Update this path to your GloVe file
node_embeddings = {}

# Load the GloVe word embeddings into a dictionary
with open(GLOVE_PATH, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split() # Split line into [word, dim1, dim2, ..., dim300]
        word = values[0]  # The actual word
        vector = np.asarray(values[1:], dtype='double')  # 300-dimensional vector
        node_embeddings[word] = vector

FileNotFoundError: [Errno 2] No such file or directory: 'data/glove.42B.300d.txt'

In [None]:
def get_glove_embedding(self, words):
    embedding_vectors = [node_embeddings[word] for word in words]
    return torch.tensor(embedding_vectors, dtype=torch.float)