<a href="https://colab.research.google.com/github/IgnatiusEzeani/spatio-textual-colab-demos/blob/main/demo_2_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification with `spatio-textual`

In this demo, we explore the sentiment classification and analysis features withi the `spatio-textual` package.

It defaults to the a rule-based approach but includes the supports for large language models and HuggingFace

---

## Setting up

### Downloads
As earlier, download the `spaCy` model and install the `spatio-textual` package

In [None]:
!python -m spacy download en_core_web_trf
!pip install -q git+https://github.com/SpaceTimeNarratives/spatio-textual.git

### Imports  <a id='imports'></a>
Let's import the necessary modules: `load_spacy_model` and `Annotator` from `spatio_textual.utils`; and `SentimentAnalyzer` from `spatio_textual.sentiment`

We also need `pandas` for working with data frames.

In [None]:
import spatio_textual
from spatio_textual.utils import load_spacy_model, Annotator
from spatio_textual.sentiment import SentimentAnalyzer
import pandas as pd

### Load `spaCy` model and instantiate `Annotator`

In [None]:
nlp = load_spacy_model("en_core_web_trf")
ann = Annotator(nlp)

### Set up a pipeline `HuggingFace`, transformer-based  sentiment analyser

In [None]:
from transformers import pipeline
hf = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


### Set up a hook for LLM-based sentiment analysis

In [None]:
from spatio_textual.llm import LLMRouter

---
## Sentiment Classification

### Quick Demo  <a id='data-demo'></a>

In [None]:
texts = [
    "I felt safe and relieved when we reached the farmhouse.",
    "We were afraid, hungry, and cold during the march.",
    "They asked us questions.",
]
sa = SentimentAnalyzer("rule")
sa.predict(texts)


[{'label': 'positive', 'score': 0.32151273753163434},
 {'label': 'negative', 'score': -0.5827829453479102},
 {'label': 'neutral', 'score': 0.0}]

### Main Tutorial
#### 1. Annotate + attach sentiment
We can annotate the texts and attach sentiment score using `SentimentAnalyzer("rule")` i.e. the default rule-based approach in `spatio-textual`

In [None]:
recs = ann.annotate_texts(
    texts,
    file_id="sent_demo", # Use what is relevant for your work
    include_text=True, # Let's you include the text in the result
    include_verbs=True) # Let's you extract verbs

sa = SentimentAnalyzer("rule")
preds = sa.predict([r["text"] for r in recs])

for r, p in zip(recs, preds):
    r.update({"sentiment_label": p["label"], "sentiment_score": p["score"]})

pd.DataFrame([{k:r.get(k) for k in ["segId","entities","verb_data","text","sentiment_label","sentiment_score"]} for r in recs])


Unnamed: 0,segId,entities,verb_data,text,sentiment_label,sentiment_score
0,1,"[{'start_char': 45, 'token': 'farmhouse', 'tag...","[{'sent-id': 0, 'verb': 'felt', 'subject': 'I'...",I felt safe and relieved when we reached the f...,positive,0.321513
1,2,[],[],"We were afraid, hungry, and cold during the ma...",negative,-0.582783
2,3,[],"[{'sent-id': 0, 'verb': 'asked', 'subject': 'T...",They asked us questions.,neutral,0.0


#### 2. Using a HuggingFace pipeline
We can also use a transformer-based sentiment analysis model from HuggingFace.

Here are using the [twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) from the [CardiffNLP](https://cardiffnlp.github.io/) team.

In [None]:
recs = ann.annotate_texts(
    texts,
    file_id="sent_demo",  # Use what is relevant for your work
    include_text=True,    # Let's you include the text in the result
    include_verbs=True)   # Let's you extract verbs

hf_sentiments = hf(texts)
for r, p in zip(recs, hf_sentiments):
    r.update({"hf_sentiment_label": p["label"],
              "hf_sentiment_score": p["score"]})

pd.DataFrame([{k:r.get(k) for k in [
    "segId","entities","verb_data","text",
    "hf_sentiment_label","hf_sentiment_score"]}
              for r in recs])

Unnamed: 0,segId,entities,verb_data,text,hf_sentiment_label,hf_sentiment_score
0,1,"[{'start_char': 45, 'token': 'farmhouse', 'tag...","[{'sent-id': 0, 'verb': 'felt', 'subject': 'I'...",I felt safe and relieved when we reached the f...,positive,0.869196
1,2,[],[],"We were afraid, hungry, and cold during the ma...",negative,0.84529
2,3,[],"[{'sent-id': 0, 'verb': 'asked', 'subject': 'T...",They asked us questions.,neutral,0.895591


#### 2. Hooking up an LLM for sentiment classification
`spatio-textual` has a built in LLM support for theses providers and their models:

* **openai**: `gpt-4o-mini`
* **anthropic**: `claude-3-5-sonnet-20240620`
* **google**: `gemini-1.5-pro`
* **groq**: `llama3-70b-8192` (or mixtral, etc)
* **xai**: `grok-beta` (use `base_url=https://api.x.ai, OPENAI-compatible`)
* **ollama**: `llama3:8b` (local)


In [None]:
router = LLMRouter(
    provider="openai",
    model="gpt-4o-mini",
    api_key="",
    # Optional overrides (or use env vars):
    # api_key="...",                # else OPENAI_API_KEY / ANTHROPIC_API_KEY / GOOGLE_API_KEY / GROQ_API_KEY
    # base_url="https://api.x.ai",  # for OpenAI-compatible endpoints like xAI/Together
    temperature=0.0,
    max_tokens=64,
)

# Your existing ann pipeline
recs = ann.annotate_texts(
    texts,
    file_id="sent_demo",
    include_text=True,
    include_verbs=True
)

# Drop-in LLM sentiment
llm_sentiments = router.sentiment(texts, rate_limit_s=0.0)

for r, p in zip(recs, llm_sentiments):
    r.update({"llm_sentiment_label": p["label"], "llm_sentiment_score": p["score"]})

pd.DataFrame([{k:r.get(k) for k in [
    "segId","entities","verb_data","text",
    "llm_sentiment_label","llm_sentiment_score"]}
              for r in recs])

Unnamed: 0,segId,entities,verb_data,text,llm_sentiment_label,llm_sentiment_score
0,1,"[{'start_char': 45, 'token': 'farmhouse', 'tag...","[{'sent-id': 0, 'verb': 'felt', 'subject': 'I'...",I felt safe and relieved when we reached the f...,positive,0.9
1,2,[],[],"We were afraid, hungry, and cold during the ma...",negative,0.1
2,3,[],"[{'sent-id': 0, 'verb': 'asked', 'subject': 'T...",They asked us questions.,neutral,0.5


## Tips & Troubleshooting  <a id='tips'></a>
- Rule backend is offline and immediate but simplistic; HF/LLM provide richer signals.
- Keep inputs as short segments for better classifier performance.


## Summary  <a id='summary'></a>
You ran sentiment classification with the rule backend and saw how to plug an HF pipeline.
