# Esplorazione del Sentiment nei Dati

Questo notebook è dedicato all'analisi esplorativa dei dati relativi al sentiment. Utilizzando il modello VADER (Valence Aware Dictionary and sEntiment Reasoner), vengono calcolati i punteggi di sentiment per un insieme di frasi associate ai video. L'obiettivo principale è comprendere la distribuzione delle emozioni nei dati, identificare eventuali squilibri e preparare i dati per ulteriori analisi o modelli di machine learning.

### Obiettivi Principali:

- **Caricamento e preprocessamento dei dati grezzi:** I dati vengono letti da file Excel e filtrati per includere solo i video esistenti con descrizioni valide.
- **Calcolo dei punteggi di sentiment:** Utilizzando il modello VADER, ogni frase viene analizzata per ottenere un punteggio "compound" che rappresenta l'intensità del sentiment.
- **Classificazione delle emozioni:** Le frasi vengono suddivise in tre categorie principali: Positive, Negative e Neutral, in base a soglie definite per il punteggio "compound".
- **Visualizzazione della distribuzione delle emozioni:** Viene creato un grafico a barre per mostrare la distribuzione delle emozioni nei dati, insieme a un istogramma dei punteggi "compound".
- **Salvataggio dei risultati:** I dati elaborati, inclusi i video e le relative emozioni, vengono salvati in un file CSV per ulteriori analisi.

### Perché VADER?

VADER è stato scelto per la sua capacità di analizzare il sentiment in modo efficace, tenendo conto di fattori come la punteggiatura, le maiuscole e le parole enfatizzanti. Questo lo rende particolarmente adatto per analizzare frasi brevi e informali, come quelle presenti nei dati.

### Sfide e Limitazioni:

- **Ambiguità linguistica:** Alcune frasi potrebbero avere un sentiment ambiguo che VADER potrebbe non interpretare correttamente.
- **Soglie di classificazione:** La scelta delle soglie per classificare le emozioni potrebbe influenzare i risultati e richiedere ulteriori ottimizzazioni.

Questo processo è fondamentale per garantire che i dati siano ben compresi e pronti per essere utilizzati nei successivi passi del progetto EmoSign.


In [25]:
# Step 1: Librerie necessarie
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [26]:
# # Step 2: Funzione per calcolo sentiment
# def calculate_sentiment(text):
#     analyzer = SentimentIntensityAnalyzer()
#     scores = analyzer.polarity_scores(text)
#     return scores

In [27]:
# sentences = [
#     # "VADER is smart, handsome, and funny.",  # positive sentence example
#     # "VADER is not smart, handsome, nor funny.",  # negation sentence example
#     # "VADER is smart, handsome, and funny!",  # punctuation emphasis handled correctly (sentiment intensity adjusted)
#     # "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
#     # "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
#     # "VADER is VERY SMART, handsome, and FUNNY!!!",  # combination of signals - VADER appropriately adjusts intensity
#     # "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!",  # booster words & punctuation make this close to ceiling for score
#     # "The book was good.",  # positive sentence
#     # "The book was kind of good.",  # qualified positive sentence is handled correctly (intensity adjusted)
#     # "The plot was good, but the characters are uncompelling and the dialog is not great.",  # mixed negation sentence
#     # "At least it isn't a horrible book.",  # negated negative sentence with contraction
#     # "Make sure you :) or :D today!",  # emoticons handled
#     # "Today SUX!",  # negative slang with capitalization emphasis
#     # "Today only kinda sux! But I'll get by, lol",  # mixed sentiment example with slang and constrastive conjunction "but"
#     # "The weather is lousy today",  # mixed sentiment with contrastive conjunction
#     # "The family adopted those puppies. ",
#     "Last break, I completed all of my homework ahead of time. "
#     "Last night I did a lot of homework. ",
#     "know that I should enter the haunted house at my own risk.",
#     "I love action movies, not boring dramas.",
#     "I watched a movie last night and the actor was good.",
#     "My teacher is a good advisior.",
#     "Alcohol is not always good.",
#     "I tend to exercise after class.",
#     "Last night, there was an accident and traffic was horrible.",
#     "I watched a movie last night. I was really scared.",
#     "My mom is getting old.",
#     "Age isn't important.",
#     "My ancestors are from Germany.",
#     "I watched a movie last night and the actor was good.",
#     "I was picked to join the club. Wow, it was an honor.",
#     "I love riding in airplanes and flying different places.",
#     "My ancestors are from Germany.",
#     "Last night, there was an accident and traffic was horrible.",
#     "My boss works in an office right above me.",
#     "I have an appointment this afternoon with my teacher.",
#     "I asked my teacher for some advice and we had a discussion.",
#     "I noticed my sister acting strange.",
#     "Last night I did a lot of homework.",
#     "My boss works in an office right above me.",
#     "I got a D in the class; I accepted it.",
#     "I watched a movie last night and the actor was good.",
#     "I noticed my sister acting strange.",
#     "I love action movies, not boring dramas.",
#     "I asked my teacher for some advice and we had a discussion.",
#     "My teacher is a good advisior.",
#     "I watched a movie last night. I was really scared.",
#     "This afternoon I'm going to the store.",
# ]

In [28]:
# # Step 3: Estrarre punteggi di sentiment
# analyzer = SentimentIntensityAnalyzer()
# for sentence in sentences:
#     vs = analyzer.polarity_scores(sentence)
#     print("{:-<65} {}".format(sentence, str(vs)))

In [29]:
# Script per leggere il file Excel e creare un mapping tra nomi video e captions
import pandas as pd
import os

# Seleziona modalità: 'train', 'val', 'test'
mode = "train"  # Cambia in 'val' o 'test' secondo necessità

# Percorso del file Excel e cartella video
excel_path = f"../data/raw/{mode}/how2sign_{mode}.xlsx"
video_folder = f"../data/raw/{mode}/raw_videos_front_{mode}"

# Legge il file Excel
print("Leggendo il file Excel...")
df = pd.read_excel(excel_path)

# Mostra le prime righe per capire la struttura
# print("Prime 5 righe del dataset:")
# print(df.head())
print(f"\nColonne disponibili: {list(df.columns)}")
print(f"Numero totale di righe: {len(df)}")

# Estrae le frasi (assumendo che sia l'ultima colonna)
sentences = df.iloc[:, -1].dropna().tolist()

# Assumendo che ci sia una colonna con i nomi dei video
if "SENTENCE_NAME" in df.columns:
    video_names = df["SENTENCE_NAME"].dropna().tolist()

# Filtra i video che esistono nella cartella raw_videos
existing_videos = set(os.listdir(video_folder))
video_caption_mapping = {}
for i, (video, caption) in enumerate(zip(video_names, sentences)):
    video_file = f"{video}.mp4"
    if video_file in existing_videos and pd.notna(caption):
        video_caption_mapping[video] = caption

print(
    f"\nCreato mapping per {len(video_caption_mapping)} video-caption pairs (solo video esistenti)"
)
print("Esempi di mapping:")
for i, (video, caption) in enumerate(list(video_caption_mapping.items())[:3]):
    print(f"{i+1}. Video: {video}")
    print(f"   Caption: {caption[:100]}...")

Leggendo il file Excel...

Colonne disponibili: ['VIDEO_ID', 'VIDEO_NAME', 'SENTENCE_ID', 'SENTENCE_NAME', 'START', 'END', 'SENTENCE']
Numero totale di righe: 31165

Creato mapping per 2147 video-caption pairs (solo video esistenti)
Esempi di mapping:
1. Video: --7E2sU6zP4_10-5-rgb_front
   Caption: And I call them decorative elements because basically all they're meant to do is to enrich and color...
2. Video: --7E2sU6zP4_11-5-rgb_front
   Caption: So they don't really have much of a symbolic meaning other than maybe life is richer, life is beauti...
3. Video: --7E2sU6zP4_12-5-rgb_front
   Caption: Now this is very, this is actually an insert of a kind of an envelope for stationary, and this is a ...


In [30]:
# Step 4: Calcola sentiment per ogni video-caption pair e organizza per categorie
import matplotlib.pyplot as plt
import matplotlib

matplotlib.use("Agg")
from collections import Counter

analyzer = SentimentIntensityAnalyzer()

threshold = 0.34  # Soglia per classificare come positivo o negativo

# Calcola il sentiment per ogni caption
video_sentiment_data = []
for video_name, caption in video_caption_mapping.items():
    scores = analyzer.polarity_scores(caption)
    compound_score = scores["compound"]

    # Determina la categoria
    if compound_score >= threshold:
        emotion = "Positive"
    elif compound_score <= -threshold:
        emotion = "Negative"
    else:
        emotion = "Neutral"

    video_sentiment_data.append(
        {
            "video_name": video_name,
            "caption": caption,
            "compound_score": compound_score,
            "emotion": emotion,
            "scores": scores,
        }
    )

# Organizza per categoria
positive_examples = [
    item for item in video_sentiment_data if item["emotion"] == "Positive"
]
negative_examples = [
    item for item in video_sentiment_data if item["emotion"] == "Negative"
]
neutral_examples = [
    item for item in video_sentiment_data if item["emotion"] == "Neutral"
]

# Conta le occorrenze
emotion_counts = Counter([item["emotion"] for item in video_sentiment_data])

print(f"Distribuzione sentiment:")
print(f"Positive: {len(positive_examples)} esempi")
print(f"Negative: {len(negative_examples)} esempi")
print(f"Neutral: {len(neutral_examples)} esempi")
print(f"Totale: {len(video_sentiment_data)} esempi")

# Mostra esempi per ogni categoria
print("\nEsempi positivi:")
for example in positive_examples[:3]:
    print(f"Video: {example['video_name']}, Caption: {example['caption'][:200]}+++")

print("\nEsempi negativi:")
for example in negative_examples[:3]:
    print(f"Video: {example['video_name']}, Caption: {example['caption'][:200]}+++")

print("\nEsempi neutri:")
for example in neutral_examples[:3]:
    print(f"Video: {example['video_name']}, Caption: {example['caption'][:200]}+++")

# Crea il grafico
plt.figure(figsize=(10, 6))
bars = plt.bar(
    emotion_counts.keys(), emotion_counts.values(), color=["green", "red", "gray"]
)
plt.title("Distribuzione delle emozioni")
plt.xlabel("Emozione")
plt.ylabel("Numero di video")

# Aggiungi il numero in cima a ogni colonna
for bar in bars:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width() / 2.0,
        height,
        f"{int(height)}",
        ha="center",
        va="bottom",
        fontsize=10,
    )

# Aggiungi la soglia usata come testo in alto a destra
plt.text(
    0.95,
    0.95,
    f"Threshold: |{threshold}|",
    transform=plt.gca().transAxes,
    fontsize=10,
    verticalalignment="top",
    horizontalalignment="right",
    bbox=dict(facecolor="white", alpha=0.5),
)

# Salva l'immagine nella cartella specificata
output_folder = "../reports/figures"
os.makedirs(output_folder, exist_ok=True)
output_path = os.path.join(output_folder, f"emotions_distribution_{threshold}.png")
plt.savefig(output_path)

Distribuzione sentiment:
Positive: 693 esempi
Negative: 106 esempi
Neutral: 1348 esempi
Totale: 2147 esempi

Esempi positivi:
Video: --7E2sU6zP4_11-5-rgb_front, Caption: So they don't really have much of a symbolic meaning other than maybe life is richer, life is beautiful, but they've become so beautifully stylized and so you find them in different illuminative being+++
Video: --7E2sU6zP4_5-5-rgb_front, Caption: It's almost has a feathery like posture to it.+++
Video: --7E2sU6zP4_6-5-rgb_front, Caption: And so, it's used in architecture as a decorative element in architecture on columns and so on, and it's also used a great deal in illumination.+++

Esempi negativi:
Video: --8pSDeC-fg_12-5-rgb_front, Caption: Low self esteem tends to bring us down and there's a set of thoughts or cognitions that tend to go with low self esteem.+++
Video: --8pSDeC-fg_13-5-rgb_front, Caption: Such as I am unlove, unlovable and no one will like me.+++
Video: --8pSDeC-fg_4-5-rgb_front, Caption: Low self e

In [31]:
# Estrai tutti i valori 'compound'
compound_scores = [res["compound_score"] for res in video_sentiment_data]

# Crea l'istogramma
plt.hist(compound_scores, bins=10, color="skyblue", edgecolor="black")
plt.title("Distribuzione dei punteggi 'compound' (sentiment)")
plt.xlabel("Valore compound")
plt.ylabel("Numero di frasi")
plt.show()

  plt.show()


In [32]:
# Salva i video positivi, negativi e neutrali con le relative frasi in un file CSV
import csv

# Percorso del file CSV di output
output_csv_path = (
    f"../data/processed/{mode}/video_sentiment_data_with_neutral_{threshold}.csv"
)

# Filtra i dati positivi, negativi e neutrali
positive_negative_neutral_data = [
    {
        "video_name": item["video_name"],
        "caption": item["caption"],
        "emotion": item["emotion"],
    }
    for item in video_sentiment_data
    if item["emotion"] in ["Positive", "Negative", "Neutral"]
]

# Scrive i dati in un file CSV
with open(output_csv_path, mode="w", newline="", encoding="utf-8") as csv_file:
    fieldnames = ["video_name", "caption", "emotion"]
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerows(positive_negative_neutral_data)

print(f"File CSV salvato con successo in: {output_csv_path}")

File CSV salvato con successo in: ../data/processed/train/video_sentiment_data_with_neutral_0.34.csv
