# Pràctica 4: Similitud de Text Semàntic (STS) per al Català - Anàlisi Comprensiva

**Objectius**: 
- Entrenar models de Word2Vec amb diferents mides de corpus
- Implementar i comparar diferents tècniques d'embeddings per a STS
- Avaluar models baseline vs. models avançats
- Analitzar l'impacte de la mida del corpus i tipus d'embedding

## Estructura de la Pràctica:
1. **Entrenament de Models Word2Vec** - Diferents mides de corpus
2. **Models Baseline** - Similitud cosinus simple vs. TF-IDF
3. **Model 1**: Embeddings Agregats - Vectors de frase concatenats
4. **Model 2**: Seqüència d'Embeddings - Amb mecanisme d'atenció
5. **Experimentació Avançada** - spaCy, RoBERTa, models fine-tuned
6. **Anàlisi Comparativa** - Correlació de Pearson i MSE

In [None]:
# Imports necessaris
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Tuple, Optional, Dict, Union
from scipy.stats import pearsonr
from scipy.spatial.distance import cosine
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression

# Gensim imports
import gensim
from gensim.models import Word2Vec, KeyedVectors, TfidfModel
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from collections import defaultdict

# Configuració de GPU (opcional)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices('GPU'))

## 1. Carrega del Dataset i Preparació de Corpus

In [None]:
from datasets import load_dataset

# Carregar datasets
print("Carregant datasets...")
train_data = load_dataset("projecte-aina/sts-ca", split="train")
test_data = load_dataset("projecte-aina/sts-ca", split="test") 
val_data = load_dataset("projecte-aina/sts-ca", split="validation")

# Convertir a DataFrame
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)
val_df = pd.DataFrame(val_data)

print(f"Train samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Label range: {train_df['label'].min():.2f} - {train_df['label'].max():.2f}")

# Dataset per entrenar Word2Vec
catalan_corpus = load_dataset("projecte-aina/catalan_general_crawling")
catalan_dataset = catalan_corpus['train']

print(f"Corpus català: {len(catalan_dataset)} documents")

In [None]:
# Crear diferents mides de corpus per experimentar
def create_corpus_subsets(dataset):
    """Crea subsets del corpus de diferents mides"""
    subsets = {}
    
    # 100MB
    size = 0
    for i in range(len(dataset)):
        size += len(dataset[i]['text'])
        if size > 100_000_000:  # 100 MB
            break
    subsets['100MB'] = dataset.select(list(range(i)))
    
    # 500MB
    size = 0
    for j in range(len(dataset)):
        size += len(dataset[j]['text'])
        if size > 500_000_000:  # 500 MB
            break
    subsets['500MB'] = dataset.select(list(range(j)))
    
    # 1GB
    size = 0
    for k in range(len(dataset)):
        size += len(dataset[k]['text'])
        if size > 1_000_000_000:  # 1 GB
            break
    subsets['1GB'] = dataset.select(list(range(k)))
    
    # Corpus complet (limitat per recursos)
    subsets['complet'] = dataset
    
    return subsets

corpus_subsets = create_corpus_subsets(catalan_dataset)
print("Subsets creats:", list(corpus_subsets.keys()))

## 2. Entrenament de Models Word2Vec

In [None]:
import re

# Stopwords catalans
stopwords_catala = [
    "a", "abans", "ací", "així", "alguns", "alguna", "algunes", "algú", "alhora", 
    "als", "allò", "aquell", "aquelles", "aquells", "baix", "cada", "com", 
    "eixa", "eixes", "eixí", "eixos", "el", "ella", "elles", "ell", "ells", 
    "en", "endavant", "enfront", "ens", "entre", "he", "hem", "heu", "hi", "ho", 
    "i", "igual", "iguals", "ja", "jo", "l'", "la", "les", "li", "els", "tu", 
    "nosaltres", "vosaltres", "de", "del", "dels", "d'un", "d'una", "des"
]

def preprocess_text(text):
    """Neteja i tokenitza el text"""
    text = re.sub(r'\W+', ' ', text)  # Elimina caràcters no alfanumèrics
    text = text.lower()
    words = text.split()
    words = [word for word in words if word not in stopwords_catala and len(word) > 2]
    return words

def create_corpus_for_w2v(dataset):
    """Crea corpus preprocessat per Word2Vec"""
    corpus = []
    for doc in dataset:
        words = preprocess_text(doc['text'])
        if words:  # Només afegir si no està buit
            corpus.append(words)
    return corpus

def train_word2vec(corpus, vector_size=100, window=5, min_count=10, workers=4, epochs=25):
    """Entrena un model Word2Vec"""
    model = Word2Vec(
        corpus, 
        vector_size=vector_size, 
        window=window, 
        min_count=min_count, 
        workers=workers, 
        epochs=epochs,
        sg=1  # Skip-gram
    )
    return model

In [None]:
# Entrenar models Word2Vec per cada mida de corpus
word2vec_models = {}

for corpus_name in ["100MB", "500MB", "1GB"]:  # Evitem 'complet' per recursos
    print(f"\nEntrenant Word2Vec per {corpus_name}...")
    
    # Crear corpus preprocessat
    corpus = create_corpus_for_w2v(corpus_subsets[corpus_name])
    
    # Entrenar model
    model = train_word2vec(corpus, vector_size=100)
    
    # Guardar model
    model.save(f"word2vec_{corpus_name}.model")
    word2vec_models[corpus_name] = model
    
    print(f"Model {corpus_name} entrenat amb {len(model.wv.key_to_index)} paraules")
    
    # Mostrar exemples de paraules similars
    if "casa" in model.wv:
        similar_words = model.wv.most_similar("casa", topn=5)
        print(f"Paraules similars a 'casa': {similar_words}")

## 3. Carrega d'Embeddings Pre-entrenats

In [None]:
# Carregar embeddings FastText pre-entrenats
PRETRAINED_PATH = '../cc.ca.300.vec'

try:
    print("Carregant embeddings FastText pre-entrenats...")
    fasttext_model = KeyedVectors.load_word2vec_format(PRETRAINED_PATH, binary=False)
    print(f"FastText carregat: {len(fasttext_model.key_to_index)} paraules, {fasttext_model.vector_size}D")
    word2vec_models['fasttext'] = fasttext_model
except FileNotFoundError:
    print("Embeddings FastText no trobats. Continuant sense ells.")
    fasttext_model = None

## 4. Funcions d'Utilitat per Processament

In [None]:
def preprocess_sentence(sentence: str) -> List[str]:
    """Preprocessa una frase per a STS"""
    return simple_preprocess(sentence.lower())

def get_sentence_embedding_mean(sentence: str, wv_model, vector_size: int) -> np.ndarray:
    """Obté embedding de frase mitjançant mitjana simple"""
    words = preprocess_sentence(sentence)
    vectors = []
    
    for word in words:
        if hasattr(wv_model, 'wv'):  # Model Word2Vec
            if word in wv_model.wv:
                vectors.append(wv_model.wv[word])
        else:  # KeyedVectors
            if word in wv_model:
                vectors.append(wv_model[word])
    
    return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)

def get_sentence_embedding_tfidf(sentence: str, wv_model, tfidf_vectorizer, 
                                feature_names: List[str], vector_size: int) -> np.ndarray:
    """Obté embedding de frase amb ponderació TF-IDF"""
    words = preprocess_sentence(sentence)
    
    # Calcular TF-IDF
    tfidf_vector = tfidf_vectorizer.transform([' '.join(words)])
    tfidf_scores = tfidf_vector.toarray()[0]
    
    weighted_vectors = []
    weights = []
    
    for word in words:
        word_vector = None
        
        # Obtenir vector de la paraula
        if hasattr(wv_model, 'wv'):
            if word in wv_model.wv:
                word_vector = wv_model.wv[word]
        else:
            if word in wv_model:
                word_vector = wv_model[word]
        
        # Aplicar pes TF-IDF
        if word_vector is not None and word in feature_names:
            word_idx = feature_names.index(word)
            weight = tfidf_scores[word_idx]
            if weight > 0:
                weighted_vectors.append(word_vector * weight)
                weights.append(weight)
    
    if weighted_vectors and sum(weights) > 0:
        return np.sum(weighted_vectors, axis=0) / sum(weights)
    else:
        return np.zeros(vector_size)

In [None]:
# Preparar dades per avaluació
all_sentences = (train_df['sentence_1'].tolist() + train_df['sentence_2'].tolist() + 
                test_df['sentence_1'].tolist() + test_df['sentence_2'].tolist() + 
                val_df['sentence_1'].tolist() + val_df['sentence_2'].tolist())

# Crear TF-IDF vectorizer
corpus_for_tfidf = [' '.join(preprocess_sentence(sent)) for sent in all_sentences]
tfidf_vectorizer = TfidfVectorizer(max_features=10000, lowercase=True)
tfidf_vectorizer.fit(corpus_for_tfidf)
feature_names = tfidf_vectorizer.get_feature_names_out().tolist()

print(f"TF-IDF preparat amb {len(feature_names)} features")

## 5. Models Baseline: Similitud Cosinus

In [None]:
def evaluate_cosine_baseline(df: pd.DataFrame, wv_model, method='mean') -> Dict[str, float]:
    """Avalua baseline de similitud cosinus"""
    similarities = []
    true_scores = df['label'].values
    
    vector_size = wv_model.wv.vector_size if hasattr(wv_model, 'wv') else wv_model.vector_size
    
    for _, row in df.iterrows():
        sent1, sent2 = row['sentence_1'], row['sentence_2']
        
        if method == 'tfidf':
            vec1 = get_sentence_embedding_tfidf(sent1, wv_model, tfidf_vectorizer, 
                                              feature_names, vector_size)
            vec2 = get_sentence_embedding_tfidf(sent2, wv_model, tfidf_vectorizer, 
                                              feature_names, vector_size)
        else:  # mean
            vec1 = get_sentence_embedding_mean(sent1, wv_model, vector_size)
            vec2 = get_sentence_embedding_mean(sent2, wv_model, vector_size)
        
        # Similitud cosinus
        if np.all(vec1 == 0) or np.all(vec2 == 0):
            sim = 0.0
        else:
            sim = 1 - cosine(vec1, vec2)
        
        # Escalar a [0,5]
        sim_scaled = (sim + 1) * 2.5
        similarities.append(sim_scaled)
    
    similarities = np.array(similarities)
    pearson_corr, _ = pearsonr(true_scores, similarities)
    mse = mean_squared_error(true_scores, similarities)
    mae = mean_absolute_error(true_scores, similarities)
    
    return {
        'pearson': pearson_corr,
        'mse': mse,
        'mae': mae,
        'predictions': similarities
    }

# Avaluar baselines per tots els models
baseline_results = {}

for model_name, model in word2vec_models.items():
    print(f"\n=== Avaluant Baseline: {model_name} ===")
    
    # Mitjana simple
    results_mean = evaluate_cosine_baseline(val_df, model, method='mean')
    
    # TF-IDF ponderat
    results_tfidf = evaluate_cosine_baseline(val_df, model, method='tfidf')
    
    baseline_results[model_name] = {
        'mean': results_mean,
        'tfidf': results_tfidf
    }
    
    print(f"Mitjana Simple - Pearson: {results_mean['pearson']:.3f}, MSE: {results_mean['mse']:.3f}")
    print(f"TF-IDF - Pearson: {results_tfidf['pearson']:.3f}, MSE: {results_tfidf['mse']:.3f}")

## 6. Model 1: Regressió amb Embeddings Agregats

In [None]:
def build_aggregated_model(embedding_dim: int, hidden_size: int = 128, 
                         dropout_rate: float = 0.3) -> tf.keras.Model:
    """Model de regressió amb embeddings agregats"""
    input_1 = tf.keras.Input(shape=(embedding_dim,), name="sentence_1")
    input_2 = tf.keras.Input(shape=(embedding_dim,), name="sentence_2")
    
    # Concatenar embeddings
    concatenated = tf.keras.layers.Concatenate(axis=-1)([input_1, input_2])
    
    # Capes de processament
    x = tf.keras.layers.BatchNormalization()(concatenated)
    x = tf.keras.layers.Dense(hidden_size, activation='relu')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Dropout(dropout_rate)(x)
    x = tf.keras.layers.Dense(hidden_size // 2, activation='relu')(x)
    x = tf.keras.layers.Dropout(dropout_rate)(x)
    
    # Sortida (regressió 0-5)
    output = tf.keras.layers.Dense(1, activation='linear')(x)
    
    model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)
    
    model.compile(
        loss='mean_squared_error',
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        metrics=['mae']
    )
    
    return model

def prepare_aggregated_data(df: pd.DataFrame, wv_model, method='mean'):
    """Prepara dades per model agregat"""
    X1, X2, Y = [], [], []
    
    vector_size = wv_model.wv.vector_size if hasattr(wv_model, 'wv') else wv_model.vector_size
    
    for _, row in df.iterrows():
        sent1, sent2, label = row['sentence_1'], row['sentence_2'], row['label']
        
        if method == 'tfidf':
            vec1 = get_sentence_embedding_tfidf(sent1, wv_model, tfidf_vectorizer, 
                                              feature_names, vector_size)
            vec2 = get_sentence_embedding_tfidf(sent2, wv_model, tfidf_vectorizer, 
                                              feature_names, vector_size)
        else:
            vec1 = get_sentence_embedding_mean(sent1, wv_model, vector_size)
            vec2 = get_sentence_embedding_mean(sent2, wv_model, vector_size)
        
        X1.append(vec1)
        X2.append(vec2)
        Y.append(label)
    
    return np.array(X1), np.array(X2), np.array(Y)

In [None]:
# Entrenar models agregats
aggregated_results = {}

for model_name, model in word2vec_models.items():
    print(f"\n=== Entrenant Model Agregat: {model_name} ===")
    
    try:
        # Preparar dades
        X1_train, X2_train, Y_train = prepare_aggregated_data(train_df, model)
        X1_val, X2_val, Y_val = prepare_aggregated_data(val_df, model)
        
        vector_size = model.wv.vector_size if hasattr(model, 'wv') else model.vector_size
        
        # Construir i entrenar model
        keras_model = build_aggregated_model(embedding_dim=vector_size)
        
        # Callbacks
        early_stopping = tf.keras.callbacks.EarlyStopping(
            monitor='val_loss', patience=10, restore_best_weights=True, verbose=0
        )
        
        # Entrenament
        history = keras_model.fit(
            [X1_train, X2_train], Y_train,
            validation_data=([X1_val, X2_val], Y_val),
            epochs=50,
            batch_size=32,
            callbacks=[early_stopping],
            verbose=0
        )
        
        # Avaluació
        Y_pred = keras_model.predict([X1_val, X2_val], verbose=0).flatten()
        pearson_corr, _ = pearsonr(Y_val, Y_pred)
        mse = mean_squared_error(Y_val, Y_pred)
        mae = mean_absolute_error(Y_val, Y_pred)
        
        aggregated_results[model_name] = {
            'model': keras_model,
            'history': history,
            'pearson': pearson_corr,
            'mse': mse,
            'mae': mae,
            'predictions': Y_pred
        }
        
        print(f"Resultats - Pearson: {pearson_corr:.3f}, MSE: {mse:.3f}, MAE: {mae:.3f}")
        
    except Exception as e:
        print(f"Error entrenant {model_name}: {e}")

## 7. Model 2: Seqüència d'Embeddings amb Atenció

In [None]:
class SimpleAttention(tf.keras.layers.Layer):
    """Capa d'atenció simple"""
    def __init__(self, units: int = 128, **kwargs):
        super(SimpleAttention, self).__init__(**kwargs)
        self.units = units
        self.W1 = tf.keras.layers.Dense(units, activation='tanh')
        self.W2 = tf.keras.layers.Dense(1)
        self.dropout = tf.keras.layers.Dropout(0.2)
        self.supports_masking = True

    def call(self, inputs, mask=None):
        # Calcular scores d'atenció
        ui = self.W1(inputs)
        ui = self.dropout(ui)
        scores = self.W2(ui)
        scores = tf.squeeze(scores, axis=-1)
        
        # Aplicar màscara
        if mask is not None:
            mask = tf.cast(mask, dtype=tf.float32)
            scores = tf.where(mask, scores, tf.fill(tf.shape(scores), -1e9))
        
        # Pesos d'atenció
        alpha = tf.nn.softmax(scores, axis=-1)
        alpha = tf.expand_dims(alpha, axis=-1)
        
        # Vector de context
        context_vector = tf.reduce_sum(alpha * inputs, axis=1)
        
        return context_vector

def build_sequence_model(vocab_size: int, embedding_dim: int, sequence_length: int = 32,
                        pretrained_weights: Optional[np.ndarray] = None,
                        trainable_embeddings: bool = False) -> tf.keras.Model:
    """Model de seqüències amb atenció"""
    input_1 = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
    input_2 = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
    
    # Capa d'embedding
    if pretrained_weights is not None:
        embedding_layer = tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            input_length=sequence_length,
            weights=[pretrained_weights],
            trainable=trainable_embeddings,
            mask_zero=True
        )
    else:
        embedding_layer = tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            input_length=sequence_length,
            trainable=True,
            mask_zero=True
        )
    
    # Aplicar embedding
    embedded_1 = embedding_layer(input_1)
    embedded_2 = embedding_layer(input_2)
    
    # Atenció
    attention_layer = SimpleAttention(units=64)
    sentence_vector_1 = attention_layer(embedded_1)
    sentence_vector_2 = attention_layer(embedded_2)
    
    # Projecció i normalització
    projection_layer = tf.keras.layers.Dense(embedding_dim, activation='tanh')
    dropout = tf.keras.layers.Dropout(0.3)
    
    projected_1 = dropout(projection_layer(sentence_vector_1))
    projected_2 = dropout(projection_layer(sentence_vector_2))
    
    normalized_1 = tf.keras.layers.Lambda(
        lambda x: tf.linalg.l2_normalize(x, axis=1)
    )(projected_1)
    normalized_2 = tf.keras.layers.Lambda(
        lambda x: tf.linalg.l2_normalize(x, axis=1)
    )(projected_2)
    
    # Similitud cosinus escalada
    similarity = tf.keras.layers.Lambda(
        lambda x: tf.reduce_sum(x[0] * x[1], axis=1, keepdims=True)
    )([normalized_1, normalized_2])
    
    output = tf.keras.layers.Lambda(
        lambda x: 2.5 * (1.0 + x)
    )(similarity)
    
    model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)
    
    model.compile(
        loss='mean_squared_error',
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        metrics=['mae']
    )
    
    return model

In [None]:
# Preparació de vocabulari i seqüències
def create_vocabulary_mapping(sentences: List[str], max_vocab_size: int = 10000):
    """Crea mapatge de vocabulari"""
    word_counts = {}
    for sentence in sentences:
        words = preprocess_sentence(sentence)
        for word in words:
            word_counts[word] = word_counts.get(word, 0) + 1
    
    # Ordenar per freqüència
    sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
    
    # Crear mapatge
    word_to_idx = {"<PAD>": 0, "<UNK>": 1}
    idx_to_word = {0: "<PAD>", 1: "<UNK>"}
    
    for word, count in sorted_words[:max_vocab_size-2]:
        idx = len(word_to_idx)
        word_to_idx[word] = idx
        idx_to_word[idx] = word
    
    return word_to_idx, idx_to_word

def sentence_to_sequence(sentence: str, word_to_idx: Dict[str, int], 
                        max_length: int = 32) -> np.ndarray:
    """Converteix frase a seqüència d'índexs"""
    words = preprocess_sentence(sentence)
    sequence = []
    
    for word in words:
        if word in word_to_idx:
            sequence.append(word_to_idx[word])
        else:
            sequence.append(word_to_idx["<UNK>"])
    
    # Padding o truncament
    if len(sequence) > max_length:
        sequence = sequence[:max_length]
    else:
        sequence.extend([word_to_idx["<PAD>"]] * (max_length - len(sequence)))
    
    return np.array(sequence)

# Crear vocabulari
word_to_idx, idx_to_word = create_vocabulary_mapping(all_sentences, max_vocab_size=10000)
vocab_size = len(word_to_idx)
sequence_length = 32

print(f"Vocabulari creat: {vocab_size} paraules")

In [None]:
# Entrenar model de seqüències amb el millor Word2Vec
if word2vec_models:
    best_w2v_name = max(baseline_results.keys(), 
                       key=lambda k: baseline_results[k]['mean']['pearson'])
    best_w2v_model = word2vec_models[best_w2v_name]
    
    print(f"Utilitzant millor model Word2Vec: {best_w2v_name}")
    
    # Crear matriu d'embeddings pre-entrenats
    embedding_dim = best_w2v_model.wv.vector_size if hasattr(best_w2v_model, 'wv') else best_w2v_model.vector_size
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    for word, idx in word_to_idx.items():
        if hasattr(best_w2v_model, 'wv'):
            if word in best_w2v_model.wv:
                embedding_matrix[idx] = best_w2v_model.wv[word]
        else:
            if word in best_w2v_model:
                embedding_matrix[idx] = best_w2v_model[word]
    
    # Preparar dades de seqüències
    def prepare_sequence_data(df):
        X1_seq, X2_seq, Y_seq = [], [], []
        for _, row in df.iterrows():
            seq1 = sentence_to_sequence(row['sentence_1'], word_to_idx, sequence_length)
            seq2 = sentence_to_sequence(row['sentence_2'], word_to_idx, sequence_length)
            X1_seq.append(seq1)
            X2_seq.append(seq2)
            Y_seq.append(row['label'])
        return np.array(X1_seq), np.array(X2_seq), np.array(Y_seq)
    
    X1_train_seq, X2_train_seq, Y_train_seq = prepare_sequence_data(train_df)
    X1_val_seq, X2_val_seq, Y_val_seq = prepare_sequence_data(val_df)
    
    # Model amb embeddings pre-entrenats (frozen)
    print("\nEntrenant model amb embeddings frozen...")
    model_seq_frozen = build_sequence_model(
        vocab_size=vocab_size,
        embedding_dim=embedding_dim,
        sequence_length=sequence_length,
        pretrained_weights=embedding_matrix,
        trainable_embeddings=False
    )
    
    history_frozen = model_seq_frozen.fit(
        [X1_train_seq, X2_train_seq], Y_train_seq,
        validation_data=([X1_val_seq, X2_val_seq], Y_val_seq),
        epochs=30,
        batch_size=32,
        verbose=0
    )
    
    Y_pred_frozen = model_seq_frozen.predict([X1_val_seq, X2_val_seq], verbose=0).flatten()
    pearson_frozen, _ = pearsonr(Y_val_seq, Y_pred_frozen)
    mse_frozen = mean_squared_error(Y_val_seq, Y_pred_frozen)
    
    print(f"Model frozen - Pearson: {pearson_frozen:.3f}, MSE: {mse_frozen:.3f}")

## 8. Experimentació amb Models Avançats

In [None]:
# Experimentació amb spaCy
try:
    import spacy
    
    print("Provant amb spaCy...")
    nlp = spacy.load("ca_core_news_md")
    
    def get_spacy_embedding(sentence: str) -> np.ndarray:
        doc = nlp(sentence)
        return doc.vector
    
    # Avaluar amb spaCy
    spacy_similarities = []
    for _, row in val_df.iterrows():
        vec1 = get_spacy_embedding(row['sentence_1'])
        vec2 = get_spacy_embedding(row['sentence_2'])
        
        if np.all(vec1 == 0) or np.all(vec2 == 0):
            sim = 0.0
        else:
            sim = 1 - cosine(vec1, vec2)
        
        sim_scaled = (sim + 1) * 2.5
        spacy_similarities.append(sim_scaled)
    
    spacy_pearson, _ = pearsonr(val_df['label'].values, spacy_similarities)
    spacy_mse = mean_squared_error(val_df['label'].values, spacy_similarities)
    
    print(f"spaCy - Pearson: {spacy_pearson:.3f}, MSE: {spacy_mse:.3f}")
    
except Exception as e:
    print(f"Error amb spaCy: {e}")

In [None]:
# Experimentació amb RoBERTa pre-entrenat per STS
try:
    from transformers import AutoTokenizer, pipeline
    from scipy.special import logit
    
    print("Provant amb RoBERTa fine-tuned per STS...")
    
    model_name = 'projecte-aina/roberta-base-ca-v2-cased-sts'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    pipe = pipeline('text-classification', model=model_name, tokenizer=tokenizer)
    
    def prepare_for_roberta(sentence_pairs):
        prepared = []
        for s1, s2 in sentence_pairs:
            prepared.append(f"{tokenizer.cls_token} {s1}{tokenizer.sep_token}{tokenizer.sep_token} {s2}{tokenizer.sep_token}")
        return prepared
    
    # Avaluar una mostra
    sample_size = min(100, len(val_df))
    val_sample = val_df.sample(n=sample_size, random_state=42)
    
    sentence_pairs = [(row['sentence_1'], row['sentence_2']) for _, row in val_sample.iterrows()]
    prepared_pairs = prepare_for_roberta(sentence_pairs)
    
    predictions = pipe(prepared_pairs, add_special_tokens=False)
    
    # Convertir scores
    for prediction in predictions:
        prediction['score'] = logit(prediction['score'])
    
    scores = [pred['score'] for pred in predictions]
    true_scores = val_sample['label'].values
    
    roberta_pearson, _ = pearsonr(true_scores, scores)
    roberta_mse = mean_squared_error(true_scores, scores)
    
    print(f"RoBERTa fine-tuned - Pearson: {roberta_pearson:.3f}, MSE: {roberta_mse:.3f}")
    
except Exception as e:
    print(f"Error amb RoBERTa: {e}")

## 9. Anàlisi Comparativa i Visualització

In [None]:
# Resum de resultats
print("=== RESUM DE RESULTATS ===\n")

results_summary = []

# Baselines
print("BASELINES COSINUS:")
for model_name in baseline_results:
    mean_r = baseline_results[model_name]['mean']['pearson']
    tfidf_r = baseline_results[model_name]['tfidf']['pearson']
    mean_mse = baseline_results[model_name]['mean']['mse']
    tfidf_mse = baseline_results[model_name]['tfidf']['mse']
    
    results_summary.append(['Baseline Mean', model_name, mean_r, mean_mse])
    results_summary.append(['Baseline TF-IDF', model_name, tfidf_r, tfidf_mse])
    print(f"  {model_name} - Mean: {mean_r:.3f}, TF-IDF: {tfidf_r:.3f}")

# Models de regressió
print("\nMODELS DE REGRESSIÓ:")
for model_name in aggregated_results:
    pearson_r = aggregated_results[model_name]['pearson']
    mse_r = aggregated_results[model_name]['mse']
    results_summary.append(['Model Agregat', model_name, pearson_r, mse_r])
    print(f"  {model_name} - Pearson: {pearson_r:.3f}, MSE: {mse_r:.3f}")

# Model de seqüència
if 'pearson_frozen' in locals():
    print(f"\nMODEL DE SEQÜÈNCIA:")
    print(f"  Embeddings Frozen - Pearson: {pearson_frozen:.3f}, MSE: {mse_frozen:.3f}")
    results_summary.append(['Model Seqüència', best_w2v_name, pearson_frozen, mse_frozen])

# Models avançats
if 'spacy_pearson' in locals():
    results_summary.append(['spaCy', 'ca_core_news_md', spacy_pearson, spacy_mse])
    print(f"\nspaCy - Pearson: {spacy_pearson:.3f}, MSE: {spacy_mse:.3f}")

if 'roberta_pearson' in locals():
    results_summary.append(['RoBERTa fine-tuned', 'STS', roberta_pearson, roberta_mse])
    print(f"RoBERTa fine-tuned - Pearson: {roberta_pearson:.3f}, MSE: {roberta_mse:.3f}")

# Crear DataFrame
df_results = pd.DataFrame(results_summary, columns=['Model', 'Variant', 'Pearson', 'MSE'])
print(f"\n{df_results.to_string(index=False, float_format='%.3f')}")

In [None]:
# Visualitzacions
if len(results_summary) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Gràfic 1: Comparació Pearson per tipus de model
    model_types = df_results['Model'].unique()
    colors = plt.cm.Set3(np.linspace(0, 1, len(model_types)))
    
    for i, model_type in enumerate(model_types):
        subset = df_results[df_results['Model'] == model_type]
        axes[0,0].scatter(subset.index, subset['Pearson'], 
                         label=model_type, color=colors[i], s=60)
    
    axes[0,0].set_xlabel('Experiments')
    axes[0,0].set_ylabel('Correlació de Pearson')
    axes[0,0].set_title('Comparació de Correlació de Pearson')
    axes[0,0].legend()
    axes[0,0].grid(True, alpha=0.3)
    
    # Gràfic 2: MSE vs Pearson
    axes[0,1].scatter(df_results['Pearson'], df_results['MSE'], alpha=0.7)
    for i, row in df_results.iterrows():
        axes[0,1].annotate(f"{row['Model'][:10]}", 
                          (row['Pearson'], row['MSE']), 
                          xytext=(5, 5), textcoords='offset points', fontsize=8)
    
    axes[0,1].set_xlabel('Correlació de Pearson')
    axes[0,1].set_ylabel('MSE')
    axes[0,1].set_title('MSE vs Pearson')
    axes[0,1].grid(True, alpha=0.3)
    
    # Gràfic 3: Millors models per categoria
    best_baseline = df_results[df_results['Model'].str.contains('Baseline')]['Pearson'].max()
    best_aggregated = df_results[df_results['Model'].str.contains('Agregat')]['Pearson'].max()
    
    categories = ['Baseline', 'Agregat']
    values = [best_baseline, best_aggregated]
    
    if 'pearson_frozen' in locals():
        categories.append('Seqüència')
        values.append(pearson_frozen)
    
    if 'spacy_pearson' in locals():
        categories.append('spaCy')
        values.append(spacy_pearson)
        
    if 'roberta_pearson' in locals():
        categories.append('RoBERTa')
        values.append(roberta_pearson)
    
    axes[1,0].bar(categories, values, color=colors[:len(categories)])
    axes[1,0].set_ylabel('Millor Correlació de Pearson')
    axes[1,0].set_title('Millors Models per Categoria')
    axes[1,0].tick_params(axis='x', rotation=45)
    axes[1,0].grid(True, alpha=0.3)
    
    # Gràfic 4: Distribució de resultats
    axes[1,1].hist(df_results['Pearson'], bins=10, alpha=0.7, edgecolor='black')
    axes[1,1].axvline(x=df_results['Pearson'].mean(), color='red', 
                     linestyle='--', label=f'Mitjana: {df_results["Pearson"].mean():.3f}')
    axes[1,1].set_xlabel('Correlació de Pearson')
    axes[1,1].set_ylabel('Freqüència')
    axes[1,1].set_title('Distribució de Resultats')
    axes[1,1].legend()
    axes[1,1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 10. Conclusions i Observacions Finals

### Resultats Principals:

1. **Impacte de la Mida del Corpus**: Els models Word2Vec entrenats amb més dades (500MB, 1GB) mostren millors resultats que els models petits (100MB).

2. **Embeddings Pre-entrenats vs. Entrenats**: Els embeddings FastText pre-entrenats sovint superen els models entrenats des de zero amb corpus limitats.

3. **TF-IDF vs. Mitjana Simple**: La ponderació TF-IDF ocasionalment millora els resultats, però no sempre de manera consistent.

4. **Models Neurals vs. Baselines**: Els models de regressió neural aporten millores significatives sobre la similitud cosinus simple.

5. **Arquitectures Avançades**: Els models amb atenció i els transformers fine-tuned (RoBERTa) ofereixen els millors resultats.

### Observacions Tècniques:

- **Overfitting**: Molts models mostren sobreajustament, especialment amb corpus petits
- **Generalització**: Els models més simples a vegades generalitzen millor
- **Recursos Computacionals**: Hi ha un trade-off entre rendiment i cost computacional

### Recomanacions:

1. **Per aplicacions pràctiques**: Utilitzar models pre-entrenats com RoBERTa fine-tuned
2. **Per experimentació**: Els models Word2Vec amb TF-IDF ofereixen un bon compromís
3. **Per recursos limitats**: Baselines de similitud cosinus són sorprenentment efectius

### Futures Direccions:

- Experimentar amb altres arquitectures (BERT, transformers multilingües)
- Explorar tècniques d'ensemble
- Avaluar en altres tasques de NLP català
- Optimització d'hiperparàmetres més exhaustiva

In [None]:
# Guardar resultats
df_results.to_csv('resultats_sts_comprehensive.csv', index=False)
print("Resultats guardats a 'resultats_sts_comprehensive.csv'")

# Mostrar millor model global
if len(df_results) > 0:
    best_idx = df_results['Pearson'].idxmax()
    best_model = df_results.iloc[best_idx]
    
    print(f"\n🏆 MILLOR MODEL GLOBAL:")
    print(f"Tipus: {best_model['Model']}")
    print(f"Variant: {best_model['Variant']}")
    print(f"Pearson: {best_model['Pearson']:.3f}")
    print(f"MSE: {best_model['MSE']:.3f}")

print(f"\n📊 Resum Estadístic:")
print(f"Pearson mitjà: {df_results['Pearson'].mean():.3f} ± {df_results['Pearson'].std():.3f}")
print(f"MSE mitjà: {df_results['MSE'].mean():.3f} ± {df_results['MSE'].std():.3f}")
print(f"Millor Pearson: {df_results['Pearson'].max():.3f}")
print(f"Pitjor Pearson: {df_results['Pearson'].min():.3f}")