# Classificador de Sentiments a Xarxes Socials en Català (CSXSC): Dataset

**Author:** Daniel Arias Cámara  
**Date:** July 2025  

**Description:**  This notebook aims to build a high-quality dataset for fine-tuning the **CSXSC** model. The dataset is constructed by combining trusted data sources, including structured sentiment corpora and translated social media content. Details on data origin and preprocessing steps are provided in the sections below.


## 1. GuiaCat Dataset

**Description:** This dataset consists of 5,750 restaurant reviews in Catalan, sourced from the GuiaCat platform. Each review includes individual ratings for service, food, price-quality ratio, and atmosphere, along with an overall average score.

**Access:** [projecte-aina/GuiaCat on Hugging Face](https://huggingface.co/datasets/projecte-aina/GuiaCat)

**Source:** Aina Project

**Notes:**  
The dataset is divided into three subsets:  
- **Train:** 4,750 rows  
- **Validation:** 500 rows  
- **Test:** 500 rows  

The original fields are: Service, Food, Price-quality, Environment, Avg, Text, and Label.  
For our purposes, we retain only the Text and Label fields, discarding the rest.

The Label field includes five sentiment categories:  
- Molt bo (Very good)  
- Bo (Good)  
- Regular (Average)  
- Dolent (Bad)  
- Molt dolent (Very bad)

These are grouped into three classes for sentiment classification:  
- **Positive:** Molt bo and Bo  
- **Neutral:** Regular  
- **Negative:** Dolent and Molt dolent


In [None]:
import subprocess
from typing import Dict, List

try:
    import pandas as pd
    from datasets import load_dataset, concatenate_datasets
except ImportError:
    subprocess.check_call(["pip", "install", "-q", "pandas", "datasets", "pyarrow"])
    import pandas as pd
    from datasets import load_dataset, concatenate_datasets

COLUMNS_TO_KEEP: List[str] = ["text", "label"]
DATASET_NAME: str = "projecte-aina/GuiaCat"
CSV_FILENAME: str = "guiacat.csv"

def relabel_opinion(opinion: Dict) -> Dict:
    label = opinion["label"].lower()
    if label in ["molt bo", "bo"]:
        opinion["label"] = "positive"
    elif label == "regular":
        opinion["label"] = "neutral"
    elif label in ["dolent", "molt dolent"]:
        opinion["label"] = "negative"
    return opinion

def process_and_combine_dataset(dataset_name: str) -> pd.DataFrame:
    print(f"Loading and processing '{dataset_name}'")
    raw_dataset = load_dataset(dataset_name)
    
    processed_splits = []
    for split in raw_dataset:
        processed_split = raw_dataset[split].map(relabel_opinion)
        drop_columns = [col for col in processed_split.column_names if col not in COLUMNS_TO_KEEP]
        processed_splits.append(processed_split.remove_columns(drop_columns))
    

    combined_dataset = concatenate_datasets(processed_splits)
    return combined_dataset.to_pandas()


guiacat_df = process_and_combine_dataset(DATASET_NAME)
guiacat_df.to_csv(CSV_FILENAME, index=False)

total_rows = len(guiacat_df)
label_dist = guiacat_df['label'].value_counts(normalize=True) * 100

positive_pct = label_dist.get('positive', 0)
negative_pct = label_dist.get('negative', 0)
neutral_pct = label_dist.get('neutral', 0)

print("\nFinal Dataset Distribution")
print(
    f"Total Rows: {total_rows}\n"
    f"Distribution: {positive_pct:.1f}% Positive, "
    f"{negative_pct:.1f}% Negative, "
    f"{neutral_pct:.1f}% Neutral"
)

  from .autonotebook import tqdm as notebook_tqdm


Loading and processing 'projecte-aina/GuiaCat'

Final Dataset Distribution
Total Rows: 5750
Distribution: 94.3% Positive, 3.6% Negative, 2.1% Neutral


## 2. Catalan Structured Sentiment Analysis (CaSSA) Dataset

**Description:** The CaSSA dataset contains 6,400 reviews and forum messages in Catalan, annotated at the fine-grained level with polar expressions. Each text instance is labeled with all the sentiment expressions it contains. For each polar expression, the annotation includes the **expression itself**, the **target** (i.e., the object of the sentiment), and the **source** (i.e., the subject expressing the sentiment). In total, 25,453 polar expressions have been annotated.

**Access:** [projecte-aina/CaSSA on Hugging Face](https://huggingface.co/datasets/projecte-aina/CaSSA-catalan-structured-sentiment-analysis)

**Source:** Aina Project

**Notes:**

Each instance in the dataset is a text. For each text, there can be 0 to unlimited polar expressions, which are contained in the "opinions" field. Each opinion contains a source, a target, a polar expression, a polarity value and an intensity value.

To convert this structured information into a single sentiment label per text, we apply the following strategy:
- Count all Positive, Negative, and Neutral polarities per opinion.
- Assign the sentiment label based on the dominant polarity.
  - If Positive polar expressions are the majority: **Positive**.
  - If Negative polar expressions dominate: **Negative**.
  - In case of a tie or no polar expressions: **Neutral**.

In [2]:
import subprocess
from typing import Dict, List

try:
    import pandas as pd
    from datasets import load_dataset
except ImportError:
    subprocess.check_call(["pip", "install", "-q", "pandas", "datasets", "pyarrow"])
    import pandas as pd
    from datasets import load_dataset

DATASET_NAME: str = "projecte-aina/CaSSA-catalan-structured-sentiment-analysis"
CSV_FILENAME: str = "cassa.csv"
COLUMNS_TO_KEEP: List[str] = ["text", "label"]

def relabel_from_opinions(item: Dict) -> Dict:
    pos = neg = neu = 0
    for opinion in item.get("opinions", []):
        polarity = (opinion.get("Polarity") or "").strip().lower()
        if polarity == "positive":
            pos += 1
        elif polarity == "negative":
            neg += 1
        elif polarity == "neutral":
            neu += 1

    if pos > neg and pos > neu:
        label = "positive"
    elif neg > pos and neg > neu:
        label = "negative"
    else:
        label = "neutral"

    return {"text": item["text"], "label": label}

def process_cassa_dataset(dataset_name: str) -> pd.DataFrame:
    print(f"Loading and processing '{dataset_name}'...")

    raw_dataset = load_dataset(dataset_name)["train"]
    processed_dataset = raw_dataset.map(relabel_from_opinions)
    
    drop_columns = [col for col in processed_dataset.column_names if col not in COLUMNS_TO_KEEP]
    cleaned_dataset = processed_dataset.remove_columns(drop_columns)
    
    return cleaned_dataset.to_pandas()

cassa_df = process_cassa_dataset(DATASET_NAME)
cassa_df.to_csv(CSV_FILENAME, index=False)

total_rows = len(cassa_df)
label_dist = cassa_df['label'].value_counts(normalize=True) * 100

positive_pct = label_dist.get('positive', 0)
negative_pct = label_dist.get('negative', 0)
neutral_pct = label_dist.get('neutral', 0)

print("\nFinal Dataset Distribution")
print(
    f"Total Rows: {total_rows}\n"
    f"Distribution: {positive_pct:.1f}% Positive, "
    f"{negative_pct:.1f}% Negative, "
    f"{neutral_pct:.1f}% Neutral"
)

Loading and processing 'projecte-aina/CaSSA-catalan-structured-sentiment-analysis'...

Final Dataset Distribution
Total Rows: 6400
Distribution: 64.0% Positive, 9.5% Negative, 26.5% Neutral


## 3. GoEmotions Dataset

## GoEmotions Dataset Processing

**Description:**  The **GoEmotions** dataset is a large-scale human-annotated corpus of 58k English Reddit comments labeled for 27 emotion categories plus neutral. It was developed by Google AI to support fine-grained sentiment and emotion classification in user-generated content.  
Each comment may have one or multiple labels, making it suitable for multilabel classification tasks.

**Access:** [Kaggle - GoEmotions Dataset](https://www.kaggle.com/datasets/debarshichanda/goemotions)  
**Source:** Google AI  

### Notes

The GoEmotions dataset includes annotations for 27 fine-grained emotion categories plus a neutral class.  
Since each Reddit comment can have multiple emotions, we adopt a two-step mapping strategy to simplify it into three sentiment categories: Positive, Negative, and Neutral.

#### **Step 1: Mapping Emotions to Broad Sentiment Groups**

- **Positive**: amusement, excitement, joy, love, desire, optimism, caring, pride, admiration, gratitude, relief, approval.  
- **Negative**: fear, nervousness, remorse, embarrassment, disappointment, sadness, grief, disgust, anger, annoyance, disapproval.  
- **Ambiguous**: realization, surprise, curiosity, confusion.

#### **Step 2: Decision Rule for Classification**

For each comment, count how many mapped emotions belong to each group and apply:

1. **Positive majority**: classify as **Positive**.  
2. **Negative majority**: classify as **Negative**.  
3. **Tie or no mapped emotions**: classify as **Neutral**.

### Translation to Catalan

After sentiment classification, all reviews are translated into Catalan using the  
[**Aina Project English–Catalan Translator**](https://huggingface.co/projecte-aina/aina-translator-ca-en),  
a machine translation model trained specifically for English to Catalan.

### Translation Quality Check

To ensure translations are high-quality:

1. Use the [**Salamandra 7B Instruct**](https://huggingface.co/BSC-LT/salamandra-7b-instruct) model.  
   - **Task 1:** Read the translated review and rate the translation quality on a **1–5 scale**,  
     considering the informal nature of social media (emojis, slang, grammar errors).  
   - **Task 2:** Justify the score, explaining why it was rated that way.

2. **Filtering:**  
   - Reviews scoring **below 3** are **discarded**.  
   - Justifications are logged for transparency.

In [3]:
import subprocess
import pandas as pd
import os
import gc
import torch

POSITIVE_SAMPLES_TO_ADD = 100
TARGET_DISTRIBUTION = {"positive": 0.40, "negative": 0.30, "neutral": 0.30}
SALAMANDRA_PATH = "/home/user/Escritorio/TFM/salamandra-7b-instruct"
NUM_SAMPLES_TO_EVALUATE = 500

for package in ["datasets", "pandas", "tqdm", "ctranslate2", "sentencepiece", "huggingface_hub", "transformers", "torch"]:
    try:
        __import__(package.split("[")[0])
    except ImportError:
        print(f"Installing required library: {package}")
        subprocess.check_call(["pip", "install", "-q", package])

from datasets import load_dataset
from tqdm.auto import tqdm
import ctranslate2
import sentencepiece as spm
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForCausalLM

emotion_id2label = ["admiration","amusement","anger","annoyance","approval","caring","confusion","curiosity","desire","disappointment","disapproval","disgust","embarrassment","excitement","fear","gratitude","grief","joy","love","nervousness","optimism","pride","realization","relief","remorse","sadness","surprise","neutral"]
sentiment_map = {"positive": {"amusement","excitement","joy","love","desire","optimism","caring","pride","admiration","gratitude","relief","approval"},"negative": {"fear","nervousness","remorse","embarrassment","disappointment","sadness","grief","disgust","anger","annoyance","disapproval"},"ambiguous": {"realization","surprise","curiosity","confusion"}}

def parse_labels(labels_str):
    if isinstance(labels_str, list): return labels_str
    try: return [int(i) for i in str(labels_str).split(',')]
    except (ValueError, TypeError): return []

def classify_sentiment(emotion_ids):
    counts = {"positive": 0, "negative": 0, "ambiguous": 0}
    for eid in emotion_ids:
        if isinstance(eid, int) and eid < len(emotion_id2label):
            emotion = emotion_id2label[eid]
            for category, emotions in sentiment_map.items():
                if emotion in emotions: counts[category] += 1; break
    if counts["positive"] > counts["negative"] and counts["positive"] > counts["ambiguous"]: return "positive"
    elif counts["negative"] > counts["positive"] and counts["negative"] > counts["ambiguous"]: return "negative"
    else: return "neutral"

def add_sentiment_label(example):
    parsed_ids = parse_labels(example["labels"])
    example["sentiment_label"] = classify_sentiment(parsed_ids)
    return example

def initialize_models():
    print("Initializing all models.")
    model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-en-ca", revision="main")
    sp_model_path = os.path.join(model_dir, "spm.model")
    sp = spm.SentencePieceProcessor()
    sp.load(sp_model_path)
    translator = ctranslate2.Translator(model_dir, device="auto")

    print("  - AINA translator initialized.")
    gc.collect()
    if torch.cuda.is_available(): torch.cuda.empty_cache()
    eval_tokenizer = AutoTokenizer.from_pretrained(SALAMANDRA_PATH)
    eval_model = AutoModelForCausalLM.from_pretrained(SALAMANDRA_PATH, device_map="auto", torch_dtype=torch.float16)
    print("  - Salamandra evaluator LLM initialized.")
    return sp, translator, eval_tokenizer, eval_model

def translate_en_to_ca(text, sp, translator):
    try:
        tokens = sp.encode(text, out_type=str)
        translation = translator.translate_batch([tokens])
        return sp.decode(translation[0][0]["tokens"])
    except Exception: return ""

def evaluate_translation(text_ca, tokenizer, model):
    score_prompt = (
        "Ets un avaluador de qualitat de traducció en català. "
        "Et donaré un text en català procedent de xarxes socials, "
        "traduït automàticament des de l'anglès. "
        "Avalua la qualitat de la traducció en una escala del 1 (molt dolenta) al 5 (excel·lent). "
        "Ignora errors ortogràfics menors, abreviatures o estil informal. "
        "Tingues en compte que el text pot ser informal o col·loquial ja que prové d'opinions de xarxes socials. "
        "Respon només amb el número."
        f"\n\nText: {text_ca}"
    )
    messages_score = [{"role": "user", "content": score_prompt}]
    chat_text_score = tokenizer.apply_chat_template(messages_score, add_generation_prompt=True, tokenize=False)
    inputs_score = tokenizer(chat_text_score, return_tensors="pt").to(model.device)
    outputs_score = model.generate(**inputs_score, max_new_tokens=5, temperature=0.1)
    score_text = tokenizer.decode(outputs_score[0][len(inputs_score["input_ids"][0]):], skip_special_tokens=True).strip()
    try: score = int(score_text[0])
    except (ValueError, IndexError): score = 3
    explanation = ""
    if score <= 2:
        explanation_prompt = (
            "El text següent és una traducció automàtica del anglès al català. "
            "La seva qualitat s'ha valorat amb una puntuació baixa (≤ 2) "
            "en una escala de 1 a 5. Explica breument per què podria ser de baixa qualitat, "
            "centrant-te en problemes de traducció i no en el contingut."
            f"\n\nText: {text_ca}"
        )
        messages_explanation = [{"role": "user", "content": explanation_prompt}]
        chat_text_expl = tokenizer.apply_chat_template(messages_explanation, add_generation_prompt=True, tokenize=False)
        inputs_expl = tokenizer(chat_text_expl, return_tensors="pt").to(model.device)
        outputs_expl = model.generate(**inputs_expl, max_new_tokens=100, temperature=0.3)
        explanation = tokenizer.decode(outputs_expl[0][len(inputs_expl["input_ids"][0]):], skip_special_tokens=True).strip()
    return score, explanation

def classify_and_analyze_goemotions():
    print("- Step 1: Analyzing GoEmotions Dataset")
    goemotions_train = load_dataset("go_emotions", "simplified")["train"]
    classified_ds = goemotions_train.map(add_sentiment_label, batched=False)
    return classified_ds

def analyze_base_datasets():
    print("\n- Step 2: Analyzing Base Datasets (cassa.csv + guiacat.csv)")
    try:
        combined_df = pd.concat([pd.read_csv("cassa.csv"), pd.read_csv("guiacat.csv")], ignore_index=True)
        label_counts = combined_df['label'].value_counts().to_dict()
        print(f"Combined base dataset has {len(combined_df):,} rows. Distribution: {label_counts}")
        return label_counts
    except FileNotFoundError as e:
        print(f"Error: {e}. Ensure 'cassa.csv' and 'guiacat.csv' are in the root path.")
        return None

def collect_and_translate_goemotions(classified_ds, needed_counts, sp, translator):
    print("\n- Step 3: Collecting and Translating from GoEmotions")
    collected_samples = []
    goemotions_iterator = iter(classified_ds)
    with tqdm(total=sum(needed_counts.values()), desc="Collecting samples") as pbar:
        while sum(needed_counts.values()) > 0:
            try: row = next(goemotions_iterator)
            except StopIteration:
                print("\nWarning: Reached end of GoEmotions dataset before collecting all samples.")
                break
            sentiment = row["sentiment_label"]
            if needed_counts.get(sentiment, 0) > 0:
                translated_text = translate_en_to_ca(row["text"], sp, translator)
                if translated_text:
                    collected_samples.append({"text": translated_text, "label": sentiment})
                    needed_counts[sentiment] -= 1
                    pbar.update(1)
    return pd.DataFrame(collected_samples)

def filter_bad_translations(df, num_samples, tokenizer, model):
    print(f"\n- Step 4: Evaluating {num_samples} random translations before saving")
    if len(df) == 0: return df
    sample_df = df.sample(n=min(num_samples, len(df)), random_state=42)
    bad_texts = set()
    for _, row in tqdm(sample_df.iterrows(), total=len(sample_df), desc="Evaluating translations"):
        score, explanation = evaluate_translation(row["text"], tokenizer, model)
        if score <= 2:
            print("\nDiscarded Translation:")
            print(f"Text: {row['text']}")
            print(f"Reason: {explanation}\n")
            bad_texts.add(row["text"])
    if bad_texts:
        print(f"Removing {len(bad_texts)} bad translations from dataset")
        df = df[~df["text"].isin(bad_texts)]
    return df

if __name__ == "__main__":
    classified_goemotions = classify_and_analyze_goemotions()
    base_counts = analyze_base_datasets()
    if base_counts:
        current_pos, current_neg, current_neu = base_counts.get("positive",0), base_counts.get("negative",0), base_counts.get("neutral",0)
        final_pos_count = current_pos + POSITIVE_SAMPLES_TO_ADD
        total_final_size = final_pos_count / TARGET_DISTRIBUTION["positive"]
        needed_counts = {
            "positive": POSITIVE_SAMPLES_TO_ADD,
            "negative": max(0, int(total_final_size * TARGET_DISTRIBUTION["negative"] - current_neg)),
            "neutral": max(0, int(total_final_size * TARGET_DISTRIBUTION["neutral"] - current_neu))
        }
        print("\nCollection Plan:")
        for label, count in needed_counts.items(): print(f"  Collect {label.capitalize()}: {count:,} rows")
        
        sp_translator, ctranslate_translator, eval_tokenizer, eval_model = initialize_models()
        
        goemotions_df = collect_and_translate_goemotions(
            classified_goemotions, 
            needed_counts.copy(), 
            sp_translator, 
            ctranslate_translator
        )
        
        goemotions_df = filter_bad_translations(goemotions_df, NUM_SAMPLES_TO_EVALUATE, eval_tokenizer, eval_model)
        
        output_filename = "goemotions.csv"
        goemotions_df.to_csv(output_filename, index=False)
        print(f"\nSaved {len(goemotions_df):,} cleaned samples to '{output_filename}'")
        
    print("\nFull pipeline finished.")


- Step 1: Analyzing GoEmotions Dataset

- Step 2: Analyzing Base Datasets (cassa.csv + guiacat.csv)
Combined base dataset has 12,150 rows. Distribution: {'positive': 9517, 'neutral': 1815, 'negative': 818}

Collection Plan:
  Collect Positive: 100 rows
  Collect Negative: 6,394 rows
  Collect Neutral: 5,397 rows
Initializing all models.


Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 102717.65it/s]


  - AINA translator initialized.


Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.20s/it]


  - Salamandra evaluator LLM initialized.

- Step 3: Collecting and Translating from GoEmotions


  return sp.decode(translation[0][0]["tokens"])
Collecting samples: 100%|██████████| 11891/11891 [16:54<00:00, 11.72it/s]



- Step 4: Evaluating 500 random translations before saving


Evaluating translations:   0%|          | 0/500 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   0%|          | 1/500 [01:38<13:36:14, 98.15s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   0%|          | 2/500 [03:06<12:44:40, 92.13s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   1%|          | 3/500 [04:34<12:30:08, 90.56s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   1%|          | 4/500 [08:12<19:24:15, 140.84s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Tose dm de cvs igual que delsym
Reason: La traducció automàtica d'aquesta frase anglesa "Tose dm de cvs igual que delsym" ha estat considerada de mala qualitat perquè conté errors gramaticals i ortogràfics. Concretament, hi ha faltes d'ortografia com ara la paraula 'dm', que hauria de ser 'damage'. També hi ha un error gramatical en l'estructura de les frases, ja que algunes paraules estan mal col·locades o utilitzades incorrectament. Per exemple, la paraula



Evaluating translations:   1%|          | 5/500 [09:32<16:19:22, 118.71s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   1%|          | 6/500 [10:42<14:01:50, 102.25s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   1%|▏         | 7/500 [11:54<12:38:34, 92.32s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   2%|▏         | 8/500 [13:07<11:46:18, 86.13s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   2%|▏         | 9/500 [14:18<11:06:55, 81.50s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   2%|▏         | 10/500 [15:28<10:36:07, 77.89s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:   2%|▏         | 11/500 [16:37<10:13:26, 75.27s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end genera


Discarded Translation:
Text: Fa massa fred per sortir al carrer la major part de l'any. Plorar per les càmeres web és tot el que tenen!
Reason: La frase "Fa massa fred per sortir al carrer la major part de l'any" probablement es refereix a un país d'Europa continental o Àsia del Nord, mentre que "Plorar per les càmeres web" sembla referir-se als Estats Units. Això fa pensar que hi ha hagut algun tipus de malentès durant la traducció.



Evaluating translations:  14%|█▎        | 68/500 [1:24:16<10:36:30, 88.40s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  14%|█▍        | 69/500 [1:25:24<9:51:08, 82.29s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  14%|█▍        | 70/500 [1:26:31<9:18:33, 77.94s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  14%|█▍        | 71/500 [1:27:46<9:09:12, 76.81s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  14%|█▍        | 72/500 [1:28:53<8:47:51, 74.00s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  15%|█▍        | 73/500 [1:30:03<8:37:30, 72.72s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  15%|█▍        | 74/500 [1:32:57<12


Discarded Translation:
Text: Tria'n una. - El socialisme no pot treballar la gent la naturalesa humana és intrínsecament egoista. - No necessitem impostos perquè la gent serà naturalment caritativa.
Reason: La primera frase conté un error gramatical; hauria d'"intrinsicament" en comptes d'"intrínsecament". L'altra frase es basa en l'error que les persones són inherentment generoses sense necessitat d'impostos. Aquestes frases podrien haver estat escrites per algú que no parla bé el català o que té coneixements limitats sobre economia política.



Evaluating translations:  15%|█▌        | 75/500 [1:34:07<11:00:10, 93.20s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  15%|█▌        | 76/500 [1:35:20<10:15:31, 87.10s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  15%|█▌        | 77/500 [1:36:32<9:43:13, 82.73s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  16%|█▌        | 78/500 [1:37:42<9:14:04, 78.78s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  16%|█▌        | 79/500 [1:38:50<8:49:15, 75.43s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  16%|█▌        | 80/500 [1:39:59<8:35:03, 73.58s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  16%|█▌        | 81/500 [1:41:08<8:23:43, 72.13s/it]Setting `pad_token_id` to `eos_token_id`:2 for op


Discarded Translation:
Text: >els dos partits són ximples > tribalisme estúpid  ⁇ 
Reason: Possible raó de la mala qualitat: l’ús inadequat dels signes d’interrogació i exclamació.



Evaluating translations:  26%|██▌       | 131/500 [2:40:14<8:35:51, 83.88s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  26%|██▋       | 132/500 [2:41:24<8:09:19, 79.78s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  27%|██▋       | 133/500 [2:42:30<7:42:52, 75.68s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  27%|██▋       | 134/500 [2:43:37<7:26:14, 73.15s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  27%|██▋       | 135/500 [2:44:48<7:19:52, 72.31s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  27%|██▋       | 136/500 [2:45:57<7:13:55, 71.53s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  27%|██▋       | 137/500 [2:47:06<7:08:23, 70.81s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: Típic canal d'escombraries Lebronsexual. No veure vídeo.
Reason: Possibles problemes de traducció: "Típico canal basura Lebronsexual. No ver video." -> "Típic canal escombraria Lebronsexual. No mirar vídeo."



Evaluating translations:  28%|██▊       | 141/500 [2:53:03<8:35:25, 86.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  28%|██▊       | 142/500 [2:54:09<7:59:20, 80.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  29%|██▊       | 143/500 [2:55:15<7:32:34, 76.06s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  29%|██▉       | 144/500 [2:56:26<7:21:19, 74.38s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  29%|██▉       | 145/500 [2:57:33<7:06:26, 72.07s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  29%|██▉       | 146/500 [2:58:43<7:02:44, 71.65s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  29%|██▉       | 147/500 [2:59:54<6:59:05, 71.23s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: Baixar en bicicleta per Powell és *una puta idea estúpida*.
Reason: La frase "Bajar en bicicleta por Powell es una puta idea estúpida" prové d’un capítol de la sèrie televisiva The Simpsons anomenat 'Treehouse of Horror XIV', emesa l’any 2003. Es tracta d’una traducció literal de l’expressió anglesa "riding bikes down Powell is a fucking stupid idea", que fa referència a un passatge fictici dins Springfield en què els nens fan servir bicicletes per baixar pel carrer Powell. El context de



Evaluating translations:  33%|███▎      | 163/500 [3:20:04<8:33:51, 91.49s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  33%|███▎      | 164/500 [3:21:12<7:53:23, 84.53s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  33%|███▎      | 165/500 [3:22:21<7:25:45, 79.84s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  33%|███▎      | 166/500 [3:23:27<7:01:26, 75.71s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  33%|███▎      | 167/500 [3:24:37<6:50:02, 73.88s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  34%|███▎      | 168/500 [3:25:49<6:45:26, 73.27s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  34%|███▍      | 169/500 [3:26:57<6:36:27, 71.87s/it]Setting `pad_token_id` to `eos_token_id`:2 fo


Discarded Translation:
Text: No es diu la ciència immortal sense raó ey
Reason: La traducció pot contenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En aquest cas, sembla haver-hi un error ortogràfic o gramatical que afecta la coherència de la frase traduïda.



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  35%|███▌      | 175/500 [3:36:27<9:52:36, 109.41s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: jo com la merda
Reason: La traducció pot contenir errors gramaticals o de vocabulari que afecten la comprensió del missatge original. En aquest cas, sembla que hi ha un problema de traducció que fa que el resultat final sigui difícil d'entendre o incorrecte gramaticalment.



Evaluating translations:  35%|███▌      | 176/500 [3:37:35<8:43:30, 96.95s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  35%|███▌      | 177/500 [3:38:41<7:51:53, 87.66s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  36%|███▌      | 178/500 [3:39:52<7:23:31, 82.64s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  36%|███▌      | 179/500 [3:41:00<6:57:29, 78.04s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  36%|███▌      | 180/500 [3:43:48<9:21:36, 105.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Desacreditat com en demostrat malament. Una vegada més, és un ximple ben conegut. Feu-vos un favor i mai prendre res del que diu remotament seriosament.
Reason: La traducció pot tenir errors gramaticals o sintàctics. També hi ha paraules mal traduïdes o faltes d’ortografia. Per exemple: "un" hauria de ser "una", ja que la paraula anterior era femenina; també faltava l’accent agut sobre la "e". En general, la traducció sembla massa literal i manca fluïdesa.



Evaluating translations:  36%|███▌      | 181/500 [3:44:57<8:21:33, 94.34s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  36%|███▋      | 182/500 [3:46:06<7:39:42, 86.74s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  37%|███▋      | 183/500 [3:47:14<7:08:49, 81.16s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  37%|███▋      | 184/500 [3:48:23<6:46:50, 77.25s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  37%|███▋      | 185/500 [3:49:33<6:34:18, 75.10s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  37%|███▋      | 186/500 [3:50:43<6:25:43, 73.70s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  37%|███▋      | 187/500 [3:51:50<6:14:14, 71.74s/it]Setting `pad_token_id` to `eos_token_id`:2 fo


Discarded Translation:
Text: aw vostè no ha d'afaitar! l'afaitat en realitat causa encarnades com aquest tipus: 0
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En concret, hi ha faltes d'ortografia i possibles incorreccions en la construcció de les frases. Per exemple, "l'afaitat" hauria de ser "raspallar", ja que es refereix a un procés diferent. Aquestes errades poden dificultar la lectura i interpretació correcta del text traduït.



Evaluating translations:  39%|███▉      | 197/500 [4:04:59<7:35:02, 90.11s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  40%|███▉      | 198/500 [4:06:05<6:57:38, 82.97s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  40%|███▉      | 199/500 [4:07:14<6:34:42, 78.68s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  40%|████      | 200/500 [4:08:21<6:15:50, 75.17s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  40%|████      | 201/500 [4:09:30<6:05:42, 73.39s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  40%|████      | 202/500 [4:10:37<5:54:08, 71.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  41%|████      | 203/500 [4:11:44<5:47:39, 70.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: Mala sort, cabrons.
Reason: Possible raó: El text original estava escrit en castellà i es va traduir directament al català sense fer servir un traductor automatitzat adequat o revisar la traducció manualment. Això pot haver provocat errors gramaticals i faltes d’ortografia que van afectar negativament la qualitat de la traducció.



Evaluating translations:  42%|████▏     | 210/500 [4:21:09<6:56:00, 86.07s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  42%|████▏     | 211/500 [4:22:17<6:27:49, 80.52s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  42%|████▏     | 212/500 [4:24:40<7:56:32, 99.28s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Aquest joc és en realitat pitjor del que esperava i sabia que seria dolent
Reason: Possible raó: la traducció pot contenir errors o expressions inadequades que afecten negativament la claredat o precisió del missatge original.



Evaluating translations:  43%|████▎     | 213/500 [4:25:46<7:08:01, 89.48s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  43%|████▎     | 214/500 [4:26:54<6:34:45, 82.82s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  43%|████▎     | 215/500 [4:28:04<6:15:12, 78.99s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  43%|████▎     | 216/500 [4:29:14<6:00:54, 76.25s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  43%|████▎     | 217/500 [4:30:21<5:47:19, 73.64s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  44%|████▎     | 218/500 [4:31:29<5:37:52, 71.89s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  44%|████▍     | 219/500 [4:32:37<5:31:50, 70.86s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: No tens cor si anomenes “druggies” a la gent que fuma marihuana i he acabat de discutir amb tu, que tinguis un bon dia.
Reason: La frase "No tens cor" es refereix a algú que manca d’empatia o compassió cap als altres. En aquest context, l’autor està expressant ira perquè se sent menyspreat pel fet que l’interlocutor els hagi anomenat "drogat". Per tant, la intenció de les paraules és insultar l’altra persona. Això fa que sigui difícil traduir aquesta part sense perdre matisos importants.



Evaluating translations:  46%|████▌     | 228/500 [4:44:52<6:55:47, 91.72s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  46%|████▌     | 229/500 [4:47:53<8:55:00, 118.45s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Argh. L'abús més estúpid i innecessari dels mestres de poder ho fan. Alguns particularment estúpids almenys.
Reason: La traducció sembla haver estat automatitzada mitjançant un traductor com ara Google Translate. El resultat conté errors gramaticals i lèxics que afecten la comprensió. Per exemple, "Argh" es pot traduir com "Ai!". Però aquí significa "Uau!", cosa que confondria els parlants catalans. També hi ha faltes d’ortografia, com ara "ho", quan hauria de ser "això". I també falta algun article o preposició, com ara "



Evaluating translations:  46%|████▌     | 230/500 [4:49:00<7:44:17, 103.18s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  46%|████▌     | 231/500 [4:50:11<6:59:51, 93.65s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  46%|████▋     | 232/500 [4:53:00<8:38:23, 116.06s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: ARGH!!
Reason: La traducció automàtica d’aquesta paraula anglesa "ARGH" pot donar lloc a diverses paraules catalanes com ara "AAAARG", "AAAAAAAGH", etc., que poden resultar confuses o fins i tot sense sentit en la llengua catalana. Per tant, cal seleccionar acuradament l'opció més adequada segons el context. En aquest cas concret, sembla que es tracta d'una expressió d'enuig, així doncs, la millor opció seria probablement "AAAARG". No obstant això



Evaluating translations:  47%|████▋     | 233/500 [4:54:07<7:31:48, 101.53s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  47%|████▋     | 234/500 [4:55:19<6:50:38, 92.63s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  47%|████▋     | 235/500 [4:56:28<6:17:49, 85.54s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  47%|████▋     | 236/500 [4:57:37<5:54:03, 80.47s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  47%|████▋     | 237/500 [4:58:44<5:34:55, 76.41s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  48%|████▊     | 238/500 [5:01:02<6:54:12, 94.86s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Tenint aquest problema, així que realment em fa en
Reason: La traducció pot tenir errors gramaticals o sintàctics, la qual cosa afecta negativament la comprensió del missatge original.



Evaluating translations:  48%|████▊     | 239/500 [5:02:12<6:20:06, 87.38s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  48%|████▊     | 240/500 [5:03:18<5:51:15, 81.06s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  48%|████▊     | 241/500 [5:04:28<5:36:05, 77.86s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  48%|████▊     | 242/500 [5:05:37<5:22:20, 74.97s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  49%|████▊     | 243/500 [5:06:47<5:14:57, 73.53s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  49%|████▉     | 244/500 [5:07:53<5:03:55, 71.23s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  49%|████▉     | 245/500 [5:09:02<5:00:18, 70.66s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: No hi ha res adorable en aquest home aterridor
Reason: La traducció pot contenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En concret, l'ús de "No" com a adverbi davant d'un adjectiu pot canviar el significat de manera inesperada. També es podrien haver produït altres errors de traducció relacionats amb les diferències entre els sistemes lingüístics espanyol/català i anglès.



Evaluating translations:  50%|████▉     | 249/500 [5:15:12<6:15:23, 89.73s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  50%|█████     | 250/500 [5:16:20<5:46:40, 83.20s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  50%|█████     | 251/500 [5:17:31<5:29:54, 79.50s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  50%|█████     | 252/500 [5:19:56<6:49:16, 99.02s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Nois aquest equip és  ⁇ 
Reason: Possible problema de traducció des d’una altra llengua estrangera que no sigui l’anglès. El signe «⁇» pot tenir diferents significats segons la llengua d’origen.



Evaluating translations:  51%|█████     | 253/500 [5:21:03<6:08:44, 89.57s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  51%|█████     | 254/500 [5:22:11<5:40:55, 83.15s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  51%|█████     | 255/500 [5:23:21<5:23:25, 79.21s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  51%|█████     | 256/500 [5:24:30<5:09:04, 76.00s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  51%|█████▏    | 257/500 [5:25:40<5:00:27, 74.19s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  52%|█████▏    | 258/500 [5:28:22<6:45:49, 100.62s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Això és increïblement descoratjador.
Reason: La traducció pot ser considerada com de mala qualitat degut als errors gramaticals presents en la frase traduïda. Concretament, hi ha un error d'ortografia ("d" en lloc de "de") que afecta l'estructura gramatical de la frase. Aquest tipus d'errors poden fer difícil o impossible entendre el missatge original, cosa que fa baixar la qualitat de la traducció.



Evaluating translations:  52%|█████▏    | 259/500 [5:29:30<6:05:08, 90.90s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  52%|█████▏    | 260/500 [5:32:03<7:18:06, 109.53s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: El mateix. El borrissol corporal (en el meu propi cos...no m'importa el que faci ningú amb el seu borrissol corporal) em fa fàstic d'una puta vegada.
Reason: La traducció sembla tenir errors gramaticals o sintàctics, la qual cosa afecta negativament la comprensió del text original. Aquest fet pot fer baixar la qualitat de la traducció.



Evaluating translations:  52%|█████▏    | 261/500 [5:33:13<6:29:18, 97.73s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  52%|█████▏    | 262/500 [5:34:23<5:53:36, 89.15s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  53%|█████▎    | 263/500 [5:35:29<5:25:07, 82.31s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  53%|█████▎    | 264/500 [5:36:39<5:08:57, 78.55s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  53%|█████▎    | 265/500 [5:37:50<4:59:30, 76.47s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  53%|█████▎    | 266/500 [5:40:54<7:03:47, 108.66s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Rebutja la teva humanitat voluntàriament i posa l'odi pur, la manca de moral, la sociopatia i l'altitud sàdica dins de la teva ànima sense poder fer marxa enrere.
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten negativament la comprensió del missatge original. També hi ha paraules mal traduïdes que poden canviar el significat de les frases. Per exemple, "voluntàriament" hauria d'haver estat traduït com "voluntària", ja que es refereix a un acte realitzat lliurement i conscientment. D'altra banda, "sociopatia" està correctament traduïda però apareix dues vegades seguides en el mateix paràgraf



Evaluating translations:  53%|█████▎    | 267/500 [5:42:04<6:16:32, 96.96s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  54%|█████▎    | 268/500 [5:43:10<5:38:47, 87.62s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  54%|█████▍    | 269/500 [5:44:16<5:12:28, 81.16s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  54%|█████▍    | 270/500 [5:45:25<4:57:08, 77.51s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  54%|█████▍    | 271/500 [5:47:52<6:15:48, 98.47s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: [NOM] Els tatuatges van ser dissenyats específicament per augmentar el seu físic. Aquest noi sembla que la seva germana petita li va fer un gargot mentre dormia.
Reason: La traducció pot tenir errors gramaticals o sintàctics, cosa que afecta negativament la comprensió del missatge original.



Evaluating translations:  54%|█████▍    | 272/500 [5:48:59<5:38:20, 89.04s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  55%|█████▍    | 273/500 [5:50:07<5:13:16, 82.81s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  55%|█████▍    | 274/500 [5:51:17<4:57:27, 78.97s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  55%|█████▌    | 275/500 [5:52:28<4:47:10, 76.58s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  55%|█████▌    | 276/500 [5:53:35<4:35:07, 73.69s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  55%|█████▌    | 277/500 [5:54:42<4:26:09, 71.61s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  56%|█████▌    | 278/500 [5:55:54<4:25:15, 71.69s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: Després de veure TLJ, em resulta impossible veure TFA. És una seqüela d'alguna pel·lícula, però aquesta pel·lícula no és TFA.
Reason: La traducció sembla estar feta amb un traductor automàtic que ha generat frases poc naturals o ambigües com ara "és una seqüela d'una pel·lícula", la qual cosa fa difícil entendre si es refereix a una pel·lícula o a alguna altra cosa.



Evaluating translations:  57%|█████▋    | 287/500 [6:07:48<5:13:34, 88.33s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  58%|█████▊    | 288/500 [6:08:57<4:51:06, 82.39s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  58%|█████▊    | 289/500 [6:10:06<4:35:54, 78.46s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  58%|█████▊    | 290/500 [6:11:14<4:23:33, 75.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  58%|█████▊    | 291/500 [6:12:22<4:14:15, 72.99s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  58%|█████▊    | 292/500 [6:13:31<4:08:41, 71.74s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  59%|█████▊    | 293/500 [6:14:40<4:05:25, 71.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: Només dóna-li el seu maleït control total del govern i els mitjans de comunicació, dóna-li els seus maleïts camps de concentració, dóna-li la seva maleïda Solució Final.
Reason: La traducció té un vocabulari ofensiu que pot ferir algunes persones. No obstant això, es tracta d'un problema lingüístic més que de qualitat de traducció.



Evaluating translations:  61%|██████    | 305/500 [6:29:54<4:40:51, 86.42s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  61%|██████    | 306/500 [6:31:05<4:24:44, 81.88s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  61%|██████▏   | 307/500 [6:32:15<4:11:21, 78.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  62%|██████▏   | 308/500 [6:33:22<3:59:45, 74.92s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  62%|██████▏   | 309/500 [6:34:28<3:50:14, 72.33s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  62%|██████▏   | 310/500 [6:37:23<5:26:18, 103.04s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Dóna-li 3 dies, quan tots els punts de venda de MSM es retractin tranquil·lament de la història.
Reason: Possible problemes de traducció: "MSM" pot referir-se a un acrònim o terme que potser no està clar sense context addicional. El nombre "3" també pot ser confús si no queda clar quin tipus d’intervals numèrics estan inclosos aquí. Finalment, l’expressió "retractar-se tranquil·lament" pot necessitar més precisió cultural o idiomàtica perquè sigui clara per als parlants catalans.



Evaluating translations:  62%|██████▏   | 311/500 [6:38:29<4:49:45, 91.99s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  62%|██████▏   | 312/500 [6:39:37<4:25:47, 84.83s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  63%|██████▎   | 313/500 [6:40:46<4:09:32, 80.07s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  63%|██████▎   | 314/500 [6:43:16<5:12:42, 100.87s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: No. Tristament, a principis dels 30
Reason: La traducció conté paraules que no són correctes gramaticalment o bé estan mal utilitzades. Per exemple "Tristament" hauria d'estar escrit com "Malauradament". Això pot afectar la comprensió del missatge original.



Evaluating translations:  63%|██████▎   | 315/500 [6:44:27<4:43:33, 91.96s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  63%|██████▎   | 316/500 [6:45:38<4:23:13, 85.83s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  63%|██████▎   | 317/500 [6:48:16<5:27:43, 107.45s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Totes les persones que van pujar a l'estació del pont de conserves aquest matí i van continuar empenyent-se i empenyent-se com a animals.
Reason: La traducció pot tenir errors gramaticals o sintàctics, la qual cosa afecta negativament la comprensió del text original. També hi ha faltes d’ortografia i paraules mal traduïdes, fet que redueix encara més la qualitat de la traducció.



Evaluating translations:  64%|██████▎   | 318/500 [6:49:24<4:49:42, 95.51s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  64%|██████▍   | 319/500 [6:50:32<4:23:00, 87.19s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  64%|██████▍   | 320/500 [6:53:19<5:33:53, 111.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: HYS és la pitjor secció de comentaris d'Internet. Fa que els comentaris de Youtube s'assemblin al torn de preguntes dels jardiners.
Reason: La traducció pot ser considerada com de mala qualitat perquè conté errors gramaticals i de vocabulari. Per exemple, "HYS" hauria de traduir-se com "HYF", i "torn de preguntes dels jardiners" seria més adequat com "debat entre jardiners". Aquests canvis milloren significativament la coherència i precisió del missatge original en anglès.



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  64%|██████▍   | 321/500 [6:56:13<6:27:59, 130.05s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: prou robins  ⁇  biblioteques no sabria el que és l'estalvi, fins i tot si els va colpejar a la cara!  ⁇   ⁇   ⁇ 
Reason: La traducció està mal construïda gramaticalment. S’han traduït paraules individuals sense tenir present les estructures gramaticals pròpies del català. Per exemple, "prou" hauria d’anar seguit pel nom o un adjectiu, com ara "robins". En general, cal evitar traduccions literals quan es tracta de frases complexes perquè poden provocar errors gramaticals.



Evaluating translations:  64%|██████▍   | 322/500 [6:57:19<5:28:59, 110.89s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  65%|██████▍   | 323/500 [6:58:26<4:48:25, 97.77s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  65%|██████▍   | 324/500 [6:59:40<4:25:35, 90.54s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  65%|██████▌   | 325/500 [7:00:58<4:12:57, 86.73s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  65%|██████▌   | 326/500 [7:03:58<5:32:59, 114.83s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: els liberals són animals fastigosos
Reason: La traducció pot ser de mala qualitat perquè conté un llenguatge ofensiu que fa referència als liberals com "animals fastigosos". Aquest tipus d’expressió pot resultar ofensiva o denigrant per a algunes persones i cultures, especialment quan es refereix a un grup polític o ideològic específic. És important tenir cura en fer traduccions per evitar expressions ofensives i garantir la sensibilitat cap a diferents grups culturals i lingüístics. En aquest cas concret, seria més adequat utilitzar un terme menys ofensiu o fins



Evaluating translations:  65%|██████▌   | 327/500 [7:05:08<4:52:29, 101.44s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  66%|██████▌   | 328/500 [7:07:45<5:37:50, 117.85s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Això definitivament fa mal
Reason: La traducció pot contenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En aquest cas, "definitivament" es tradueix com "totalment", però això crea una frase incoherent gramaticalment parlant. Per tant, la traducció hauria de dir alguna cosa així com "Això sens dubte fa mal".



Evaluating translations:  66%|██████▌   | 329/500 [7:08:55<4:55:28, 103.67s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  66%|██████▌   | 330/500 [7:10:02<4:22:38, 92.70s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  66%|██████▌   | 331/500 [7:12:34<5:10:37, 110.28s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Tots els cabrons de tot arreu. Apesta que sigui tan fàcil d'utilitzar per a moltes altres coses.
Reason: La traducció té errors gramaticals com ara l'ús incorrecte dels pronoms personals o la manca de concordança entre gèneres. Aquests errors podrien fer que la traducció fos difícil de comprendre pels parlants nadius catalans.



Evaluating translations:  66%|██████▋   | 332/500 [7:13:42<4:33:27, 97.66s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  67%|██████▋   | 333/500 [7:14:50<4:07:12, 88.81s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  67%|██████▋   | 334/500 [7:16:00<3:50:20, 83.26s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  67%|██████▋   | 335/500 [7:18:35<4:48:08, 104.78s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Odio, fotre, [NOM]
Reason: Possible raó de la mala qualitat: El text conté paraules grolleres que podrien ferir els sentiments d’algú o resultar ofensives. Aquestes expressions són inadequades per a un llenguatge escrit formal i pot ser preferible evitar-les si es vol mantenir un nivell adequat de professionalitat.



Evaluating translations:  67%|██████▋   | 336/500 [7:19:46<4:18:51, 94.71s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  67%|██████▋   | 337/500 [7:20:53<3:54:22, 86.28s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  68%|██████▊   | 338/500 [7:23:16<4:38:28, 103.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Va publicar sobre racistes blancs en una publicació que no té res a veure amb el racisme. Bastant segur que és simplement racista.
Reason: La traducció pot tenir errors gramaticals o sintàctics, la qual cosa afecta negativament la comprensió del missatge original.



Evaluating translations:  68%|██████▊   | 339/500 [7:24:22<4:07:30, 92.24s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  68%|██████▊   | 340/500 [7:26:53<4:52:49, 109.81s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: "Els seus intents de ser ""atrevida"" cauen en sac trencat són increïblement cringeworthy."
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En aquest cas concret, hi ha un error gramatical en l'ús dels signes d'interrogació i exclamació.



Evaluating translations:  68%|██████▊   | 341/500 [7:28:02<4:18:44, 97.64s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  68%|██████▊   | 342/500 [7:30:32<4:58:15, 113.27s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Pobre home:(
Reason: Possible problema de traducció: la frase "Poor man:" pot haver estat traduïda literalment des de l'anglès sense tenir en compte les regles gramaticals o expressions equivalents en català. Això resultaria en una traducció poc natural o incorrecta gramaticalment.



Evaluating translations:  69%|██████▊   | 343/500 [7:31:40<4:20:37, 99.60s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  69%|██████▉   | 344/500 [7:32:48<3:54:10, 90.07s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  69%|██████▉   | 345/500 [7:33:57<3:36:29, 83.81s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  69%|██████▉   | 346/500 [7:35:02<3:20:32, 78.13s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  69%|██████▉   | 347/500 [7:36:11<3:12:26, 75.47s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  70%|██████▉   | 348/500 [7:37:20<3:06:15, 73.52s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  70%|██████▉   | 349/500 [7:38:29<3:01:33, 72.15s/it]Setting `pad_token_id` to `eos_token_id`:2 fo


Discarded Translation:
Text: LONDRES Sincerament ni tan sols entenc això
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. També hi ha faltes d'ortografia i possibles ambigüitats en la interpretació de les paraules utilitzades. En general, la traducció sembla estar feta sense gaire atenció als detalls, cosa que afecta negativament la claredat i precisió del missatge.



Evaluating translations:  79%|███████▉  | 397/500 [8:34:31<2:31:01, 87.98s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  80%|███████▉  | 398/500 [8:35:37<2:18:17, 81.35s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  80%|███████▉  | 399/500 [8:36:43<2:09:31, 76.94s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  80%|████████  | 400/500 [8:37:51<2:03:25, 74.05s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  80%|████████  | 401/500 [8:38:59<1:59:19, 72.31s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  80%|████████  | 402/500 [8:40:08<1:56:16, 71.19s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  81%|████████  | 403/500 [8:41:13<1:52:29, 69.59s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: Deixar de dir això? Fa que la meva panxa es molesti  ⁇ 
Reason: La traducció pot tenir errors gramaticals o sintàctics, cosa que afecta negativament la comprensió del missatge original.



Evaluating translations:  85%|████████▍ | 424/500 [9:06:31<1:45:22, 83.20s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  85%|████████▌ | 425/500 [9:07:38<1:38:13, 78.57s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  85%|████████▌ | 426/500 [9:08:45<1:32:35, 75.07s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  85%|████████▌ | 427/500 [9:09:56<1:29:46, 73.78s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  86%|████████▌ | 428/500 [9:11:04<1:26:23, 71.99s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  86%|████████▌ | 429/500 [9:12:14<1:24:23, 71.32s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  86%|████████▌ | 430/500 [9:13:24<1:22:43, 70.90s/it]Setting `pad_token_id` to `eos_token_id`:2 for


Discarded Translation:
Text: FBI!! OBRIR!!!
Reason: Puc dir que aquest text té un error ortogràfic ("FBI" hauria d’escriure's "FBI!!"), però no tinc prou informació sobre la llengua original ni el context per determinar si es tracta d'una mala traducció o no. Per tant, només puc donar una resposta general sense especificar els motius concrets de la baixa qualitat.



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  88%|████████▊ | 441/500 [9:29:02<1:55:10, 117.13s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: El repte doble gos.
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la claredat del missatge original. En aquest cas, sembla haver-hi un error ortogràfic que afecta la comprensió del terme "doble" en relació al concepte de desafiament. Això es deu probablement a l’ús inadequat d’accents diacrítics en paraules compostes com ara "doble". Per tant, la traducció hauria estat més precisa si hagués utilitzat correctament els accents diacrítics



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  88%|████████▊ | 442/500 [9:31:31<2:02:21, 126.57s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Emo!!
Reason: Possible raó de mala qualitat: la paraula "Emo" pot tenir diferents significats segons el context; aquí sembla que es refereix a un estil musical o subcultura, però sense més informació sobre el context, és difícil determinar si aquesta traducció és correcta.



Evaluating translations:  89%|████████▊ | 443/500 [9:32:40<1:43:55, 109.39s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  89%|████████▉ | 444/500 [9:33:50<1:31:08, 97.65s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  89%|████████▉ | 445/500 [9:34:58<1:21:17, 88.69s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  89%|████████▉ | 446/500 [9:36:09<1:14:55, 83.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  89%|████████▉ | 447/500 [9:37:17<1:09:25, 78.60s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  90%|████████▉ | 448/500 [9:38:25<1:05:34, 75.66s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  90%|████████▉ | 449/500 [9:39:32<1:01:59, 72.94s/it]Setting `pad_token_id` to `eos_token_id`:2 f


Discarded Translation:
Text: Els seus controls de Twitter també són clarament un maleït idiota.
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En aquest cas concret, hi ha faltes d’ortografia com ara "maleït" que podrien haver estat mal traduïdes des de l'anglès. També es poden observar algunes paraules manllevades sense adaptar adequadament les formes gramaticals catalanes corresponents. Per tant, aquesta traducció té una puntuació baixa segons els criteris esmentats anteriorment.



Evaluating translations:  91%|█████████ | 454/500 [9:46:57<1:09:08, 90.19s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  91%|█████████ | 455/500 [9:49:56<1:27:37, 116.84s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: sembla immadura, jugant a jocs mentals. Jo seria directe i si encara segueix jugant a deixar-la anar. sembla un cercador d'atenció.
Reason: La traducció sembla tenir errors gramaticals o sintàctics que la fan difícil de comprendre, especialment perquè hi ha paraules que podrien traduir-se millor. Per exemple, "immadur" es pot substituir per "juvenil", ja que l'edat no té res a veure amb els aspectes psicològics descrits; també caldria revisar les expressions com ara "deixar-la anar". En general, cal millorar la precisió lingüística i evitar traduccions literals poc naturals.



Evaluating translations:  91%|█████████ | 456/500 [9:51:03<1:14:34, 101.70s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  91%|█████████▏| 457/500 [9:52:12<1:05:54, 91.98s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  92%|█████████▏| 458/500 [9:55:14<1:23:12, 118.86s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Tenint en compte el cul de cavall que heu fet de l'últim, crec que és millor deixar aquesta idea en un segon pla.
Reason: La traducció pot ser considerada com de mala qualitat perquè conté errors gramaticals i sintàctics. Per exemple, "cul" hauria d'estar escrit sense accent, ja que es refereix a la part posterior del cos i no té cap altra funció gramatical rellevant en aquest context. També hi ha faltes d'ortografia com ara "de", que s'hauria d'escriure "del". Finalment, la frase "crec que és millor deixar aquesta idea en un segon pla



Evaluating translations:  92%|█████████▏| 459/500 [9:56:20<1:10:26, 103.10s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  92%|█████████▏| 460/500 [9:57:26<1:01:24, 92.11s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  92%|█████████▏| 461/500 [9:58:33<54:59, 84.60s/it]  Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  92%|█████████▏| 462/500 [9:59:42<50:33, 79.83s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  93%|█████████▎| 463/500 [10:00:50<46:55, 76.11s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  93%|█████████▎| 464/500 [10:02:00<44:35, 74.31s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  93%|█████████▎| 465/500 [10:04:3


Discarded Translation:
Text: Mèxic o els EUA?
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En aquest cas concret, es tracta d'una falta d'ortografia ("Mèxic" hauria de portar tilde), però també hi ha altres possibles errors. Per tant, la traducció té una puntuació global de 3/5.



Evaluating translations:  93%|█████████▎| 466/500 [10:05:44<50:46, 89.61s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  93%|█████████▎| 467/500 [10:08:30<1:01:57, 112.65s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: [NOM] és un imbècil. Per què treure la pilota al seu millor jugador i donar possiblement el millor RB en la nació l'oportunitat d'executar-lo.
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. També hi ha faltes d'ortografia i possibles ambigüitats en les expressions utilitzades. En general, la traducció sembla estar més pensada per a comunicar ràpidament i eficaçment el significat bàsic sense prestar molta atenció als detalls lingüístics.



Evaluating translations:  94%|█████████▎| 468/500 [10:09:37<52:40, 98.75s/it]   Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  94%|█████████▍| 469/500 [10:10:46<46:31, 90.04s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  94%|█████████▍| 470/500 [10:11:55<41:48, 83.62s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  94%|█████████▍| 471/500 [10:13:05<38:24, 79.48s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  94%|█████████▍| 472/500 [10:14:13<35:30, 76.10s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  95%|█████████▍| 473/500 [10:15:22<33:13, 73.83s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  95%|█████████▍| 474/500 [10:16:28<30:58, 71.49s/it]Setting `pad_token_id` to `eos_token_id`:2 for ope


Discarded Translation:
Text: No només això, els controls millorats no són res en absolut. No estic re-comprant Blood Money per a gràfics més bonics.
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. En aquest cas concret, es poden trobar faltes d’ortografia com "re" en lloc de "per", cosa que altera el significat de l’enunciat.



Evaluating translations:  96%|█████████▋| 482/500 [10:26:57<26:09, 87.18s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  97%|█████████▋| 483/500 [10:28:06<23:07, 81.62s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  97%|█████████▋| 484/500 [10:29:13<20:34, 77.15s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  97%|█████████▋| 485/500 [10:30:20<18:33, 74.22s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  97%|█████████▋| 486/500 [10:32:55<22:58, 98.49s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: també un dur NOOOO quan es disgusta
Reason: Possible raó de la mala qualitat: El text conté paraules mal escrites o faltes d’ortografia que podrien dificultar la comprensió. En aquest cas, "NOOO" sembla estar mal escrit i pot causar confusió sobre si aquesta paraula forma part de l’expressió originalment escrita en anglès.



Evaluating translations:  97%|█████████▋| 487/500 [10:34:01<19:14, 88.82s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  98%|█████████▊| 488/500 [10:35:08<16:27, 82.28s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  98%|█████████▊| 489/500 [10:36:17<14:20, 78.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  98%|█████████▊| 490/500 [10:37:29<12:44, 76.44s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  98%|█████████▊| 491/500 [10:38:41<11:13, 74.85s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  98%|█████████▊| 492/500 [10:41:33<13:52, 104.04s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: És lletja, així que això podria afegir alguns mesos més.
Reason: La frase "És lletja" probablement es refereix a alguna cosa física o visualment desagradable. Aquesta expressió pot tenir connotacions negatives i afectar la percepció d’una persona sobre un objecte o situació concret. En aquest cas, l'ús d'"afegir uns quants mesos més" sembla un intent de suavitzar les conseqüències negatives associades a la descripció inicial ("lletja"). No obstant això, aquesta expressió també té una certa quantitat de sarcasme



Evaluating translations:  99%|█████████▊| 493/500 [10:42:39<10:49, 92.76s/it] Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  99%|█████████▉| 494/500 [10:43:50<08:37, 86.17s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  99%|█████████▉| 495/500 [10:44:58<06:43, 80.77s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  99%|█████████▉| 496/500 [10:46:04<05:05, 76.41s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations:  99%|█████████▉| 497/500 [10:47:12<03:41, 73.73s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations: 100%|█████████▉| 498/500 [10:49:51<03:18, 99.44s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Discarded Translation:
Text: Necessitàvem alguna manera de torturar un tipus per poder crear una religió.
Reason: La traducció pot tenir errors gramaticals o sintàctics que afecten la comprensió del missatge original. També pot haver-hi faltes d'ortografia o paraules mal traduïdes que afectin l'exactitud de la traducció. En general, sembla que hi ha hagut poca atenció als detalls durant el procés de traducció.



Evaluating translations: 100%|█████████▉| 499/500 [10:51:00<01:30, 90.23s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Evaluating translations: 100%|██████████| 500/500 [10:52:08<00:00, 78.26s/it]

Removing 50 bad translations from dataset

Saved 11,841 cleaned samples to 'goemotions.csv'

Full pipeline finished.





## 4. Create the Final Dataset

The final step consists of generating the training, validation, and test splits. After combining the previously processed datasets, the rows are shuffled to ensure randomness and reduce potential ordering biases. The data is then divided into three subsets: 80% for training, 10% for validation, and 10% for testing. Finally, the resulting splits are stored as separate CSV files named `train.csv`, `validation.csv`, and `test.csv`.

In [None]:
import pandas as pd
import subprocess

try:
    from sklearn.model_selection import train_test_split
except ImportError:
    subprocess.check_call(["pip", "install", "-q", "scikit-learn"])
    from sklearn.model_selection import train_test_split

INPUT_FILES = ["cassa.csv", "goemotions.csv", "guiacat.csv"]
OUTPUT_FILES = {
    "train": "train.csv",
    "validation": "validation.csv",
    "test": "test.csv"
}
RANDOM_STATE = 42

def print_distribution(df, name):
    total_rows = len(df)
    label_dist = df['label'].value_counts(normalize=True) * 100
    
    pos_pct = label_dist.get('positive', 0)
    neg_pct = label_dist.get('negative', 0)
    neu_pct = label_dist.get('neutral', 0)
    
    print(f"'{name}' ({total_rows:,} rows):")
    print(f"  Distribution: {pos_pct:.1f}% Positive, {neg_pct:.1f}% Negative, {neu_pct:.1f}% Neutral")

def combine_and_split_datasets():
    dataframes = []
    print("Reading input files:")
    for file in INPUT_FILES:
        try:
            df = pd.read_csv(file)
            dataframes.append(df)
            print(f"  - Loaded '{file}' with {len(df):,} rows.")
        except FileNotFoundError:
            print(f"  - Warning: '{file}' not found. Skipping.")
    
    if not dataframes:
        print("\nError: No data files found. Aborting.")
        return

    combined_df = pd.concat(dataframes, ignore_index=True)
    print(f"\nCombined dataset has a total of {len(combined_df):,} rows.")

    shuffled_df = combined_df.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)
    print("Dataset shuffled successfully.")

    train_df, temp_df = train_test_split(
        shuffled_df, test_size=0.2, random_state=RANDOM_STATE
    )

    validation_df, test_df = train_test_split(
        temp_df, test_size=0.5, random_state=RANDOM_STATE
    )
    
    print("\nSaving final CSV files")
    train_df.to_csv(OUTPUT_FILES["train"], index=False)
    validation_df.to_csv(OUTPUT_FILES["validation"], index=False)
    test_df.to_csv(OUTPUT_FILES["test"], index=False)

    print("\nProcess Complete")
    print_distribution(train_df, "train.csv")
    print_distribution(validation_df, "validation.csv")
    print_distribution(test_df, "test.csv")

if __name__ == "__main__":
    combine_and_split_datasets()

Reading input files:
  - Loaded 'cassa.csv' with 6,400 rows.
  - Loaded 'goemotions.csv' with 11,638 rows.
  - Loaded 'guiacat.csv' with 5,750 rows.

Combined dataset has a total of 23,788 rows.
Dataset shuffled successfully.

Saving final CSV files

Process Complete
'train.csv' (19,030 rows):
  Distribution: 40.5% Positive, 30.1% Negative, 29.4% Neutral
'validation.csv' (2,379 rows):
  Distribution: 38.1% Positive, 30.4% Negative, 31.4% Neutral
'test.csv' (2,379 rows):
  Distribution: 42.1% Positive, 28.2% Negative, 29.8% Neutral
