In [1]:
import torch

if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

GPU is available


# Classificador de Sentiments a Xarxes Socials en Català (CSXSC): Dataset

**Author:** Daniel Arias Cámara  
**Date:** 25-07-2025  

**Description:**  This notebook aims to build a high-quality dataset for fine-tuning the **CSXSC** model. The dataset is constructed by combining trusted data sources, including structured sentiment corpora and translated social media content. Details on data origin and preprocessing steps are provided in the sections below.


## 1. GuiaCat Dataset

**Description:** This dataset consists of 5,750 restaurant reviews in Catalan, sourced from the GuiaCat platform. Each review includes individual ratings for service, food, price-quality ratio, and atmosphere, along with an overall average score.

**Access:** [projecte-aina/GuiaCat on Hugging Face](https://huggingface.co/datasets/projecte-aina/GuiaCat)

**Source:** Aina Project

**Notes:**  
The dataset is divided into three subsets:  
- **Train:** 1,750 rows  
- **Validation:** 500 rows  
- **Test:** 500 rows  

The original fields are: Service, Food, Price-quality, Environment, Avg, Text, and Label.  
For our purposes, we retain only the Text and Label fields, discarding the rest.

The Label field includes five sentiment categories:  
- Molt bo (Very good)  
- Bo (Good)  
- Regular (Average)  
- Dolent (Bad)  
- Molt dolent (Very bad)

These are grouped into three classes for sentiment classification:  
- **Positive:** Molt bo and Bo  
- **Neutral:** Regular  
- **Negative:** Dolent and Molt dolent


In [5]:
import os

try:
    import datasets
except ImportError:
    import subprocess
    subprocess.check_call(["pip", "install", "-q", "datasets"])

from datasets import load_dataset

ds_guiacat = load_dataset("projecte-aina/GuiaCat")
ds = {}

for split in ds_guiacat:
    drop_columns = [col for col in ds_guiacat[split].column_names if col not in ["text", "label"]]
    ds[split] = ds_guiacat[split].remove_columns(drop_columns)

    def relabel(opinion):
        label = opinion["label"].lower()
        if label in ["molt bo", "bo"]:
            opinion["label"] = "positive"
        elif label == "regular":
            opinion["label"] = "neutral"
        elif label in ["dolent", "molt dolent"]:
            opinion["label"] = "negative"
        return opinion

    ds[split] = ds[split].map(relabel)

output_dirs = {
    "train": "train",
    "validation": "validate",
    "test": "test"
}

for split in ['train', 'validation', 'test']:
    os.makedirs(split, exist_ok=True)
    output_path = os.path.join(split, "guiacat.csv")
    ds[split].to_csv(output_path, index=False)

print("Train:", ds["train"][0])
print("Validation:", ds["validation"][0])
print("Test:", ds["test"][0])

Creating CSV from Arrow format: 100%|██████████| 5/5 [00:00<00:00, 316.32ba/s]
Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 401.64ba/s]
Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 424.27ba/s]

Train: {'text': 'El lloc és acollidor. El tracte, familiar. Els plats són casolans, abundants i de qualitat! Productes de la terra com embutits i altres de la zona. Hi tornarem segur!', 'label': 'positive'}
Validation: {'text': 'Bon Menjar i bon tracte en un restaurant que segú hi tornaràs un altre vegada.', 'label': 'positive'}
Test: {'text': "Fantàstic restaurant ,una carta plena de plats creatius i cuina de temporada que sempre es d'agrair , i el que varem menjar nosaltres molt bé, estic segura que tornaré no ho dubtaré.El tracte bó i l'espai molt acollidor.", 'label': 'positive'}



