Code to create and prepare a dataset in the form required by the function to fine-tune the model 'kevinscaria/joint_tk-instruct-base-def-pos-neg-neut-combined'.

An example dataset is shown below.


| raw_text                                         | aspectTerms                                                                 |
|--------------------------------------------------|------------------------------------------------------------------------------|
| The cab ride was amazing but the service was pricey | [{'term': 'cab ride', 'polarity': 'positive'}, {'term': 'service', 'polarity': 'negative'}] |
| I ordered the Barbeque Pizza                     | [{'term': 'noaspectterm', 'polarity': 'none'}]                               |


The dataset use to fine-tune the model is labeled with a powerfull LLM, for better results the dataset should be labeled by an human or a better LLM. 

In [1]:
import  pandas as pd
import os
from langchain_ollama.llms import OllamaLLM
from langchain.prompts import ChatPromptTemplate
import json
from typing import List, Dict


In [40]:
#==
#Loading data
#==
path = "c:\\Users\\jacop\\Desktop\\Lavori\\Consigl_IA_mi-\\data\\raw\\Barcelona_reviews.csv"
raw_data = pd.read_csv(path)

#====================================
# Creating a balanced dataset to be labeled

Neg = 400
Pos = 600


# Sample from each class
positive_sample = raw_data[raw_data['sample'] == 'Positive'].sample(n=Pos, random_state=55)
negative_sample = raw_data[raw_data['sample'] == 'Negative'].sample(n=Neg, random_state=55)

# Combine them into one balanced dataset
balanced_sample = pd.concat([positive_sample, negative_sample]).sample(frac=1, random_state=42).reset_index(drop=True)

tuning_data = balanced_sample
variables = ['restaurant_name', 'review_full']
tuning_data = tuning_data[variables]



  raw_data = pd.read_csv(path)


In [41]:
PROMPT_TEMPLATE = '''
You are a Natural Language Processing (NLP) expert. Your task is to perform Aspect-Based Sentiment Analysis (ABSA) on the following food review.

Identify all aspects mentioned or implied in the text (both explicit and implicit), and assign a sentiment polarity to each one. 
Aspects should refer to specific elements of the restaurant experience such as food, service, waiting time, prices, atmosphere, etc.

Return the result as a **single-line JSON-style list of dictionaries, without line breaks or indentation**:
- Each element must have two keys: "term" and "polarity"
- Polarity must be one between: ["positive", "negative", "neutral", "mixed", "None"] 
- Example: `[{{"term": "pizza", "polarity": "positive"}}, {{"term": "service", "polarity": "negative"}}]`
- Use lowercase for all aspect terms
- Only include aspects mentioned or implied in the review
- Group synonymous or repeated terms under the same aspect (e.g., "wait time" instead of both "delay" and "waiting", or "service" instead of both "staff" and "crew")

⚠️ ATTENTION:
- Do not add line breaks, indentation, or extra spaces.
- Do not include reasoning or explanations.
- The output must be a single line inside square brackets, exactly as shown in the examples.

---

Examples:

Input:
"Pizza was very good but the service was quite terrible. Prices were honest."

Output:
[{{"term": "pizza", "polarity": "positive"}}, {{"term": "service", "polarity": "negative"}}, {{"term": "price", "polarity": "positive"}}]

Input:
"We waited over an hour despite the promise of a 20-minute wait. Once seated, the pasta was undercooked and the waiter ignored us."

Output:
[{{"term": "wait time", "polarity": "negative"}}, {{"term": "pasta", "polarity": "negative"}}, {{"term": "service", "polarity": "negative"}}]

Input:
"Empanadas were average but the outdoor patio was beautiful and relaxing. Fair prices too."

Output:
[{{"term": "empanadas", "polarity": "neutral"}}, {{"term": "atmosphere", "polarity": "positive"}}, {{"term": "price", "polarity": "positive"}}]

Input:
"Very nice place near Colosseum"

Output:
[{{"term": "location", "polarity": "positive"}}]

---

Now analyze the following review and return the JSON-style list of dictionaries in a single line:

{food_review}
'''


In [23]:
#Testing the model on a single review

path = "c:\\Users\\jacop\\Desktop\\Lavori\\Consigl_IA_mi-\\data\\raw\\Barcelona_reviews.csv"
raw_data = pd.read_csv(path)

row = raw_data.iloc[24960]['review_full']

####### Testing if it works #######
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(food_review = row)

model = OllamaLLM(model="gemma3:12b")

response_text = model.invoke(prompt)
print(response_text)
#####################################

  raw_data = pd.read_csv(path)


[{"term": "location", "polarity": "positive"}, {"term": "atmosphere", "polarity": "positive"}, {"term": "food", "polarity": "positive"}, {"term": "portion size", "polarity": "negative"}, {"term": "price", "polarity": "negative"}, {"term": "fish", "polarity": "neutral"}, {"term": "pasta", "polarity": "positive"}]


In [26]:
type(json.loads(response_text))

list

In [42]:
def clean_absa_output(parsed: List) -> List:
    """
    Clean ABSA model output (list of dicts).
    Rule:
      - First key must be 'term'
      - Second key must be 'polarity'
      If keys differ, they are replaced.
    """
    if not isinstance(parsed, list):
        print("!!You should pass a List here!!")
        return parsed  # nothing to clean

    cleaned = []

    for d in parsed:
        if not isinstance(d, dict):
            print("!!The List should contains Dictionaries!!")
            continue  # skip invalid entries

        keys = list(d.keys())
        values = list(d.values())

        # Fix first key
        if len(keys) >= 1:
            keys[0] = "term"
        # Fix second key
        if len(keys) >= 2:
            keys[1] = "polarity"

        # Rebuild dict with corrected keys
        corrected = {k: v for k, v in zip(keys, values)}

        # Ensure both keys exist, even if missing
        if "term" not in corrected:
            corrected["term"] = ""
        if "polarity" not in corrected:
            corrected["polarity"] = "none"

        cleaned.append(corrected)

    return cleaned


def absa(df: pd.DataFrame)-> pd.DataFrame:
    # New dataset
    absa_rows = []
    df['ABSA_results'] = None


    for idx, row in df.iterrows():
        print(f'idx: {idx}')
        original_review = row['review_full']
        #print(f'ORIGINAL REVIEW : {original_review}')
        
        if idx%10 == 0:
            print(f'Progress: {idx}/{len(df)}')
        
        prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
        prompt = prompt_template.format(food_review = original_review)
        
        # 1. Call your LLM with the review (this is a placeholder)
        llm_response = model.invoke(prompt)  
        
        try:
            parsed = json.loads(llm_response)
        except Exception:
            parsed = llm_response  # fallback se il modello non restituisce JSON valido
            
        parsed = clean_absa_output(parsed)
        
        df.at[idx, 'ABSA_results'] = parsed

    # Final DataFrame
    
    df.to_csv("C:\\Users\\jacop\\Desktop\\Lavori\\InstructABSA_Fine_Tuning\\Dataset\\Data.csv", index=False)
                
    return df

In [44]:
df = absa(tuning_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['ABSA_results'] = None


idx: 0
Progress: 0/1000
idx: 1
idx: 2
idx: 3
idx: 4
idx: 5
idx: 6
idx: 7
idx: 8
idx: 9
idx: 10
Progress: 10/1000
idx: 11
idx: 12
idx: 13
idx: 14
idx: 15
idx: 16
idx: 17
idx: 18
idx: 19
idx: 20
Progress: 20/1000
idx: 21
idx: 22
idx: 23
idx: 24
idx: 25
idx: 26
idx: 27
idx: 28
idx: 29
idx: 30
Progress: 30/1000
idx: 31
idx: 32
idx: 33
idx: 34
idx: 35
idx: 36
idx: 37
idx: 38
idx: 39
idx: 40
Progress: 40/1000
idx: 41
idx: 42
idx: 43
idx: 44
idx: 45
idx: 46
idx: 47
idx: 48
idx: 49
idx: 50
Progress: 50/1000
idx: 51
idx: 52
idx: 53
idx: 54
idx: 55
idx: 56
idx: 57
idx: 58
idx: 59
idx: 60
Progress: 60/1000
idx: 61
idx: 62
idx: 63
idx: 64
idx: 65
idx: 66
idx: 67
idx: 68
idx: 69
idx: 70
Progress: 70/1000
idx: 71
idx: 72
idx: 73
idx: 74
idx: 75
idx: 76
idx: 77
idx: 78
idx: 79
idx: 80
Progress: 80/1000
idx: 81
idx: 82
idx: 83
idx: 84
idx: 85
idx: 86
idx: 87
idx: 88
idx: 89
idx: 90
Progress: 90/1000
idx: 91
idx: 92
idx: 93
idx: 94
idx: 95
idx: 96
idx: 97
idx: 98
idx: 99
idx: 100
Progress: 100/1000
idx

In [52]:
df['review_full'].iloc[6]

'The location on the seaside promenade is nice, but it has scarce appeal to me, who was born and always lived in seaside cities. Good appetizers (see snails, clams, esqueixada, buñuelos de bacalao). But paella was disappointing, absolutely mediocre. I can make better paella at home. One bottle of wine shared among three, it came to 51 euros each. Actually I suspected that the place was very touristic, but my local friend assured me it was not. He was wrong.'

In [53]:
df['ABSA_results'].iloc[6]

[{'term': 'location', 'polarity': 'positive'},
 {'term': 'appetizers', 'polarity': 'positive'},
 {'term': 'paella', 'polarity': 'negative'},
 {'term': 'wine', 'polarity': 'negative'},
 {'term': 'touristic', 'polarity': 'negative'}]