# Marathi Sentiment Analysis: Model Comparison and Error Analysis

This notebook analyzes the results of fine-tuning multiple transformer models for Marathi sentiment classification. It performs the following steps:

1.  **Gathers Metrics**: Reads the `summary.json` files from all training runs.
2.  **Compares Models**: Consolidates the metrics into a table and identifies the best-performing model based on F1-score and accuracy.
3.  **Visualizes Results**: Creates bar charts to visually compare the performance of the models.
4.  **Performs Error Analysis**: For the best model, it loads the test predictions and analyzes the misclassified examples to identify common error patterns (e.g., sarcasm, negation, ambiguity).
5.  **Generates README**: Creates a comprehensive `README.md` file with all the findings.

In [29]:
import pandas as pd
import json
from pathlib import Path
import plotly.express as px

# Define the path to the results directory
results_dir = Path(r"C:\LLM's_for_SA\results\notebook_run")

# Check if the directory exists
if not results_dir.exists():
    raise FileNotFoundError(f"The directory {results_dir} does not exist. Please make sure you have run the training notebooks.")

# Get all run directories, sorted by creation time
run_dirs = sorted(results_dir.iterdir(), key=lambda f: f.stat().st_mtime, reverse=True)

latest_runs = {}
for run_dir in run_dirs:
    summary_path = run_dir / 'summary.json'
    if summary_path.exists():
        with open(summary_path, 'r', encoding='utf-8') as f:
            try:
                summary = json.load(f)
                model_name = summary.get('model')
                if model_name and model_name not in latest_runs:
                    latest_runs[model_name] = run_dir
            except json.JSONDecodeError:
                print(f"Skipping invalid JSON file: {summary_path}")

print(f"Found {len(latest_runs)} unique models:")
for model_name, run_dir in latest_runs.items():
    print(f"- {model_name}: {run_dir.name}")

Found 5 unique models:
- xlm-roberta-base: 20250917_203037
- l3cube-pune/marathi-albert-v2: 20250917_180532
- l3cube-pune/marathi-bert: 20250917_173302
- bert-base-multilingual-cased: 20250917_162312
- google/muril-base-cased: 20250917_142448


In [30]:
records = []
for model_name, run_dir in latest_runs.items():
    summary_path = run_dir / 'summary.json'
    with open(summary_path, 'r', encoding='utf-8') as f:
        summary = json.load(f)
        
        # Extract baseline and finetuned metrics
        baseline_metrics = summary.get('baseline', {})
        finetuned_metrics = summary.get('finetuned', {})
        
        records.append({
            'Model': model_name,
            'Baseline Accuracy': baseline_metrics.get('accuracy'),
            'Baseline F1': baseline_metrics.get('f1_macro'),
            'Finetuned Accuracy': finetuned_metrics.get('accuracy'),
            'Finetuned F1': finetuned_metrics.get('f1_macro'),
            'Run Directory': run_dir.name
        })

# Create a DataFrame and display it
df_results = pd.DataFrame(records).sort_values(by='Finetuned F1', ascending=False).reset_index(drop=True)
df_results

Unnamed: 0,Model,Baseline Accuracy,Baseline F1,Finetuned Accuracy,Finetuned F1,Run Directory
0,google/muril-base-cased,0.333333,0.166667,0.794333,0.79255,20250917_142448
1,l3cube-pune/marathi-albert-v2,0.333333,0.166667,0.780333,0.779034,20250917_180532
2,l3cube-pune/marathi-bert,0.331,0.168996,0.778333,0.776411,20250917_173302
3,xlm-roberta-base,0.333333,0.166667,0.771667,0.769817,20250917_203037
4,bert-base-multilingual-cased,0.344333,0.256454,0.742,0.740075,20250917_162312


In [31]:
# Visualize the results
fig_f1 = px.bar(df_results, x='Model', y='Finetuned F1', title='Finetuned F1 Score Comparison',
                text_auto='.4f', color='Model',
                labels={'Model': 'Model', 'Finetuned F1': 'F1 Score (Macro)'})
fig_f1.update_traces(textangle=0, textposition='outside')
fig_f1.update_layout(
    xaxis_tickangle=-45,
    margin=dict(b=150),  # Increase bottom margin further
    height=800,          # Increase plot height
    width=800           # Increase plot width
)
fig_f1.show()

fig_acc = px.bar(df_results, x='Model', y='Finetuned Accuracy', title='Finetuned Accuracy Comparison',
                 text_auto='.4f', color='Model',
                 labels={'Model': 'Model', 'Finetuned Accuracy': 'Accuracy'})
fig_acc.update_traces(textangle=0, textposition='outside')
fig_acc.update_layout(
    xaxis_tickangle=-50,
    margin=dict(b=150),  # Increase bottom margin further
    height=800,          # Increase plot height
    width=800           # Increase plot width
)
fig_acc.show()

In [32]:
# --- Error Analysis ---
# Identify the best model
best_model_info = df_results.iloc[0]
best_model_name = best_model_info['Model']
best_model_run_dir = results_dir / best_model_info['Run Directory']

print(f"Performing error analysis on the best model: {best_model_name}")
print(f"Run directory: {best_model_run_dir}")

# Paths for the model and test data
model_path = best_model_run_dir / 'model'
test_data_path = best_model_run_dir / 'splits' / 'test.csv'

# Check if model and data exist
if not model_path.exists() or not test_data_path.exists():
    raise FileNotFoundError(f"Could not find model at {model_path} or test data at {test_data_path}")

# We need to redefine the evaluation utilities and classes from the training notebook
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
import numpy as np

LABEL_ORDER = {'negative': 0, 'neutral': 1, 'positive': 2}
ID2LABEL = {v: k for k, v in LABEL_ORDER.items()}
MAX_LENGTH = 192 # Should match the training configuration

class TextClsDS(Dataset):
    def __init__(self, df, tokenizer, text_col='text', label_col='label_id', max_length=MAX_LENGTH):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.text_col = text_col
        self.label_col = label_col
        self.max_length = max_length
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        enc = self.tokenizer(str(r[self.text_col]), truncation=True, max_length=self.max_length)
        enc['labels'] = int(r[self.label_col])
        return enc

def eval_model_for_error_analysis(model, tokenizer, dataset, device, batch_size=32):
    coll = DataCollatorWithPadding(tokenizer=tokenizer)
    dl = DataLoader(dataset, batch_size=batch_size, collate_fn=coll)
    model.to(device)
    model.eval()
    ys, ps = [], []
    with torch.no_grad():
        for batch in dl:
            labels = batch.get('labels')
            inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
            logits = model(**inputs).logits.detach().cpu().numpy()
            if labels is None: continue
            ys.extend(labels.detach().cpu().numpy().tolist())
            ps.extend(np.argmax(logits, axis=-1).tolist())
    return np.array(ys), np.array(ps)

# Load the model and tokenizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Load the test data
df_test = pd.read_csv(test_data_path)
test_ds = TextClsDS(df_test, tokenizer)

# Get predictions
y_true, y_pred = eval_model_for_error_analysis(model, tokenizer, test_ds, device)

# Create a dataframe with errors
df_test['predicted_id'] = y_pred
df_test['predicted_label'] = df_test['predicted_id'].map(ID2LABEL)
df_test['true_label'] = df_test['label_id'].map(ID2LABEL)

df_errors = df_test[df_test['label_id'] != df_test['predicted_id']].copy()

print(f"Found {len(df_errors)} errors out of {len(df_test)} test samples.")

# Display some of the errors
df_errors[['text', 'true_label', 'predicted_label']].head(20)

Performing error analysis on the best model: google/muril-base-cased
Run directory: C:\LLM's_for_SA\results\notebook_run\20250917_142448
Found 617 errors out of 3000 test samples.
Found 617 errors out of 3000 test samples.


Unnamed: 0,text,true_label,predicted_label
4,त्यांच्या दडपलेल्या भावना केवळ त्या पृष्ठभागाव...,neutral,negative
7,मला असे वाटले की आपण असे कोणीतरी आहात ज्याला म...,neutral,negative
13,मी गंभीर आहे.,neutral,negative
21,अरे आता तुला पॅड पाहिजे.,negative,neutral
24,कारण यामध्ये रोमॅंटिक काय आहे आणि कॉमेडी काय आ...,neutral,negative
37,खोलीभोवती त्याचा पाठलाग.,neutral,negative
52,या संकल्पनेसाठी समीर आणि त्याच्या टीमला हॅटस ऑफ,positive,neutral
53,घटस्फोटाची काळजी घेणार्‍या पिलो-टॉप राणीसाठी $...,negative,neutral
54,जाण्यासाठी मार्ग!,positive,neutral
55,दशावतार कलेबद्दल ती साकारणाऱ्यांबद्दल काहीही म...,negative,neutral


In [33]:
# Categorize the errors
def categorize_error(text):
    text_lower = text.lower()
    # Keywords for negation
    negation_words = ['नाही', 'नाहीये', 'नको', 'नका', 'ना', 'not', 'no']
    # Keywords for sarcasm or figurative language (this is difficult and context-dependent)
    sarcasm_words = ['भारी', 'उत्तम', 'छानच', 'great', 'nice', 'wow'] # Often used sarcastically
    # Keywords for ambiguity
    ambiguous_words = ['पण', 'but', 'तरी', 'however', 'तरीही']

    if any(word in text_lower for word in negation_words):
        return 'Negation'
    if any(word in text_lower for word in ambiguous_words):
        return 'Ambiguous/Mixed'
    # Simple check for sarcasm: positive words in a negative context (this is a heuristic)
    if 'true_label' in df_errors.columns and 'negative' in df_errors['true_label'].iloc[0] and any(word in text_lower for word in sarcasm_words):
         return 'Potential Sarcasm'
    return 'Other/Complex'

df_errors['error_category_simple'] = df_errors['text'].apply(categorize_error)

# Display the distribution of error categories
error_counts = df_errors['error_category_simple'].value_counts()
print("Error Category Distribution (Simple):")
print(error_counts)

# Visualize the error categories
fig_errors = px.pie(error_counts, values=error_counts.values, names=error_counts.index,
                    title=f'Error Analysis for {best_model_name}',
                    hole=0.3)
fig_errors.update_traces(textposition='inside', textinfo='percent+label')
fig_errors.show()

# --- Detailed Error Analysis (Refined) ---
import re

# Take a sample of up to 30 errors for a closer look
error_samples = df_errors.head(30)

def get_error_category_and_explanation(row):
    text = row['text']
    true_label = row['true_label']
    pred_label = row['predicted_label']
    
    text_lower = text.lower()
    
    # 1. Negation
    negation_words = ['नाही', 'नाहीये', 'नको', 'नका', 'ना', 'not', 'no', "नाहीयेत", "नाहीत"]
    if any(word in text.split() for word in negation_words):
        return "Negation", "The model failed to correctly process a negative word like 'नाही' (not), leading to a misclassification of the sentiment."

    # 2. Code-Mixing
    if re.search(r'[a-zA-Z]', text):
        return "Code-Mixing", "The sentence contains a mix of Marathi and English. The model may have struggled to understand the sentiment of the English word within the Marathi sentence structure."

    # 3. Sarcasm/Irony
    positive_sarcasm_words = ['भारी', 'उत्तम', 'छानच', 'मस्त', 'great', 'nice', 'wow', 'amazing']
    if true_label == 'negative' and any(word in text_lower for word in positive_sarcasm_words):
        return "Sarcasm/Irony", "The model likely misinterpreted a positive word used in a negative or sarcastic context, taking it literally instead of understanding the irony."

    # 4. Neutral Ambiguity
    if true_label == 'neutral' and pred_label != 'neutral':
        return "Neutral Ambiguity", "The sentence is fact-based or lacks strong sentiment, but the model forced a positive or negative classification."
    if pred_label == 'neutral' and true_label != 'neutral':
        return "Neutral Ambiguity", "The sentence has subtle sentiment (positive or negative) that the model missed, causing it to default to a 'safe' neutral prediction."

    # 5. Idioms / Colloquialisms / Complex
    # This is a catch-all for errors that don't fit the other patterns.
    return "Idiom/Complex", "The error is likely due to complex sentence structure, nuanced language, a colloquialism, or an idiom whose meaning is not literal."


# Apply the new categorization
categorization_results = [get_error_category_and_explanation(row) for index, row in error_samples.iterrows()]
df_detailed_errors = pd.DataFrame({
    'Text': error_samples['text'],
    'True Label': error_samples['true_label'],
    'Predicted Label': error_samples['predicted_label'],
    'Error Category': [res[0] for res in categorization_results],
    'Potential Reason for Error': [res[1] for res in categorization_results]
})

print("\\nDetailed Error Analysis (Sample):")
df_detailed_errors


Error Category Distribution (Simple):
error_category_simple
Other/Complex      341
Negation           218
Ambiguous/Mixed     58
Name: count, dtype: int64


\nDetailed Error Analysis (Sample):


Unnamed: 0,Text,True Label,Predicted Label,Error Category,Potential Reason for Error
4,त्यांच्या दडपलेल्या भावना केवळ त्या पृष्ठभागाव...,neutral,negative,Negation,The model failed to correctly process a negati...
7,मला असे वाटले की आपण असे कोणीतरी आहात ज्याला म...,neutral,negative,Neutral Ambiguity,The sentence is fact-based or lacks strong sen...
13,मी गंभीर आहे.,neutral,negative,Neutral Ambiguity,The sentence is fact-based or lacks strong sen...
21,अरे आता तुला पॅड पाहिजे.,negative,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
24,कारण यामध्ये रोमॅंटिक काय आहे आणि कॉमेडी काय आ...,neutral,negative,Negation,The model failed to correctly process a negati...
37,खोलीभोवती त्याचा पाठलाग.,neutral,negative,Neutral Ambiguity,The sentence is fact-based or lacks strong sen...
52,या संकल्पनेसाठी समीर आणि त्याच्या टीमला हॅटस ऑफ,positive,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
53,घटस्फोटाची काळजी घेणार्‍या पिलो-टॉप राणीसाठी $...,negative,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
54,जाण्यासाठी मार्ग!,positive,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
55,दशावतार कलेबद्दल ती साकारणाऱ्यांबद्दल काहीही म...,negative,neutral,Negation,The model failed to correctly process a negati...


In [34]:

# --- Detailed Error Analysis (Refined) ---

import re

# Take a sample of up to 30 errors for a closer look
error_samples = df_errors.head(30)
error_explanations = []

def get_error_category_and_explanation(row):
    text = row['text']
    true_label = row['true_label']
    pred_label = row['predicted_label']
    
    text_lower = text.lower()
    
    # 1. Negation
    negation_words = ['नाही', 'नाहीये', 'नको', 'नका', 'ना', 'not', 'no', "नाहीयेत", "नाहीत"]
    if any(word in text.split() for word in negation_words):
        return "Negation", "The model failed to correctly process a negative word like 'नाही' (not), leading to a misclassification of the sentiment."

    # 2. Code-Mixing
    if re.search(r'[a-zA-Z]', text):
        return "Code-Mixing", "The sentence contains a mix of Marathi and English. The model may have struggled to understand the sentiment of the English word within the Marathi sentence structure."

    # 3. Sarcasm/Irony
    positive_sarcasm_words = ['भारी', 'उत्तम', 'छानच', 'मस्त', 'great', 'nice', 'wow', 'amazing']
    if true_label == 'negative' and any(word in text_lower for word in positive_sarcasm_words):
        return "Sarcasm/Irony", "The model likely misinterpreted a positive word used in a negative or sarcastic context, taking it literally instead of understanding the irony."

    # 4. Neutral Ambiguity
    if true_label == 'neutral' and pred_label != 'neutral':
        return "Neutral Ambiguity", "The sentence is fact-based or lacks strong sentiment, but the model forced a positive or negative classification."
    if pred_label == 'neutral' and true_label != 'neutral':
        return "Neutral Ambiguity", "The sentence has subtle sentiment (positive or negative) that the model missed, causing it to default to a 'safe' neutral prediction."

    # 5. Idioms / Colloquialisms / Complex
    # This is a catch-all for errors that don't fit the other patterns.
    return "Idiom/Complex", "The error is likely due to complex sentence structure, nuanced language, a colloquialism, or an idiom whose meaning is not literal."


# Apply the new categorization
categorization_results = [get_error_category_and_explanation(row) for index, row in error_samples.iterrows()]
df_detailed_errors = pd.DataFrame({
    'Text': error_samples['text'],
    'True Label': error_samples['true_label'],
    'Predicted Label': error_samples['predicted_label'],
    'Error Category': [res[0] for res in categorization_results],
    'Potential Reason for Error': [res[1] for res in categorization_results]
})

df_detailed_errors


Unnamed: 0,Text,True Label,Predicted Label,Error Category,Potential Reason for Error
4,त्यांच्या दडपलेल्या भावना केवळ त्या पृष्ठभागाव...,neutral,negative,Negation,The model failed to correctly process a negati...
7,मला असे वाटले की आपण असे कोणीतरी आहात ज्याला म...,neutral,negative,Neutral Ambiguity,The sentence is fact-based or lacks strong sen...
13,मी गंभीर आहे.,neutral,negative,Neutral Ambiguity,The sentence is fact-based or lacks strong sen...
21,अरे आता तुला पॅड पाहिजे.,negative,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
24,कारण यामध्ये रोमॅंटिक काय आहे आणि कॉमेडी काय आ...,neutral,negative,Negation,The model failed to correctly process a negati...
37,खोलीभोवती त्याचा पाठलाग.,neutral,negative,Neutral Ambiguity,The sentence is fact-based or lacks strong sen...
52,या संकल्पनेसाठी समीर आणि त्याच्या टीमला हॅटस ऑफ,positive,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
53,घटस्फोटाची काळजी घेणार्‍या पिलो-टॉप राणीसाठी $...,negative,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
54,जाण्यासाठी मार्ग!,positive,neutral,Neutral Ambiguity,The sentence has subtle sentiment (positive or...
55,दशावतार कलेबद्दल ती साकारणाऱ्यांबद्दल काहीही म...,negative,neutral,Negation,The model failed to correctly process a negati...


In [35]:
# --- Generate Research Paper Style README.md ---
import plotly.io as pio
from datetime import datetime

# Define paths
# The script runs from LLMs_testing, so paths are relative to it.
readme_path = Path.cwd() / 'README.md'
images_dir = Path.cwd() / 'images'
images_dir.mkdir(exist_ok=True)

# Save figures as static images, using relative paths for the README
f1_img_path = images_dir / 'f1_comparison.png'
acc_img_path = images_dir / 'accuracy_comparison.png'
errors_img_path = images_dir / 'error_analysis.png'

pio.write_image(fig_f1, f1_img_path, scale=2)
pio.write_image(fig_acc, acc_img_path, scale=2)
pio.write_image(fig_errors, errors_img_path, scale=2)

# --- Content for README ---

# 1. Results and Detailed Errors to Markdown
results_md = df_results.to_markdown(index=False)
detailed_errors_md = df_detailed_errors.to_markdown(index=False)

# 2. Best model info
best_model_name = best_model_info['Model']
best_f1 = best_model_info['Finetuned F1']
best_acc = best_model_info['Finetuned Accuracy']

# 3. Error category distribution for summary
error_category_counts = df_detailed_errors['Error Category'].value_counts().to_markdown()

# 4. Construct the README content string
readme_content = f"""
# Fine-Tuning Transformer Models for Marathi Sentiment Analysis

**Author**: Ashish Parmar
**Date**: {datetime.now().strftime('%B %d, %Y')}
**Version**: 1.1.0

---

### **Abstract**

This paper presents an empirical study on fine-tuning various transformer-based language models for sentiment analysis on a Marathi-language dataset. We evaluate and compare the performance of several multilingual and Marathi-specific models, including `xlm-roberta-base`, `bert-base-multilingual-cased`, `google/muril-base-cased`, and variants from the L3Cube-MahaNLP suite. The objective is to identify the most effective model architecture for this task and to conduct a thorough qualitative error analysis to understand its limitations in handling complex linguistic nuances such as negation, sarcasm, and code-mixing. Our results indicate that **{best_model_name}** achieves the highest performance, with a macro F1-score of **{best_f1:.4f}** and an accuracy of **{best_acc:.4f}**. This study provides valuable insights and a reproducible benchmark for practitioners working on sentiment analysis for low-resource Indic languages.

---

## 1. Introduction

Sentiment analysis is a critical task in Natural Language Processing (NLP) with wide-ranging applications, from social media monitoring to customer feedback analysis. While significant progress has been made for high-resource languages like English, low-resource languages such as Marathi present unique challenges due to the scarcity of labeled data and monolingual pretrained models. This project aims to address this gap by systematically fine-tuning and evaluating a suite of transformer models on a custom Marathi sentiment dataset, thereby creating a robust baseline for future research.

The primary contributions of this work are:
- A standardized, end-to-end pipeline for fine-tuning and evaluating sentiment analysis models in Marathi, suitable for GPUs with moderate memory (e.g., NVIDIA RTX 4050).
- A comparative analysis of several state-of-the-art multilingual and Marathi-specific models to determine their relative effectiveness.
- A detailed qualitative error analysis of the best-performing model to identify common failure modes and suggest concrete directions for future research and model improvement.

## 2. Project Structure

The repository is organized to ensure reproducibility and a clear separation of concerns between training, analysis, and results.

```
. (c:/LLM's_for_SA/)
├── LLMs_testing/
│   ├── analysis.ipynb               # Notebook to analyze results and generate this README
│   ├── marathi_finetune_and_eval.ipynb          (for xlm-roberta-base)
│   ├── marathi_finetune_and_eval_mbert.ipynb    (for bert-base-multilingual-cased)
│   ├── marathi_finetune_and_eval_muril.ipynb    (for google/muril-base-cased)
│   ├── marathi_finetune_and_eval_l3cube.ipynb   (for l3cube-pune/marathi-bert)
│   ├── marathi_finetune_and_eval_mahaalbert_v1.ipynb (for l3cube-pune/mahaalbert)
│   ├── README.md                    # This documentation file
│   ├── requirements.txt             # Python dependencies for the analysis
│   └── images/                      # Directory for storing plots and visualizations
│       ├── f1_comparison.png
│       ├── accuracy_comparison.png
│       └── error_analysis.png
│
├── output/
│   └── balanced_mode_strict_domain.csv # The dataset used for training and evaluation
│
└── results/
    └── notebook_run/                  # Directory containing outputs from all training runs
        └── <timestamp>_<model_name>/  # Each run has its own folder with checkpoints, logs, and metrics
```

## 3. Methodology

### 3.1. Dataset

The experiments were conducted on a custom dataset of Marathi text, `balanced_mode_strict_domain.csv`, labeled with three sentiment classes: `positive`, `neutral`, and `negative`. The dataset was preprocessed and split into training (80%), validation (10%), and testing (10%) sets using stratified sampling to maintain the label distribution across all splits.

### 3.2. Models Evaluated

The following pretrained transformer models were selected for this study, representing a mix of general multilingual models and models specifically trained on Marathi or other Indic languages:

{df_results['Model'].to_markdown(index=False)}

### 3.3. Experimental Setup

All models were fine-tuned using the Hugging Face `transformers` and `accelerate` libraries on a system equipped with an NVIDIA RTX 4050 GPU. The training was configured with the following key hyperparameters:
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Batch Size**: 16 (with gradient accumulation where necessary)
- **Epochs**: 2
- **Maximum Sequence Length**: 192 tokens
- **Evaluation Strategy**: Metrics were computed at the end of each epoch, and the model checkpoint with the best macro F1-score on the validation set was saved for the final evaluation.

## 4. Results and Discussion

### 4.1. Quantitative Analysis

The performance of each fine-tuned model was evaluated on the held-out test set. The table below summarizes the key metrics, comparing the baseline (pre-fine-tuning) performance with the final fine-tuned results. Models are sorted by macro F1-score.

{results_md}

![F1 Score Comparison](images/f1_comparison.png)
![Accuracy Comparison](images/accuracy_comparison.png)

As evidenced by the results, **{best_model_name}** significantly outperforms the other models, achieving an F1-score of **{best_f1:.4f}**. This suggests that models specifically pretrained on or for Indic languages, which often include transliterated and mixed-language data, hold a considerable advantage over general multilingual models for this task.

### 4.2. Qualitative Error Analysis

To understand the qualitative behavior of the best model, **{best_model_name}**, we conducted a detailed analysis of its misclassifications on the test set.

![Error Analysis for {best_model_name}](images/error_analysis.png)

The errors were programmatically categorized based on linguistic phenomena to identify systemic weaknesses:
{error_category_counts}

The following table provides concrete examples of these errors and their likely causes:

{detailed_errors_md}

### 4.3. Discussion of Findings

The error analysis reveals several key challenges that are common in sentiment analysis for morphologically rich and code-mixed languages:
- **Linguistic Nuance**: The model struggles with sarcasm, irony, and idiomatic expressions, where the literal meaning of words does not correspond to the overall sentiment. This is a classic NLP challenge that requires deeper contextual or world knowledge.
- **Negation Handling**: While transformers are generally capable of handling negation, complex sentence structures or subtle negations can still cause the model to fail.
- **Code-Mixing**: The presence of English words in Marathi sentences introduces ambiguity that the model is not always equipped to handle, sometimes leading it to weigh the English sentiment more heavily.
- **Ambiguity and Subtlety**: Sentences with subtle or mixed sentiment are often misclassified, typically defaulting to `neutral` or being swayed by a single emotionally charged word.

## 5. Conclusion and Future Work

This study successfully demonstrates the effectiveness of fine-tuning transformer models for Marathi sentiment analysis, establishing that **{best_model_name}** provides the best performance among the evaluated models and serves as a strong baseline for future work.

The error analysis highlights that the primary remaining challenges are linguistic in nature. Future work should focus on:
1.  **Targeted Data Augmentation**: Creating or sourcing more training examples that specifically target sarcasm, complex negation, and common code-mixing scenarios.
2.  **Advanced Modeling Techniques**: Exploring more sophisticated architectures, such as multi-task learning (e.g., jointly predicting sentiment and sarcasm) or incorporating external knowledge bases to better handle idiomatic language.
3.  **Human-in-the-Loop Annotation**: Using the current model's prediction errors and low-confidence predictions to identify ambiguous samples for human review, thereby iteratively improving the quality and robustness of the training dataset.

By addressing these challenges, we can further advance the capabilities of sentiment analysis for Marathi and other low-resource languages, paving the way for more nuanced and accurate NLP applications.
"""

# Write the final README file
with open(readme_path, 'w', encoding='utf-8') as f:
    f.write(readme_content)

print(f"✅ Research paper style README.md has been successfully generated at: {readme_path}")
print(f"All plots have been saved in the '{images_dir.name}' directory.")


✅ Research paper style README.md has been successfully generated at: c:\LLM's_for_SA\LLMs_testing\README.md
All plots have been saved in the 'images' directory.
