# Marathi Tweet Sentiment Analysis with SentiWordNet

This notebook combines two Marathi sentiment datasets (MahaSent Movie Reviews and MahaSent Social Tweets), translates Marathi text to English, applies a rule-based sentiment classifier using SentiWordNet, evaluates performance, and visualizes results with word clouds and confusion matrices.

## Pipeline Overview
1. Load and unify datasets (movie reviews + social tweets)
2. Clean and deduplicate
3. Translate Marathi → English with caching
4. Tokenize and score using SentiWordNet
5. Predict sentiment labels (positive / negative / neutral)
6. Evaluate against gold labels
7. Visualize (word clouds, label distributions, confusion matrix)
8. Save merged results, metrics, artifacts

## Why Rule-Based First?
Using SentiWordNet provides a transparent baseline. Later you can compare with ML models (Naive Bayes, SVM) or transformer-based IndicBERT / mBERT fine-tuning.

---
**Note:** Translation quality affects downstream scoring; idioms or domain phrases may reduce accuracy. Caching limits repeated API usage.

In [5]:
# =====================
# 1. Setup: Imports & Configuration
# =====================
import os
import re
import json
import time
import math
import random
from pathlib import Path
from collections import defaultdict, Counter

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

import nltk
from nltk.corpus import sentiwordnet as swn, stopwords, wordnet
from nltk import pos_tag, word_tokenize

from deep_translator import GoogleTranslator

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix

# Ensure reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Translation & caching configuration
BLOCK_SIZE = 25            # number of rows per translation batch
CACHE_SAVE_INTERVAL = 100  # persist cache every N new translations
TRANSLATION_CACHE_PATH = Path('translation_cache.json')
OUTPUT_DIR = Path('output')
OUTPUT_DIR.mkdir(exist_ok=True)

# NLTK data download (idempotent)
for resource in ["punkt", "wordnet", "sentiwordnet", "stopwords", "omw-1.4"]:
    try:
        nltk.data.find(f'tokenizers/{resource}')
    except LookupError:
        try:
            nltk.download(resource)
        except Exception as e:
            print(f"Warning: could not download {resource}: {e}")

# Mapping now supports numeric labels (-1,0,1) commonly used in datasets
MARATHI_LABEL_NORMALIZATION = {
    'pos': 'positive','positive':'positive','1':'positive',
    'neg':'negative','negative':'negative','-1':'negative',
    'neu':'neutral','neutral':'neutral','0':'neutral'
}

EN_STOPWORDS = set(stopwords.words('english'))
# Keep negations
NEGATION_WORDS = {"not","no","never","n't"}
EN_STOPWORDS = {w for w in EN_STOPWORDS if w not in NEGATION_WORDS}

print("Setup complete. Updated label mapping supports -1/0/1.")

Setup complete. Updated label mapping supports -1/0/1.


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ashpa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\ashpa\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ashpa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ashpa\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Load & Inspect Datasets
We automatically detect the MahaSent movie review and social tweet datasets (train/val/test CSV files) and unify them.

Steps:
- Search for files containing `MahaSent_MR_` and `MahaSent_ST_`
- Accept `.csv`, fallback to `.tsv` or `.txt` with common delimiters
- Standardize columns to: `text` (Marathi) and `label`
- Normalize label values to {positive, negative, neutral}
- Drop duplicates / empty rows
- Report dataset statistics

In [14]:
# =====================
# 2. Load Datasets
# =====================

SEARCH_ROOT = Path('.')

CANDIDATE_PATTERNS = [
    ('MR', 'MahaSent_MR_'),
    ('MS', 'MahaSent_ST_')
]

VALID_EXTS = ['.csv', '.tsv', '.txt']
DELIMS = [',','\t','|',';']


def discover_files():
    files = {k: [] for k,_ in CANDIDATE_PATTERNS}
    for k, pattern in CANDIDATE_PATTERNS:
        for p in SEARCH_ROOT.rglob(f"{pattern}*"):
            if p.suffix.lower() in VALID_EXTS and p.is_file():
                files[k].append(p)
    return files


def try_read(path: Path):
    for d in DELIMS:
        try:
            df = pd.read_csv(path, delimiter=d)
            if df.shape[1] > 15:  # defensive: skip obviously wrong delimiter
                continue
            return df
        except Exception:
            continue
    return None


def standardize(df: pd.DataFrame):
    cols_lower = {c.lower(): c for c in df.columns}
    # Find text
    text_col = None
    for cand in ['text','tweet','review','sentence','comment','marathi_text','marathi_sentence']:
        if cand in cols_lower:
            text_col = cols_lower[cand]
            break
    if text_col is None:
        text_col = df.columns[0]
    # Find label
    label_col = None
    for cand in ['label','sentiment','polarity','class']:
        if cand in cols_lower:
            label_col = cols_lower[cand]
            break
    if label_col is None and len(df.columns) > 1:
        label_col = df.columns[1]
    out = pd.DataFrame({'text': df[text_col].astype(str)})
    if label_col and label_col in df.columns:
        out['label_raw'] = df[label_col].astype(str).str.strip()
        out['label'] = out['label_raw'].apply(lambda x: x.lower() if re.search('[A-Za-z]', x) else x)
        out['label'] = out['label'].map(MARATHI_LABEL_NORMALIZATION)
    else:
        out['label'] = np.nan
    return out

files_found = discover_files()
print('Discovered files:')
for k,v in files_found.items():
    for f in v:
        print(f"  [{k}] {f}")

frames = []
for k, paths in files_found.items():
    for p in paths:
        df_raw = try_read(p)
        if df_raw is None:
            print(f"Could not read {p}")
            continue
        df_std = standardize(df_raw)
        df_std['source'] = k
        frames.append(df_std)

if not frames:
    raise RuntimeError('No datasets loaded. Ensure dataset files are present.')

merged = pd.concat(frames, ignore_index=True)

# Clean
merged['text'] = merged['text'].astype(str).str.replace('\s+',' ', regex=True).str.strip()
merged = merged[merged['text'].str.len() > 0]
merged = merged.drop_duplicates(subset=['text'])

print('Dataset size after cleaning:', len(merged))
print('Label distribution (raw mapped):')
print(merged['label'].value_counts(dropna=False))

# --- NEW: Save combined (Marathi only) dataset early ---
# Ensure output dir exists (in case this cell run before setup accidentally)
if 'OUTPUT_DIR' not in globals():
    OUTPUT_DIR = Path('output')
OUTPUT_DIR.mkdir(exist_ok=True)
combined_path = OUTPUT_DIR / 'combined_marathi_dataset.csv'
merged.to_csv(combined_path, index=False, encoding='utf-8')
print(f'Saved combined dataset (before translation) to: {combined_path}')

merged.head()

Discovered files:
  [MR] MahaSent_MR_Train.csv
  [MR] L3Cube_MahaSent_MR\MahaSent_MR_Test.csv
  [MR] L3Cube_MahaSent_MR\MahaSent_MR_Train.csv
  [MR] L3Cube_MahaSent_MR\MahaSent_MR_Val.csv
  [MS] L3Cube_MahaSent_MS\MahaSent_ST_Test.csv
  [MS] L3Cube_MahaSent_MS\MahaSent_ST_Train.csv
  [MS] L3Cube_MahaSent_MS\MahaSent_ST_Val.csv
Dataset size after cleaning: 30000
Label distribution (raw mapped):
label
negative    10000
neutral     10000
positive    10000
Name: count, dtype: int64
Saved combined dataset (before translation) to: output\combined_marathi_dataset.csv
Dataset size after cleaning: 30000
Label distribution (raw mapped):
label
negative    10000
neutral     10000
positive    10000
Name: count, dtype: int64
Saved combined dataset (before translation) to: output\combined_marathi_dataset.csv


Unnamed: 0,text,label_raw,label,source
0,माने यांचा घटस्फोट झाला आहे तर मोहितेने नुकतेच...,-1,negative,MR
1,एका रात्रीत घडणारी किंबहुना बिघडणारी ही गोष्ट आहे,-1,negative,MR
2,जरी आघात समजण्यायोग्य आहे जरी चित्रपटाला खराब ...,-1,negative,MR
3,पण तो असा आघातही अनुभवत आहे की तो कोणाशीही शेअ...,-1,negative,MR
4,छोटे-छोटे गैरसमज मोठ्या अडचणीत येतात,-1,negative,MR


## Translate Marathi → English
We translate texts using `deep_translator.GoogleTranslator` with caching to avoid redundant API calls. Translation is performed in batches with retry logic.

Caching strategy:
- Load existing JSON cache if present
- Only translate missing Marathi strings
- Persist cache every N new translations (`CACHE_SAVE_INTERVAL`)

Handles failures by storing a placeholder token and marking a `translation_ok` flag.

### Domain-Balanced Combination (New)
To reduce underfitting/overfitting when merging Movie Reviews (MR) and Subtitles/Tweets (MS):
- We keep class balance AND source balance.
- For each (label, source) pair we sample up to the minimum available across sources (per label) so both domains contribute equally.
- We then create a stratified split that preserves both sentiment label distribution and domain proportion.
You can adjust the `BALANCE_MODE` variable to switch strategies:
- `strict`: enforce equal counts per (label, source)
- `proportional`: keep natural frequencies (original merged)
- `none`: skip balancing (raw merged)
The balanced dataset and splits will be saved under `output/combined_dataset/`. This helps avoid a model overfitting to wording style of the larger source.

In [None]:
# =====================
# 3. Translation with Caching
# =====================

def load_cache(path: Path):
    if path.exists():
        try:
            with open(path, 'r', encoding='utf-8') as f:
                return json.load(f)
        except Exception as e:
            print(f"Failed to load cache: {e}")
    return {}


def save_cache(cache: dict, path: Path):
    tmp = path.with_suffix('.tmp')
    with open(tmp, 'w', encoding='utf-8') as f:
        json.dump(cache, f, ensure_ascii=False, indent=2)
    tmp.replace(path)

translation_cache = load_cache(TRANSLATION_CACHE_PATH)
print(f"Loaded {len(translation_cache)} cached translations")

translator = GoogleTranslator(source='auto', target='en')

missing_texts = [t for t in merged['text'] if t not in translation_cache]
print(f"Need to translate: {len(missing_texts)} new entries")

new_counter = 0
for i in tqdm(range(0, len(missing_texts), BLOCK_SIZE), desc='Translating'):
    batch = missing_texts[i:i+BLOCK_SIZE]
    for mar_txt in batch:
        if mar_txt in translation_cache:
            continue
        attempt = 0
        backoff = 2
        while attempt < 4:
            try:
                eng = translator.translate(mar_txt)
                if not isinstance(eng, str):
                    eng = str(eng)
                eng = re.sub(r'\s+', ' ', eng).strip()
                translation_cache[mar_txt] = {"english": eng, "ok": True}
                break
            except Exception as e:
                attempt += 1
                if attempt == 4:
                    translation_cache[mar_txt] = {"english": "<translation_failed>", "ok": False, "error": str(e)}
                else:
                    time.sleep(backoff)
                    backoff *= 2
        new_counter += 1
    if new_counter >= CACHE_SAVE_INTERVAL:
        save_cache(translation_cache, TRANSLATION_CACHE_PATH)
        print(f"Intermediate cache saved ({len(translation_cache)} entries)")
        new_counter = 0

# Final cache save
save_cache(translation_cache, TRANSLATION_CACHE_PATH)
print(f"Cache saved: {len(translation_cache)} total entries")

merged['english_text'] = merged['text'].map(lambda x: translation_cache.get(x, {}).get('english',''))
merged['translation_ok'] = merged['text'].map(lambda x: translation_cache.get(x, {}).get('ok', False))

print('Sample translations:')
print(merged[['text','english_text','translation_ok']].head())

Loaded 0 cached translations
Need to translate: 30000 new entries


Translating:   0%|          | 4/1200 [02:54<14:31:31, 43.72s/it]

Intermediate cache saved (100 entries)


Translating:   1%|          | 8/1200 [06:00<14:38:11, 44.20s/it]

Intermediate cache saved (200 entries)


Translating:   1%|          | 12/1200 [09:10<14:59:23, 45.42s/it]

Intermediate cache saved (300 entries)


Translating:   1%|▏         | 16/1200 [12:03<14:17:31, 43.46s/it]

Intermediate cache saved (400 entries)


Translating:   2%|▏         | 20/1200 [14:50<13:52:30, 42.33s/it]

Intermediate cache saved (500 entries)


Translating:   2%|▏         | 24/1200 [17:45<14:12:49, 43.51s/it]

Intermediate cache saved (600 entries)


Translating:   2%|▏         | 28/1200 [20:44<14:33:39, 44.73s/it]

Intermediate cache saved (700 entries)


Translating:   3%|▎         | 32/1200 [23:33<13:58:05, 43.05s/it]

Intermediate cache saved (800 entries)


Translating:   3%|▎         | 36/1200 [27:32<18:22:41, 56.84s/it]

Intermediate cache saved (900 entries)


Translating:   3%|▎         | 40/1200 [31:02<16:32:29, 51.34s/it]

Intermediate cache saved (1000 entries)


Translating:   4%|▎         | 44/1200 [34:02<15:02:16, 46.83s/it]

Intermediate cache saved (1100 entries)


Translating:   4%|▍         | 48/1200 [37:49<16:40:55, 52.13s/it]

Intermediate cache saved (1200 entries)


Translating:   4%|▍         | 52/1200 [41:38<18:09:12, 56.93s/it]

Intermediate cache saved (1300 entries)


Translating:   5%|▍         | 56/1200 [44:55<16:21:17, 51.47s/it]

Intermediate cache saved (1400 entries)


Translating:   5%|▍         | 57/1200 [45:43<16:05:54, 50.70s/it]

In [None]:
# =====================
# 2b. Domain-Class Balancing & Stratified Splits
# =====================

from sklearn.model_selection import train_test_split

# Choose balancing mode: 'strict', 'proportional', 'none'
BALANCE_MODE = 'strict'
RANDOM_STATE = SEED

bal_output_dir = OUTPUT_DIR / 'combined_dataset'
bal_output_dir.mkdir(parents=True, exist_ok=True)

base_df = merged.copy()

if BALANCE_MODE == 'none':
    balanced_df = base_df
elif BALANCE_MODE == 'proportional':
    balanced_df = base_df  # keep original distribution
elif BALANCE_MODE == 'strict':
    # For each label, find counts per source; take min and sample that many from each source
    groups = []
    for label, sub in base_df.groupby('label'):
        if pd.isna(label):
            continue
        per_source_counts = sub['source'].value_counts()
        min_count = per_source_counts.min()
        for src, src_sub in sub.groupby('source'):
            if min_count <= 0:
                continue
            sample_n = min_count
            groups.append(src_sub.sample(sample_n, random_state=RANDOM_STATE, replace=False))
    if groups:
        balanced_df = pd.concat(groups, ignore_index=True)
    else:
        balanced_df = base_df
else:
    print(f"Unknown BALANCE_MODE={BALANCE_MODE}, defaulting to unmodified dataset.")
    balanced_df = base_df

balanced_df = balanced_df.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)
print('Balanced size:', len(balanced_df))
print('Balanced label distribution:')
print(balanced_df['label'].value_counts())
print('Balanced source distribution:')
print(balanced_df['source'].value_counts())

# Create a combined stratification key label|source to preserve both aspects
strat_key = balanced_df['label'].astype(str) + '|' + balanced_df['source'].astype(str)

train_df, temp_df = train_test_split(
    balanced_df,
    test_size=0.3,
    random_state=RANDOM_STATE,
    stratify=strat_key
)
strat_key_temp = temp_df['label'].astype(str) + '|' + temp_df['source'].astype(str)
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=RANDOM_STATE,
    stratify=strat_key_temp
)

print('Split sizes -> train:', len(train_df), 'val:', len(val_df), 'test:', len(test_df))

# Save all versions
balanced_path = bal_output_dir / f'balanced_mode_{BALANCE_MODE}.csv'
train_path = bal_output_dir / f'train_{BALANCE_MODE}.csv'
val_path = bal_output_dir / f'val_{BALANCE_MODE}.csv'
test_path = bal_output_dir / f'test_{BALANCE_MODE}.csv'

balanced_df.to_csv(balanced_path, index=False, encoding='utf-8')
train_df.to_csv(train_path, index=False, encoding='utf-8')
val_df.to_csv(val_path, index=False, encoding='utf-8')
test_df.to_csv(test_path, index=False, encoding='utf-8')

print('Saved balanced full dataset to:', balanced_path)
print('Saved train / val / test to:')
print(' ', train_path)
print(' ', val_path)
print(' ', test_path)

# Keep for downstream steps (translation will now operate on balanced_df or original merged?)
# We'll proceed with balanced_df for translation to avoid domain skew.
merged = balanced_df.copy()

## Apply SentiWordNet Scoring
We tokenize translated English text, map POS tags, fetch SentiWordNet synset scores, aggregate positive/negative/objective scores, and assign a final label using a margin threshold of 0.05.

Design choices:
- Simple POS mapping via `nltk.pos_tag`
- Use first synset match heuristic (fast baseline)
- Aggregate raw sums then length-normalize (optional)
- Skip tokens without sentiment entries
- Provide progress monitoring

In [None]:
# =====================
# 4. SentiWordNet Scoring
# =====================

POS_MAP = {
    'J': wordnet.ADJ,
    'N': wordnet.NOUN,
    'R': wordnet.ADV,
    'V': wordnet.VERB
}

MARGIN = 0.05


def get_primary_swn_scores(lemma, wn_pos):
    try:
        synsets = list(wordnet.synsets(lemma, pos=wn_pos))
        if not synsets:
            return 0.0, 0.0, 0.0
        # take first synset
        syn = synsets[0]
        try:
            swn_syn = swn.senti_synset(syn.name())
            return swn_syn.pos_score(), swn_syn.neg_score(), swn_syn.obj_score()
        except Exception:
            return 0.0, 0.0, 0.0
    except Exception:
        return 0.0, 0.0, 0.0


def classify_from_scores(pos_sum, neg_sum):
    if pos_sum > neg_sum + MARGIN:
        return 'positive'
    if neg_sum > pos_sum + MARGIN:
        return 'negative'
    return 'neutral'

pos_scores = []
neg_scores = []
obj_scores = []
pred_labels = []

texts = merged['english_text'].fillna('')
for text in tqdm(texts, desc='Scoring SentiWordNet'):
    tokens = [t for t in word_tokenize(text) if re.search(r'[A-Za-z]', t)]
    tokens_lower = [t.lower() for t in tokens]
    tagged = pos_tag(tokens_lower)
    sent_pos = sent_neg = sent_obj = 0.0
    counted = 0
    for tok, tag in tagged:
        if tok in EN_STOPWORDS:
            continue
        wn_pos = POS_MAP.get(tag[0])
        if not wn_pos:
            continue
        p, n, o = get_primary_swn_scores(tok, wn_pos)
        if p == 0 and n == 0 and o == 0:
            continue
        sent_pos += p
        sent_neg += n
        sent_obj += o
        counted += 1
    if counted > 0:
        # length normalization (optional): uncomment to use
        # sent_pos /= counted; sent_neg /= counted; sent_obj /= counted
        pass
    pos_scores.append(sent_pos)
    neg_scores.append(sent_neg)
    obj_scores.append(sent_obj)
    pred_labels.append(classify_from_scores(sent_pos, sent_neg))

merged['pos_score'] = pos_scores
merged['neg_score'] = neg_scores
merged['obj_score'] = obj_scores
merged['predicted_label'] = pred_labels

print('Predicted label distribution:')
print(merged['predicted_label'].value_counts())
merged[['english_text','pos_score','neg_score','predicted_label']].head()

## Evaluation
We evaluate the rule-based predictions against gold labels using accuracy, precision, recall, F1, and confusion matrix.

Also surface examples of misclassifications for qualitative error analysis.

In [None]:
# =====================
# 5. Evaluation
# =====================

valid_labels = {'positive','negative','neutral'}
_eval_df = merged[merged['label'].isin(valid_labels)].copy()
print(f"Evaluation rows: {_eval_df.shape[0]} / {merged.shape[0]}")

if _eval_df.empty:
    print('No valid gold labels to evaluate.')
else:
    gold = _eval_df['label']
    pred = _eval_df['predicted_label']
    acc = accuracy_score(gold, pred)
    precision, recall, f1, support = precision_recall_fscore_support(gold, pred, labels=['positive','negative','neutral'], zero_division=0)
    report = classification_report(gold, pred, digits=4, zero_division=0)
    cm = confusion_matrix(gold, pred, labels=['positive','negative','neutral'])

    print(f"Accuracy: {acc:.4f}")
    print('\nClassification Report:\n')
    print(report)

    fig, ax = plt.subplots(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['positive','negative','neutral'], yticklabels=['positive','negative','neutral'], ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Gold')
    ax.set_title('Confusion Matrix')
    plt.tight_layout()
    plt.show()

    # Error analysis
    errors = _eval_df[_eval_df['label'] != _eval_df['predicted_label']]
    print(f"Misclassified examples: {len(errors)}")
    display(errors[['text','english_text','label','predicted_label','pos_score','neg_score']].head(10))

    metrics_summary = {
        'accuracy': acc,
        'per_class': {
            lbl: {
                'precision': float(precision[i]),
                'recall': float(recall[i]),
                'f1': float(f1[i]),
                'support': int(support[i])
            } for i,lbl in enumerate(['positive','negative','neutral'])
        }
    }
else:
    metrics_summary = {}

metrics_summary

## Visualization
We generate word clouds for each predicted sentiment and a bar plot comparing gold vs predicted distribution.

In [None]:
# =====================
# 6. Visualization
# =====================

# WordClouds by predicted label
wc_dir = OUTPUT_DIR
wc_paths = {}

fig, axes = plt.subplots(1,3, figsize=(15,5))
for ax, label, cmap in zip(axes, ['positive','negative','neutral'], ['Greens','Reds','Blues']):
    subset = merged[merged['predicted_label']==label]
    text_blob = ' '.join(subset['english_text'].dropna().astype(str))
    if not text_blob:
        ax.set_title(f"No data for {label}")
        ax.axis('off')
        continue
    wc = WordCloud(width=800, height=600, background_color='white', colormap=cmap, stopwords=EN_STOPWORDS, max_words=300).generate(text_blob)
    ax.imshow(wc, interpolation='bilinear')
    ax.axis('off')
    ax.set_title(label)
    out_path = wc_dir / f'wordcloud_{label}.png'
    wc.to_file(out_path)
    wc_paths[label] = str(out_path)
plt.tight_layout()
plt.show()

# Label distribution bar plot
fig, ax = plt.subplots(figsize=(6,4))
count_gold = merged['label'].value_counts()
count_pred = merged['predicted_label'].value_counts()
all_labels = sorted(set(count_gold.index).union(count_pred.index))
bar_df = pd.DataFrame({
    'gold': [count_gold.get(l,0) for l in all_labels],
    'predicted': [count_pred.get(l,0) for l in all_labels]
}, index=all_labels)
bar_df.plot(kind='bar', ax=ax)
ax.set_ylabel('Count')
ax.set_title('Gold vs Predicted Label Distribution')
plt.tight_layout()
plt.show()

wc_paths

## Save Results
Persist processed dataset, metrics (JSON), word clouds, and translation cache to the `output/` directory.

In [None]:
# =====================
# 7. Save Artifacts
# =====================

results_cols = ['text','english_text','label','predicted_label','pos_score','neg_score','obj_score','translation_ok','source']
results_path = OUTPUT_DIR / 'merged_sentiment_results.csv'
merged[results_cols].to_csv(results_path, index=False, encoding='utf-8')
print(f"Saved results CSV: {results_path}")

metrics_path = OUTPUT_DIR / 'metrics_summary.json'
with open(metrics_path, 'w', encoding='utf-8') as f:
    json.dump(metrics_summary, f, ensure_ascii=False, indent=2)
print(f"Saved metrics JSON: {metrics_path}")

# Confusion matrix plot saving (if existed above)
cm_plot_path = OUTPUT_DIR / 'confusion_matrix.png'
# Recreate + save if evaluation happened
if _eval_df.shape[0] > 0:
    cm = confusion_matrix(_eval_df['label'], _eval_df['predicted_label'], labels=['positive','negative','neutral'])
    fig, ax = plt.subplots(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['positive','negative','neutral'], yticklabels=['positive','negative','neutral'], ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Gold')
    ax.set_title('Confusion Matrix')
    plt.tight_layout()
    fig.savefig(cm_plot_path)
    plt.close(fig)
    print(f"Saved confusion matrix: {cm_plot_path}
")

# Translation cache persistence (already saved earlier) ensure final
save_cache(translation_cache, TRANSLATION_CACHE_PATH)
print(f"Translation cache path: {TRANSLATION_CACHE_PATH}")

print('Word cloud image paths:')
for lbl, p in wc_paths.items():
    print(f"  {lbl}: {p}")

print('Done saving artifacts.')

## Conclusion
This notebook established a transparent rule-based baseline for Marathi tweet/movie review sentiment by leveraging machine translation and SentiWordNet.

### Key Takeaways
- Translation quality directly impacts sentiment scoring
- SentiWordNet works better on standard English vocabulary; domain / slang terms are often OOV
- Neutral classification absorbs ambiguous low-margin cases

### Potential Improvements
1. Use better translation (IndicTrans2 or offline Marian models) to reduce noise
2. Lemmatize tokens before SentiWordNet lookup to improve coverage
3. Incorporate negation scope detection (e.g., reversing polarity within window after negation)
4. Add weighting by synset rank or average top-K synsets instead of first-sense heuristic
5. Compare with ML baselines: Logistic Regression / SVM using TF-IDF over English translations
6. Fine-tune transformer models (IndicBERT, mBERT, MuRIL) directly on Marathi without translation
7. Perform error clustering to identify systematic mistranslations

### Next Steps
You can now add a new section that trains a supervised classifier and compares metrics, or wrap this pipeline into a reusable Python module.

---
End of notebook.