# Aspect Based Sentiment Analysis Datasets for Bangla Text

## Dataset Details

### Dataset Characteristics
- **Total Size**: 3,725 comments across 4 domains
- **Source**: Manually created by undergraduate students
- **Language**: Bangla text (complex/compound sentences)
- **Unique Feature**: Each comment contains **2 aspects** with **2 sentiment polarities**
- **Labeling**: Annotated by 3 annotators with majority voting

### Domain-wise Distribution
| Domain | Number of Comments | Average Length (words) | Max Length (words) |
|--------|-------------------|------------------------|-------------------|
| Car_ABSA | 1,149 | 10.26 | 19 |
| Mobile_phone_ABSA | 975 | 10.14 | 20 |
| Movie_ABSA | 800 | 11.78 | 35 |
| Restaurant_ABSA | 801 | 11.22 | 23 |

### Aspect Categories by Domain

**Car_ABSA**: Performance, Exterior, Interior, Accessories, Comfort, Safety, Practicality, Others

**Mobile_phone_ABSA**: Performance, Design, Display, Camera, Battery, Storage, Value_For_Money, Others

**Movie_ABSA**: Story, Performance, Music, Miscellaneous

**Restaurant_ABSA**: Food, Price, Service, Ambiance, Miscellaneous

### Data Format
- **Structure**: Id, Comment, {Aspect category, Sentiment Polarity}
- **Sentiment Classes**: Positive, Negative

## Best Performing Model

### Model Architecture
**Support Vector Machine (SVM)**

### Performance Results

#### Aspect Category Classification
| Dataset | Precision (%) | Recall (%) | F1-Score (%) |
|---------|---------------|------------|--------------|
| Car_ABSA | 83.75 | 82.38 | **83.13** |
| Mobile_phone_ABSA | 95.63 | 92.63 | **93.63** |
| Movie_ABSA | 99.00 | 93.00 | **95.60** |
| Restaurant_ABSA | 99.33 | 95.00 | **97.00** |

#### Sentiment Polarity Classification
| Dataset | Precision (%) | Recall (%) | F1-Score (%) |
|---------|---------------|------------|--------------|
| Car_ABSA | 84.50 | 84.00 | **84.00** |
| Mobile_phone_ABSA | 82.50 | 82.00 | **82.50** |
| Movie_ABSA | 94.33 | 78.33 | **83.67** |
| Restaurant_ABSA | 99.33 | 92.33 | **95.67** |

### Comparison with Existing Systems
| Dataset | F1-Score |
|---------|----------|
| Cricket [1] | 0.34 |
| Restaurant [1] | 0.38 |
| BAN-ABSA [2] | 0.69 |
| **Proposed Car_ABSA** | **0.83** |
| **Proposed Mobile_Phone_ABSA** | **0.94** |
| **Proposed Movie_ABSA** | **0.96** |
| **Proposed Restaurant_ABSA** | **0.97** |



# Cell 1: Import necessary libraries


In [None]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Cell 2: Load and explore the dataset


In [None]:
# Load your dataset
df = pd.read_csv('/kaggle/input/final-dataset/final-dataset.csv')

# Display basic information
print(f"Original dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nSentiment distribution (before filtering):")
print(df['Polarity'].value_counts())

# Drop neutral comments to match the paper's setup
df = df[df['Polarity'] != 'neutral']

print(f"\nDataset shape after removing neutral: {df.shape}")
print(f"\nSentiment distribution (after filtering):")
print(df['Polarity'].value_counts())
print(f"\nSample data:")
print(df.head())

# Cell 3: Define Bangla text preprocessing function


In [None]:
def preprocess_bangla_text(text):
    """Preprocess Bangla text according to the paper"""
    if pd.isna(text):
        return ""
    
    # Convert to string
    text = str(text)
    
    # Remove specific punctuation marks mentioned in paper
    text = re.sub(r'[:]', '', text)  # Remove colons
    text = re.sub(r'[;]', '', text)  # Remove semicolons
    
    # Keep meaningful punctuation (,, !, ?, |)
    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text)
    
    # Strip leading/trailing spaces
    text = text.strip()
    
    return text

# Bangla stop words from the paper
# Adding more common Bangla stop words
bangla_stop_words = [
    'ęüđ', 'êąòČ', 'òì', 'öĐğČ', 'ąČě', 'ŁĕĘö',
    'এই', 'সে', 'তা', 'এটা', 'সেটা', 'যে', 'তারা',
    'আমি', 'তুমি', 'আমরা', 'তোমরা', 'এবং', 'ও',
    'কিন্তু', 'যদি', 'তবে', 'নাহলে', 'অথবা'
]

def remove_stop_words(text, stop_words):
    """Remove stop words from text"""
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

# Cell 4: Apply preprocessing


In [None]:
# Apply preprocessing to Text column
df['processed_text'] = df['Text'].apply(preprocess_bangla_text)
df['processed_text'] = df['processed_text'].apply(lambda x: remove_stop_words(x, bangla_stop_words))

# Remove empty comments after preprocessing
df = df[df['processed_text'].str.len() > 0]

print(f"Dataset shape after preprocessing: {df.shape}")
print(f"\nSample processed text:")
print(df[['Text', 'processed_text']].head())

# Cell 5: Prepare data for aspect category classification


In [None]:
# Create label encoder for sentiment
sentiment_encoder = LabelEncoder()

# Encode sentiment (positive -> 1, negative -> 0)
df['sentiment_encoded'] = sentiment_encoder.fit_transform(df['Polarity'])

# Display encoding
print("Sentiment encoding:")
for label, encoded in zip(sentiment_encoder.classes_, sentiment_encoder.transform(sentiment_encoder.classes_)):
    print(f"{label} -> {encoded}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['processed_text'].values,
    df['sentiment_encoded'].values,
    test_size=0.2,
    random_state=42,
    stratify=df['sentiment_encoded']
)

print(f"\nTraining set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training set sentiment distribution:")
print(pd.Series(y_train).value_counts())

# Cell 6: Create TF-IDF features


In [None]:
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),  # Using unigrams and bigrams
    min_df=2
)

# Fit and transform training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"TF-IDF features shape: {X_train_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")

# Cell 7: Train SVM model for aspect category classification


In [None]:
# Initialize SVM model with linear kernel (as per paper)
svm_model = SVC(kernel='linear', random_state=42, probability=True)

# Train the model
print("Training SVM for sentiment classification...")
svm_model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = svm_model.predict(X_test_tfidf)

# Calculate metrics
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"\nSVM Sentiment Classification Results:")
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1-Score: {f1:.2%}")

# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, 
                          target_names=sentiment_encoder.classes_))

# Cell 8: Prepare data for sentiment polarity classification


In [None]:
# Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred_nb = nb_model.predict(X_test_tfidf)

precision_nb = precision_score(y_test, y_pred_nb, average='weighted')
recall_nb = recall_score(y_test, y_pred_nb, average='weighted')
f1_nb = f1_score(y_test, y_pred_nb, average='weighted')

# Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_tfidf, y_train)
y_pred_lr = lr_model.predict(X_test_tfidf)

precision_lr = precision_score(y_test, y_pred_lr, average='weighted')
recall_lr = recall_score(y_test, y_pred_lr, average='weighted')
f1_lr = f1_score(y_test, y_pred_lr, average='weighted')

# Display comparison
print("\nModel Comparison:")
print(f"\nNaive Bayes:")
print(f"  Precision: {precision_nb:.2%}")
print(f"  Recall: {recall_nb:.2%}")
print(f"  F1-Score: {f1_nb:.2%}")

print(f"\nLogistic Regression:")
print(f"  Precision: {precision_lr:.2%}")
print(f"  Recall: {recall_lr:.2%}")
print(f"  F1-Score: {f1_lr:.2%}")

print(f"\nSVM (Best Model):")
print(f"  Precision: {precision:.2%}")
print(f"  Recall: {recall:.2%}")
print(f"  F1-Score: {f1:.2%}")

# Cell 9: Train SVM model for sentiment polarity classification


In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=sentiment_encoder.classes_,
            yticklabels=sentiment_encoder.classes_)
plt.title('Confusion Matrix - SVM Model')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Error analysis - look at misclassified examples
misclassified_idx = np.where(y_test != y_pred)[0]
print(f"\nNumber of misclassified samples: {len(misclassified_idx)}")
print(f"Misclassification rate: {len(misclassified_idx)/len(y_test):.2%}")

# Show some misclassified examples
if len(misclassified_idx) > 0:
    print("\nSome misclassified examples:")
    for i in misclassified_idx[:5]:  # Show first 5
        print(f"\nText: {X_test[i]}")
        print(f"True label: {sentiment_encoder.inverse_transform([y_test[i]])[0]}")
        print(f"Predicted: {sentiment_encoder.inverse_transform([y_pred[i]])[0]}")

# Cell 10: Compare with other models (Naive Bayes and Logistic Regression)


In [None]:
import pickle

# Save the best performing model (SVM)
with open('svm_sentiment_model.pkl', 'wb') as f:
    pickle.dump(svm_model, f)

# Save the TF-IDF vectorizer
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

# Save the label encoder
with open('sentiment_encoder.pkl', 'wb') as f:
    pickle.dump(sentiment_encoder, f)

# Also save the preprocessing parameters
preprocessing_params = {
    'stop_words': bangla_stop_words,
    'model_type': 'SVM',
    'features': 'TF-IDF',
    'ngram_range': (1, 2),
    'max_features': 5000
}

with open('preprocessing_params.pkl', 'wb') as f:
    pickle.dump(preprocessing_params, f)

print("Models and preprocessing components saved successfully!")

# Cell 11: Save the models and vectorizer


In [None]:
def predict_sentiment(text, tfidf_vectorizer, svm_model, sentiment_encoder):
    """Predict sentiment for new Bangla text"""
    
    # Preprocess the text
    processed_text = preprocess_bangla_text(text)
    processed_text = remove_stop_words(processed_text, bangla_stop_words)
    
    # Transform using TF-IDF
    text_tfidf = tfidf_vectorizer.transform([processed_text])
    
    # Predict sentiment
    sentiment_pred = svm_model.predict(text_tfidf)[0]
    sentiment_proba = svm_model.predict_proba(text_tfidf)[0]
    
    # Get label
    sentiment_label = sentiment_encoder.inverse_transform([sentiment_pred])[0]
    
    return {
        'text': text,
        'processed_text': processed_text,
        'predicted_sentiment': sentiment_label,
        'confidence': {
            sentiment_encoder.inverse_transform([i])[0]: float(prob) 
            for i, prob in enumerate(sentiment_proba)
        }
    }

# Test the function with some examples
test_texts = [
    "এই পণ্যটি খুবই ভালো",
    "খুব খারাপ সার্ভিস, একদম পছন্দ হয়নি",
    "দাম অনেক বেশি কিন্তু মান ভালো"
]

print("Testing sentiment prediction:\n")
for text in test_texts:
    result = predict_sentiment(text, tfidf_vectorizer, svm_model, sentiment_encoder)
    print(f"Text: {result['text']}")
    print(f"Predicted Sentiment: {result['predicted_sentiment']}")
    print(f"Confidence: {result['confidence']}")
    print("-" * 50)

# Cell 12: Function to predict both aspect and sentiment for new text


In [None]:
# Load saved components
with open('svm_sentiment_model.pkl', 'rb') as f:
    loaded_svm = pickle.load(f)

with open('tfidf_vectorizer.pkl', 'rb') as f:
    loaded_vectorizer = pickle.load(f)

with open('sentiment_encoder.pkl', 'rb') as f:
    loaded_encoder = pickle.load(f)

# Test with loaded model
sample_text = "অসাধারণ পণ্য, খুবই সন্তুষ্ট"
result = predict_sentiment(sample_text, loaded_vectorizer, loaded_svm, loaded_encoder)
print(f"Loaded model test:")
print(f"Text: {result['text']}")
print(f"Sentiment: {result['predicted_sentiment']}")
print(f"Confidence: {result['confidence']}")