# 🇪🇹 Amharic Sentiment Analyzer

Welcome to the **Ethiopian Data Science & AI Community** starter project!

In this notebook, you'll build a sentiment classifier for **Amharic text** — a crucial step toward NLP for local languages in Ethiopia.

We'll use:
- Synthetic Amharic dataset (for learning)
- Pre-trained multilingual BERT (mBERT)
- Simple classification with `transformers`

🎯 Goal: Classify text as `Positive`, `Negative`, or `Neutral`.

In [None]:
# Step 1: Install required libraries
!pip install -q transformers torch pandas scikit-learn matplotlib

In [None]:
# Step 2: Import libraries
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

## Step 3: Create Sample Amharic Dataset

💡 This is **synthetic data** for learning. In real projects, you'd collect real Amharic comments from social media, news, etc. (with consent).

In [None]:
# Sample Amharic sentences with sentiment labels
data = [
    ("በጊዜ ሂደት እንደሚሻሻል ተስፋ አለኝ", "Positive"),
    ("ይቅርታ ያ ነገር በጣም ግልጽ አይደለም", "Negative"),
    ("ስምህን ንገረኝ", "Neutral"),
    ("ወደ አዲስ አበባ መመለስ እወዳለሁ", "Positive"),
    ("ይህ ጥሩ ነገር አይደለም", "Negative"),
    ("ከፍተኛ የትምህርት አቅም አለ", "Positive"),
    ("እሱ ወደ ቤቱ ሄደ", "Neutral"),
    ("ይቅርታ፣ ያ አስደናቂ አይደለም", "Negative"),
    ("እንደገና ደስ ብሎኛል", "Positive"),
    ("ይህ ነገር አልተመቸኚም", "Negative")
]

# Convert to DataFrame
df = pd.DataFrame(data, columns=["text", "label"])
print("Sample Data:")
print(df)

## Step 4: Explore the Data

In [None]:
# Check label distribution
df['label'].value_counts().plot(kind='bar', title="Sentiment Distribution")
plt.ylabel("Count")
plt.xticks(rotation=0)
plt.show()

## Step 5: Load mBERT Tokenizer and Model

In [None]:
# Use mBERT (multilingual BERT) - supports Amharic!
model_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

# Map labels to numbers
label_map = {"Positive": 0, "Negative": 1, "Neutral": 2}
reverse_label_map = {0: "Positive", 1: "Negative", 2: "Neutral"}
df['label_num'] = df['label'].map(label_map)

## Step 6: Tokenize Text

In [None]:
# Tokenize all texts
encoded = tokenizer(
    df['text'].tolist(),
    padding=True,
    truncation=True,
    max_length=64,
    return_tensors="pt"
)

# Prepare labels
labels = torch.tensor(df['label_num'].values)

## Step 7: Train-Test Split

In [None]:
# Split into train and test
from torch.utils.data import DataLoader, TensorDataset, random_split

# Create dataset
dataset = TensorDataset(encoded['input_ids'], encoded['attention_mask'], labels)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=4)

## Step 8: Train the Model (Simple Training Loop)

In [None]:
# Simple training (1 epoch for demo)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.train()

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

for epoch in range(1):
    total_loss = 0
    for batch in train_loader:
        input_ids, attention_mask, batch_labels = [b.to(device) for b in batch]

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=batch_labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Average Loss: {total_loss/len(train_loader):.4f}")

## Step 9: Evaluate the Model

In [None]:
model.eval()
predictions = []
true_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, batch_labels = [b.to(device) for b in batch]

        outputs = model(input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
        labels = batch_labels.cpu().numpy()

        predictions.extend(preds)
        true_labels.extend(labels)

acc = accuracy_score(true_labels, predictions)
print(f"Test Accuracy: {acc:.4f}")
print("\nClassification Report:")
print(classification_report(true_labels, predictions, target_names=reverse_label_map.values()))

## Step 10: Try Your Own Amharic Sentence!

In [None]:
def predict_sentiment(text):
    model.eval()
    encoded = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=64
    ).to(device)

    with torch.no_grad():
        outputs = model(**encoded)
        pred = torch.argmax(outputs.logits, dim=1).item()

    return reverse_label_map[pred]

# Try it!
sample_text = "ይሄ በጣም ጥሩ ነው!"
print(f"Text: {sample_text}")
print(f"Predicted Sentiment: {predict_sentiment(sample_text)}")

## ✅ Next Steps & Challenges

🔹 **Add more Amharic data** (collect from social media, forums, etc.)
🔹 **Translate labels** into Amharic for broader access
🔹 **Fine-tune on larger dataset** for better accuracy
🔹 **Build a web app** using Streamlit or Gradio
🔹 **Contribute your dataset** to the community repo!

📌 Join the [Ethiopian Data Science & AI Community](https://www.linkedin.com/groups/...) to share your results!