## **🧹 NLPTA: Amharic Text Cleaning, Tokenization & Stopword Removal Demo** ##
 

 A step-by-step walkthrough of preprocessing real Amharic text.
 
 **This notebook demonstrates:**
 - Loading sample Amharic text from Wikipedia
 - Cleaning punctuation and whitespace
 - Tokenizing into words
 - Removing stopwords



In [5]:
import os
os.chdir("/workspaces/nlpta")
print("Current working directory:", os.getcwd())

from nlpta import load_sample_corpus, clean_text, tokenize, remove_stopwords, load_stopwords

Current working directory: /workspaces/nlpta


In [6]:
print("📥 Loading sample Amharic corpus...")
corpus = load_sample_corpus()
print(f"✅ Loaded {len(corpus)} paragraphs.\n")

📥 Loading sample Amharic corpus...
✅ Loaded 4 paragraphs.



In [7]:
print("=" * 80)
print("🧹 CLEANING DEMO: First 3 Paragraphs")
print("=" * 80)

for i, paragraph in enumerate(corpus[:3]):
    print(f"\n📄 Paragraph {i+1} (Original):")
    print("-" * 40)
    print(paragraph[:200] + "..." if len(paragraph) > 200 else paragraph)

    print(f"\n✨ Paragraph {i+1} (Cleaned):")
    print("-" * 40)
    cleaned = clean_text(paragraph)
    print(cleaned[:200] + "..." if len(cleaned) > 200 else cleaned)

    print(f"\n🔍 Paragraph {i+1} (Tokens):")
    print("-" * 40)
    tokens = tokenize(cleaned)
    print(tokens[:10])  # Show first 10 tokens

    print(f"\n🚫 Paragraph {i+1} (After Stopword Removal):")
    print("-" * 40)
    no_stops = remove_stopwords(tokens)
    print(no_stops[:10])  # Show first 10 tokens

    print("\n" + "=" * 80)

🧹 CLEANING DEMO: First 3 Paragraphs

📄 Paragraph 1 (Original):
----------------------------------------
ወደ ውክፔዲያ እንኳን ደህና መጡ! ማንኛውም ሰው ሊያዘጋጀው የሚችለው ነጻ መዝገበ እውቀት   ዛሬ ቅዳሜ፣ መስከረም 2 ቀን 2016 ዓ.ም. (13 ሴፕቴምበር, 2025 እ.ኤ.አ.) ነዉ።

✨ Paragraph 1 (Cleaned):
----------------------------------------
ወደ ውክፔዲያ እንኳን ደህና መጡ ማንኛውም ሰው ሊያዘጋጀው የሚችለው ነጻ መዝገበ እውቀት ዛሬ ቅዳሜ መስከረም 2 ቀን 2016 ዓ.ም. 13 ሴፕቴምበር, 2025 እ.ኤ.አ. ነዉ

🔍 Paragraph 1 (Tokens):
----------------------------------------
['ወደ', 'ውክፔዲያ', 'እንኳን', 'ደህና', 'መጡ', 'ማንኛውም', 'ሰው', 'ሊያዘጋጀው', 'የሚችለው', 'ነጻ']

🚫 Paragraph 1 (After Stopword Removal):
----------------------------------------
['ወደ', 'ክፔዲያ', 'እንኳን', 'ደህና', 'መጡ', 'ማንኛውም', 'ሰው', 'ሊያዘጋጀው', 'ሚችለው', 'ነጻ']


📄 Paragraph 2 (Original):
----------------------------------------
ከዚያን ጊዜ ጀምሮ፣ ኔንቲዶ በቪዲዮ ጌም ኢንደስትሪ ውስጥ በጣም የተሳካላቸው ኮንሶሎችን አዘጋጅቷል፣ ለምሳሌ ጌም ቦይ፣ ሱፐር ኔንቲዶ መዝናኛ ሲስተም፣ ኔንቲዶ ዲኤስ፣ ዊኢ እና ስዊች። ማሪዮ፣ አህያ ኮንግ፣ የዜልዳ አፈ ታሪክ፣ ሜትሮይድ፣ የእሳት አርማ፣ ኪርቢ፣ ስታር ፎክስ፣ ፖክሞን፣ ሱፐር ስማሽ ብሮስ፣ የእንስሳት...

✨ Paragraph 2 (Cleaned):
---

### **🧠 Load Pretrained Word Embeddings**

In [2]:
from nlpta import load_embeddings

print("\n🧠 LOADING EMBEDDINGS")
print("=" * 80)
model = load_embeddings()
print(f"✅ Embeddings loaded with {len(model.wv)} words in vocabulary.")

# Show similarity
test_words = ["ኢትዮጵያ", "አበባ", "ሕዝብ", "መንግሥት"]
for word in test_words:
    if word in model.wv.key_to_index:
        vec = model.wv[word]
        print(f"  {word}: {vec[:5]}...")  # Show first 5 dimensions

if "ኢትዮጵያ" in model.wv.key_to_index and "አበባ" in model.wv.key_to_index:
    similarity = model.wv.similarity("ኢትዮጵያ", "አበባ")
    print(f"\n📈 Similarity between 'ኢትዮጵያ' and 'አበባ': {similarity:.3f}")

ImportError: cannot import name 'load_embeddings' from 'nlpta' (/workspaces/nlpta/nlpta/__init__.py)