## **üßπ NLPTA: Amharic Text Cleaning, Tokenization & Stopword Removal Demo** ##
 

 A step-by-step walkthrough of preprocessing real Amharic text.
 
 **This notebook demonstrates:**
 - Loading sample Amharic text from Wikipedia
 - Cleaning punctuation and whitespace
 - Tokenizing into words
 - Removing stopwords



In [5]:
import os
os.chdir("/workspaces/nlpta")
print("Current working directory:", os.getcwd())

from nlpta import load_sample_corpus, clean_text, tokenize, remove_stopwords, load_stopwords

Current working directory: /workspaces/nlpta


In [6]:
print("üì• Loading sample Amharic corpus...")
corpus = load_sample_corpus()
print(f"‚úÖ Loaded {len(corpus)} paragraphs.\n")

üì• Loading sample Amharic corpus...
‚úÖ Loaded 4 paragraphs.



In [7]:
print("=" * 80)
print("üßπ CLEANING DEMO: First 3 Paragraphs")
print("=" * 80)

for i, paragraph in enumerate(corpus[:3]):
    print(f"\nüìÑ Paragraph {i+1} (Original):")
    print("-" * 40)
    print(paragraph[:200] + "..." if len(paragraph) > 200 else paragraph)

    print(f"\n‚ú® Paragraph {i+1} (Cleaned):")
    print("-" * 40)
    cleaned = clean_text(paragraph)
    print(cleaned[:200] + "..." if len(cleaned) > 200 else cleaned)

    print(f"\nüîç Paragraph {i+1} (Tokens):")
    print("-" * 40)
    tokens = tokenize(cleaned)
    print(tokens[:10])  # Show first 10 tokens

    print(f"\nüö´ Paragraph {i+1} (After Stopword Removal):")
    print("-" * 40)
    no_stops = remove_stopwords(tokens)
    print(no_stops[:10])  # Show first 10 tokens

    print("\n" + "=" * 80)

üßπ CLEANING DEMO: First 3 Paragraphs

üìÑ Paragraph 1 (Original):
----------------------------------------
·ãà·ã∞ ·ãç·ä≠·çî·ã≤·ã´ ·ä•·äï·ä≥·äï ·ã∞·àÖ·äì ·àò·å°! ·àõ·äï·äõ·ãç·àù ·à∞·ãç ·àä·ã´·ãò·åã·åÄ·ãç ·ã®·àö·âΩ·àà·ãç ·äê·åª ·àò·ãù·åà·â† ·ä•·ãç·âÄ·âµ   ·ãõ·à¨ ·âÖ·ã≥·àú·ç£ ·àò·àµ·ä®·à®·àù 2 ·âÄ·äï 2016 ·ãì.·àù. (13 ·à¥·çï·â¥·àù·â†·à≠, 2025 ·ä•.·ä§.·ä†.) ·äê·ãâ·ç¢

‚ú® Paragraph 1 (Cleaned):
----------------------------------------
·ãà·ã∞ ·ãç·ä≠·çî·ã≤·ã´ ·ä•·äï·ä≥·äï ·ã∞·àÖ·äì ·àò·å° ·àõ·äï·äõ·ãç·àù ·à∞·ãç ·àä·ã´·ãò·åã·åÄ·ãç ·ã®·àö·âΩ·àà·ãç ·äê·åª ·àò·ãù·åà·â† ·ä•·ãç·âÄ·âµ ·ãõ·à¨ ·âÖ·ã≥·àú ·àò·àµ·ä®·à®·àù 2 ·âÄ·äï 2016 ·ãì.·àù. 13 ·à¥·çï·â¥·àù·â†·à≠, 2025 ·ä•.·ä§.·ä†. ·äê·ãâ

üîç Paragraph 1 (Tokens):
----------------------------------------
['·ãà·ã∞', '·ãç·ä≠·çî·ã≤·ã´', '·ä•·äï·ä≥·äï', '·ã∞·àÖ·äì', '·àò·å°', '·àõ·äï·äõ·ãç·àù', '·à∞·ãç', '·àä·ã´·ãò·åã·åÄ·ãç', '·ã®·àö·âΩ·àà·ãç', '·äê·åª']

üö´ Paragraph 1 (After Stopword Removal):
----------------------------------------
['·ãà·ã∞', 

### **üß† Load Pretrained Word Embeddings**

In [2]:
from nlpta import load_embeddings

print("\nüß† LOADING EMBEDDINGS")
print("=" * 80)
model = load_embeddings()
print(f"‚úÖ Embeddings loaded with {len(model.wv)} words in vocabulary.")

# Show similarity
test_words = ["·ä¢·âµ·ãÆ·åµ·ã´", "·ä†·â†·â£", "·àï·ãù·â•", "·àò·äï·åç·à•·âµ"]
for word in test_words:
    if word in model.wv.key_to_index:
        vec = model.wv[word]
        print(f"  {word}: {vec[:5]}...")  # Show first 5 dimensions

if "·ä¢·âµ·ãÆ·åµ·ã´" in model.wv.key_to_index and "·ä†·â†·â£" in model.wv.key_to_index:
    similarity = model.wv.similarity("·ä¢·âµ·ãÆ·åµ·ã´", "·ä†·â†·â£")
    print(f"\nüìà Similarity between '·ä¢·âµ·ãÆ·åµ·ã´' and '·ä†·â†·â£': {similarity:.3f}")

ImportError: cannot import name 'load_embeddings' from 'nlpta' (/workspaces/nlpta/nlpta/__init__.py)