<a href="https://colab.research.google.com/github/ChidiTonio/ChidiTonio/blob/main/low_resource_language_processing_toolkit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# New Section

# Low-Resource Language Processing Toolkit

A Python toolkit for processing morphologically complex languages and working effectively in low-resource NLP scenarios. This toolkit provides extensible preprocessing pipelines and data augmentation techniques specifically designed for languages with limited resources.

## Features

- **Morphological Preprocessing Pipeline**: Tools for processing languages with complex morphology
- **Data Augmentation Techniques**: Methods for augmenting limited data in low-resource settings
- **Evaluation Framework**: Metrics and tools for evaluating performance in cross-lingual settings
- **Language Support**: Special handling for morphologically rich languages including Swahili, Hausa, and others

## Project Structure

low-resource-nlp/
├── README.md
├── requirements.txt
├── lowresource_nlp/
│   ├── __init__.py
│   ├── preprocessing.py
│   ├── augmentation.py
│   ├── evaluation.py
│   └── utils.py
├── examples/
│   ├── preprocessing_demo.ipynb
│   └── augmentation_demo.ipynb
└── data/
    └── sample/
        ├── swahili_small.txt
        └── hausa_small.txt


### Requirements
nltk>=3.6.2   scikit-learn>=0.24.2

pandas>=1.3.0      numpy>=1.20.0

transformers>=4.8.0    torch>=1.9.0

sentencepiece>=0.1.96     sacremoses>=0.0.45

textaugment>=1.3.4     morfessor>=2.0.6

polyglot>=16.7.4   fasttext>=0.9.2     pyicu>=2.8


###Quick Start
Here's a simple example to get started with our toolkit:

### Lowresource_nlp

In [None]:
!pip install morfessor
import re
import unicodedata
import nltk
from nltk.tokenize import word_tokenize
#import morfessor
import numpy as np
from typing import List, Dict, Tuple, Optional, Union
import warnings

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

class MorphologicalPreprocessor:
    """Preprocessing pipeline for morphologically complex languages."""

    def __init__(self, language: str = 'en', use_morfessor: bool = False):
        """
        Initialize the preprocessor.

        Args:
            language: ISO code for language (default: 'en')
            use_morfessor: Whether to use Morfessor for morphological analysis
        """
        self.language = language
        self.use_morfessor = use_morfessor
        self.morfessor_model = None

        # Language-specific settings
        self.language_settings = {
            'sw': {'complex_morphology': True},  # Swahili
            'ha': {'complex_morphology': True},  # Hausa
            'yo': {'complex_morphology': True},  # Yoruba
            'zu': {'complex_morphology': True},  # Zulu
            'tr': {'complex_morphology': True},  # Turkish
            'fi': {'complex_morphology': True},  # Finnish
            'hu': {'complex_morphology': True},  # Hungarian
            'en': {'complex_morphology': False}, # English
        }

        # Check if the language is supported
        if language not in self.language_settings:
            warnings.warn(f"Language '{language}' not explicitly supported. Using default settings.")
            self.settings = {'complex_morphology': False}
        else:
            self.settings = self.language_settings[language]

        if use_morfessor:
            # Initialize empty Morfessor model (will be trained later)
            self.morfessor_model = morfessor.MorfessorIO()

    def normalize_text(self, text: str) -> str:
        """
        Normalize text by removing extra whitespace, normalizing unicode, etc.

        Args:
            text: Input text
        Returns:
            Normalized text
        """
        # Normalize unicode
        text = unicodedata.normalize('NFKC', text)

        # Replace multiple whitespace with a single space
        text = re.sub(r'\s+', ' ', text)

        # Remove leading and trailing whitespace
        text = text.strip()

        return text

    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text.

        Args:
            text: Input text
        Returns:
            List of tokens
        """
        # Normalize first
        text = self.normalize_text(text)

        # Use NLTK's tokenizer with language-specific models if available
        tokens = word_tokenize(text, language=self.language if self.language in ['en', 'fr', 'de', 'it', 'es', 'pt', 'nl'] else 'en')

        return tokens

    def train_morfessor(self, texts: List[str], save_path: Optional[str] = None):
        """
        Train a Morfessor model for morphological segmentation.

        Args:
            texts: List of texts for training
            save_path: Path to save the trained model (optional)
        """
        if not self.use_morfessor:
            warnings.warn("Morfessor is disabled. Enable it by setting use_morfessor=True")
            return

        # Prepare data for Morfessor training
        all_tokens = []
        for text in texts:
            all_tokens.extend(self.tokenize(text))

        # Create a frequency list
        freq_list = []
        for word in set(all_tokens):
            freq_list.append((word, all_tokens.count(word)))

        # Train model
        self.morfessor_model = morfessor.BaselineModel()
        self.morfessor_model.load_data(freq_list)
        self.morfessor_model.train_batch()

        # Save model if path is provided
        if save_path:
            with open(save_path, 'wb') as f:
                morfessor.MorfessorIO().write_binary_model_file(f, self.morfessor_model)

    def load_morfessor(self, model_path: str):
        """
        Load a pre-trained Morfessor model.

        Args:
            model_path: Path to the model file
        """
        if not self.use_morfessor:
            warnings.warn("Morfessor is disabled. Enable it by setting use_morfessor=True")
            return

        with open(model_path, 'rb') as f:
            self.morfessor_model = morfessor.MorfessorIO().read_binary_model_file(f)

    def segment(self, word: str) -> List[str]:
        """
        Segment a word into morphemes.

        Args:
            word: Input word
        Returns:
            List of morphemes
        """
        if not self.use_morfessor or self.morfessor_model is None:
            return [word]

        return self.morfessor_model.viterbi_segment(word)[0]

    def preprocess(self, text: str, segment_morphemes: bool = False) -> Dict:
        """
        Full preprocessing pipeline.

        Args:
            text: Input text
            segment_morphemes: Whether to segment tokens into morphemes
        Returns:
            Dictionary with processed text in different forms
        """
        normalized = self.normalize_text(text)
        tokens = self.tokenize(normalized)

        result = {
            'original': text,
            'normalized': normalized,
            'tokens': tokens,
        }

        if segment_morphemes and self.use_morfessor and self.morfessor_model is not None:
            morphemes = []
            for token in tokens:
                morphemes.extend(self.segment(token))
            result['morphemes'] = morphemes

        return result




### Augmentation.py

In [None]:
import random
import numpy as np
from typing import List, Dict, Tuple, Optional, Union
import nltk
from nltk.corpus import wordnet
import warnings
import re

try:
    nltk.data.find('corpora/wordnet')
    nltk.data.find('taggers/universal_tagset')
except LookupError:
    nltk.download('wordnet')
    nltk.download('universal_tagset')

try:
    from textaugment import word2vec
    W2V_AVAILABLE = True
except ImportError:
    W2V_AVAILABLE = False
    warnings.warn("textaugment not available. Some augmentation methods will be disabled.")

class DataAugmenter:
    """Data augmentation techniques for low-resource settings."""

    def __init__(self, language: str = 'en'):
        """
        Initialize the augmenter.

        Args:
            language: ISO code for language (default: 'en')
        """
        self.language = language
        self.w2v_augmenter = None

        # Initialize resources conditionally to avoid unnecessary downloads
        self.resources_initialized = False

    def _ensure_resources(self):
        """Initialize resources when needed."""
        if not self.resources_initialized:
            if W2V_AVAILABLE and self.language == 'en':
                try:
                    self.w2v_augmenter = word2vec.Word2vec()
                except Exception as e:
                    print(f"Error initializing Word2Vec augmenter: {e}")
                    pass

            self.resources_initialized = True

    def word_dropout(self, tokens: List[str], dropout_prob: float = 0.1) -> List[str]:
        """
        Randomly drop words to create augmented samples.

        Args:
            tokens: List of tokens
            dropout_prob: Probability of dropping each token
        Returns:
            List of tokens with some dropped
        """
        if dropout_prob <= 0 or dropout_prob >= 1:
            return tokens

        return [token for token in tokens if random.random() > dropout_prob]

    def random_swap(self, tokens: List[str], n_swaps: int = 1) -> List[str]:
        """
        Randomly swap n pairs of words.

        Args:
            tokens: List of tokens
            n_swaps: Number of swaps to perform
        Returns:
            List of tokens with swapped positions
        """
        if len(tokens) < 2 or n_swaps <= 0:
            return tokens

        result = tokens.copy()
        for _ in range(min(n_swaps, len(tokens) // 2)):
            idx1, idx2 = random.sample(range(len(result)), 2)
            result[idx1], result[idx2] = result[idx2], result[idx1]

        return result

    def synonym_replacement(self, tokens: List[str], n_replacements: int = 1) -> List[str]:
        """
        Replace words with their synonyms.
        Only works well for English.

        Args:
            tokens: List of tokens
            n_replacements: Number of replacements to make
        Returns:
            List of tokens with some replaced by synonyms
        """
        self._ensure_resources()

        if self.language != 'en':
            warnings.warn("Synonym replacement is only well-supported for English")
            return tokens

        if n_replacements <= 0:
            return tokens

        result = tokens.copy()
        replacement_indices = random.sample(range(len(tokens)), min(n_replacements, len(tokens)))

        for idx in replacement_indices:
            synonyms = []
            for syn in wordnet.synsets(tokens[idx]):
                for lemma in syn.lemmas():
                    synonyms.append(lemma.name())

            if synonyms:
                # Make sure we use a different word
                filtered_synonyms = [s for s in synonyms if s != tokens[idx] and '_' not in s]
                if filtered_synonyms:
                    result[idx] = random.choice(filtered_synonyms)

        return result

    def word_embedding_replacement(self, text: str, n_replacements: int = 1) -> str:
        """
        Replace words with similar words based on word embeddings.

        Args:
            text: Input text
            n_replacements: Number of replacements to make
        Returns:
            Augmented text
        """
        self._ensure_resources()

        if not W2V_AVAILABLE or self.w2v_augmenter is None:
            warnings.warn("Word2Vec augmentation unavailable. Install textaugment package.")
            return text

        try:
            return self.w2v_augmenter.augment(text, n_replacements)
        except Exception as e:
            warnings.warn(f"Error in word embedding augmentation: {e}")
            return text

    def character_noise(self, tokens: List[str], noise_prob: float = 0.05) -> List[str]:
        """
        Add character-level noise (substitutions/deletions).

        Args:
            tokens: List of tokens
            noise_prob: Probability of altering a character
        Returns:
            Tokens with character-level noise
        """
        if noise_prob <= 0:
            return tokens

        result = []
        alphabet = 'abcdefghijklmnopqrstuvwxyz'

        for token in tokens:
            if len(token) <= 1 or not token.isalnum():
                result.append(token)
                continue

            chars = list(token)
            for i in range(len(chars)):
                if random.random() < noise_prob:
                    operation = random.choice(['replace', 'delete', 'insert'])

                    if operation == 'replace' and chars[i].isalpha():
                        chars[i] = random.choice(alphabet)
                    elif operation == 'delete':
                        chars[i] = ''
                    elif operation == 'insert' and chars[i].isalpha():
                        chars[i] = random.choice(alphabet) + chars[i]

            result.append(''.join(chars))

        return result

    def augment(self, text: str, methods: List[str] = None, tokenized: bool = False) -> List[str]:
        """
        Apply multiple augmentation methods to generate several augmented versions.

        Args:
            text: Input text or list of tokens
            methods: List of augmentation methods to apply. Options:
                     ['word_dropout', 'random_swap', 'synonym_replacement',
                      'word_embedding', 'character_noise', 'all']
            tokenized: Whether the input is already tokenized
        Returns:
            List of augmented texts
        """
        if methods is None:
            methods = ['all']

        if 'all' in methods:
            methods = ['word_dropout', 'random_swap', 'synonym_replacement',
                      'character_noise']
            if W2V_AVAILABLE:
                methods.append('word_embedding')

        tokens = text if tokenized else nltk.word_tokenize(text)
        results = []

        if 'word_dropout' in methods:
            dropped = self.word_dropout(tokens)
            results.append(' '.join(dropped) if not tokenized else dropped)

        if 'random_swap' in methods:
            swapped = self.random_swap(tokens)
            results.append(' '.join(swapped) if not tokenized else swapped)

        if 'synonym_replacement' in methods:
            replaced = self.synonym_replacement(tokens)
            results.append(' '.join(replaced) if not tokenized else replaced)

        if 'character_noise' in methods:
            noisy = self.character_noise(tokens)
            results.append(' '.join(noisy) if not tokenized else noisy)

        if 'word_embedding' in methods and not tokenized:
            embedding_text = self.word_embedding_replacement(text)
            if embedding_text != text:
                results.append(embedding_text)

        return results


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


### Evaluation.py

In [None]:
import numpy as np
from typing import List, Dict, Callable, Optional
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import warnings

class CrossLingualEvaluator:
    """Evaluation tools for cross-lingual transfer and low-resource models."""

    def __init__(self):
        """Initialize the evaluator."""
        pass

    def evaluate_classification(self, y_true: List, y_pred: List) -> Dict:
        """
        Evaluate classification performance.

        Args:
            y_true: List of true labels
            y_pred: List of predicted labels
        Returns:
            Dictionary of evaluation metrics
        """
        accuracy = accuracy_score(y_true, y_pred)
        precision, recall, f1, _ = precision_recall_fscore_support(
            y_true, y_pred, average='weighted')

        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    def evaluate_token_overlap(self, reference: List[str], hypothesis: List[str]) -> float:
        """
        Evaluate token overlap between reference and hypothesis.

        Args:
            reference: Reference tokens
            hypothesis: Hypothesis tokens
        Returns:
            Token overlap score (0-1)
        """
        if not reference or not hypothesis:
            return 0.0

        ref_set = set(reference)
        hyp_set = set(hypothesis)

        intersection = ref_set.intersection(hyp_set)
        union = ref_set.union(hyp_set)

        return len(intersection) / len(union) if union else 0.0

    def evaluate_morphological_accuracy(self,
                                      gold_segmentations: List[List[str]],
                                      pred_segmentations: List[List[str]]) -> Dict:
        """
        Evaluate morphological segmentation accuracy.

        Args:
            gold_segmentations: List of gold standard morpheme segmentations
            pred_segmentations: List of predicted morpheme segmentations
        Returns:
            Dictionary of evaluation metrics
        """
        if len(gold_segmentations) != len(pred_segmentations):
            warnings.warn("Gold and predicted segmentations must have the same length")
            return {'boundary_precision': 0, 'boundary_recall': 0, 'boundary_f1': 0}

        total_boundaries_gold = 0
        total_boundaries_pred = 0
        correct_boundaries = 0

        for gold, pred in zip(gold_segmentations, pred_segmentations):
            # Convert segmentations to boundary indices
            gold_boundaries = set([len(''.join(gold[:i])) for i in range(1, len(gold))])
            pred_boundaries = set([len(''.join(pred[:i])) for i in range(1, len(pred))])

            total_boundaries_gold += len(gold_boundaries)
            total_boundaries_pred += len(pred_boundaries)
            correct_boundaries += len(gold_boundaries.intersection(pred_boundaries))

        precision = correct_boundaries / total_boundaries_pred if total_boundaries_pred > 0 else 0
        recall = correct_boundaries / total_boundaries_gold if total_boundaries_gold > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

        return {
            'boundary_precision': precision,
            'boundary_recall': recall,
            'boundary_f1': f1
        }

    def calculate_data_efficiency(self,
                               eval_function: Callable,
                               data_sizes: List[int],
                               results: List[float]) -> Dict:
        """
        Calculate data efficiency metrics.

        Args:
            eval_function: The evaluation metric used
            data_sizes: List of training data sizes
            results: Corresponding results for each data size
        Returns:
            Dictionary of data efficiency metrics
        """
        if len(data_sizes) < 2 or len(data_sizes) != len(results):
            warnings.warn("Need at least two data points with matching results")
            return {}

        # Calculate area under the learning curve
        auc = np.trapz(results, data_sizes) / (data_sizes[-1] - data_sizes[0])

        # Calculate slope at different points
        slopes = []
        for i in range(1, len(data_sizes)):
            slope = (results[i] - results[i-1]) / (data_sizes[i] - data_sizes[i-1])
            slopes.append(slope)

        return {
            'learning_curve_auc': auc,
            'avg_slope': sum(slopes) / len(slopes),
            'final_slope': slopes[-1],
            'max_performance': max(results)
        }


### Lowresource_nlp/utils.py

In [None]:
import os
import json
import pickle
from typing import List, Dict, Any, Optional
import warnings
import numpy as np
import re

def load_text_file(file_path: str, encoding: str = 'utf-8') -> str:
    """
    Load text from file.

    Args:
        file_path: Path to text file
        encoding: File encoding
    Returns:
        Text content
    """
    try:
        with open(file_path, 'r', encoding=encoding) as f:
            return f.read()
    except UnicodeDecodeError:
        # Try different encodings
        encodings = ['latin-1', 'iso-8859-1', 'cp1252']
        for enc in encodings:
            try:
                with open(file_path, 'r', encoding=enc) as f:
                    warnings.warn(f"File {file_path} was read with {enc} encoding instead of {encoding}")
                    return f.read()
            except UnicodeDecodeError:
                continue

        raise ValueError(f"Could not decode file {file_path} with any encoding")

def save_model(model: Any, file_path: str):
    """
    Save a model to file.

    Args:
        model: Model to save
        file_path: Path to save the model
    """
    try:
        with open(file_path, 'wb') as f:
            pickle.dump(model, f)
    except Exception as e:
        warnings.warn(f"Error saving model: {e}")

def load_model(file_path: str) -> Any:
    """
    Load a model from file.

    Args:
        file_path: Path to the model file
    Returns:
        Loaded model
    """
    try:
        with open(file_path, 'rb') as f:
            return pickle.load(f)
    except Exception as e:
        warnings.warn(f"Error loading model: {e}")
        return None

def detect_language(text: str, min_length: int = 20) -> Optional[str]:
    """
    Simple language detection for common languages.

    Args:
        text: Text to detect language from
        min_length: Minimum text length for reliable detection
    Returns:
        ISO language code or None
    """
    if len(text) < min_length:
        return None

    try:
        from langdetect import detect
        return detect(text)
    except ImportError:
        warnings.warn("langdetect not installed. Language detection will use basic heuristics.")

        # Simple pattern-based detection
        text = text.lower()

        # Check for characteristic patterns
        patterns = {
            'en': r'\b(the|and|is|in|to|of)\b',  # English
            'sw': r'\b(na|ya|wa|ni|kwa)\b',      # Swahili
            'ha': r'\b(da|na|wa|ta|ba)\b',       # Hausa
            'yo': r'\b(ni|ti|ati|si|oun)\b'      # Yoruba
        }

        scores = {}
        for lang, pattern in patterns.items():
            matches = re.findall(pattern, text)
            scores[lang] = len(matches)

        if max(scores.values()) > 0:
            return max(scores, key=scores.get)

        return None

def split_data(data: List, train_ratio: float = 0.8, val_ratio: float = 0.1,
               random_seed: Optional[int] = None) -> Dict:
    """
    Split data into training, validation, and test sets.

    Args:
        data: List of data items
        train_ratio: Ratio of training data
        val_ratio: Ratio of validation data (test = 1 - train - val)
        random_seed: Random seed for reproducibility
    Returns:
        Dictionary with train, val, and test splits
    """
    if random_seed is not None:
        np.random.seed(random_seed)

    indices = np.random.permutation(len(data))

    train_end = int(train_ratio * len(data))
    val_end = train_end + int(val_ratio * len(data))

    train_indices = indices[:train_end]
    val_indices = indices[train_end:val_end]
    test_indices = indices[val_end:]

    return {
        'train': [data[i] for i in train_indices],
        'val': [data[i] for i in val_indices],
        'test': [data[i] for i in test_indices]
    }

def ensure_dir(directory: str):
    """
    Ensure that a directory exists.

    Args:
        directory: Directory path
    """
    if not os.path.exists(directory):
        os.makedirs(directory)


### Examples/preprocessing_demo.ipynb


In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Low-Resource Language Processing Toolkit: Preprocessing Demo\n",
    "\n",
    "This notebook demonstrates how to use the preprocessing components of the Low-Resource Language Processing Toolkit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "First, let's install and import the necessary packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None, # Changed null to None
   "metadata": {},
   "outputs": [],
   "source": [
    "# Installation (only if running in Colab)\n",
    "import sys\n",
    "if 'google.colab' in sys.modules:\n",
    "    !pip install nltk morfessor polyglot pyicu pycld2\n",
    "    !git clone https://github.com/YourUsername/low-resource-nlp.git\n",
    "    sys.path.append('/content/low-resource-nlp')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None, # Changed null to None
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import modules\n",
    "from lowresource_nlp.preprocessing import MorphologicalPreprocessor\n",
    "from lowresource_nlp.utils import load_text_file, detect_language\n",
    "\n",
    "import nltk\n",
    "nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample Text\n",
    "\n",
    "Let's define some sample text in different languages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None, # Changed null to None
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sample texts\n",
    "swahili_text = \"\"\"Lugha ya Kiswahili ni mojawapo ya lugha za Kiafrika. \n",
    "Inatumiwa na watu wengi katika nchi za Afrika Mashariki na Kati.\n",
    "Kiswahili kina muundo wa maneno tata na mfumo wa viambishi vingi.\"\"\"\n",
    "\n",
    "hausa_text = \"\"\"Hausa tana daya daga cikin harsuna manya da ake amfani da su a nahiyar Afirka.\n",
    "Harshen Hausa tana da sautin baki mai yawa da dana hauwa game da gine-gine na kalmomi.\"\"\"\n",
    "\n",
    "english_text = \"\"\"Natural language processing helps computers communicate with humans in their own language.\n",
    "NLP is a component of artificial intelligence.\"\"\"\n",
    "\n",
    "print(f\"Languages detected: Swahili: {detect_language(swahili_text)}, Hausa: {detect_language(hausa_text)}, English: {detect_language(english_text)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Preprocessing\n",
    "\n",
    "Let's use the MorphologicalPreprocessor for basic preprocessing tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None, # Changed null to None
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize preprocessors for different languages\n",
    "sw_preprocessor = MorphologicalPreprocessor(language='sw')\n",
    "ha_preprocessor = MorphologicalPreprocessor(language='ha')\n",
    "en_preprocessor = MorphologicalPreprocessor(language='en')\n",
    "\n",
    "# Preprocess Swahili text\n",
    "sw_processed = sw_preprocessor.preprocess(swahili_text)\n",
    "print(\"Swahili tokens:\")\n",
    "print(sw_processed['tokens'])\n",
    "\n",
    "# Preprocess Hausa text\n",
    "ha_processed = ha_preprocessor.preprocess(hausa_text)\n",
    "print(\"\\nHausa tokens:\")\n",
    "print(ha_processed['tokens'])\n",
    "\n",
    "# Preprocess English text\n",
    "en_processed = en_preprocessor.preprocess(english_text)\n",
    "print(\"\\nEnglish tokens:\")\n",
    "print(en_processed['tokens'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Morphological Analysis with Morfessor\n",
    "\n",
    "Now let's train a morphological segmentation model using Morfessor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None, # Changed null to None
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a preprocessor with Morfessor enabled\n",
    "morfessor_preprocessor = MorphologicalPreprocessor(language='sw', use_morfessor=True)\n",
    "\n",
    "# Create training data (this would typically be a larger corpus)\n",
    "training_texts = [\n",
    "    \"Lugha ya Kiswahili ni mojawapo ya lugha za Kiafrika.\",\n",
    "    \"Inatumiwa na watu wengi katika nchi za Afrika Mashariki na Kati.\",\n",
    "    \"Kiswahili kina muundo wa maneno tata na mfumo wa viambishi vingi.\",\n",
    "    \"Wanafunzi wanapenda kusoma vitabu vizuri vya hadithi.\",\n",
    "    \"Tulienda sokoni kununua matunda na mboga kwa ajili ya familia.\"\n",
    "]\n",
    "\n",
    "# Train the Morfessor model\n",
    "morfessor_preprocessor.train_morfessor(training_texts)\n",
    "\n",
    "# Process text with morphological segmentation\n",
    "result = morfessor_preprocessor.preprocess(swahili_text, segment_morphemes=True)\n",
    "\n",
    "print(\"Original tokens:\")\n",
    "print(result['tokens'])\n",
    "print(\"\\nMorphemes:\")\n",
    "print(result['morphemes'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processing a Collection of Texts\n",
    "\n",
    "Let's demonstrate how to process a collection of texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None, # Changed null to None
   "metadata": {},
   "outputs": [],
   "source": [
    "# Collection of texts\n",
    "swahili_texts = [\n",
    "    \"Jamhuri ya Kenya ni taifa katika Afrika Mashariki.\",\n",
    "    \"Nairobi ndio mji mkuu wa Kenya.\",\n",
    "    \"Kiswahili na Kiingereza ni lugha rasmi za Kenya.\"\n",
    "]\n",
    "\n",
    "# Process all texts\n",
    "processed_texts = []\n",
    "for text in swahili_texts:\n",
    "    processed = sw_preprocessor.preprocess(text)\n",
    "    processed_texts.append(processed)\n",
    "\n",
    "# Display results\n",
    "for i, processed in enumerate(processed_texts):\n",
    "    print(f\"Text {i+1}:\")\n",
    "    print(f\"Original: {processed['original']}\")\n",
    "    print(f\"Tokens: {processed['tokens']}\")\n",
    "    print()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['# Low-Resource Language Processing Toolkit: Preprocessing Demo\n',
    '\n',
    'This notebook demonstrates how to use the preprocessing components of the Low-Resource Language Processing Toolkit.']},
  {'cell_type': 'markdown',
   'metadata': {},
   'source': ['## Setup\n',
    '\n',
    "First, let's install and import the necessary packages."]},
  {'cell_type': 'code',
   'execution_count': None,
   'metadata': {},
   'outputs': [],
   'source': ['# Installation (only if running in Colab)\n',
    'import sys\n',
    "if 'google.colab' in sys.modules:\n",
    '    !pip install nltk morfessor polyglot pyicu pycld2\n',
    '    !git clone https://github.com/YourUsername/low-resource-nlp.git\n',
    "    sys.path.append('/content/low-resource-nlp')"]},
  {'cell_type': 'code',
   'execution_count': None,
   'metadata': {},
   'outputs': [],
   'source': ['# Import modules\n',
    'from lowresource_nlp.preprocessing i