### Text Processing: Handling Amharic text, tokenization, and preprocessing techniques.

To preprocess the scraped Amharic text data for tasks like tokenization, normalization, and handling Amharic-specific linguistic features, we need to follow several preprocessing steps tailored for the language. 

Here’s how we can approach this task:

**Steps to Preprocess Amharic Text**

- **Tokenization**: Tokenization is the process of splitting text into individual units such as words or subwords. Since Amharic uses a different script and has some unique linguistic features, tokenizing might need adjustments. 
    - Use specialized libraries that handle Amharic text or a custom rule-based tokenizer.

- **Normalization**: This step involves cleaning and converting the text into a standard format:

    - Remove special characters, punctuation, and numbers.
    - Normalize similar-looking characters.
    - Convert text to a standard form (for example, removing diacritics if necessary).

- **Handling Amharic-Specific Features:**

    - Amharic, like other Semitic languages, has specific features such as root-and-pattern morphology.

    - Handling unique orthographic variants and considering suffixes, prefixes, and infixes in the language.

    - Identifying verb conjugations, plural forms, and possessives for better tokenization.

**Load the scraped Telegram data**

In [None]:
import pandas as pd
import logging
import os
import sys
from matplotlib import pyplot as plt
from collections import Counter
from amharic_text_processor import AmharicTextPreprocessor  # type: ignore
from amharic_labeler import AmharicNERLabeler  # type: ignore

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class DataProcessor:
    def __init__(self, data_path: str):
        self.data_path = data_path
        self.data = pd.read_csv(data_path)
        self.tokens = None
        self.labeled_data = None

    def explore_data(self):
        """Explores the first and last 5 rows of the dataset."""
        logger.info("Exploring the data...")
        logger.info(f"First 5 rows: \n{self.data.head()}")
        logger.info(f"Last 5 rows: \n{self.data.tail()}")
        logger.info(f"Data shape: {self.data.shape}")
        logger.info(f"Missing values: \n{self.data.isnull().sum()}")

    def preprocess_data(self):
        """Preprocess and tokenize the Amharic messages."""
        logger.info("Preprocessing data...")
        preprocessor = AmharicTextPreprocessor()
        self.tokens = preprocessor.preprocess_dataframe(self.data, 'Message')

    def drop_na(self):
        """Drops rows with missing values in the 'Message' column."""
        logger.info("Dropping NaN values in 'Message' column...")
        self.data.dropna(subset='Message', inplace=True)

    def get_preprocessed_texts(self):
        """Returns a list of preprocessed messages."""
        preprocessed_texts = self.tokens['preprocessed_message'].dropna().tolist()
        return pd.Series(preprocessed_texts).reset_index(name='message')

class NERLabeler:
    def __init__(self):
        self.labeler = AmharicNERLabeler()

    def label_data(self, df: pd.DataFrame):
        """Labels the tokens in the DataFrame."""
        logger.info("Labeling data...")
        df['Tokenized'] = df['message'].apply(lambda x: x.split())
        labeled_df = self.labeler.label_dataframe(df, 'Tokenized')
        return labeled_df

    def save_labeled_data(self, labeled_df: pd.DataFrame, output_path: str):
        """Saves the labeled data in CoNLL format."""
        logger.info(f"Saving labeled data to {output_path}...")
        self.labeler.save_conll_format(labeled_df, output_path)

class AmharicTextPipeline:
    def __init__(self, data_path: str, output_path: str):
        self.data_processor = DataProcessor(data_path)
        self.ner_labeler = NERLabeler()
        self.output_path = output_path

    def run_pipeline(self):
        """Run the full pipeline: preprocess, label, and save."""
        self.data_processor.explore_data()
        self.data_processor.preprocess_data()
        self.data_processor.drop_na()
        preprocessed_df = self.data_processor.get_preprocessed_texts()
        labeled_df = self.ner_labeler.label_data(preprocessed_df)
        self.ner_labeler.save_labeled_data(labeled_df, self.output_path)

if __name__ == "__main__":
    # Set the data and output paths
    data_path = '../data/telegram_data.csv'
    output_path = '../data/labeled_data_conll.conll'

    # Run the pipeline
    pipeline = AmharicTextPipeline(data_path, output_path)
    pipeline.run_pipeline()

    logger.info("Pipeline execution completed.")