📊 Data Ingestion and Preprocessing for NER Model

🎯 Objective

Establish a data ingestion system to collect and preprocess messages from Ethiopian-based Telegram e-commerce channels for Named Entity Recognition (NER) tasks.

🛠️ Approach

1️⃣ Data Ingestion

- Channel Identification: Select at least five relevant Telegram channels focused on e-commerce.

- Custom Scraper Development: Create a web scraper to automate the collection of messages, images, and documents from the identified channels.

- Real-Time Data Collection: Implement a system to fetch data as it is posted, ensuring the dataset remains current.

2️⃣ Data Preprocessing

- Text Normalization: Clean the collected text by converting it to a consistent format (e.g., lowercasing, removing special characters).

- Tokenization: Split the text into individual tokens (words) for easier analysis.

- Handling Amharic-Specific Features: Address unique linguistic characteristics of the Amharic language, such as diacritics and script variations.

3️⃣ Data Structuring

- Metadata Separation: Organize the data by separating metadata (e.g., sender, timestamp) from the message content.

- Unified Format Creation: Structure the cleaned data into a consistent format (e.g., CSV, JSON) for further analysis.

4️⃣ Quality Assurance

- Data Review: Conduct a thorough review of the collected data to ensure completeness and accuracy, checking for any missing or corrupted entries.

5️⃣ Data Storage

- Save Preprocessed Data: Store the cleaned and structured data in a suitable format for easy access during the labeling and model training phases.

✅ Summary of Steps

1. Identify relevant Telegram channels.

2. Develop a custom scraper for data collection.

3. Preprocess the collected data.

4. Structure and organize the data.

5. Conduct quality assurance and store the data.

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Logging Setup Example in Python ✨</h1>

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Importing Modules ✨</h1>

In [1]:
import os
import sys
notebook_dir = os.getcwd()
sys.path.append(os.path.abspath(os.path.join(notebook_dir, '..')))
sys.path.append(os.path.abspath('../scripts'))


In [2]:
from scripts.analysis import extract_messages_from_html_files, load_csv_to_dataframe


<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Loading Extracting Messages from telegram in CSV form✨</h1>

In [4]:
# Load the CSV into a pandas DataFrame
telegram_data = load_csv_to_dataframe(r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\telegram_data.csv")

# Display the first few rows of the DataFrame
print(telegram_data.head())

     Channel Username    ID  \
0  @Shageronlinestore  6211   
1  @Shageronlinestore  6210   
2  @Shageronlinestore  6207   
3  @Shageronlinestore  6206   
4  @Shageronlinestore  6205   

                                             Message  \
0  💥INIMA JAPAN COFFEE GRINDER\n\n💯ከፍተኛ ጥራት \n\n⚡...   
1  💥INIMA JAPAN COFFEE GRINDER\n\n💯ከፍተኛ ጥራት \n\n⚡...   
2  💥stainless still flower shape cake mold\n\n⚡️n...   
3  💥Delux Foldable multifunctional Draying RACK\n...   
4  #አልቆል_ለተባላችሁ_በድጋሚ_አስገብተናል \n💥Automatic rotatin...   

                        Date  
0  2025-01-17 06:59:57+00:00  
1  2025-01-17 06:59:57+00:00  
2  2025-01-16 13:41:31+00:00  
3  2025-01-16 10:07:54+00:00  
4  2025-01-16 09:20:43+00:00  


<style>
    h3 {
        color: #ff1199;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h3>✨ checking NAN and cleaning it✨</h3>

In [5]:
telegram_data.dropna()

Unnamed: 0,Channel Username,ID,Message,Date
0,@Shageronlinestore,6211,💥INIMA JAPAN COFFEE GRINDER\n\n💯ከፍተኛ ጥራት \n\n⚡...,2025-01-17 06:59:57+00:00
1,@Shageronlinestore,6210,💥INIMA JAPAN COFFEE GRINDER\n\n💯ከፍተኛ ጥራት \n\n⚡...,2025-01-17 06:59:57+00:00
2,@Shageronlinestore,6207,💥stainless still flower shape cake mold\n\n⚡️n...,2025-01-16 13:41:31+00:00
3,@Shageronlinestore,6206,💥Delux Foldable multifunctional Draying RACK\n...,2025-01-16 10:07:54+00:00
4,@Shageronlinestore,6205,#አልቆል_ለተባላችሁ_በድጋሚ_አስገብተናል \n💥Automatic rotatin...,2025-01-16 09:20:43+00:00
...,...,...,...,...
995,@helloomarketethiopia,4210,እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nየልጆች የንባብ እንዲሁም የቀ...,2024-09-18 14:39:51+00:00
996,@helloomarketethiopia,4209,እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nለፀጉርዎ ልስላሳሴ፡ ጠንካሬ ...,2024-09-18 09:05:51+00:00
997,@helloomarketethiopia,4208,እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nለሁለቱም ፆታ የሚሆን በጀርባ...,2024-09-17 18:00:11+00:00
998,@helloomarketethiopia,4207,እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nከሸራ የተሰራ የልጆች የምሳእ...,2024-09-17 15:18:21+00:00


In [6]:
print("Checking for NaN values in the 'Message' column:")
nan_count = telegram_data['Message'].isnull().sum()
print(f"Number of NaN values in 'Message' column: {nan_count}")

Checking for NaN values in the 'Message' column:
Number of NaN values in 'Message' column: 0


In [7]:
telegram_data_df=telegram_data['Message']
telegram_data_df

0      💥INIMA JAPAN COFFEE GRINDER\n\n💯ከፍተኛ ጥራት \n\n⚡...
1      💥INIMA JAPAN COFFEE GRINDER\n\n💯ከፍተኛ ጥራት \n\n⚡...
2      💥stainless still flower shape cake mold\n\n⚡️n...
3      💥Delux Foldable multifunctional Draying RACK\n...
4      #አልቆል_ለተባላችሁ_በድጋሚ_አስገብተናል \n💥Automatic rotatin...
                             ...                        
995    እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nየልጆች የንባብ እንዲሁም የቀ...
996    እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nለፀጉርዎ ልስላሳሴ፡ ጠንካሬ ...
997    እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nለሁለቱም ፆታ የሚሆን በጀርባ...
998    እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nከሸራ የተሰራ የልጆች የምሳእ...
999    እንኳን ለአዲሱ ዓመት በሰላም አደረሳችሁ!\nእንዲህ ድምቅ ያለ ባህላዊ የ...
Name: Message, Length: 1000, dtype: object

<style>
    h2 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h2>✨   Extracting Unique Characters from a CSV Column ✨</h2>

In [8]:
# Combine all rows in the 'Address' column into a single string
combined_text = " ".join(telegram_data["Message"].astype(str))

# Find unique characters
unique_chars = sorted(set(combined_text))

# Print the unique characters
print("Unique characters found:")
print(unique_chars)


Unique characters found:
['\n', ' ', '!', '"', '#', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '\xa0', '®', '°', '×', 'ሀ', 'ሁ', 'ሂ', 'ሃ', 'ሄ', 'ህ', 'ሆ', 'ለ', 'ሉ', 'ሊ', 'ላ', 'ሌ', 'ል', 'ሎ', 'ሐ', 'ሒ', 'ሕ', 'መ', 'ሙ', 'ሚ', 'ማ', 'ሜ', 'ም', 'ሞ', 'ሟ', 'ሠ', 'ሣ', 'ረ', 'ሩ', 'ሪ', 'ራ', 'ሬ', 'ር', 'ሮ', 'ሯ', 'ሰ', 'ሱ', 'ሲ', 'ሳ', 'ሴ', 'ስ', 'ሶ', 'ሸ', 'ሹ', 'ሻ', 'ሼ', 'ሽ', 'ሾ', 'ቀ', 'ቁ', 'ቂ', 'ቃ', 'ቄ', 'ቅ', 'ቆ', 'ቋ', 'በ', 'ቡ', 'ቢ', 'ባ', 'ቤ', 'ብ', 'ቦ', 'ቧ', 'ቨ', 'ቪ', 'ቫ', 'ቭ', 'ቮ', 'ተ', 'ቱ', 'ቲ', 'ታ', 'ቴ', 'ት', 'ቶ', 'ቷ', 'ቸ', 'ቹ', 'ቺ', 'ቻ', 'ቼ', 'ች', 'ቾ', 'ኋ', 'ነ', 'ኑ', 'ኒ', 'ና', 'ኔ', 'ን', 'ኖ', 'ኘ', 'ኙ', 'ኛ', 'ኝ', 'ኞ', 'አ', 'ኢ', 'ኣ', 'ኤ', 'እ', 'ኦ', 

<style>
    h3 {
        color: #ff1199;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h3>✨   Extracting Unique Characters from a CSV Column ✨</h3>

In [9]:
import emoji
import re


# Function to clean the text (remove emojis, symbols, etc.)
def remove_emoji(text):
    if isinstance(text, str):
        return emoji.replace_emoji(text, replace='')
    return text

def remove_symbols(text):
    if isinstance(text, str):
        return re.sub(r'[^A-Za-z0-9ሀ-ፐ\s]+', '', text)
    return text

# Apply cleaning functions to 'Message' column
telegram_data['Message'] = telegram_data['Message'].apply(remove_emoji).apply(remove_symbols)
display(telegram_data.head())

Unnamed: 0,Channel Username,ID,Message,Date
0,@Shageronlinestore,6211,INIMA JAPAN COFFEE GRINDER\n\nከፍተኛ ጥራት \n\n150...,2025-01-17 06:59:57+00:00
1,@Shageronlinestore,6210,INIMA JAPAN COFFEE GRINDER\n\nከፍተኛ ጥራት \n\n150...,2025-01-17 06:59:57+00:00
2,@Shageronlinestore,6207,stainless still flower shape cake mold\n\nnon ...,2025-01-16 13:41:31+00:00
3,@Shageronlinestore,6206,Delux Foldable multifunctional Draying RACK\n\...,2025-01-16 10:07:54+00:00
4,@Shageronlinestore,6205,አልቆልለተባላችሁበድጋሚአስገብተናል \nAutomatic rotating noz...,2025-01-16 09:20:43+00:00


<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨    Named Entity Recognition (NER) Labeling Function✨</h1>

In [10]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline 
#Load the tokenizer and model for NER
tokenizer = AutoTokenizer.from_pretrained("mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-amharic")
model = AutoModelForTokenClassification.from_pretrained("mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-amharic")

# Set up NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0


In [11]:
# Function to map NER model results to CoNLL format labels
def map_ner_to_conll(ner_results, tokens):
    labels = ['O'] * len(tokens)  # Initialize labels as 'O' (Outside)
    
    for entity in ner_results:
        word = entity['word'].replace('##', '')  # Remove subword artifacts from NER results
        entity_type = entity['entity']  # Extract entity type
        
        # Define mapping for NER entity types
        label = 'O'  # Default label
        if entity_type == 'B-LOC':
            label = 'B-LOC'
        elif entity_type == 'I-LOC':
            label = 'I-LOC'
        elif entity_type == 'B-ORG':
            label = 'B-PRODUCT'
        elif entity_type == 'I-ORG':
            label = 'I-PRODUCT'
        elif entity_type == 'B-MISC':
            label = 'B-PRICE'
        elif entity_type == 'I-MISC':
            label = 'I-PRICE'
        
        # Apply NER labels to matching tokens
        for i, token in enumerate(tokens):
            if word in token:
                labels[i] = label

    return labels

# Custom labeling function to identify prices and locations
def custom_label_prices_locations(tokens):
    labels = ['O'] * len(tokens)  # Initialize labels as 'O'
    
    # Define price patterns and location names
    price_patterns = [r'^\d*(00|.*50)(\.\d{1,2})?$', 'ETB', 'ዋጋ', '\$', 'ብር', 'Birr']
    
    product_patterns=['sketchers', 'Nike', 'Adidas','Rebook','samsung','Samsung','Vans','Nikon', 'Nike','Puma',
                      'Adidas','Lacosta', 'Rolex','New','Allstar','Vapor','Sketchers','FILA','CK','TRAZER',
                      'Jordan','Womens','Human','Couple','Original','Victorias','BURBERRY','OFFER','Fila','2TB',
                      'CLASICO','Men','Balenciaga','Shose','CASENT','NIKE','Nike','Airforce','ROLEX','LOUIS','CYBER',
                      'Speed','speed','AIR','Air','Skacher','Time','All','Fitron','FITRON','EMPORIO','CK',
                      'CHANEL','Skechers','Sketcher','NB','Old','old','OLD','FENDI','SPEED','BRAND','Brand',
                      'BALENCIAGA','GUCCI','CHEKICH','GIORGIO','Jordan','JORDAN', 'Vest','European','Fur','VIGUER',
                      'Quality', 'QUALITY','SVETSEON','Couple','COUPLE','High','HIGH','Under','ADIDAS','VANS','Sun',
                      'Rolex','LEBRON','Lebron','Yezzy','ALEXANDER','XO','Jacket','55','HURACHE','Clark','Hermes','VM','RADO','Apple',
                      'Fendi','Police','Champion','Gucci','Stan','Calvin','SWISH','SKMEL','FOR','Cr','Military','VEST','YEEZY','DIESEL','chekich']
    
    locations = ['Addis','Ababa', 'ቦሌ', 'ሜክሲኮ', 'ለቡ', 'Mekelle', 'Adama', 'Gondar', 'ለቡ','መዳህኒዓለም', 
                 'መገናኛ', 'አበባ', 'ሀይሎች','ጦር', 'ድሪም', 'ታወር','205','አዲስ', 'ቁጥር', "ቢሮ", "ፎቅ", "2ተኛ" ]
    
    # First, process the specific tokens you provided with custom labels
    custom_tokens = {
          
        "አድራሻ"  : "B-LOC",       # Beginning of a location
        "Price": "B-PRICE",         #Beginning of a price
        "Prices": "B-PRICE",       #Beginning of a price
        "Free" : "O",
        "Delivery":"O",
        "Inbox" :"O",
        "Hiwe5266": "O",
        "ስልክ":"O",
        "ፋሽን":"O",
        "ተራ":"O",
        "Fashion":"O",
        "Tera":"O",
        "New" : "O",
        "year" :"O",
        "Discount": "O",
        "me" : "O",
        "httpsvmtiktokcomZM2yHbMPH" : "O",
        "contact" : "O",
        "sold" : "O",
        "out" : "O",
        "Sold" : "O",
        "Call" : "O",
        "call" : "O",
        "more" : "O",
        "info" : "O",
        "as" : "O",
        "Anyone" : "O",
        "who" : "O",
        "want" : "O",
        "new" : "O",
        "Original" : "O",
        "BIG" : "O",
        "DISCOUNT" : "O",
        "ብዛት" : "O",
        "ለምትወስዱ" : "O",
        "ልዩ" : "O",
        "ቅናሽ" : "O",
        "አለዉ" : "O",
        "ባሉበት" : "O",
        "እናደርሳለን" : "O",
        "ጫማ" : "O",
        "ለመግዛት" : "O",
        "መርካቶ" : "O",
        "እየሄዱ" : "O",
        "ደክመዋል" : "O",
        "እንግዲያውስ" : "O",
        "ቻናላችንን" : "O",
        "በመቀላቀለ" : "O",
        "የፈለጉትን" : "O",
        "ይዘዙን" : "O",
        "ባሉበት" : "O",
        "እናመጣለን" : "O",
        "httpstmejoinchatAAAAAEYRIOB5Tt7gKGGjA" : "O",
        "Enkuan": "O",
        "le": "O",
        "berhan": "O",
        "meswkelu": "O",
        "beselam": "O",
        "adersachu": "O",
               

    }
 
    # Apply labels based on the tokens
    for i, token in enumerate(tokens):
        # Check if token is in the custom list
        if token in custom_tokens:
            labels[i] = custom_tokens[token]
        # Check if token matches very long numbers (10 digits or more)
        elif re.match(r'^\d{10,}$', token):
            labels[i] = 'O'  # Label long numbers as 'O'
        # Label prices (e.g., numbers, ETB, Birr, $, etc.)
        elif any(pro in token for pro in product_patterns):
            labels[i] = 'B-PRODUCT'
        elif any(re.match(pattern, token) for pattern in price_patterns):
            labels[i] = 'I-PRICE'
        # Label locations (predefined locations)
        elif any(loc in token for loc in locations):
            labels[i] = 'I-LOC'
        # Label other tokens as I-PRODUCT
        else:
            labels[i] = 'I-PRODUCT'
    
    return labels 

# Function to combine both NER and custom labels
def combine_labels(ner_labels, custom_labels):
    final_labels = []
    
    for ner_label, custom_label in zip(ner_labels, custom_labels):
        if ner_label != 'O':  # NER label takes precedence
            final_labels.append(ner_label)
        else:
            final_labels.append(custom_label)  # Otherwise, use custom label
    
    return final_labels

In [12]:
# Function to process a message with both NER and custom methods
def process_message(message, nlp_pipeline):
    tokens = re.findall(r'\S+', message)  # Tokenize the message
    
    # Apply NER model
    ner_results = nlp_pipeline(message)
    ner_labels = map_ner_to_conll(ner_results, tokens)
    
    # Apply custom labeling
    custom_labels = custom_label_prices_locations(tokens)
    
    # Combine both label sets
    final_labels = combine_labels(ner_labels, custom_labels)
    
    # Return tokens with their combined labels
    labeled_tokens = [f"{token} {label}" for token, label in zip(tokens, final_labels)]
    return "\n".join(labeled_tokens)

# Apply the combined processing to each message
telegram_data['Labeled_Message'] = telegram_data['Message'].apply(lambda msg: process_message(msg, nlp))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [13]:
telegram_data.head()

Unnamed: 0,Channel Username,ID,Message,Date,Labeled_Message
0,@Shageronlinestore,6211,INIMA JAPAN COFFEE GRINDER\n\nከፍተኛ ጥራት \n\n150...,2025-01-17 06:59:57+00:00,INIMA I-PRODUCT\nJAPAN I-PRODUCT\nCOFFEE I-PRO...
1,@Shageronlinestore,6210,INIMA JAPAN COFFEE GRINDER\n\nከፍተኛ ጥራት \n\n150...,2025-01-17 06:59:57+00:00,INIMA I-PRODUCT\nJAPAN I-PRODUCT\nCOFFEE I-PRO...
2,@Shageronlinestore,6207,stainless still flower shape cake mold\n\nnon ...,2025-01-16 13:41:31+00:00,stainless I-PRODUCT\nstill I-PRODUCT\nflower I...
3,@Shageronlinestore,6206,Delux Foldable multifunctional Draying RACK\n\...,2025-01-16 10:07:54+00:00,Delux I-PRODUCT\nFoldable B-PRODUCT\nmultifunc...
4,@Shageronlinestore,6205,አልቆልለተባላችሁበድጋሚአስገብተናል \nAutomatic rotating noz...,2025-01-16 09:20:43+00:00,አልቆልለተባላችሁበድጋሚአስገብተናል I-LOC\nAutomatic I-PRODU...


In [14]:
# Save the final labeled data to a CoNLL-style file
output_file_combined = r'C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\labeled_telegram_data_conll.conll'
with open(output_file_combined, 'w', encoding='utf-8') as f:
    for index, row in telegram_data.iterrows():
        f.write(f"{row['Labeled_Message']}\n\n")


print(f"labeled data saved to {output_file_combined}")  

labeled data saved to C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\labeled_telegram_data_conll.conll
