📊 Data Ingestion and Preprocessing for NER Model

🎯 Objective

Establish a data ingestion system to collect and preprocess messages from Ethiopian-based Telegram e-commerce channels for Named Entity Recognition (NER) tasks.

🛠️ Approach

1️⃣ Data Ingestion

- Channel Identification: Select at least five relevant Telegram channels focused on e-commerce.

- Custom Scraper Development: Create a web scraper to automate the collection of messages, images, and documents from the identified channels.

- Real-Time Data Collection: Implement a system to fetch data as it is posted, ensuring the dataset remains current.

2️⃣ Data Preprocessing

- Text Normalization: Clean the collected text by converting it to a consistent format (e.g., lowercasing, removing special characters).

- Tokenization: Split the text into individual tokens (words) for easier analysis.

- Handling Amharic-Specific Features: Address unique linguistic characteristics of the Amharic language, such as diacritics and script variations.

3️⃣ Data Structuring

- Metadata Separation: Organize the data by separating metadata (e.g., sender, timestamp) from the message content.

- Unified Format Creation: Structure the cleaned data into a consistent format (e.g., CSV, JSON) for further analysis.

4️⃣ Quality Assurance

- Data Review: Conduct a thorough review of the collected data to ensure completeness and accuracy, checking for any missing or corrupted entries.

5️⃣ Data Storage

- Save Preprocessed Data: Store the cleaned and structured data in a suitable format for easy access during the labeling and model training phases.

✅ Summary of Steps

1. Identify relevant Telegram channels.

2. Develop a custom scraper for data collection.

3. Preprocess the collected data.

4. Structure and organize the data.

5. Conduct quality assurance and store the data.

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Logging Setup Example in Python ✨</h1>

In [1]:
import logging

# Configure logging
logging.basicConfig(
    filename="eda_log.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

logger = logging.getLogger()

# Example log
logger.info("Logging setup complete.")

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨ Importing Modules ✨</h1>

In [2]:
import os
import sys
notebook_dir = os.getcwd()
sys.path.append(os.path.abspath(os.path.join(notebook_dir, '..')))
sys.path.append(os.path.abspath('../scripts'))
logger.info("Imported required libraries.")

In [3]:
from scripts.analysis import extract_messages_from_html, load_and_preview_csv  

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨  Extracting Messages from HTML and Saving to CSV ✨</h1>

In [4]:
# Define the file paths
html_file = r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\messages.html"
output_csv = r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\messages.csv"

# Call the function
extract_messages_from_html(html_file, output_csv)


Messages extracted and saved to C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\messages.csv


In [5]:

# Define the path to your CSV file
csv_file = r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\messages.csv"

# Load and preview the CSV file
telegram_data = load_and_preview_csv(csv_file, num_rows=5)


             Date   Time          Sender         Product Name  Price  \
0  22 August 2023  07:30  BELLA CLASSIC®     crocodile school  2,800   
1  22 August 2023  07:57  BELLA CLASSIC®  ALLIGATOR CROCODILE   3500   
2  22 August 2023  08:31  BELLA CLASSIC®             Nike ACG   3500   
3  22 August 2023  12:14  BELLA CLASSIC®            Nike sb w   3400   
4  23 August 2023  11:51  BELLA CLASSIC®        ADIDAS OZELIA   3400   

             Size  Made In Phone Number    Color    Contact Link  \
0  40,41,42,43,44   Turkey   0944222069  2 kinds  @Bellaclassics   
1  40,41,42,43,44   Turkey   0944222069  2 kinds  @Bellaclassics   
2  40,41,42,43,44  Vietnam   0944222069  2 kinds  @Bellaclassics   
3  40,41,42,43,44  Vietnam   0944222069  4 kinds  @Bellaclassics   
4  40,41,42,43,44  Vietnam   0944222069  2 kinds  @Bellaclassics   

                                             Address  
0  ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...  
1  ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል 

In [6]:
telegram_data.dropna()

Unnamed: 0,Date,Time,Sender,Product Name,Price,Size,Made In,Phone Number,Color,Contact Link,Address
0,22 August 2023,07:30,BELLA CLASSIC®,crocodile school,2800,4041424344,Turkey,0944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
1,22 August 2023,07:57,BELLA CLASSIC®,ALLIGATOR CROCODILE,3500,4041424344,Turkey,0944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
2,22 August 2023,08:31,BELLA CLASSIC®,Nike ACG,3500,4041424344,Vietnam,0944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
3,22 August 2023,12:14,BELLA CLASSIC®,Nike sb w,3400,4041424344,Vietnam,0944222069,4 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
4,23 August 2023,11:51,BELLA CLASSIC®,ADIDAS OZELIA,3400,4041424344,Vietnam,0944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
...,...,...,...,...,...,...,...,...,...,...,...
407,16 January 2025,03:49,BELLA CLASSIC®,Unknown,1900☎️ ስልክ,1900☎️ ስልክ,Unknown,1900☎️ ስልክ,Unknown,1900☎️ ስልክ,1900☎️ ስልክ
408,16 January 2025,09:02,BELLA CLASSIC®,Unknown,1900☎️ ስልክ,1900☎️ ስልክ,Unknown,1900☎️ ስልክ,Unknown,1900☎️ ስልክ,1900☎️ ስልክ
409,16 January 2025,21:13,BELLA CLASSIC®,Unknown,LUKAI MC📐 size,LUKAI MC📐 size,Unknown,LUKAI MC📐 size,Unknown,LUKAI MC📐 size,LUKAI MC📐 size
410,16 January 2025,23:22,BELLA CLASSIC®,Unknown,ALEXANDER📐 size,ALEXANDER📐 size,Unknown,ALEXANDER📐 size,Unknown,ALEXANDER📐 size,ALEXANDER📐 size


In [7]:
print("Checking for NaN values in the 'Address' column:")
nan_count = telegram_data['Address'].isnull().sum()
print(f"Number of NaN values in 'Address' column: {nan_count}")

Checking for NaN values in the 'Address' column:
Number of NaN values in 'Address' column: 0


In [8]:
telegram_data_df=telegram_data['Address']
telegram_data_df

0      ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
1      ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
2      ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
3      ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
4      ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...
                             ...                        
407                                           1900☎️ ስልክ
408                                           1900☎️ ስልክ
409                                       LUKAI MC📐 size
410                                      ALEXANDER📐 size
411                                AIR-FORCE 1 LOW📐 size
Name: Address, Length: 412, dtype: object

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨   Extracting Unique Characters from a CSV Column ✨</h1>

In [9]:
# Import pandas
import pandas as pd

# Load the CSV file
csv_file = r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\messages.csv"
df = pd.read_csv(csv_file)

# Combine all rows in the 'Address' column into a single string
combined_text = " ".join(df["Address"].astype(str))

# Find unique characters
unique_chars = sorted(set(combined_text))

# Print the unique characters
print("Unique characters found:")
print(unique_chars)


Unique characters found:
[' ', "'", '(', ')', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', '\xa0', '®', '×', 'ሁ', 'ህ', 'ለ', 'ሉ', 'ላ', 'ሌ', 'ል', 'መ', 'ሙ', 'ማ', 'ሜ', 'ም', 'ሪ', 'ራ', 'ር', 'ሱ', 'ሲ', 'ስ', 'ሻ', 'ሽ', 'ቀ', 'ቁ', 'ቅ', 'ቆ', 'በ', 'ተ', 'ት', 'ና', 'ን', 'ኙ', 'ኛ', 'ኝ', 'አ', 'እ', 'ከ', 'ኩ', 'ኬ', 'ክ', 'ኮ', 'ወ', 'ዋ', 'ው', 'ዓ', 'የ', 'ዩ', 'ያ', 'ይ', 'ደ', 'ዲ', 'ዳ', 'ድ', 'ጀ', 'ጅ', 'ገ', 'ጋ', 'ጡ', 'ጥ', 'ጫ', 'ፃ', 'ፋ', 'ፍ', 'ፎ', 'ፕ', '\u200d', '☎', '️', '🇫', '🇷', '🏘', '🐊', '👔', '👢', '👧', '👩', '💰', '📐', '📥', '🦰']


In [11]:
import re
import pandas as pd

# Define a function to remove emojis and specific symbols
def remove_emojis_and_symbols(text):
    emoji_pattern = re.compile(
        "[" 
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251"  # Enclosed characters
        "]+", 
        flags=re.UNICODE
    )
    # Add additional specific symbols to remove
    additional_symbols = r"[☎️🇫🇷🏘🐊👔👢👧👩💰📐📥🦰®×]"
    
    # Remove emojis and additional symbols
    text = emoji_pattern.sub(r'', text)
    text = re.sub(additional_symbols, '', text)
    
    return text


In [12]:
telegram_data['Address'] = telegram_data['Address'].astype(str).apply(remove_emojis_and_symbols)

# Display the updated DataFrame
print(telegram_data.head())

             Date   Time          Sender         Product Name  Price  \
0  22 August 2023  07:30  BELLA CLASSIC®     crocodile school  2,800   
1  22 August 2023  07:57  BELLA CLASSIC®  ALLIGATOR CROCODILE   3500   
2  22 August 2023  08:31  BELLA CLASSIC®             Nike ACG   3500   
3  22 August 2023  12:14  BELLA CLASSIC®            Nike sb w   3400   
4  23 August 2023  11:51  BELLA CLASSIC®        ADIDAS OZELIA   3400   

             Size  Made In Phone Number    Color    Contact Link  \
0  40,41,42,43,44   Turkey   0944222069  2 kinds  @Bellaclassics   
1  40,41,42,43,44   Turkey   0944222069  2 kinds  @Bellaclassics   
2  40,41,42,43,44  Vietnam   0944222069  2 kinds  @Bellaclassics   
3  40,41,42,43,44  Vietnam   0944222069  4 kinds  @Bellaclassics   
4  40,41,42,43,44  Vietnam   0944222069  2 kinds  @Bellaclassics   

                                             Address  
0  ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...  
1  ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል 

In [13]:
output_csv_file = r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\cleaned_telegram_data.csv"
telegram_data.to_csv(output_csv_file, index=False)

# Display the updated DataFrame
print(telegram_data.head())

             Date   Time          Sender         Product Name  Price  \
0  22 August 2023  07:30  BELLA CLASSIC®     crocodile school  2,800   
1  22 August 2023  07:57  BELLA CLASSIC®  ALLIGATOR CROCODILE   3500   
2  22 August 2023  08:31  BELLA CLASSIC®             Nike ACG   3500   
3  22 August 2023  12:14  BELLA CLASSIC®            Nike sb w   3400   
4  23 August 2023  11:51  BELLA CLASSIC®        ADIDAS OZELIA   3400   

             Size  Made In Phone Number    Color    Contact Link  \
0  40,41,42,43,44   Turkey   0944222069  2 kinds  @Bellaclassics   
1  40,41,42,43,44   Turkey   0944222069  2 kinds  @Bellaclassics   
2  40,41,42,43,44  Vietnam   0944222069  2 kinds  @Bellaclassics   
3  40,41,42,43,44  Vietnam   0944222069  4 kinds  @Bellaclassics   
4  40,41,42,43,44  Vietnam   0944222069  2 kinds  @Bellaclassics   

                                             Address  
0  ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...  
1  ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል 

<style>
    h1 {
        color: #ffaa00;
        text-shadow: 2px 2px 5px #000;
        font-family: "Comic Sans MS", sans-serif;
    }
</style>

<h1>✨    Named Entity Recognition (NER) Labeling Function✨</h1>

In [14]:
import re

def label_message_utf8_with_birr(message):
    # Split the message at the first occurrence of '\n'
    if '\n' in message:
        first_line, remaining_message = message.split('\n', 1)
    else:
        first_line, remaining_message = message, ""
    
    labeled_tokens = []
    
    # Tokenize the first line
    first_line_tokens = re.findall(r'\S+', first_line)
    
    # Label the first token as B-PRODUCT and the rest as I-PRODUCT
    if first_line_tokens:
        labeled_tokens.append(f"{first_line_tokens[0]} B-PRODUCT")  # First token as B-PRODUCT
        for token in first_line_tokens[1:]:
            labeled_tokens.append(f"{token} I-PRODUCT")  # Remaining tokens as I-PRODUCT
    
    # Process the remaining message normally
    if remaining_message:
        lines = remaining_message.split('\n')
        for line in lines:
            tokens = re.findall(r'\S+', line)  # Tokenize each line while considering non-ASCII characters
            
            for token in tokens:
                # Check if token is a price (e.g., 500 ETB, $100, or ብር)
                if re.match(r'^\d{10,}$', token):
                    labeled_tokens.append(f"{token} O")  # Label as O for "other" or outside of any entity
                elif re.match(r'^\d+(\.\d{1,2})?$', token) or 'ETB' in token or 'ዋጋ' in token or '$' in token or 'ብር' in token:
                    labeled_tokens.append(f"{token} I-PRICE")
                # Check if token could be a location (e.g., cities or general location names)
                elif any(loc in token for loc in ['Addis Ababa', 'ለቡ', 'ለቡ መዳህኒዓለም', 'መገናኛ', 'ቦሌ', 'ሜክሲኮ']):
                    labeled_tokens.append(f"{token} I-LOC")
                # Assume other tokens are part of a product name or general text
                else:
                    labeled_tokens.append(f"{token} O")
    
    return "\n".join(labeled_tokens)

In [15]:
# Apply the updated function to the non-null messages
telegram_data['Labeled_Address'] = telegram_data['Address'].apply(label_message_utf8_with_birr)

# Display the updated DataFrame
telegram_data.head()

Unnamed: 0,Date,Time,Sender,Product Name,Price,Size,Made In,Phone Number,Color,Contact Link,Address,Labeled_Address
0,22 August 2023,07:30,BELLA CLASSIC®,crocodile school,2800,4041424344,Turkey,944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...,ሜክሲኮ B-PRODUCT\nኬኬር I-PRODUCT\nህንፃ I-PRODUCT\n...
1,22 August 2023,07:57,BELLA CLASSIC®,ALLIGATOR CROCODILE,3500,4041424344,Turkey,944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...,ሜክሲኮ B-PRODUCT\nኬኬር I-PRODUCT\nህንፃ I-PRODUCT\n...
2,22 August 2023,08:31,BELLA CLASSIC®,Nike ACG,3500,4041424344,Vietnam,944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...,ሜክሲኮ B-PRODUCT\nኬኬር I-PRODUCT\nህንፃ I-PRODUCT\n...
3,22 August 2023,12:14,BELLA CLASSIC®,Nike sb w,3400,4041424344,Vietnam,944222069,4 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...,ሜክሲኮ B-PRODUCT\nኬኬር I-PRODUCT\nህንፃ I-PRODUCT\n...
4,23 August 2023,11:51,BELLA CLASSIC®,ADIDAS OZELIA,3400,4041424344,Vietnam,944222069,2 kinds,@Bellaclassics,ሜክሲኮ ኬኬር ህንፃ 2ኛ ፎቅ ልክ እንደወጡ በዲስፕሌው በስተቀኝ በኩል እ...,ሜክሲኮ B-PRODUCT\nኬኬር I-PRODUCT\nህንፃ I-PRODUCT\n...


In [16]:

# Define the directory and file path
output_dir = r"C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data"
labeled_data_birr_path = os.path.join(output_dir, 'labeled_telegram_product_price_location.txt')

# Save the updated labeled dataset to a file in CoNLL format
with open(labeled_data_birr_path, 'w', encoding='utf-8') as f:
    for index, row in telegram_data.iterrows():
        f.write(f"{row['Labeled_Address']}\n\n")

print(f"Labeled data has been saved to: {labeled_data_birr_path}")


Labeled data has been saved to: C:\Users\fikad\Desktop\10acedamy\EthioMart-NER-Named-Entity-Recognition-\Data\labeled_telegram_product_price_location.txt


In [None]:
! pip install torch torchvision torchaudio

In [17]:
import torch
print(torch.__version__)

2.5.1+cpu


In [18]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from tenacity import retry, stop_after_attempt, wait_exponential

  from .autonotebook import tqdm as notebook_tqdm


In [20]:
# Retry decorator
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=2, max=10))
def load_model():
    tokenizer = AutoTokenizer.from_pretrained("masakhane/afroxlmr-large-ner-masakhaner-1.0_2.0")
    model = AutoModelForTokenClassification.from_pretrained("masakhane/afroxlmr-large-ner-masakhaner-1.0_2.0")
    return tokenizer, model

tokenizer, model = load_model()
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# Example usage
example = telegram_data['Address'][10]
ner_results = nlp(example)
print(ner_results)

Device set to use cpu


[{'entity': 'B-LOC', 'score': np.float32(0.99999595), 'index': 1, 'word': '▁ሜ', 'start': 0, 'end': 1}, {'entity': 'I-LOC', 'score': np.float32(0.9981389), 'index': 2, 'word': 'ክሲ', 'start': 1, 'end': 3}, {'entity': 'I-LOC', 'score': np.float32(0.99996245), 'index': 3, 'word': 'ኮ', 'start': 3, 'end': 4}, {'entity': 'I-LOC', 'score': np.float32(0.93774956), 'index': 4, 'word': '▁', 'start': 5, 'end': 6}, {'entity': 'I-LOC', 'score': np.float32(0.81205964), 'index': 5, 'word': 'ኬ', 'start': 5, 'end': 6}, {'entity': 'I-LOC', 'score': np.float32(0.7874235), 'index': 6, 'word': 'ኬ', 'start': 6, 'end': 7}, {'entity': 'I-LOC', 'score': np.float32(0.75342035), 'index': 7, 'word': 'ር', 'start': 7, 'end': 8}, {'entity': 'I-LOC', 'score': np.float32(0.8010717), 'index': 8, 'word': '▁', 'start': 9, 'end': 10}, {'entity': 'I-LOC', 'score': np.float32(0.684481), 'index': 10, 'word': 'ፃ', 'start': 11, 'end': 12}]


In [None]:
# Function to check if a string contains Amharic characters
def is_amharic(message):
    return bool(re.search(r'[\u1200-\u137F]', message))

In [None]:
# Function to check if a string contains Amharic characters
def is_amharic(message):
    return bool(re.search(r'[\u1200-\u137F]', message))