In [7]:
!pip install transformers

Defaulting to user installation because normal site-packages is not writeable




Collecting transformers
  Downloading transformers-4.48.1-py3-none-any.whl (9.7 MB)
     ---------------------------------------- 9.7/9.7 MB 18.5 kB/s eta 0:00:00
Collecting tokenizers<0.22,>=0.21
  Downloading tokenizers-0.21.0-cp39-abi3-win_amd64.whl (2.4 MB)
     ---------------------------------------- 2.4/2.4 MB 18.9 kB/s eta 0:00:00
Collecting safetensors>=0.4.1
  Using cached safetensors-0.5.2-cp38-abi3-win_amd64.whl (303 kB)
Collecting huggingface-hub<1.0,>=0.24.0
  Using cached huggingface_hub-0.27.1-py3-none-any.whl (450 kB)
Collecting fsspec>=2023.5.0
  Using cached fsspec-2024.12.0-py3-none-any.whl (183 kB)
Installing collected packages: safetensors, fsspec, huggingface-hub, tokenizers, transformers
Successfully installed fsspec-2024.12.0 huggingface-hub-0.27.1 safetensors-0.5.2 tokenizers-0.21.0 transformers-4.48.1


In [8]:
import pandas as pd
import re
from transformers import pipeline

In [2]:
# Load the reduced dataset
df = pd.read_csv('final_data.csv')

In [3]:
# Function to preprocess Amharic text
def preprocess_text(text):
    if pd.isna(text):
        return "[UNKNOWN]"  # Handle missing values
    
    # Normalize text (remove diacritics, links, and emojis)
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'https?://\S+', '', text)  # Remove URLs
    text = re.sub(r'[^\u1200-\u137F\s]', '', text)  # Remove non-Amharic characters
    
    # Tokenize (split text into words)
    tokens = text.split()
    return " ".join(tokens)  # Join tokens with spaces for compatibility

# Apply preprocessing to the Message column
df['Preprocessed Message'] = df['Message'].apply(preprocess_text)

# Save the preprocessed dataset
df.to_csv('preprocessed_data.csv', index=False)
print("Preprocessing complete! Saved as 'preprocessed_data.csv'.")

Preprocessing complete! Saved as 'preprocessed_data.csv'.


In [4]:
df

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,Preprocessed Message
0,SINA KIDS/ሲና ኪድስⓇ,@sinayelj,15344,,2025-01-20 05:23:15+00:00,@sinayelj_15344.jpg,[UNKNOWN]
1,SINA KIDS/ሲና ኪድስⓇ,@sinayelj,15343,,2025-01-20 05:23:15+00:00,@sinayelj_15343.jpg,[UNKNOWN]
2,SINA KIDS/ሲና ኪድስⓇ,@sinayelj,15342,,2025-01-20 05:23:15+00:00,@sinayelj_15342.jpg,[UNKNOWN]
3,SINA KIDS/ሲና ኪድስⓇ,@sinayelj,15341,,2025-01-20 05:23:15+00:00,@sinayelj_15341.jpg,[UNKNOWN]
4,SINA KIDS/ሲና ኪድስⓇ,@sinayelj,15340,,2025-01-20 05:23:15+00:00,@sinayelj_15340.jpg,[UNKNOWN]
...,...,...,...,...,...,...,...
10903,Sheger online-store,@Shageronlinestore,12,🎯 Kitchen Sticker\n\nለኪችንዎ ውበት እጅግ ተመራጭ \n🔰ውሀ ...,2021-04-27 05:58:59+00:00,@Shageronlinestore_12.jpg,ለኪችንዎ ውበት እጅግ ተመራጭ ውሀ የማያስገባ ቅባት ዘይት ነገሮች የማያበ...
10904,Sheger online-store,@Shageronlinestore,10,🎯 3in1 One Step Hair Dryer & Styler \n\n👉 ከርል ...,2021-04-27 05:57:12+00:00,@Shageronlinestore_10.jpg,ከርል ለመስራት ለማለስለስ እንዲሁም ለማድረቅ የሚያገለግል ለኢትዮጵያውያን...
10905,Sheger online-store,@Shageronlinestore,9,✅ Home GYM - X5 slimming vibrator \n\n📢📢📢 ታላቅ ...,2021-04-27 05:45:57+00:00,@Shageronlinestore_9.jpg,ታላቅ ቅናሽ የሰዉነትዎ ውፍረት አሳስቧታል ሙሉ በሙሉ ቦርጭን በአጭር ጊዜ...
10906,Sheger online-store,@Shageronlinestore,3,#Finger_tip_pulse_oximeter\n #በተመጣጣኝ_ዋጋ\...,2021-04-12 08:35:47+00:00,@Shageronlinestore_3.jpg,በተመጣጣኝዋጋ ለአጠቃቀም ምቹ በሰዉነታችን ያለውን የኦክስጅን እና የልብ ...


In [5]:
# Convert dataset to CoNLL format
with open('data_conll.txt', 'w', encoding='utf-8') as f:
    for _, row in df.iterrows():
        text = row['Preprocessed Message']
        tokens = text.split()
        
        for token in tokens:
            f.write(f"{token} O\n")  # Default label is "O"
        
        f.write("\n")  # Separate sentences/messages with a blank line

print("Conversion to CoNLL format complete! Saved as 'data_conll.txt'.")

Conversion to CoNLL format complete! Saved as 'data_conll.txt'.


In [9]:
# Load a pretrained multilingual model for NER
ner_model = pipeline("ner", model="xlm-roberta-large-finetuned-conll03-english")

def label_with_pretrained_model(message):
    if pd.isna(message) or not message.strip():
        return ""

    labeled_tokens = []
    
    # Tokenize and predict entities using a pretrained model
    entities = ner_model(message)

    # Assign labels based on model predictions
    for entity in entities:
        token = entity['word']
        label = entity['entity']  # e.g., "B-LOC", "I-LOC", "O"
        labeled_tokens.append(f"{token} {label}")

    return "\n".join(labeled_tokens)

# Apply the function to the dataset
df['Labeled_Message'] = df['Message'].apply(label_with_pretrained_model)


config.json:   0%|          | 0.00/852 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFXLMRobertaForTokenClassification.

All the weights of TFXLMRobertaForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForTokenClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:03<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Device set to use 0


In [10]:
# Save the labeled dataset
df.to_csv('labeled_data.csv', index=False)