# Tokenization 

Tokenize Amharic text using various approaches, including traditional rule-based, statistical, and pre-trained models. We will:

* Tokenize using amseg.amharicSegmenter.
* Tokenize using nltk.tokenizer.
* Tokenize using spm.SentencePieceProcessor.
* Tokenize using XLM-Roberta-Base.

In [2]:
# import basic libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from amseg.amharicSegmenter import AmharicSegmenter
from transformers import AutoTokenizer
nltk.download('punkt')
import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\liulj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
# get working directory
import os, sys
sys.path.append(os.path.abspath('..'))

# Import basic functions from the scripts folder
from scripts.tokenization import amseg_tokenizer, nltk_tokenizer, regex_tokenizer, xlm_roberta_tokenizer

Load preprocessed data 

In [4]:
data = pd.read_csv('../data/preprocessed_telegram_data.csv')

In [5]:
data.head()

Unnamed: 0,Channel Title,Channel Username,Message Id,Message Text,Message Date,Media Path
0,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4310,nipple shield የእናት ጡት ጫፍ ማራዘሚያ ዋጋ 450 ብር 09117...,2025-01-17 10:32:58+00:00,../data/photos\@gebeyaadama_4310.jpg
1,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4309,Marc Jacob 3 in 1 glasses,2025-01-16 08:37:21+00:00,
2,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4306,Marc Jacob 3 in 1 sunglass 3 መነፀር በ 1 የያዘ 1 ከስ...,2025-01-16 08:21:16+00:00,../data/photos\@gebeyaadama_4306.jpg
3,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4304,Door Bottom Sealer,2025-01-13 09:16:50+00:00,
4,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4303,Door Bottom Sealer አየር ከውጭ ወደ ቤት ውስጥ እንዳይገባ ይከ...,2025-01-13 09:16:45+00:00,../data/photos\@gebeyaadama_4303.jpg


In [9]:
# Column to be tokenized
tokenized_column = data['Message Text'].astype(str)

In [10]:
tokenized_column.head()

0    nipple shield የእናት ጡት ጫፍ ማራዘሚያ ዋጋ 450 ብር 09117...
1                            Marc Jacob 3 in 1 glasses
2    Marc Jacob 3 in 1 sunglass 3 መነፀር በ 1 የያዘ 1 ከስ...
3                                   Door Bottom Sealer
4    Door Bottom Sealer አየር ከውጭ ወደ ቤት ውስጥ እንዳይገባ ይከ...
Name: Message Text, dtype: object

### Amharic Segmentor Tokenization: using `amseg.amharicSegmenter`

In [11]:
# Tokenize using the Amharic Segmenter
amseg_tokens = tokenized_column.apply(amseg_tokenizer)

In [31]:
amseg_tokens.head(1).to_list()

[['nipple',
  'shield',
  'የእናት',
  'ጡት',
  'ጫፍ',
  'ማራዘሚያ',
  'ዋጋ',
  '450',
  'ብር',
  '0911762201',
  '0972824252',
  '0988404491',
  '0922282582',
  'በቴሌግራም',
  'ለማዘዝ',
  '@GebeyaAdama21',
  'አድራሻችን',
  'አዳማ',
  'ፖስታ',
  'ቤት',
  'ሶሬቲ',
  'ሞል',
  'ምድር',
  'ላይ',
  'ሱ.ቁ',
  '33',
  'ይሄንን',
  'በመጫን',
  'የቤተሠባችን',
  'አባል',
  'ይሁኑ',
  'የመረጡትን',
  'እቃ',
  'ይዘዙ',
  '፤',
  'ያሉበት',
  'እናደርሳለን',
  '!!',
  'በኪስዎ',
  'ጥሬ',
  'ገንዘብ',
  'ካልያዙ',
  'በሞባይል',
  'ማስተላለፍ',
  'ይችላሉ',
  '።']]

### NLTK Word Tokenizer

In [16]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\liulj\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [17]:
# Tokenize using the NLTK tokenizer
nltk_tokens = tokenized_column.apply(nltk_tokenizer)

In [18]:
nltk_tokens.head(10)

0    [nipple, shield, የእናት, ጡት, ጫፍ, ማራዘሚያ, ዋጋ, 450,...
1                     [Marc, Jacob, 3, in, 1, glasses]
2    [Marc, Jacob, 3, in, 1, sunglass, 3, መነፀር, በ, ...
3                               [Door, Bottom, Sealer]
4    [Door, Bottom, Sealer, አየር, ከውጭ, ወደ, ቤት, ውስጥ, ...
5    [6, in, 1, silicone, set, 6, እቃዎች, የያዘ, መመገቢያ,...
6    [Foldable, High, Capacity, Travel, Bag, ውሃ, የማ...
7    [ምግብ, ማስጀመሪያ, ሠኀን, የራሱ, ማንኪያ, ያለው, ልጆች, ጐትተው, ...
8    [Grooming, set, 3in1, hair, clipper, Mini, sha...
9    [Geemy, hair, clapper, መብራት, ጠፋ, አልጠፋ, ብሎ, መጨነ...
Name: Message Text, dtype: object

### Regex Tokenization

In [19]:
# Tokenize using the regular expression tokenizer
regex_tokens = tokenized_column.apply(regex_tokenizer)

In [25]:
regex_tokens.head()

0    [nipple, shield, የእናት, ጡት, ጫፍ, ማራዘሚያ, ዋጋ, 450,...
1                     [Marc, Jacob, 3, in, 1, glasses]
2    [Marc, Jacob, 3, in, 1, sunglass, 3, መነፀር, በ, ...
3                               [Door, Bottom, Sealer]
4    [Door, Bottom, Sealer, አየር, ከውጭ, ወደ, ቤት, ውስጥ, ...
Name: Message Text, dtype: object

### Pre-Trained XLM-Roberto-Base Tokenization

In [21]:
# Tokenize using the XLM-RoBERTa tokenizer
xlm_roberta_tokens = tokenized_column.apply(xlm_roberta_tokenizer)

In [32]:
xlm_roberta_tokens.head(1).to_list()

[['▁ni',
  'pp',
  'le',
  '▁',
  'shield',
  '▁የእ',
  'ናት',
  '▁',
  'ጡት',
  '▁',
  'ጫ',
  'ፍ',
  '▁ማ',
  'ራ',
  'ዘ',
  'ሚያ',
  '▁ዋጋ',
  '▁450',
  '▁ብር',
  '▁09',
  '11',
  '762',
  '201',
  '▁09',
  '72',
  '82',
  '42',
  '52',
  '▁09',
  '884',
  '04',
  '491',
  '▁09',
  '22',
  '28',
  '25',
  '82',
  '▁በ',
  'ቴ',
  'ሌ',
  'ግራ',
  'ም',
  '▁ለማ',
  'ዘ',
  'ዝ',
  '▁@',
  'Ge',
  'be',
  'ya',
  'Adam',
  'a',
  '21',
  '▁አድራሻ',
  'ችን',
  '▁አዳ',
  'ማ',
  '▁ፖ',
  'ስታ',
  '▁ቤት',
  '▁',
  'ሶ',
  'ሬ',
  'ቲ',
  '▁ሞ',
  'ል',
  '▁ምድር',
  '▁ላይ',
  '▁',
  'ሱ',
  '.',
  'ቁ',
  '▁33',
  '▁ይሄን',
  'ን',
  '▁በመ',
  'ጫ',
  'ን',
  '▁የቤተ',
  'ሠ',
  'ባ',
  'ችን',
  '▁አባል',
  '▁ይ',
  'ሁኑ',
  '▁የ',
  'መረጡ',
  'ትን',
  '▁እ',
  'ቃ',
  '▁ይዘ',
  'ዙ',
  '፤',
  '▁ያሉ',
  'በት',
  '▁እና',
  'ደር',
  'ሳ',
  'ለን',
  '!!',
  '▁በ',
  'ኪ',
  'ስ',
  'ዎ',
  '▁ጥ',
  'ሬ',
  '▁ገንዘብ',
  '▁ካል',
  'ያዙ',
  '▁በ',
  'ሞባይል',
  '▁ማስ',
  'ተላለፍ',
  '▁ይችላሉ',
  '።']]

**Note:** from the above Tokenization methode `Amharic Segment` is good for this project

**Save tokenized data**

In [33]:
tokenized_file_path = '../data/tokenized_telegram_data.csv'
amseg_tokens.to_csv(tokenized_file_path, index=False, encoding='utf-8')