# Classification Arabic Tweets Using AraBERT

**Workflow:**
1. Import Data
2. Load AraBERT model
3. Preprocessing
4. Training and validation
5. Saving the model


In [None]:
!pip install ktrain

Collecting ktrain
  Downloading ktrain-0.28.3.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 99.6 MB/s 
[?25hCollecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 27.3 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 74.2 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 71.8 MB/s 
Collecting syntok
  Downloading syntok-1.3.3-py3-none-any.whl (22 kB)
Collecting seqeval==0.0.19
  Downloading seqeval-0.0.19.tar.gz (30 kB)
Collecting transformers<=4.10.3,>=4.0.0
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 68.3 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ktrain
from ktrain import text
from sklearn.metrics import ConfusionMatrixDisplay
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer 

## Import Data

In [None]:
df_train = pd.read_excel('/content/train.xlsx')
df_test = pd.read_excel('/content/test.xlsx')
df_val = pd.read_excel('/content/val.xlsx')


In [None]:
df_train.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,text,Username,Timestamp,followers,label,categorie
0,15363,2412,2412,RT @directaidorg: اللهم اجعل القرآن العظيم ربي...,salislls,Mon Oct 12 20:36:52 +0000 2020,247,2,religion
1,34580,16232,16232,RT @khaleejTraining: #أنت_المدير_القادم مع دبل...,Drrkh911,Tue Oct 13 03:04:14 +0000 2020,121,4,economy
2,10240,12022,12022,@Leboo777 @freedom0871 @fahdwhy11 @houdaifa199...,mgudoo0,Fri Oct 09 22:28:08 +0000 2020,40,2,religion
3,29716,16396,16396,RT @ZADTVChannel: أكثروا من الصلاة على نبينا م...,mr_m_i_,Fri Oct 09 22:37:45 +0000 2020,223,2,religion
4,12680,315,315,RT @eastturkistan33: نطالب المملكة العربية الس...,Nibras30368137,Sun Jan 09 13:09:00 +0000 2022,170,1,politic


In [None]:
# set hyperparameters
maxlen = 64
batch_size = 16
lr = 2e-5
epochs = 3

In [None]:
df_val.isna().sum()

Unnamed: 0        0
Unnamed: 0.1      0
Unnamed: 0.1.1    0
text              0
Username          0
Timestamp         0
followers         0
label             0
categorie         0
dtype: int64

In [None]:
df_train.isna().sum()

Unnamed: 0        0
Unnamed: 0.1      0
Unnamed: 0.1.1    0
text              0
Username          0
Timestamp         0
followers         0
label             0
categorie         0
dtype: int64

In [None]:
df_test.isna().sum()

Unnamed: 0      0
Unnamed: 0.1    0
text            0
Username        0
Timestamp       0
followers       0
label           0
categorie       0
dtype: int64

In [None]:
df_train

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,text,Username,Timestamp,followers,label,categorie
0,15363,2412,2412,RT @directaidorg: اللهم اجعل القرآن العظيم ربي...,salislls,Mon Oct 12 20:36:52 +0000 2020,247,2,religion
1,34580,16232,16232,RT @khaleejTraining: #أنت_المدير_القادم مع دبل...,Drrkh911,Tue Oct 13 03:04:14 +0000 2020,121,4,economy
2,10240,12022,12022,@Leboo777 @freedom0871 @fahdwhy11 @houdaifa199...,mgudoo0,Fri Oct 09 22:28:08 +0000 2020,40,2,religion
3,29716,16396,16396,RT @ZADTVChannel: أكثروا من الصلاة على نبينا م...,mr_m_i_,Fri Oct 09 22:37:45 +0000 2020,223,2,religion
4,12680,315,315,RT @eastturkistan33: نطالب المملكة العربية الس...,Nibras30368137,Sun Jan 09 13:09:00 +0000 2022,170,1,politic
...,...,...,...,...,...,...,...,...,...
30127,16850,45196,45196,اصابعي تتشرمط وقت الرسم 😍😍😍,i9lir,Thu Oct 08 20:58:36 +0000 2020,365,3,art
30128,6265,9248,9248,RT @AbdullahAB00: موقف الأهلي الاقوى قانونيًا ...,FXopVLFcRLGGSlJ,Mon Oct 12 16:49:02 +0000 2020,184,0,sport
30129,11284,32437,32437,بعيدا عن السياسة والدستور. ✋\n\nمعلق المباراة ...,Saiddjabali4,Fri Oct 09 21:45:51 +0000 2020,2064,1,politic
30130,860,35043,35043,RT @wd077: #الاهلي_ينشد_المساواه8\nاهلينا يناش...,moh075021,Sun Jan 09 23:01:29 +0000 2022,7,0,sport


In [None]:
df_test

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,text,Username,Timestamp,followers,label,categorie
0,34281,34281,RT @MnbrAlhilal: 🔵 وزارة الرياضة تُعلن عن إطلا...,blue_heart79,Thu Oct 08 19:21:38 +0000 2020,199,0,sport
1,22038,22038,RT @ead_tamer: مرتضى رفع قضية ضد وزير الرياضة ...,iw6gtLnQ2YqYG5w,Sat Oct 10 19:48:07 +0000 2020,21,0,sport
2,4740,4740,@tdSuWGsGqgiqrfR @ayed7171 خايفين ان القانون ل...,aleabir11,Sun Jan 09 12:58:56 +0000 2022,59,1,politic
3,42321,42321,RT @faisal_alraoji: رسالتي وفيها طلب تدخل وزير...,Jesus_althagafi,Fri Oct 09 20:55:02 +0000 2020,703,0,sport
4,31683,31683,RT @B_A_TT: @thedark5000 @Lion11197 @ma98_98 @...,NegBsUXNqQBSePK,Sun Jan 09 14:10:34 +0000 2022,646,2,religion
...,...,...,...,...,...,...,...,...
9412,8357,8357,RT @alainliving: .\nمواقيت الصلاة في مدينة الع...,alainawy_k,Fri Oct 09 22:30:28 +0000 2020,76,2,religion
9413,1179,1179,RT @LawClub: انطلاقة جديدة ... \nبآمال و طموحا...,lbrahimAzzam,Mon Oct 12 20:10:17 +0000 2020,3545,1,politic
9414,33041,33041,RT @Arkenu4art: ركن الموسيقى اليوم\nدورة القيث...,D_Omar7,Mon Oct 12 22:10:31 +0000 2020,530,3,art
9415,1801,1801,"RT @MarzouqAlajmi: ""الضغط"" على لجنة الانضباط و...",ibrahim_almqati,Thu Oct 08 19:22:37 +0000 2020,4322,0,sport


In [None]:
df_val

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,text,Username,Timestamp,followers,label,categorie
0,18136,44083,44083,@DR_mouswi2 @Rehab_313_8 @3subhan3 يبووووو راي...,ahmadttl3,Mon Oct 12 21:18:11 +0000 2020,599,3,art
1,12238,12778,12778,RT @shahidq8_1990: اللواء ركن متقاعد البطل جاس...,zATY9jFlK5qNbt2,Sun Jan 09 12:24:43 +0000 2022,388,1,politic
2,37234,25945,25945,RT @mk36xx: هذا الرجل طبق القانون علي الكل وقو...,mk36xx,Mon Oct 12 20:09:38 +0000 2020,949,1,politic
3,33805,21542,21542,RT @Faces_News: #السعودية || رسميًا النجم السع...,shady1921997,Sat Oct 10 19:45:05 +0000 2020,1341,0,sport
4,4589,37406,37406,RT @gptrsh: هذه اعظم لحظه في تاريخ كرة القدم ....,ua4rt,Mon Jan 10 00:11:21 +0000 2022,22,0,sport
...,...,...,...,...,...,...,...,...,...
7529,13340,41142,41142,RT @ATLS1983: أفضل من أتقن دور الست في السينما...,eman6315,Sat Oct 10 19:02:46 +0000 2020,3127,3,art
7530,7020,13754,13754,RT @fitri_warsito: #رسالة_اليوم \n#الاجواء_الح...,isiahharvey18,Mon Jan 10 17:00:43 +0000 2022,3,3,art
7531,24499,1286,1286,RT @Nona_Moh13: الوضع بقى صعب جداً و أختنا (أ)...,marwa_misky,Mon Oct 12 23:27:42 +0000 2020,401,4,economy
7532,3036,29688,29688,RT @Essna32: بكل سهوله تعلم الرسم ثلاثي الابعا...,Yousef7961,Sat Oct 10 23:17:09 +0000 2020,45,3,art


## Load Model

In [None]:
MODEL_NAME = 'aubmindlab/bert-base-arabertv01'
t = text.Transformer(MODEL_NAME, maxlen=maxlen)

Downloading:   0%|          | 0.00/576 [00:00<?, ?B/s]

## Preprocessing

In [None]:
trn = t.preprocess_train(df_train.text.values, df_train.categorie.values)
val = t.preprocess_test(df_val.text.values, df_val.categorie.values)
tst = t.preprocess_test(df_test.text.values, df_test.categorie.values)

preprocessing train...
language: ar
train sequence lengths:
	mean : 19
	95percentile : 26
	99percentile : 29


Downloading:   0%|          | 0.00/379 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.70M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: ar
test sequence lengths:
	mean : 19
	95percentile : 26
	99percentile : 29


preprocessing test...
language: ar
test sequence lengths:
	mean : 19
	95percentile : 26
	99percentile : 29


## Train the model

#### Wrap the model in a learner object

In [None]:
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=batch_size)

Downloading:   0%|          | 0.00/742M [00:00<?, ?B/s]

#### Train

In [None]:
history = learner.fit_onecycle(lr, epochs)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/3
Epoch 2/3
Epoch 3/3


## Evaluate

In [None]:
learner.validate(val_data=tst)

              precision    recall  f1-score   support

           0       0.94      0.95      0.95      1891
           1       0.96      0.95      0.95      1803
           2       0.93      0.93      0.93      1894
           3       0.95      0.95      0.95      1855
           4       0.96      0.97      0.96      1974

    accuracy                           0.95      9417
   macro avg       0.95      0.95      0.95      9417
weighted avg       0.95      0.95      0.95      9417



array([[1796,   11,   27,   33,   24],
       [  23, 1708,   42,   21,    9],
       [  26,   43, 1754,   37,   34],
       [  28,   10,   36, 1768,   13],
       [  29,    7,   17,    8, 1913]])

Let's make a prediction

In [None]:
p = ktrain.get_predictor(learner.model, t)

In [None]:
p.predict("الرضا باب الله الأعظم وجنة الدنيا وبستان العارفين")

'religion'

In [None]:
p.predict("صوت الأغلبية ليس اثباتا للعدالة ")

'politic'

In [None]:
p.predict("ألعب دائما من اجل الفوز سواء كان هذا خلال التدريب او المباراة الحقيقية")

'sport'

In [None]:
p.predict("الاكتتابات العامة الأولية في دول مجلس التعاون الخليجي تشهد أداء متميزا")

'economy'

In [None]:
p.predict("أحلم بالرسم وبعد ذلك أرسم حلمي")

'art'

## Saving the model


In [None]:
ktrain.load_predictor

<function ktrain.core.load_predictor>

In [None]:
predictor.save("/content/ar-bert-model")