# 3- (Arabic) Recipe 6-2. Classifying Text with Deep Learning

NLP pipeline will remain the same as done in earlier notebooks. The only change would be that instead of using machine learning algorithms, we would be building models using deep learning algorithms.

###  Import the libraries

In [1]:
import pandas as pd

### import data

In [2]:
Data = pd.read_csv("arabic_dataset_classifiction.csv", encoding='utf-8')

dataset source : [DataSet for Arabic Classification](https://data.mendeley.com/datasets/v524p5dhpj/2) <br>
it is collected from 3 Arabic online newspapers: Assabah, Hespress and Akhbarona using semi-automatic web crawling process.

### Data understanding

In [3]:
Data.head()

Unnamed: 0,text,targe
0,بين أستوديوهات ورزازات وصحراء مرزوكة وآثار ولي...,0
1,قررت النجمة الأمريكية أوبرا وينفري ألا يقتصر ع...,0
2,أخبارنا المغربية الوزاني تصوير الشملالي ألهب ا...,0
3,اخبارنا المغربية قال ابراهيم الراشدي محامي سعد...,0
4,تزال صناعة الجلود في المغرب تتبع الطريقة التقل...,0


In [4]:
Data.rename(columns={"targe": "Target", "text": "Text"}, inplace = True)

In [5]:
Data.shape

(111728, 2)

In [6]:
Data['Target'].unique()

array([0, 1, 2, 3, 4], dtype=int64)

In [7]:
Data.dtypes

Text      object
Target     int64
dtype: object

In [8]:
Data[Data['Target'] == 0].Text.iloc[350]

'قالت إنها انتظرت خمس سنوات لتشارك في فد تي في قالت الفنانة بديعة الصنهاجي التي تتقمص دور عبلة في سيتكوم ديما جيران على شاشة القناة الثانية إن قلة أعمالها التلفزيونية راجعة إلى عدم تفكيرها في الاحتراف بعد تخرجها من معهد الفن المسرحي والتنشيط الثقافي وأضافت بديعة في تصريح الصباح أن دراستها المسرح لمدة سبع سنوات كانت من باب الهواية مشيرة إلى أنها ركزت أثناء دراستها على تجسيد أدوار باللغة الفرنسية لمسرحيات كلاسيكية لموليير وتشيكوف وكان أول ظهور لبديعة الصنهاجي على شاشة التلفزيون رفقة المخرج نور الدين لخماري في السلسلة البوليسية القضية »، التي تولى بعض أصدقائها مهمة اقتراحها للعمل فيها وأوضحت الصنهاجي بخصوص دورها في سلسلة القضية أنه من الشخصيات التي تعتز بتقمصها رغم أنه من الأدوار الثانوية مضيفة أن بحكم وظيفتها لم يكن ممكنا بالنسبة إليها الغياب لمدة طويلة عن العمل إن احتراف الفن أمر صعب وهذا ما جعلني بعيدة لعدة سنوات عنه كما أن ارتباطي بعملي في مجالي التسويق والاتصال في مجال الوكالات العقارية ساهم في ذلك »، تقول بديعة عن أسباب تأخرها في دخول المجال الفني وأكدت بديعة أنها من المعجبات بأداء 

by observing the content, i realized that the labels are represntative of the following <br>
- 0 celebrities <br>
- 1 crimes <br>
- 2 economy <br>
- 3 politics <br>
- 4 sports

In [9]:
# Selecting non null data
Data = Data[pd.notnull(Data['Text'])]

In [10]:
Data.shape

(108789, 2)

In [11]:
Data.groupby('Target').Text.count().sort_values(ascending=False)

Target
4    43675
3    20485
1    16728
2    14165
0    13736
Name: Text, dtype: int64

I will use sports-crimes classification as a start, then i will include the remaining classes

In [12]:
# lets do the clustering for just 200 documents. Its easier to interpret.
# i will take 200/5 = 40 samples from each class 

sample_crime = Data[Data['Target'] == 1].sample(n=16000)
sample_sports = Data[Data['Target'] == 4].sample(n=16000)

# Here we recreate a 'balanced' dataset.
Data_sample = pd.concat([sample_crime, sample_sports],axis=0)
Data_sample.reset_index(drop=True, inplace=True)

In [13]:
Data_sample.shape

(32000, 2)

In [14]:
Data_sample['Target'].unique()

array([1, 4], dtype=int64)

In [15]:
# reset the target values to 0 and 1 insted of 1 and 4
Data_sample['Target'].replace(1, 0, inplace=True)
Data_sample['Target'].replace(4, 1, inplace=True)

In [16]:
categories = {
              0 : 'crimes', 
              1 : 'sports'
             }

### Preprocessing

In [17]:
# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('arabic')

Data_sample['Text'] = Data_sample['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [18]:
# remove punctuation and multiple spaces

import re

pattern_punctuation = '[^a-zA-z0-9ء-ي\s]' # for punctuation (not numeric nor arabic nor english letters)
pattern_multi_spaces = '[ ]{2,}'

Data_sample['Text'] = Data_sample['Text'].apply(lambda x: re.sub(pattern_punctuation, '' , x))
Data_sample['Text'] = Data_sample['Text'].apply(lambda x: re.sub(pattern_multi_spaces, ' ' , x))

In [19]:
# stemming 

from nltk.stem.snowball import SnowballStemmer

# Load 'stemmer'
stemmer = SnowballStemmer("arabic")
Data_sample['Text'] = Data_sample['Text'].apply(lambda sentence: ' '.join([stemmer.stem(word) for word in sentence.split()]))

### Data preparation for model building

In [20]:
from sklearn.model_selection import train_test_split

#Train and test split with 80:20 ratio
train, test = train_test_split(Data_sample, test_size=0.2)

# Define the sequence lengths, max number of words and embedding dimensions
# Sequence length of each sentence. If more, truncate. If less, pad with zeros
MAX_SEQUENCE_LENGTH = 300

# Top 20000 frequently occurring words
MAX_NB_WORDS = 20000

# Get the frequently occurring words
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train.Text)

# tokenizing while keeping only the top frequent words
train_sequences = tokenizer.texts_to_sequences(train.Text)
test_sequences = tokenizer.texts_to_sequences(test.Text)

# dictionary containing words and their index
word_index = tokenizer.word_index

# print(tokenizer.word_index)
# total words in the corpus
print('Found %s unique tokens.' % len(word_index))

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Found 49664 unique tokens.


look at : https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [21]:
# lets look at the tokenization done by keras (which splits words and gives each unique word to a unique integer)
train_sequences[0]

[4991,
 104,
 164,
 43,
 15,
 8,
 679,
 371,
 996,
 241,
 2062,
 14,
 1308,
 101,
 12499,
 214,
 22,
 455,
 2593,
 569,
 54,
 759,
 661,
 2311,
 824,
 911,
 147,
 2856,
 1291,
 371,
 537,
 58,
 5915,
 28,
 5262,
 11,
 55,
 7901,
 661,
 241,
 6507,
 2505,
 24,
 11,
 241,
 172,
 2593,
 55,
 54,
 212,
 107,
 66,
 88,
 2237,
 1154,
 54,
 12203,
 477,
 32,
 270,
 121,
 3668,
 3003,
 1098,
 9,
 55,
 581,
 187,
 43,
 15,
 8,
 371,
 2442,
 3166,
 3836,
 312,
 807,
 661,
 241,
 1205,
 1819,
 24,
 11,
 679,
 11,
 581,
 187,
 394,
 152,
 242,
 232,
 194,
 996,
 679,
 60]

In [22]:
# look at the lengths of the first 6 documents
for i in range(6):
    print(len(train_sequences[i]), ' ', end= ' ')

95   129   175   386   322   208   

In [23]:
# pad train and test sequences

from keras.preprocessing.sequence import pad_sequences
train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)

test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(train_data.shape)
print(test_data.shape)

(25600, 300)
(6400, 300)


In [24]:
# note that the padding is done at the beginning, why ?
train_data[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

In [25]:
# look at the lengths of the first 6 documents
for i in range(6):
    print(len(train_data[i]), ' ', end= ' ')

300   300   300   300   300   300   

In [26]:
train_labels = train['Target']
test_labels = test['Target']

In [27]:
train_labels

22058    1
23432    1
4750     0
1046     0
323      0
15067    0
28157    1
24026    1
27963    1
13669    0
31232    1
13881    0
7572     0
21291    1
23258    1
17153    1
26471    1
1030     0
17600    1
13900    0
22479    1
106      0
5053     0
19323    1
18590    1
11117    0
12738    0
21682    1
19296    1
123      0
        ..
5839     0
2867     0
29763    1
16332    1
15633    0
11325    0
7348     0
16611    1
23254    1
20544    1
24562    1
6556     0
8101     0
28105    1
5384     0
11187    0
11772    0
4964     0
16218    1
26072    1
24993    1
4403     0
6631     0
10565    0
23662    1
10548    0
8567     0
8211     0
15432    0
20611    1
Name: Target, Length: 25600, dtype: int64

In [28]:
# changing labels to one-hot-encoding 
from keras.utils import to_categorical
import numpy as np

labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))
print('Shape of data tensor:', train_data.shape)
print('Shape of label tensor:', labels_train.shape)
print('Shape of label tensor:', labels_test.shape)

Shape of data tensor: (25600, 300)
Shape of label tensor: (25600, 2)
Shape of label tensor: (6400, 2)


In [29]:
labels_train

array([[0., 1.],
       [0., 1.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [0., 1.]], dtype=float32)

In [30]:
EMBEDDING_DIM = 100

### Model building and predicting

We are building the models using different deep learning approaches
like CNN, RNN, LSTM, and Bidirectional LSTM and comparing the
performance of each model using different accuracy metrics.

#### Defining the CNN model.

Here we define a single hidden layer with 128 memory units. The
network uses a dropout with a probability of 0.5. The output layer is a
dense layer using the softmax activation function to output a probability
prediction.

In [31]:
# Import Libraries

from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import SimpleRNN, Bidirectional, LSTM
from keras.layers import BatchNormalization, Dropout, Flatten
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPool1D

**note in this notebook**: for feature engineering, we will use word embeddings as features . the embeddings was not pretraind, however it was learned from the inputed data during the training.

In [32]:
print('Training CNN 1D model.')

Training CNN 1D model.


In [33]:
model = Sequential()
model.add(Embedding(input_dim = MAX_NB_WORDS, output_dim = EMBEDDING_DIM, input_length = MAX_SEQUENCE_LENGTH))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

Instructions for updating:
Colocations handled automatically by placer.


We are now fitting our model to the data. Here we have 5 epochs and a
batch size of 64 patterns.

In [34]:
model.fit(train_data, labels_train, batch_size=64, epochs=5, validation_data=(test_data, labels_test))

Instructions for updating:
Use tf.cast instead.
Train on 25600 samples, validate on 6400 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0xa14569400>

In [35]:
predicted=model.predict(test_data)

In [36]:
predicted.shape

(6400, 2)

In [37]:
# note that the `prediction` variable is the predicted probability for each class (so each row sum up to 1)
# let's see the first 5 predictions 
for i in range(5):
    print(predicted[i])

[2.2207178e-19 1.0000000e+00]
[6.254029e-05 9.999374e-01]
[0.49948114 0.50051886]
[9.99988079e-01 1.18791995e-05]
[3.0698534e-11 1.0000000e+00]


In [38]:
#model evaluation
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted.round())

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted.round()))

precision: [0.9965035  0.98432698]
recall: [0.98399247 0.99657747]
fscore: [0.99020846 0.99041435]
support: [3186 3214]
############################
             precision    recall  f1-score   support

          0       1.00      0.98      0.99      3186
          1       0.98      1.00      0.99      3214

avg / total       0.99      0.99      0.99      6400



**test using a custom input**

In [84]:
text_to_test = 'سيقام، كأس العالم فى قطر عام 2020'

go trhough the prerpcessing pipeline

In [88]:
# remove stop words
processed_text = " ".join(x for x in text_to_test.split() if x not in stop)

# remove punctuation and multiple spaces
processed_text = re.sub(pattern_punctuation, '' , processed_text)
processed_text = re.sub(pattern_multi_spaces, ' ' , processed_text)

# stemming 
processed_text = ' '.join([stemmer.stem(word) for word in processed_text.split()])

# tokenizing
text_series_to_test = pd.Series(processed_text)
text_series_to_test =  tokenizer.texts_to_sequences(text_series_to_test)

# padding
text_series_to_test = pad_sequences(text_series_to_test, maxlen=MAX_SEQUENCE_LENGTH)

In [86]:
model.predict(text_series)

array([[0.02185888, 0.9781411 ]], dtype=float32)

97.8% class sport (correct)

#### Defining the RNN model.

In [89]:
#model training
print('Training SimpleRNN model.')

Training SimpleRNN model.


In [90]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(SimpleRNN(2, input_shape=(None,1)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy',optimizer='adam',metrics = ['accuracy'])
model.fit(train_data, labels_train, batch_size=16, epochs=5, validation_data=(test_data, labels_test))

Train on 25600 samples, validate on 6400 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0xa18a6a6a0>

In [91]:
# prediction on test data
predicted_Srnn=model.predict(test_data)
predicted_Srnn

array([[0.00379755, 0.9962024 ],
       [0.01539066, 0.98460937],
       [0.9919704 , 0.00802953],
       ...,
       [0.00157173, 0.99842834],
       [0.00646456, 0.9935354 ],
       [0.002992  , 0.997008  ]], dtype=float32)

In [92]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted_Srnn.round())

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted_Srnn.round()))

precision: [0.9817782  0.98103823]
recall: [0.98085374 0.98195395]
fscore: [0.98131575 0.98149588]
support: [3186 3214]
############################
             precision    recall  f1-score   support

          0       0.98      0.98      0.98      3186
          1       0.98      0.98      0.98      3214

avg / total       0.98      0.98      0.98      6400



**test using a custom input**

In [93]:
model.predict(text_series)

array([[0.9380894 , 0.06191062]], dtype=float32)

93.8% class crime (wrong)

#### Defining the LSTM model.

In [94]:
#model training
print('Training LSTM model.')

Training LSTM model.


In [95]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(output_dim=16, activation='relu', inner_activation='hard_sigmoid',return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.fit(train_data, labels_train, batch_size=16, epochs=5, validation_data=(test_data, labels_test))

  This is separate from the ipykernel package so we can avoid doing imports until


Train on 25600 samples, validate on 6400 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0xa1c3dfdd8>

In [96]:
#prediction on text data
predicted_lstm=model.predict(test_data)
predicted_lstm

array([[1.4037952e-16, 1.0000000e+00],
       [2.7975549e-11, 1.0000000e+00],
       [9.9799418e-01, 2.0058097e-03],
       ...,
       [3.4818143e-10, 1.0000000e+00],
       [8.1087570e-16, 1.0000000e+00],
       [2.8518691e-13, 1.0000000e+00]], dtype=float32)

In [97]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, fscore, support = score(labels_test, predicted_lstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test,
predicted_lstm.round()))

precision: [0.9881583  0.99529928]
recall: [0.9952919  0.98817673]
fscore: [0.99171228 0.99172521]
support: [3186 3214]
############################
             precision    recall  f1-score   support

          0       0.99      1.00      0.99      3186
          1       1.00      0.99      0.99      3214

avg / total       0.99      0.99      0.99      6400



**test using a custom input**

In [98]:
model.predict(text_series)

array([[4.4892155e-04, 9.9955100e-01]], dtype=float32)

99.9% class sport (correct)

#### Defining the Bidirectional LSTM model.

As we know, LSTM preserves information from inputs using the
hidden state. In bidirectional LSTMs, inputs are fed in two ways: one
from previous to future and the other going backward from future to
past, helping in learning future representation as well. Bidirectional
LSTMs are known for producing very good results as they are capable of
understanding the context better.

In [99]:
#model training
print('Training Bidirectional LSTM model.')

Training Bidirectional LSTM model.


In [100]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(Bidirectional(LSTM(16, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(Conv1D(16, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform"))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.fit(train_data, labels_train, batch_size=16, epochs=3, validation_data=(test_data, labels_test))

Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 25600 samples, validate on 6400 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0xa251868d0>

In [101]:
# prediction on test data
predicted_blstm=model.predict(test_data)
predicted_blstm

array([[3.3783587e-07, 9.9999964e-01],
       [9.6426636e-05, 9.9990356e-01],
       [9.9645293e-01, 3.5470494e-03],
       ...,
       [7.2406688e-06, 9.9999273e-01],
       [6.1590015e-07, 9.9999940e-01],
       [4.0449393e-07, 9.9999964e-01]], dtype=float32)

In [102]:
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score

precision, recall, fscore, support = score(labels_test,
predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted_blstm.round()))

precision: [0.99557102 0.98795925]
recall: [0.98775895 0.99564406]
fscore: [0.9916496  0.99178677]
support: [3186 3214]
############################
             precision    recall  f1-score   support

          0       1.00      0.99      0.99      3186
          1       0.99      1.00      0.99      3214

avg / total       0.99      0.99      0.99      6400



**test using a custom input**

In [103]:
model.predict(text_series)

array([[8.705105e-06, 9.999913e-01]], dtype=float32)

99.9% class sport (correct)

We can see that deep learning models give much higher accuracy than the classical models. Bidirectional LSTM gives the highest accuracy among the deep learning model.