<a href="https://colab.research.google.com/github/Ruqyai/MENADD-DL/blob/main/RNN/Arabic_Poems_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Arabic Poems Generator

### 1.0 Load the packages
<hr/>


In [1]:
!pip install tensorflow==2.1.0 &> /dev/null

Checking the tensorflow version

In [2]:
import tensorflow as tf
print(tf.__version__)

2.1.0


In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers 
import tensorflow.keras.utils as ku 
import numpy as np 

### 2.0 Loading the data
<hr/>


In [None]:
!wget https://raw.githubusercontent.com/Ruqyai/MENADD-DL/main/Data/arabic_poem_generator.txt

In [5]:
data = open('arabic_poem_generator.txt', 'rb').read().decode(encoding='utf-8')
data[0:300]

'لقينا يوم صهباء سريّه\nحناظلة لهم في الحرب نيّه\nلقيناهم بأسياف حداد\nوأسد لا تفرّ من المنيّه\nوكان زعيمهم إذ ذاك ليث\nهزبرا لا يبالي بالرزيّه\nفخلّفناه وسط القاع ملقى\nوها أنا طالب قتل البقيّه\nورحنا بالسيوف نسوق فيهم\nإلى ربوات معضلة خفيّه\nوكم من فارس منهم تركنا\nعليه من صوارمنا قضيّه\nفوارسنا بنو عبس وإنّا\n'

### 3.0 Tokenizing the training data
<hr/>

In [6]:
tokenizer = Tokenizer()
corpus = data.split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print('Total number of words in corpus:',total_words)

{'من': 1, 'في': 2, 'إذا': 3, 'على': 4, 'يا': 5, 'ما': 6, 'لا': 7, 'ولا': 8, 'قد': 9, 'كلّ': 10, 'يوم': 11, 'وقد': 12, 'عبل': 13, 'عن': 14, 'إلى': 15, 'بين': 16, 'حتّى': 17, 'وما': 18, 'إن': 19, 'له': 20, 'به': 21, 'لي': 22, 'كان': 23, 'لم': 24, 'مثل': 25, 'بعد': 26, 'إلّا': 27, 'أو': 28, 'لو': 29, 'كنت': 30, 'بها': 31, 'غير': 32, 'عبلة': 33, 'كم': 34, 'عبس': 35, 'وإن': 36, 'أن': 37, 'لها': 38, 'الحرب': 39, 'أنا': 40, 'الخيل': 41, 'ألا': 42, 'عنّي': 43, 'فيها': 44, 'بني': 45, 'الّذي': 46, 'الدهر': 47, 'تحت': 48, 'وإذا': 49, 'أنّ': 50, 'الموت': 51, 'إذ': 52, 'المنايا': 53, 'عليّ': 54, 'ومن': 55, 'وهو': 56, 'حين': 57, 'والخيل': 58, 'منّي': 59, 'عند': 60, 'فما': 61, 'ولم': 62, 'نار': 63, 'وفي': 64, 'منه': 65, 'الزمان': 66, 'ولو': 67, 'مع': 68, 'فوق': 69, 'وكم': 70, 'الوغى': 71, 'ليس': 72, 'فيه': 73, 'القنا': 74, 'عليه': 75, 'كيف': 76, 'إنّ': 77, 'لقد': 78, 'القوم': 79, 'كأنّ': 80, 'لمّا': 81, 'الله': 82, 'عندي': 83, 'فلا': 84, 'عبيلة': 85, 'بنو': 86, 'مثلي': 87, 'ترى': 88, 'قومي': 89, 'سيف

### 4.0 Preparing the data for training
<hr/>
This is the most important part of this entire script and can be broadly split into 4 steps. So let's get into it shall we,

For each line in the txt file (training data):
 #### 4.1) Converting text to sequences.

   You can do that using the following:

    tokenizer.texts_to_sequences([line])
    
   Once you convert the text to sequence the output of it would look some thing like the following:

    [34, 417, 877, 166, 213, 517]
 
 #### 4.2) Creating the N_gram sequences.
   Now to create N-gram sequences that would look like 

    [34,417]
    [34,417,877] 
    [34,417,877,166]
    [34,417,877,166,213]
    [34,417,877,166,213,517]

 #### 4.3) Finding the max sequence length and the padding the rest.
  
   The first thing to do here is to find the larges sequence length. After that, you are going to do pre padding using:

    pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

   Once you are done it would look something like this:

    [0,0,0,0,34,417]
    [0,0,0,34,417,877] 
    [0,0,34,417,877,166]
    [0,34,417,877,166,213]
    [34,417,877,166,213,517]

 #### 4.4) Creating the predictors and the labels.

   This is where the most interesting part comes in, we are going to consider the last element in the N_gram sequence arrays we got above as labes and the rest of the array as the predictors:
    
    PREDICTORS                      LABLES
    [0,0,0,0,34]                     417
    [0,0,0,34,417]                   877
    [0,0,34,417,877]                 166
    [0,34,417,877,166]               213
    [34,417,877,166,213]             517

The code for all of the above steps are concatenated together in the next code block:


In [7]:
# 1- Converting text to sequences.
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	# 2-Creating the N_gram sequences.
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)
# 3-Finding the max sequence length and the padding the rest.
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
# 4-Creating the predictors and the labels.
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)

#### 5.0 Defining the model
<hr/>


In [8]:
# Defining the model.
model = Sequential()

model.add(Embedding(total_words,100,input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150,return_sequences=True)))
model.add(Dropout(0.18))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(total_words/2,activation='relu',kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer = 'adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 100)            821200    
_________________________________________________________________
bidirectional (Bidirectional (None, 8, 300)            301200    
_________________________________________________________________
dropout (Dropout)            (None, 8, 300)            0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               320800    
_________________________________________________________________
dense (Dense)                (None, 4106)              825306    
_________________________________________________________________
dense_1 (Dense)              (None, 8212)              33726684  
Total params: 35,995,190
Trainable params: 35,995,190
Non-trainable params: 0
____________________________________________

#### 6.0 Training the model
<hr/>


In [9]:
 history = model.fit(predictors, label, epochs=20, verbose=1)

Train on 13043 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


#### 7.0 Testing the model
<hr/>
To test the model we have to give 2 inputs:

1. Input text or seed text so the network can start predicting. and,
2. The number of words you want thenetwork to predict. 

In [10]:
seed_text = "نور"
next_words =8
  
for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted = model.predict_classes(token_list, verbose=0)
	output_word = ""
	for word, index in tokenizer.word_index.items():
		if index == predicted:
			output_word = word
			break
	seed_text += " " + output_word
print(seed_text)

نور من كلّ جبّار الحشا يصفو القنا الجوى العدى


##Assignment: 

Retain any model with a custom dataset:

[Generate music with an RNN](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/audio/music_generation.ipynb)
OR
[Text generation with an RNN](https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/text_generation.ipynb)

