# Arabic Poem Generator
Natural Language Processing (NLP) Course Project  
Section: AI  |  Students:
* Ahad Alsulami
* Raghad Alghamdi
* Reouf Alsahafi
* Latifah Mohammed

## Read Data and Import Libraries

In this section, we will read a corpus from the web, as well as importing all the necessary libraries required for our projec.

In [1]:
# upload and read data from a web server
!wget -q https://raw.githubusercontent.com/zaidalyafeai/ARBML/master/datasets/Poems/poems
dataset = open("poems", "r").read()
print('Dataset size = ', len(dataset))

Dataset size =  282701


In [2]:
# import libraries
import nltk
import numpy as np
import random
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Split Dataset Into Words

In this section, we will use the `word_tokenize` function from `nltk` library to divide the dataset into individual words, which will allow us to manipulate the data more effectively.

In [3]:
tokens = nltk.word_tokenize(dataset)

# print the length
print("The length of the tokens = ",len(tokens))

# print the first 5 words
print("\nThe first 5 words are: \n",tokens[:5])

The length of the tokens =  51945

The first 5 words are: 
 ['فيا', 'عجبا', 'للناس', 'يستشرفونني', 'كان']


## Build the model

This section outlines the process of building an `n-gram language model`. Firstly, we created a dictionary `ngrams`, where the keys represent individual words and the values are the words that follow them. This allows for multiple values to be added to each key, which affects the probability of text generation.

Next, we generate a random sentence by selecting a starting word and looping over the values to find those with a probability above or equal a certain `threshold`. We continue this process until a terminating condition is reached, which is in this case 4 lines and 8 words in each of them.

In [4]:
# create a dictionary
n = 1 # unigram
ngrams = {}

# loop through all the words in `tokens`
# assign values to the keys
for i in range(len(tokens)-n):
    # form a string of keys
    word = ' '.join(tokens[i:i+n])
    if word not in ngrams.keys():
        ngrams[word] = []
    # append allows us to have duplicates in dictionary
    ngrams[word].append(tokens[i+n])

In [5]:
print('Display the values for the word \'عجبا\' :')
ngrams['عجبا']

Display the values for the word 'عجبا' :


['للناس',
 'للناس',
 'للناس',
 'فقالت',
 'للعين',
 'مني',
 'للعين',
 'هذا',
 'للعذب',
 'أساء']

In [6]:
def generate_poem(start_word):
  """

  This function will generate a poem
  from a given start word using unigram language model

  - start_word: the first word entered by user

  """
  start = start_word
  threshold = random.random() # generate a random threshold
  poem = []                   # store generated lines

  # enter a new word if its not found
  while start not in ngrams.keys():
    start_new = input("النموذج لا يدعم هذه الكلمة، حاول/ي مرةً أخرى: ")
    start = start_new

  # for loop to generate 4 lines
  for i in range(4):
    line = start

    # for loop to generate 8 words in each line
    for j in range(8):
      line_words = nltk.word_tokenize(line) # tokenize lines
      start = line_words[-n]     # update with the last current word

      candidates = ngrams[start] # list of possible next words

      # Compute probabilities of each possible next word based on frequency
      probability = [candidates.count(i) / len(candidates)
                for i in candidates]

      # compute cumulative probabilities for each possible next word
      accumulator = [sum(probability[: i + 1])
                      for i in range(len(probability))]

      # loop over the possible next words and their cumulative probabilities
      for word, acc in zip(candidates, accumulator):
        if acc >= threshold:
          next_word = word
          break

      line += ' ' + next_word            # add the selected next word to the current line
    poem.append(line)                    # add the completed line to the poem list
    start = random.choice(ngrams[start]) # new random word for the new line

  return '\n'.join(poem)

## Generate Poem

In [7]:
start_word = input("أدخل/ي كلمة للبدء: ")
poem = generate_poem(start_word)
print('\n', poem)

أدخل/ي كلمة للبدء: عهد

 عهد عاد بنفسجا شكوت صبابتي ووقفت للواشين في كل
المنام يزور بكيت إلى أن أرى لقد كنت من
أرهب ألا أيها البيت العتيق المحجب لأستمسكن بالود ما
ما مضى من هوى صادقا إني بالفتاة لمعجب ألا


## References


**Dataset (Corpus):**
* [ARBML GitHub Repository](https://github.com/ARBML/ARBML)

**Text generating:**
* [Text Generation Using N-Gram ](https://medium.com/@vsagziyagli/text-generation-using-n-grams-ef49e6e43d39)   
* CCAI-413: Natural Language Processing | Lab#3 N-Gram Language Models