<a href="https://colab.research.google.com/github/Khaled-Mohammed-Abdelgaber/deep-learning-projects-/blob/main/NLP/text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Reweighting a probability distribution to a different temperature

In [1]:
import numpy as np 
def reweight_distribution(original_distribution, temperature=0.5):
  """
  original_distribution is 1D numpy array 

  first thing we return the weight to its original without exponential 
  then divide it by tempreture 
  then take exponential
  then calculate new distribution 
  input is e^y
  output is [e^(y/t)] / [sum(e^(y/t))]
  where t is tempreture
  by doing that we change distribution of weights
  note that:
    Higher temperatures result in sampling distributions of higher entropy that will generate more surprising and unstructured generated data,
   whereas a lower temperature will result in less randomness and much more predictable generated data 
  """
  distribution = np.log(original_distribution) / temperature
  distribution = np.exp(distribution)
  return distribution / np.sum(distribution) 

##Downloading and parsing the initial text file

In [2]:
import tensorflow as tf
#path = tf.keras.utils.get_file(fname = ... , origin = url_link to datasets)
path = tf.keras.utils.get_file( 'nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt') #to download datasets


Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt


In [3]:
text = open(path).read().lower() #to read datasets and make it lowercase 

In [4]:
print('Corpus length:', len(text))#will print the number of character in this text

Corpus length: 600893


##Vectorizing sequences of characters

passing string to set will return all character of that string and repeated will be appear once. 
``` 
set("Khaled Mohammed AbdelGaber")
output:
      {' ', 'A', 'G', 'K', 'M', 'a', 'b', 'd', 'e', 'h', 'l', 'm', 'o', 'r'}
```
sorted() function used to sort all things characters or numbers it return a list of sorted elements
```
sorted(set("Khaled Mohammed AbdelGaber"))
output:
      [' ', 'A', 'G', 'K', 'M', 'a', 'b', 'd', 'e', 'h', 'l', 'm', 'o', 'r']
```
what we will do is vectorization by:
* storing all unique character in a list in variable called **chars**. 
* then store this characters in dictionary called **char_indices** its key is character itself and value is character index in **chars** list.
*  **last step is to make two numpy arrays input of model and output:**
  * input is a sentence of 60 characters and output is single character (will be the next one of scentence). 

 * but model does not understand alphabetic character so we will replace sentence with vector of 60 element. its elements values is 0 or 1.

  * 1 in element with index related to character that found in sentence
  * output will be a vector of 60 element of zeros except the index of predicted character. 


In [5]:
maxlen = 60  #max lenght of each sentences
sentences = [] # will store the input of model in form of sentences
next_char = [] #will store the output of model in form of characters alphabetics 
step = 3
for i in range(0 , len(text)-maxlen,step):
  sentences.append(text[i:i+maxlen])
  next_char.append(text[i+maxlen])
print('number of scenteces = ',len(sentences))
#now to store unique characters
chars = sorted(set(text))#will store unique characters
print("first ten characters are : ",chars[:10])
print("10th sentences in text is : ",sentences[9])
#now we want to vectorize input and output
X = np.zeros((len(sentences) , maxlen, len(chars)) , dtype = bool)
y = np.zeros((len(sentences) , len(chars)) , dtype = bool)
for i , sentence in enumerate(sentences):
  for j , char in enumerate(sentence):
    X[i,j,chars.index(char)] = 1
    y[i,chars.index(next_char[i])] = 1


number of scenteces =  200278
first ten characters are :  ['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.']
10th sentences in text is :  uth is a woman--what then? is there not ground
for suspectin


In [6]:
train_len = int(0.8 * X.shape[0])
test_len = int(0.1 * X.shape[0])
X_train = X[:train_len,:,:]
X_test = X[train_len:(train_len + test_len) , : , :]
X_val = X[(train_len + test_len): ,: , :]

y_train = y[:train_len,:]
y_test = y[train_len:(train_len + test_len ), : ]
y_val = y[(train_len + test_len ) : ,: ]

In [7]:
X_train.shape

(160222, 60, 57)

##Single-layer LSTM model for next-character prediction

In [35]:
from typing import KeysView
from keras.layers import *
from keras.models import Sequential
import tensorflow as tf
from keras.callbacks import ModelCheckpoint #to save checkpoints

model = Sequential()
model.add(LSTM(128 , input_shape = (maxlen , len(chars)),return_sequences= False))
model.add(Dense(len(chars) , activation = 'softmax'))
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_8 (LSTM)               (None, 128)               95232     
                                                                 
 dense_5 (Dense)             (None, 57)                7353      
                                                                 
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
_________________________________________________________________


In [36]:
optimizer = tf.keras.optimizers.Adam(lr = 0.001)
model.compile(optimizer = optimizer , loss = 'categorical_crossentropy', metrics = ['accuracy'])

##model training and sampling from it

In [37]:
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

In [38]:
import random 
import sys
for epochs in range(1,60):
  print("\n######################################################\n")
  print('\n epochs number : ',epochs)
  print("\n######################################################\n")
  model.fit(X,y,epochs = 1,batch_size = 128 ) #train model 
  start_index = random.randint(0, len(text) - maxlen - 1) #randomize start index which will feed to model to take output
  generated_text = text[start_index: start_index + maxlen] #represent seed text of length 60 to be feeded to model
  print('--- Generating with seed: "' + generated_text + '"')
  for temperature in [0.2, 0.5, 1.0, 1.2]:
    print("\n===========================================================\n\n")
    print('------ temperature:', temperature)
    sys.stdout.write(generated_text) #write seed
    for i in range(400):
      #to store one hot encoded input to model
      sampled = np.zeros((1, maxlen, len(chars))) 
      for t, char in enumerate(generated_text):
        sampled[0,t,chars.index(char)] = 1.

      #make model prediction it is just one character
      preds = model.predict(sampled, verbose=0)[0]
      #change distribution of output of model 
      next_index = sample(preds, temperature)
      #predicted character
      next_char = chars[next_index]
      #add it to model 
      generated_text += next_char

      generated_text = generated_text[1:]
      #write only one character at once
      sys.stdout.write(next_char) 




######################################################


 epochs number :  1

######################################################

--- Generating with seed: "                    he whom ye seek?
     ye stare and stop-"



------ temperature: 0.2
                    he whom ye seek?
     ye stare and stop-int and of the the hat on the the whe hand and and and and and in the herere the the hall and and and and and on the the has and and and and and and and and and and and ally and and the here the hand and and and and and and and and and and and in the herer and and and the his the hand the the sone the the here the the hand the the hall and and and and the the hererent and and and the wher and and 


------ temperature: 0.5
d and and and the the hererent and and and the wher and and of the ind is are the ther ar of whith paresy and and somerner there on ind and the the the has mant and cand and als is at are andores be mare om the the and rheris more the sed at or ther ther and on 