<a href="https://colab.research.google.com/github/H0sseinR0stami/DeepLearningProjects/blob/main/Text_generation/Text_Generation_using_LSTMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).




**1. Import the libraries**

In [None]:

%tensorflow_version 2.x

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.utils import np_utils
import keras
from numpy.random import seed
import tensorflow
tensorflow.random.set_seed(2)
seed(1)

import pandas as pd
import numpy as np
import string, os

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

**2. Load the dataset**

In [None]:
curr_dir = '/content/drive/MyDrive/Colab_Notebooks/project5/database/'
all_headlines = []
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
       print(filename)
       article_df = pd.read_csv(curr_dir + filename)
       all_headlines.extend(list(article_df.headline.values))
    #if 'CommentsJan2017' in filename:
       #print(filename)
       #article_df = pd.read_csv(curr_dir + filename,nrows=500)
       #all_headlines.extend(list(article_df.commentBody.values))
       ##df = pd.read_csv('matrix.txt', sep=',', header=None, skiprows=1000, nrows=1000)
all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)



all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

ArticlesApril2017.csv
ArticlesFeb2017.csv
ArticlesFeb2018.csv
ArticlesApril2018.csv
ArticlesJan2017.csv
ArticlesJan2018.csv
ArticlesMarch2017.csv
ArticlesMarch2018.csv
ArticlesMay2017.csv


8603

**3. Dataset preparation**

3.1 Dataset cleaning

In dataset preparation step, we will first perform text cleaning of the data which includes removal of punctuations and lower casing all the words.

In [None]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt

corpus = [clean_text(x) for x in all_headlines]
corpus[:10]


['finding an expansive view  of a forgotten people in niger',
 'and now  the dreaded trump curse',
 'venezuelas descent into dictatorship',
 'stain permeates basketball blue blood',
 'taking things for granted',
 'the caged beast awakens',
 'an everunfolding story',
 'oreilly thrives as settlements add up',
 'mouse infestation',
 'divide in gop now threatens trump tax plan']

**3.2 Generating Sequence of N-gram Tokens**

Language modelling requires a sequence input data, as given a sequence (of words/tokens) the aim is the predict next word/token.

The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens.

In [None]:
tokenizer = Tokenizer()
def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1

    ## convert data to sequence of tokens
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[391, 17],
 [391, 17, 5166],
 [391, 17, 5166, 523],
 [391, 17, 5166, 523, 4],
 [391, 17, 5166, 523, 4, 2],
 [391, 17, 5166, 523, 4, 2, 1601],
 [391, 17, 5166, 523, 4, 2, 1601, 134],
 [391, 17, 5166, 523, 4, 2, 1601, 134, 5],
 [391, 17, 5166, 523, 4, 2, 1601, 134, 5, 1951],
 [7, 57]]

**3.3 Padding the Sequences and obtain Variables : Predictors and Target**

Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Kears for this purpose. To input this data into a learning model, we need to create predictors and label. We will create N-grams sequence as predictors and the next word of the N-gram as label. For example:

Headline: they are learning data science




In [None]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = keras.utils.np_utils.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

**4. LSTMs for Text Generation**

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

Input Layer : Takes the sequence of words as input
LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting. (Optional Layer)
Output Layer : Computes the probability of the best possible next word as output
We will run this model for total 100 epoochs but it can be experimented further.

In [None]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()

    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))

    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(200))
    model.add(Dropout(0.15))

    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')

    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 23, 10)            112650    
                                                                 
 lstm (LSTM)                 (None, 200)               168800    
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 11265)             2264265   
                                                                 
Total params: 2,545,715
Trainable params: 2,545,715
Non-trainable params: 0
_________________________________________________________________


In [None]:
# checkpointing
checkpoint_path = "/content/drive/MyDrive/Colab_Notebooks/project5/training_checkpoints/bestmodel.h5"


# Create a callback that saves the best model
mc = tensorflow.keras.callbacks.ModelCheckpoint(
    monitor="loss",
    filepath=checkpoint_path,
    verbose=1,
    save_best_only= True , mode='auto'
)

# early stopping

#es = keras.callbacks.EarlyStopping(monitor = "loss" , min_delta = 0.01 , patience = 10 , verbose =1 , mode = 'auto' )

# model check point

#cp_callback = [es,mc]


**Lets train our model now**

In [None]:
model.fit(predictors, label, epochs=500,steps_per_epoch=1500, verbose=1, callbacks=[mc])
#model.fit(predictors, label, epochs=500, verbose=1, callbacks=[cp_callback])

Epoch 1/500
Epoch 00001: loss improved from 7.23863 to 6.76336, saving model to /content/drive/MyDrive/Colab_Notebooks/project5/training_checkpoints/bestmodel.h5
Epoch 2/500
Epoch 00002: loss did not improve from 6.76336
Epoch 3/500
Epoch 00003: loss improved from 6.76336 to 6.64537, saving model to /content/drive/MyDrive/Colab_Notebooks/project5/training_checkpoints/bestmodel.h5
Epoch 4/500
Epoch 00004: loss improved from 6.64537 to 6.38256, saving model to /content/drive/MyDrive/Colab_Notebooks/project5/training_checkpoints/bestmodel.h5
Epoch 5/500
Epoch 00005: loss improved from 6.38256 to 6.08102, saving model to /content/drive/MyDrive/Colab_Notebooks/project5/training_checkpoints/bestmodel.h5
Epoch 6/500
Epoch 00006: loss improved from 6.08102 to 5.74010, saving model to /content/drive/MyDrive/Colab_Notebooks/project5/training_checkpoints/bestmodel.h5
Epoch 7/500
Epoch 00007: loss improved from 5.74010 to 5.39986, saving model to /content/drive/MyDrive/Colab_Notebooks/project5/tra

<keras.callbacks.History at 0x7f4967650510>

**4-2.
Loading model**

In [None]:
# restart from checkpoint
model = create_model(max_sequence_len, total_words)

model.load_weights(checkpoint_path)


**5. Generating the text**

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')


        predict_x=model.predict(token_list)
        predicted=np.argmax(predict_x,axis=1)


        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

**6. Some Results**

In [None]:


print (generate_text("air", 12, model, max_sequence_len))
print (generate_text("trump", 14, model, max_sequence_len))
print (generate_text("donald trump", 14, model, max_sequence_len))
print (generate_text("india", 18, model, max_sequence_len))
print (generate_text("new york", 14, model, max_sequence_len))
print (generate_text("science and technology", 11, model, max_sequence_len))

Improvement Ideas
As we can see, the model has produced the output which looks fairly fine. The results can be improved further with following points:

Adding more data
Fine Tuning the network architecture
Fine Tuning the network parameters
Thanks for going through the notebook, please upvote if you liked.