<b>Creating model which will generate text about particular topic
</b><br>
Here, Data will be collected from wikipedia website, You just need to give name of particular topic as per wikipedia page.<br>
 You can also provide multiple topic as list but I wouldn't recommend that unless you can train model with more epochs

As an example I have taken Deep Learning as particular topic.


# DATA COLLECTION

Data can be acquired by performing web scrapping on wikipedia web pages <br>
Here, I will use beautiful soup and requests module for web scrapping

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
def get_data(topics):
    text_data=[]
   
    for topic in topics:
        wiki_pages=requests.get('https://en.wikipedia.org/wiki/'+topic)
        soup=BeautifulSoup(wiki_pages.text,'lxml')
        find_p=soup.find("div",{"class":"mw-content-ltr"}).find_all("p")
      
        for p in find_p:
            if len(p.text)>10:
                text_data.append(p.text)
        
    return text_data

In [3]:
data=get_data(['Deep_learning'])

In [4]:
print(len(data),'\n')
print(data[8])

98 

For supervised learning tasks, deep learning methods eliminate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.



# DATA PROCESSING


In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

Using TensorFlow backend.


By performing keras tokenizer on data, We can get meaningful insight

In [7]:
tokenizer = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
tokenizer.fit_on_texts(data)
    
print(tokenizer.document_count,'\n')   #Number of documents (texts/sequences) the tokenizer was trained on
    
print('Found {0} unique tokens.'.format(len(tokenizer.word_index)),'\n')    #total number of unique words
    
#print(tokenizer.word_index,'\n')       #dictionary mapping words to their rank/index (int)
    
#print(tokenizer.word_counts,'\n')           #dictionary mapping words to the number of times they appeared on during fit

#print(tokenizer.word_docs)             #dictionary mapping words to the number of documents/texts they appeared on during fit        

98 

Found 2059 unique tokens. 



Here I have defined two prepare data function<br>
One will create n-gram predictors no matter the size and other one will create 4-grams predictors
<br>

In [0]:
def prepare_data(data):
    tz = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
    tz.fit_on_texts(data)
    total_words = len(tz.word_index)+1
    
    input_sequences = []
    for line in data:
        token_list = tz.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)    
    max_len_sequence = max([len(x) for x in input_sequences])
    padded_sequence = np.array(pad_sequences(input_sequences, maxlen = max_len_sequence, padding = 'pre'))  #using pad_sequence for generating same dimensions training data
    predictors, label = padded_sequence[:,:-1],padded_sequence[:,-1]            #splliting predictors and target variable
    label = ku.to_categorical(label, num_classes = total_words)   
    
    return predictors,label,max_len_sequence,total_words

In [0]:
def prepare_data_4(data):
    tz = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
    tz.fit_on_texts(data)
    total_words = len(tz.word_index)+1
    
    input_sequences = []
    for line in data:
        count=0
        token_list = tz.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            if i<=4:
                n_gram_sequence = token_list[:i+1]
            else:
                count+=1
                n_gram_sequence = token_list[count:i+1]
            input_sequences.append(n_gram_sequence)    
    max_len_sequence = max([len(x) for x in input_sequences])
    padded_sequence = np.array(pad_sequences(input_sequences, maxlen = max_len_sequence, padding = 'pre'))
    predictors, label = padded_sequence[:,:-1],padded_sequence[:,-1]
    label = ku.to_categorical(label, num_classes = total_words)   
    
    return predictors,label,max_len_sequence,total_words

In [0]:
predictors,labels,max_len_sequence,total_words = prepare_data(data)

In [49]:
print(labels.shape,'\n')
print(predictors.shape,'\n')
print(max_len_sequence,'\n')
print(total_words)

(6941, 2060) 

(6941, 298) 

299 

2060


In [0]:
predictors_4,labels_4,max_len_sequence_4,total_words_4 = prepare_data_4(data)

In [50]:
print(labels_4.shape,'\n')
print(predictors_4.shape,'\n')
print(max_len_sequence_4,'\n')
print(total_words_4)

(6941, 2060) 

(6941, 4) 

5 

2060


In [62]:
print(predictors[:5],'\n')
print(predictors_4[:5])

[[  0   0   0 ...   0   0   7]
 [  0   0   0 ...   0   7   8]
 [  0   0   0 ...   7   8  58]
 [  0   0   0 ...   8  58 490]
 [  0   0   0 ...  58 490  10]] 

[[  0   0   0   7]
 [  0   0   7   8]
 [  0   7   8  58]
 [  7   8  58 490]
 [  8  58 490  10]]


In n-gram predictors, label will be classified by considering n previous words where as in 4-gram predictors, label will be classified by just four previous words

# MODEL DEVELOPEMENT

I will use LSTM layer with 100 units since data is less<br>

If data is huge then GRU works better in terms of time and GPU and output is nearly same

In [0]:
def create_model(max_len_sequence, total_words):
    model = Sequential()
    
    model.add(Embedding(total_words, 10, input_length=max_len_sequence - 1))
    
    model.add(LSTM(100))
    model.add(Dropout(0.1))

    
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
    
    return model

In [14]:
model = create_model(max_len_sequence, total_words)

print(model.summary(),'\n')





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 298, 10)           20600     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               44400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2060)              208060    
Total params: 273,060
Trainable params: 273,060
Non-trainable params: 0
_________________________________________________________________


In [51]:
model_4 = create_model(max_len_sequence_4,total_words_4)

print(model_4.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 4, 10)             20600     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               44400     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2060)              208060    
Total params: 273,060
Trainable params: 273,060
Non-trainable params: 0
_________________________________________________________________
None


In [22]:
model.fit(predictors, labels, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f0325ac8748>

In [52]:
model_4.fit(predictors_4, labels_4, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f036f43da90>

You can see that 4-gram data is very fast to train, each epoch just take 2-3 seconds where as in n-gram data each epoch took around 90 seconds

In [0]:
def generate_text(input_text, next_words_count, model, max_len_sequence):
    for i in range(next_words_count):
        token_list = tokenizer.texts_to_sequences([input_text])[0]
        token_list = np.array(pad_sequences([token_list], maxlen=max_len_sequence-1, padding='pre'))
        
        predicted = model.predict_classes(token_list)
        output_word = ''
        
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
                
        input_text = input_text + " " + output_word
        
    return input_text

Let's compare output of both models

In [54]:
print(generate_text("Deep learning is", 30, model, max_len_sequence),'\n')

print(generate_text("Deep learning is", 30, model_4, max_len_sequence_4))

Deep learning is closely related to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy to the us according to the cost and 223 the first function 

Deep learning is closely related to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy the cap image is an thousands of transformations 1 steps for number


In [55]:
print(generate_text("It is the", 30, model, max_len_sequence),'\n')

print(generate_text("It is the", 30, model_4, max_len_sequence_4))

It is the team led to george intelligent machines in a lack of understanding of the same brain in the us according to be training to produce function player 195 196 197 google 

It is the universal interpretation and abstraction the done that them an defense is learning used to flow is the game 2 221 training uniform regularization model matching variables and “bad hinton amazon


In [56]:
print(generate_text("Artificial neural networks", 30, model, max_len_sequence),'\n')

print(generate_text("Artificial neural networks", 30, model_4, max_len_sequence_4))

Artificial neural networks anns were inspired to produce molecules in the early 2000s when of finite size to approximate time and example the most mathematical nsa and darpa sri deep neural network and 

Artificial neural networks anns were used 2 such for computer weights and won the ann that adjust the style of data neurons is trained may layer at a reverse treating each the biological


In [63]:
print(generate_text("A deep neural network", 30, model, max_len_sequence),'\n')

print(generate_text("A deep neural network", 30, model_4, max_len_sequence_4))

A deep neural network and began and the early 2000s when of finite same into a be representation of deep learning methods in the width and transformed the reverse mathematical b of deep learning 

A deep neural network with relu activation is strictly larger than the input dimension then deep neural network in speech for a visual verifier boolean the cap and uses which where have been explored


In [59]:
print(generate_text("Recommendation systems",30, model, max_len_sequence),'\n')

print(generate_text("Recommendation systems", 30, model_4, max_len_sequence_4))

Recommendation systems have been used to implementing language models since the early 2000s 109 137 lstm helped of the raw hidden input and the us according to yann lecun 73 not before 

Recommendation systems have used deep learning to extract meaningful features for the picture factor at user interface to the genetic breed and training for they cases is the ebola virus 158 and


In [60]:
print(generate_text("computer vision",30,model,max_len_sequence),'\n')

print(generate_text("computer vision", 30, model_4, max_len_sequence_4))

computer vision demonstrated in the early 2000s when of finite same is attackers and defenders in the first states d and possibly the first hierarchy to the cost function 223 224 or 

computer vision be speech recognition tasks to 2011 dramatically while the numbers of the optimization extraction relationship deep networks to extract meaningful features for the picture factor at user interface to the


You can see that both model gave good performance with n_gram relatively high but 4_gram model is very fast to train so you can train 4_gram model for 500 epochs for better performance<br>

<b>NOTE:</b></br>
4_gram model predict next word based on last four words, But sometimes for predicting word, we need more information then just past four words. What appears before four words may influence next word which we want to predict 

Some Other Application Of Language Modelling:
<ul><li>Speech recognition</li>
<li>Machine Translation</li>
<li>Spell Correction</li>
<li>Providing Suggestion</li></ul>