<a href="https://colab.research.google.com/github/PearlSikka/language-ninja/blob/master/Text_generation_using_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I will be using the dataset of New York Times Comments and Headlines to train a text generation language model which can be used to generate News Headlines


In [28]:
import os
import pathlib

# Upload the API token.
def get_kaggle():
  try:
    import kaggle
    return kaggle
  except OSError:
    pass

  token_file = pathlib.Path("~/.kaggle/kaggle.json").expanduser()
  token_file.parent.mkdir(exist_ok=True, parents=True)

  try:
    from google.colab import files
  except ImportError:
    raise ValueError("Could not find kaggle token.")

  uploaded = files.upload()
  token_content = uploaded.get('kaggle.json', None)
  if token_content:
    token_file.write_bytes(token_content)
    token_file.chmod(0o600)
  else:
    raise ValueError('Need a file named "kaggle.json"')
  
  import kaggle
  return kaggle


kaggle = get_kaggle()

In [29]:
!kaggle datasets download -d aashita/nyt-comments

nyt-comments.zip: Skipping, found more recently modified local copy (use --force to force download)


In [30]:
!unzip nyt-comments.zip -d train

Archive:  nyt-comments.zip
replace train/ArticlesApril2017.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train/ArticlesApril2017.csv  
replace train/ArticlesApril2018.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train/ArticlesApril2018.csv  
replace train/ArticlesFeb2017.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train/ArticlesFeb2017.csv  
replace train/ArticlesFeb2018.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: train/ArticlesFeb2018.csv  
  inflating: train/ArticlesJan2017.csv  
  inflating: train/ArticlesJan2018.csv  
  inflating: train/ArticlesMarch2017.csv  
  inflating: train/ArticlesMarch2018.csv  
  inflating: train/ArticlesMay2017.csv  
  inflating: train/CommentsApril2017.csv  
  inflating: train/CommentsApril2018.csv  
  inflating: train/CommentsFeb2017.csv  
  inflating: train/CommentsFeb2018.csv  
  inflating: train/CommentsJan2017.csv  
  inflating: train/CommentsJan2018.csv  
  inflating: train/CommentsMarch2017.csv  

In [31]:
import pandas as pd
import tensorflow as tf
import numpy as np


from numpy.random import seed

tf.random.set_seed(2)
seed(1)



In [32]:
import os

curr_dir='/content/train/'

docs=[]
for filename in os.listdir(curr_dir):
  if 'Articles' in filename:
    data=pd.read_csv(curr_dir+filename)
    docs.append(data)
    break
frame=pd.concat(docs,axis=0)

print (frame[:10])


                  articleID  ... articleWordCount
0  58691a5795d0e039260788b9  ...             1324
1  586967bf95d0e03926078915  ...             2836
2  58698a1095d0e0392607894a  ...              445
3  5869911a95d0e0392607894e  ...              864
4  5869a61795d0e03926078962  ...              309
5  5869afd495d0e0392607896c  ...             2180
6  5869d08f95d0e03926078980  ...             1146
7  586a0d8795d0e039260789b3  ...              557
8  586a0d8795d0e039260789b6  ...              784
9  586a32f495d0e039260789f3  ...             1109

[10 rows x 16 columns]


In [33]:
frame.columns

Index(['articleID', 'abstract', 'byline', 'documentType', 'headline',
       'keywords', 'multimedia', 'newDesk', 'printPage', 'pubDate',
       'sectionName', 'snippet', 'source', 'typeOfMaterial', 'webURL',
       'articleWordCount'],
      dtype='object')

In [34]:
frame.headline[:10]

0     G.O.P. Leadership Poised to Topple Obama’s Pi...
1    Fractured World Tested the Hope of a Young Pre...
2                                 Little Troublemakers
3                  Angela Merkel, Russia’s Next Target
4                        Boots for a Stranger on a Bus
5       Molder of Navajo Youth, Where a Game Is Sacred
6     ‘The Affair’ Season 3, Episode 6: Noah Goes Home
7                Sprint and Mr. Trump’s Fictional Jobs
8                              America  Becomes a Stan
9            Fighting Diabetes, and Leading by Example
Name: headline, dtype: object

In [35]:
frame.shape

(850, 16)

In [36]:
frame=frame[['headline']]

In [37]:
frame.head()

Unnamed: 0,headline
0,G.O.P. Leadership Poised to Topple Obama’s Pi...
1,Fractured World Tested the Hope of a Young Pre...
2,Little Troublemakers
3,"Angela Merkel, Russia’s Next Target"
4,Boots for a Stranger on a Bus


In [38]:
frame.shape

(850, 1)

In [39]:
frame=frame[frame['headline']!='Unknown']

In [40]:
frame.shape

(777, 1)

In [42]:
from tensorflow.keras.preprocessing.text import Tokenizer


In [59]:
tokenizer=Tokenizer(num_words=100,filters='!"#$%&()*+,-./:;<=>?...',lower=True)

In [60]:
tokenizer.fit_on_texts(frame.headline.values)

In [61]:
dict_words=tokenizer.word_index

In [62]:
len(dict_words)

2285

In [64]:
n=tokenizer.num_words=100

In [65]:
print(n)

100


In [69]:
def get_sequence_of_tokens(corpus):
    ## tokenization
    total_words = n
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(frame.headline.values)
inp_sequences[:10]

[[66, 46],
 [66, 46, 59],
 [66, 46, 59, 3],
 [41, 1],
 [41, 1, 4],
 [41, 1, 4, 2],
 [41, 1, 4, 2, 47],
 [5, 2],
 [5, 2, 9],
 [5, 2, 9, 2]]

In [70]:
len(inp_sequences)

1501

In [71]:
total_words

100

In [72]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import keras.utils as ku 

In [73]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

In [74]:
print(predictors)

[[ 0  0  0 ...  0  0 66]
 [ 0  0  0 ...  0 66 46]
 [ 0  0  0 ... 66 46 59]
 ...
 [ 0  0  0 ...  0  0 60]
 [ 0  0  0 ...  0  0  4]
 [ 0  0  0 ...  0  0 17]]


In [75]:
print(label)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [77]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dropout,Dense,LSTM

In [78]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, n)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 12, 10)            1000      
_________________________________________________________________
lstm (LSTM)                  (None, 100)               44400     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
Total params: 55,500
Trainable params: 55,500
Non-trainable params: 0
_________________________________________________________________


In [79]:
model.fit(predictors, label, epochs=20, verbose=5)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7ff085260710>

In [80]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

In [81]:
generate_text('Trump',2,model,max_sequence_len)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


'Trump Of The'

In [85]:
generate_text('India and China 3 episode',3,model,max_sequence_len)

'India And China 3 Episode A For A'

In [86]:
generate_text('President Trump',9,model,max_sequence_len)

'President Trump Of The Of The Of The Of The Of'