## Step 0 :- Importing necessary libraries for analysis 

In [1]:
import pandas as pd
import numpy as np
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding, CuDNNLSTM, Dropout, Bidirectional
from keras.callbacks import *
from keras.utils import to_categorical
from random import randint

Using TensorFlow backend.


## Step 1 :- Loading input dataset 
<div class="alert alert-block alert-success">
<b>Data Source :</b> The data is downloaded from - https://github.com/irenetrampoline/taylor-swift-lyrics  <br><br>
 We load the dataset which is in .txt format and display the first 100 characters of text. (The text will contain special characters like new lines, tabs, brackets etc. We will be doing the preprocessing in the next step. 
</div>

In [2]:
path_to_file = r'D:\kaggle_trials\taylorswiftsongs'+'\\all_tswift_lyrics.txt'
file = open(path_to_file,'r')
text = file.read()
file.close()
print('The first 100 characters is given by:- ')
print(text[:100])

The first 100 characters is given by:- 
He said the way my blue eyes shined
Put those Georgia stars to shame that night
I said, "That's a li


## Step 2 :- Preprocessing the data
<div class="alert alert-block alert-success">
<b>Preprocessing involved:</b> The only preprocessing we are going to do is to convert the text to lowercase. We will be using the first 100,000 characters of the dataset in order to save time and get results quickly.
</div>

In [3]:
text_to_use = text.lower()
text_to_use = text[:100000]
print('The first 100 lines of pre-processed data will be :- \n ', text_to_use[:100])

The first 100 lines of pre-processed data will be :- 
  He said the way my blue eyes shined
Put those Georgia stars to shame that night
I said, "That's a li


## Step 3 :- Feature Extraction steps 
<div class="alert alert-block alert-success">
<b>Step 1: </b> Extracting all characters of the text to be used for analysis. We also determine the unique set of characters to be used later for indexing.
</div>

In [4]:
n_chars      = len(text_to_use)
unique_vocab = list(set(text_to_use))
print('The number of characters used in the songs is given by : ', n_chars)
print('The unique character size is ',len(unique_vocab))

The number of characters used in the songs is given by :  100000
The unique character size is  74


In [5]:
print('Let us look at the unique characters in the songs:- \n',unique_vocab)

Let us look at the unique characters in the songs:- 
 ['z', '(', ' ', '}', 'B', 'G', 'd', 'E', 'Y', '"', 'm', 'k', 'p', 'T', "'", ':', 'R', 'b', 'n', 'w', '2', '5', ']', ')', 's', 'O', 'f', 'l', 'I', 'v', 'u', 'F', '[', ';', '4', 'S', '.', 'U', 'J', 'q', 'j', 'e', 'H', 'W', 'a', '\n', ',', 'A', 'y', 'P', '9', '?', 'C', 'L', 'M', '8', '3', 'x', 'Q', 'K', 'D', 'V', 'N', '-', 'r', '1', 'i', 'o', '{', 'h', 't', 'g', 'c', '!']


In [6]:
int_to_char = {n:char for n, char in enumerate(unique_vocab)}
char_to_int = {char:n for n, char in enumerate(unique_vocab)}

## Step 4 :- Data preparation 
<div class="alert alert-block alert-success">
<b>Step 1 (Preparing X tensor): </b> We will create the X and y tensors to be used in the model. The X tensor will be of shape (sample size,sequence_length,features). In our case, the sample size is 100000, the sequence length is 100 and features is 1 (i.e, we are using 100 previous characters to predict the 101th character. <br>
    <b> Step 1.1 Normalize the vectors:- </b>We will further normalize each elements of the tensor to get a value between zero and one(much like min-max scalar) by dividing each integer represntation of token by number of distinct characters. <br>
    <b> Step 2 (Preparing Y tensor): </b> Here the y tensor Will be the actual 101th value(character) to be treated as a target. <br>
    <b> Step 3 (Reshaping the X tensor): </b> The shape of X tensor will be of form (sample size,sequence_length,features)<br>
    <b> Step 4 (One hot encoding Y tensor): </b> The y tensor will be one hot encoded to be passed as an input to the LSTm model we will be using.
</div>

In [23]:
X           = []
y           = []
seq_length  = 100

for i in range(0, n_chars - seq_length, 1):
    seq_in  = text_to_use[i:i + seq_length]
    seq_out = text_to_use[i + seq_length]
    X.append([char_to_int[char] for char in seq_in])
    y.append(char_to_int[seq_out])
X_new       = np.reshape(X, (len(X), seq_length, 1)) 
X_new       = X_new/(float(len(unique_vocab)))
y_new       = to_categorical(y) 
print("X_new shape:", X_new.shape)
print("y_new shape:", y_new.shape)

X_new shape: (99900, 100, 1)
y_new shape: (99900, 74)


## Step 5 :- Creating LSTM model for text generation
<div class="alert alert-block alert-success">
<b>Bidirectional LSTM + Dense layers:</b> We fit a bidirectional LSTM (because we are working with data having no trend values). We will use the last layer's hidden units to a fully connected network to predict the next word. 
</div>

In [24]:
model = Sequential()
model.add(Bidirectional(CuDNNLSTM(200, return_sequences=True), input_shape=(X_new.shape[1], X_new.shape[2])))
model.add(Dropout(0.2))
model.add(Bidirectional(CuDNNLSTM(100)))
model.add(Dropout(0.2))
model.add(Dense(y_new.shape[1], activation='softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy')

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_3 (Bidirection (None, 100, 400)          324800    
_________________________________________________________________
dropout_3 (Dropout)          (None, 100, 400)          0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 200)               401600    
_________________________________________________________________
dropout_4 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 74)                14874     
Total params: 741,274
Trainable params: 741,274
Non-trainable params: 0
_________________________________________________________________


## Step 6 : Callback Creation
We create callbacks which we will be using while execution of the model

In [25]:
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,verbose=1,
                              patience=5, min_lr=0.0001)
es        = EarlyStopping(monitor='loss', patience=5, verbose=1, mode='auto', baseline=None, 
                          restore_best_weights=True)
filepath   = os.getcwd()+'\\chkpts\\'+"weights-improvement-{epoch:02d}-{loss:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='auto')

In [26]:
epochs = 60
batch_sz = 64
model.fit(X_new, y_new, 
          epochs = epochs, 
          batch_size = batch_sz,
         callbacks        = [reduce_lr,es])

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60

Epoch 00056: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Restoring model weights from the end of the best epoch
Epoch 00056: early stopping


<keras.callbacks.History at 0x29833fae400>

## Step 8 :- Generating text from the model developed
We generate 60 further characters from a given line of text.

In [66]:
token_string = X[np.random.randint(0, len(X)-1)]
complete_string = [int_to_char[value] for value in token_string]
print (''.join(complete_string))

o low
You can't feel nothing at all
And you flashback to
When he said forever and always
And it rain


In [67]:
generate_string = []
for i in range(60):
    x = np.reshape(token_string, (1, len(token_string), 1))
    x = x / float(len(unique_vocab))
    
    prediction = model.predict(x, verbose=0)

    id_pred = np.argmax(prediction)
    seq_in = [int_to_char[value] for value in token_string]
    
    generate_string.append(int_to_char[id_pred])
    
    token_string.append(id_pred)
    token_string = token_string[1:len(token_string)] 

In [68]:
print('String of characters provided :- \n',''.join(complete_string))
print('\n')
text = ""
for char in complete_string+generate_string:
    text = text + char
print('The text completed is:- \n ',text)

String of characters provided :- 
 o low
You can't feel nothing at all
And you flashback to
When he said forever and always
And it rain


The text completed is:- 
  o low
You can't feel nothing at all
And you flashback to
When he said forever and always
And it rains in your bedroom
Everything is wrong
It rains when you're g
