<a href="https://colab.research.google.com/github/AshwinDeshpande96/Speech-Generation/blob/master/president_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Library and Data

In [3]:
import numpy
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Dropout, LSTM, Input
from keras.utils import np_utils
from google.colab import drive
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from keras.optimizers import Adam
import sys
import inflect
import re
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
import spacy
import codecs
import pandas as pd

Using TensorFlow backend.


WordNet contains the lemmatizer required to derive root words from its inflections.

In [4]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

I have my files stored on my Google Drive. The Presidential Speech Corpus is imported from this link.

You can choose to upload data to your drive or directly to Google Colaboratory (This implementation is done on Google Colab since it takes care of dependencies).
Use following code to directly upload file.


```
from google.colab import files
uploaded = files.upload()
```



In [5]:
#drive.mount('/content/gdrive', force_remount=True)
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Replace the file_path with the location you have saved your file.

In [0]:
president = 'lbjohnson'
file_path = '/content/gdrive/My Drive/Projects/NLP/President Speech/' +president+'_all.txt'

You can choose to open file in Python Standard Library style: `raw_text = open(file_path).read()`. 

But during pruning Spacy Library Requires Unicode Standard text type and Python Standard open() command returns a str datatype. 

For this purpose codecs libraries is used.

In [7]:
raw_text = codecs.open(file_path, encoding='utf-8').read()
print type(raw_text)

<type 'unicode'>


# Data Pre-Processing and Pruning
This part takes care of tokenizing words and creating word vector and vocabulary vector.

tokenize function as the name specifies will return a list of tokenized words from raw_text: 

For example: 
raw_text =  "Today's weather condition is cloudy with a 76% of rain. Temperature may remain 
    cool at 21°C with Humidity 61%."
    
 raw_words = ['today', ' ', 'weath', ' ', 'condit', ' ', 'is', ' ', 'cloudy', ' ', 'with', ' ', 'a', ' ', 'seventy', ' ', 'six', ' ', 'of', ' ', 'rain', ' ', '.',  ' ', 'temp',  ' ', 'may',  ' ', 'remain',  ' ', 'cool',  ' ', 'at',    ' ', ',  ' ', 'twenty',  ' ',  'on', ' ',  'c',  ' ', 'with',  ' ', 'humid',  ' ', 'sixt', ' ', 'on', ' ', '.']
 
 Do not run this module twice as the function changes the input variable and second execution will consider the altered variable. Make sure to run codes.open() function before tokenize function.
    

In [0]:
porter = PorterStemmer()
lancaster=LancasterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

In [0]:
def tokenize(raw_text):
    #raw_text = raw_text.lower()
    num_set = re.findall(r'\d+', raw_text)
    p = inflect.engine()
    #print num_set
    for num in num_set:
        word = str(p.number_to_words(num))
        raw_text = raw_text.replace(num, word)
    punctuation_chars = ["!", '"', '&', ',', '?', '/', ':', ';', '<', '>', '$', '#', '@', '*', '(', ')', '[', ']', '{', '}', '\n', '-', '`'] 
    for symbol in punctuation_chars:
        raw_text = raw_text.replace(symbol, ' ')
    raw_text = raw_text.replace('.', ' . ')
    raw_text = raw_text.replace( "'", "" )
    raw_text = raw_text.replace( "%", "percent" )
    
    
    spacy_nlp = spacy.load('en_core_web_sm',  disable=['ner', 'textcat']) 
    doc = spacy_nlp(raw_text[:len(raw_text)/2])
    raw_words = [token.text for token in doc]
    doc = spacy_nlp(raw_text[len(raw_text)/2:])
    raw_words = raw_words + [token.text for token in doc]
    return raw_text, raw_words
raw_text, raw_words = tokenize(raw_text)

Further Lemmatization and removing extra spaces in done in `text_preprocess()`.

In [10]:
word_copy = raw_words
def text_preprocess(raw_words):
    stripped = []
    raw_dict = {}
    for val in raw_words:
        if " " not in val:
            root = str(wordnet_lemmatizer.lemmatize(lancaster.stem(val)))
            stripped.append(root)
            raw_dict[root] = str(val)
    raw_words = numpy.array(stripped)
    print "Text Word Count: ", raw_words.shape[0]
    vocab = numpy.unique(sorted(raw_words))
    print "Vocab Length: ", vocab.shape[0]
    return vocab.shape[0],raw_words.shape[0], raw_dict, raw_words, vocab
    
n_vocab, n_words, raw_dict, raw_words, vocab = text_preprocess(raw_words)

Text Word Count:  267474
Vocab Length:  5161


Raw Text Processing is a costly process and will not be executing this cell for every experiment. Hence we save a copy of these raw words for further repeated usage.

In [0]:
df = pd.DataFrame(raw_words, columns=["raw_words"])
df.to_csv('/content/gdrive/My Drive/Projects/NLP/President Speech/words.csv', )

A sample of vocabulary is printed as follows. Uncomment line 2 to see processed words which are in same sequence as the raw text

In [12]:
print "Vocab: ",vocab[:50], "\n...\n", vocab[-10:]
#print "Raw Words: ",raw_words[:50], "\n...\n", raw_words[-10:]


Vocab:  ['.' 'a' 'ab' 'abandon' 'abbrevy' 'abc' 'abel' 'abet' 'abh' 'abhor' 'abid'
 'abl' 'ablest' 'aboard' 'abol' 'abolit' 'about' 'abov' 'abraham' 'abram'
 'abroad' 'abrupt' 'absolv' 'absorb' 'abstract' 'abund' 'abus' 'academy'
 'acc' 'acceiv' 'accel' 'accentu' 'access' 'accid' 'accommod' 'accompany'
 'accompl' 'accord' 'account' 'accredit' 'accru' 'accum' 'accus'
 'accustom' 'ach' 'achiev' 'ackley' 'acknowledg' 'acquaint' 'acquiesc'] 
...
['yourselv' 'yugoslav' 'zagor' 'zeal' 'zealand' 'zero' 'zerotwo' 'zip'
 'zon' 'zoot']


Following code cells are used to convert text data to numeric vectors such that our LSTM Model can use to find patterns.

In [0]:
#chars = sorted(list(set(raw_text)))
vocab_to_int = dict((c, i) for i, c in enumerate(vocab))

In [14]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_words - seq_length, 1):
	seq_in = raw_words[i:i + seq_length]
	seq_out = raw_words[i + seq_length]
	dataX.append([vocab_to_int[word] for word in seq_in])
	dataY.append(vocab_to_int[seq_out])
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns

Total Patterns:  267374


In [15]:
print numpy.array(dataX).shape

(267374, 100)


In [18]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
print X.shape



(267374, 100, 1)


In [16]:
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
print y.shape

(267374, 5161)


# Training

There are two different models you can choose. First one is a deeper and takes longer time to train. You can use either as this model has no notion of overfitting. It is designed to predict values very similar to train data. Hence,  there is no requirement of generalization.

In [19]:
'''
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dense(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1.0e-3))
'''

model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 256)               264192    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5161)              1326377   
Total params: 1,590,569
Trainable params: 1,590,569
Non-trainable params: 0
_________________________________________________________________


Start Training

In [0]:
# define the checkpoint
filepath="/content/gdrive/My Drive/Projects/NLP/President Speech/weights/"+president+"_weights.hdf5"

callbacks = [
    EarlyStopping(monitor='loss', patience=10, verbose=0, mode='min'),
    ReduceLROnPlateau(monitor='loss', factor=0.1, patience=5, verbose=1, mode='min'),
    ModelCheckpoint(filepath, save_best_only=True,  save_weights_only=False, monitor='loss', mode='min')
]
model.fit(X, y, epochs=1, batch_size=128, callbacks=callbacks)
model.save('/content/gdrive/My Drive/Projects/NLP/President Speech/weights/'+president+'_model.h5')
for i in range(25):
    model = load_model('/content/gdrive/My Drive/Projects/NLP/President Speech/weights/'+president+'_model.h5')
    model.fit(X, y, epochs=10, batch_size=128, callbacks=callbacks)
    model.save('/content/gdrive/My Drive/Projects/NLP/President Speech/weights/'+president+'_model.h5')
#model.load_weights('/content/gdrive/My Drive/Projects/NLP/weights-improvement-20-1.9923.hdf5')

# Testing  

In [0]:
model = load_model('/content/gdrive/My Drive/Projects/NLP/President Speech/'+president+'_model.hdf5')
int_to_vocab = dict((i, c) for i, c in enumerate(vocab))
#print int_to_vocab

In [0]:
start = numpy.random.randint(0, len(dataX)-1)
#start = len(dataX)-150
pattern = dataX[start]
print "Seed:"
#print "\"", ''.join([), "\""
nextline = 25
for x in [raw_dict[int_to_vocab[value]] for value in pattern]:
    nextline = nextline - 1
    if nextline == 0:
        print ""
        nextline = 25
    print x,
result = []
# generate characters
for i in range(100):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result.append(int_to_vocab[index])
	seq_in = [int_to_vocab[value] for value in pattern]
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print "---",
nextline = 25
for w in result:
    nextline = nextline - 1
    if nextline == 0:
        print ""
        nextline = 25
    print raw_dict[w],
print "\nDone."