<a href="https://colab.research.google.com/github/MrFlygerian/NLP-Document-Summary/blob/master/Document_Summariser_(control).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with LSTM Deep Neural Networks

#### First things first

To prepare the notebook, google drive must be mounted and the directory with the relevant files (weights, modules, data etc) must be navigated to. 


In [None]:
from google.colab import  drive
drive.mount('/content/drive')

In [None]:
!ls "/content/drive/My Drive/Colab Notebooks"
%cd "/content/drive/My Drive/Colab Notebooks"

## The Project Aims


*   To create a contextual summary of a given document automatically
*   To compare abilities of deep learning on control and real world data
*   Understand deep learning's abilities and limitations






#### The Project Ingredients


*   A set of control data (the well known and used nltk corpus for Alice in Wonderland was chosen)

*   A set of 'real world' data (some essays on a given topic were used for this experiment, not shown in this notebook)
*   The Spyder IDE, numpy, system modules (later transferred to Google Collab)
*   Keras and related modules
* A decent laptop (16GB RAM, RYZEN 7 CPU, RADEON VEGA GPU)

The relevant libraries and control data are imported below

In [None]:
%%time
#data manipulation
import numpy
import sys

#keras modules
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils



In [None]:
def txtfile2txt(textfile):
    
    raw_text = open(textfile, 'r', encoding = 'utf-8')
    raw_text = raw_text.read()
    raw_text = raw_text.lower()
    
    return raw_text
    

In [None]:
text_file = "wonderland.txt"
raw_text = txtfile2txt(text_file)

## The Project Method


1. Load files and extract text
2. Map each unique character in text to a number and store in a dictionary
2. Create a 'moving window' of arbitrary length to generate sequences of characters (mapped to numbers) as inputs 
4. Use the letter immmediately following the sequence in the text as output
5. Store inputs and outputs and reformat them seperately for use with Keras
6. Define model (add your bells and whistles), create checkpoints and fit model to reformatted data
7. Test model by feeding it with a random sequence from the input data
8. Model predicts characters which should follow, and these characters are joined together to form a sentence



In [None]:
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

n_chars = len(raw_text)
n_vocab = len(chars)

In [None]:
%%time
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)


In [None]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_vocab)

# one hot encode the output variable
y = np_utils.to_categorical(dataY)


In [None]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.load_weights('starting-weight.hdf5')
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
#define checkpoint
filepath="control-weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]


In [None]:
%%time
#fit model
#model.fit(X, y, epochs=50, batch_size=128, callbacks=callbacks_list)


model.fit(X, y, epochs=4, batch_size=128)


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
CPU times: user 18min 12s, sys: 2min 37s, total: 20min 50s
Wall time: 12min 49s


<keras.callbacks.callbacks.History at 0x7fc6005cc908>

## The Project Experiment


1.   Ran model(s) over multiple epochs (between 15 and 50)
2.   Used multiple weights in prediction to test effect of loss reduction
3. Compared results on control and real world data
4. Compared runtimes on local machine and later using Google Collab
5. Attempted to get a handle on version control



In [None]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print ("Seed:")
print ("\"", ''.join([int_to_char[value] for value in pattern]), "\"")


Seed:
" a great
hurry, muttering to himself as he came, 'oh! the duchess, the duchess!
oh! won't she be sava "


In [None]:
# generate characters
for i in range(500):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print ("\nDone.")

ng ' the saod this as the could sot she way at the could, 
'the care rinng then the marter of the bareerin,' said the kanter. 'io would you well to mo ' 
the march hare war toen a lomtte or two of the care was toink agaone the had geve a dang wand the tam to sar to tee thet sae tas aoing to thet the was anwing toted her hand and the was aolnnld the had geve and the tam ao all  and her seat soene in the was aol a ling of ges ann aoutoen ana and coe aoutoeng, and she was aowinuid to tay toee a ler
Done.


## The project results


*   So far, not that good!

*   Control data generates seperate words, but not clear english, little sense made, and is prone to repetition after +500 characters are generated 
*   Seed text from real word data creates individual words for the first ~100 characters and then starts to repeat characters
*Repetitive nature seen in both samples for early epochs and losses of $1<x<2$ ($2<$ losses are next to useless)
* At ‘very low’ losses ($1<x<1.1$) some sentences and structure appear for both texts

* It’s not easy to explain the predictions/results (DL networks are a black box by construction)
* Runtime in Google Collab is significantly less 



## The Project Conclusion
* The heavy matrix calculus nature makes DL networks very powerful and easy conceptually to understand, as well as very difficult to break down for insight into specific predictions and problem domains
* Very computer intensive
* Run time is in the order of hours/days
* Keras allows for an iterative process (can save weights, which are core components for the model, and reload them for improvement)
* Using Google GPUs via Collab speed up runtime dramatically, so it’s probably best to use those services for future DL endeavours


## Future Questions
* Can I used weights trained from one text on another text?
* Can I improve the run time and quality simultaneously?
* How can I evaluate how well my programme did (both quantitively and pictorially)
* Can I generalise weights for text (and potentially other problems)?
* Making decent DL models takes a lot of time and resources. Are there pretrained, adaptable models available (ideally for either free or very cheap) that can do the jobs I’m trying to do?



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E09FEzEIXAAnEZRgvqs09ueSPlOoOhoY#scrollTo=UYb98xSyED_R)