# Text generation

The goal of this project is to demonstrate text generation using LSTM neural networks.
Our database contains numerous movie plots taken from Wikipedia, so we will generate something similiar.

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
from keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.


In [2]:
data = pd.read_csv("movie_plots.csv")
movie_plots = data['Plot']
print("Number of plots: ", movie_plots.shape[0])
movie_plots

Number of plots:  34886


0        A bartender is working at a saloon, serving dr...
1        The moon, painted with a smiling face hangs ov...
2        The film, just over a minute long, is composed...
3        Lasting just 61 seconds and consisting of two ...
4        The earliest known adaptation of the classic f...
                               ...                        
34881    The film begins in 1919, just after World War ...
34882    Two musicians, Salih and Gürkan, described the...
34883    Zafer, a sailor living with his mother Döndü i...
34884    The film centres around a young woman named Am...
34885    The writer Orhan Şahin returns to İstanbul aft...
Name: Plot, Length: 34886, dtype: object

## Tokenize words

Generally in Natural Language Processing projects, the first step is removal of stop words, such as "the", "a", "an", and punctuation. We will skip this step since we want to generate human-like speech.
Tokenization is turning unique words into unique integers. This step is necessary for preparing data for embedding layer.

In [3]:
max_words = 50000
tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(movie_plots.values)

sequences = tokenizer.texts_to_sequences(movie_plots.values)

In [4]:
# making a single list of tokens so we can apply sliding windows

text = [item for sublist in sequences for item in sublist]
vocab_size = len(tokenizer.word_index)

In [5]:
print("Vocabulary size: ", vocab_size)

# reverse dictionary so we can decode tokenized sequences back to words
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

Vocabulary size:  169193


In [6]:
# sliding window to generate test and train data 
    
def sliding_window(text, seq_len):
    X, y = list(), list()
    
    for i in range(len(text) - seq_len):
        end_ix = i + seq_len
        
        if end_ix > len(text)-1:
            break
            
        X.append(text[i:end_ix])
        y.append(text[end_ix])
        
    return np.array(X), np.array(y)

### Splitting the data into train and test

In [7]:
seq_len = 20

X_train, y_train = sliding_window(text, seq_len)

In [12]:
X_train[0], y_train[0]

(array([    4,  5634,     6,   299,    23,     4,  3156,  2519,  2189,
            2,  3451,    30,     9,  7798,     4, 37435,  3381,  1695,
         8667,    12]),
 3927)