# Language Modelling and Text Generation using LSTMs — Deep Learning

Now that we have completed the sentiment analysis (in the previous notebook: 1_Sentiment_Analysis_Final), we will predict the future Tweets regarding some topics through Text Generation. With both tools combined, we will be able to really understand the level of sentiment (output opinion) for a certain person or a topic. 

There are two ways to go about Text Generation for Tweets: either use the standard NLP approach or the Deep Learning method.

With the latest developments and improvements in the field of deep learning and artificial intelligence, many exacting tasks of Natural Language Processing are becoming facile to implement and execute. Text Generation is one such task which can be be architectured using deep learning models, particularly Recurrent Neural Networks (RNN).

RNNs are used to, not only make predictions, but to also generate new possible sequences from a certain problem domain and by learning from the prior sequences. From this, we can also learn more about the problem itself.

In this notebook we will cover the Deep Learning method within RNNs.

# Text Generation for Tweets using LSTMs

As mentioned in class, Text Generation is a type of Language Modelling problem. Some of the tasks that require Languare Modelling are: speech to text, conversational systems and text summarization. A trained language model learns the likelihood of occurrence of a word based on the previous sequence of words used in the text. 

In this notebook, we will explain how to create a language model for generating natural language text by implementing and training Recurrent Neural Network. The goal is to ultimately compare the results of generating future Tweets with a RNN versus the simpler NLP (nltk) approach. We want to generate the new possible sequences of Tweets, based on prior Tweets about the 2016 GOP Debate. By predicting these Tweets, we will be able to understand some of the key opinions following this debate. 

We proceeded with this analysis by doing data cleaning and data preparation steps to really understand the power of the RNN in this context. However, we soon realized that, even with RNNs, the data needs to be well cleaned to generate good results.

## Overview - Completed Analysis

In this notebook, we have divided our analysis in two part: we approached the problem in two ways. The first attempt describes the steps taken and the understanding to why some of the steps weren't helpful or needed extra work. The first step failed at generating text but gave us insight into what needed to be done. In the second attempt, we were capable of working in a systematic way and applying the required cleaning steps to successfully generate Tweets. 

# First Attempt at Text Generation

## Import the Librairies

As the first step, we need to import the required libraries for the complete analysis. Keras, in language modeling, is an important library to import because it is a deep learning framework that contains other deep learning frameworks. It will be useful when creating our recurrent neural network. 



In [632]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

nltk.download('stopwords')
from nltk.corpus import stopwords

import pandas as pd
import numpy as np
import string, os 
import os.path
import csv
import pandas as pd
import nltk

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Load the Dataset

In this notebook, like in the others, we will be using a dataset that comprises of opinionated tweets related to presidential candidate speeches, during the 2016 GOP Debate, to train a text generation language model which can be used to generate further opiniated tweets on certain topics. 

We have saved the dataset into a csv file name Sentiment 4, from which we will input the column names mentioned below.

In [635]:
def loadDataset(in_file):
    my_path = os.getcwd()
    path = os.path.join(my_path, in_file)
    column_names = ['candidate','sentiment', 'subject', 'retweets', 'text', 'location', 'timezone']
    tweets = pd.read_csv(path, delimiter=',', quotechar='"', header= None, names= column_names, encoding="ISO-8859-1")

    print('Readed ', len(tweets), "tweets")
    return tweets

In [636]:
raw_training_data = loadDataset("Sentiment_4.csv")
raw_training_data.head()

Readed  13871 tweets


Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone
1,No candidate mentioned,Neutral,None of the above,5,RT @NancyLeeGrahn: How did everyone feel about...,Unknown,Quito
2,Scott Walker,Positive,None of the above,26,RT @ScottWalker: Didn't catch the full #GOPdeb...,Unknown,Unknown
3,No candidate mentioned,Neutral,None of the above,27,RT @TJMShow: No mention of Tamir Rice and the ...,Unknown,Unknown
4,No candidate mentioned,Positive,None of the above,138,RT @RobGeorge: That Carly Fiorina is trending ...,Texas,Central Time (US & Canada)
5,Donald Trump,Positive,None of the above,156,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Unknown,Arizona


Our dataset is comprised of 13871 Tweets. However, for our Text Generation, the only column we are interested in is the 'text' column. 

## 2. Initial Preprocessing of Tweets

In dataset cleaning step, we will first perform the text cleaning of the data which includes:
- Removal of hashtags
- Removal of usernames
- Removal of URLs
- Removal of the word gopdeb (word too common)
- Removal of emoticons
- Spliting of words by boundaries
- Removal of punctuation
- Removal of repetitive words
- Stemming of words

In this step, we will start by creating all the functions to cleaning the text. We will then apply all the functions in one wrapper (at the same time).

Compared to normal texts, tweets have more dirty data. Therefore, we need to make sure to take out additional noise (hashtags, usernames, URLs) before applying traditional NLP cleaning (like taught in class). To do so, we used regular expressions (regex).

#### Removal of Hashtags

In [560]:
#Hashtags
import re
hash_regex = re.compile(r"#(\w+)")
def hash_repl(match):
	return ''+match.group(1).upper()

#### Removal of Usernames

In [561]:
#Usernames
user_regex = re.compile(r"@(\w+)")
def user_repl(match):
	return ''+match.group(1).upper()

#### Removal of URLs

In [562]:
#URL
url_regex = re.compile(r"(http|https|ftp)://[a-zA-Z0-9\./]+")
def url_repl(match):
	return ''+match.group(1).upper()

#### Removal of Emoticons

In [563]:
#Emoticons
emoticons = \
	[	('__EMOT_SMILEY',	[':-)', ':)', '(:', '(-:', ] )	,\
		('__EMOT_LAUGH',		[':-D', ':D', 'X-D', 'XD', 'xD', ] )	,\
		('__EMOT_LOVE',		['<3', ':\*', ] )	,\
		('__EMOT_WINK',		[';-)', ';)', ';-D', ';D', '(;', '(-;', ] )	,\
		('__EMOT_FROWN',		[':-(', ':(', '(:', '(-:', ] )	,\
		('__EMOT_CRY',		[':,(', ':\'(', ':"(', ':(('] )	,\
	]
    
def escape_paren(arr):
	return [text.replace(')', '[)}\]]').replace('(', '[({\[]') for text in arr]

def regex_union(arr):
	return '(' + '|'.join( arr ) + ')'

emoticons_regex = [ (repl, re.compile(regex_union(escape_paren(regx))) ) for (repl, regx) in emoticons ]

Once the functions for removing dirty data are completed, we can apply the traditional NLP approach for cleaning the dataset.

#### Removal of the Word Gopdeb

We removed the word geopdeb because it was the most frequent string value used to query the data and was often used following a hashtag. By taking out this word, we would "standardize" the diversity and proportion of words in the tweets. 

We did the same for other repetitive words. 

In [564]:
#Common word
word_regex = re.compile(r"\bgopdeb\b\s+")
def word_repl(match):
	return ''+match.group(1).upper()

#### Removal of repetitive words

In [565]:
# Repeating words (like hurrrryyyyyy)
rpt_regex = re.compile(r"(.)\1{1,}", re.IGNORECASE);
def rpt_repl(match):
	return match.group(1)+match.group(1)

#### Spliting of words by boundaries and Removal of Punctuation

We also removed punctuation, which are often not as present in tweets. Still, we wanted to eliminate the noise, as the tweets generated will not require punctuation. 

In [566]:
# Spliting by word boundaries
word_bound_regex = re.compile(r"\W+")

# Punctuations
punctuations = \
	[	#('',		['.', ] )	,\
		#('',		[',', ] )	,\
		#('',		['\'', '\"', ] )	,\
		('',		['!', '¡', ] )	,\
		('__PUNC_QUES',		['?', '¿', ] )	,\
		('__PUNC_ELLP',		['...', '…', ] )	,\
	]

#For punctuation replacement
def punctuations_repl(match):
	text = match.group(0)
	repl = []
	for (key, parr) in punctuations :
		for punc in parr :
			if punc in text:
				repl.append(key)
	if( len(repl)>0 ) :
		return ' '+' '.join(repl)+' '
	else :
		return ' '

#### Stemming of words

Subsequently, we apply stemming in order to standardize the text and consider the different word variants as one. 

In [567]:
#Stemming
#Porter Stemmer
stemmer = nltk.stem.PorterStemmer()

#### Combining functions and applying them to the text

Now, we combine all the previous functions in one wrapper, to which we will apply to the text column and create a new column, named text_processed.

In [568]:
# Wrapper function that encloses all the processing procedures
def processAll(text):
    
    text = re.sub( hash_regex, hash_repl, text )
    text = re.sub( user_regex, user_repl, text)
    text = re.sub( url_regex, url_repl, text )
    text = re.sub( word_regex, word_repl, text)
    
    for (repl, regx) in emoticons_regex :
        text = re.sub(regx, ' '+repl+' ', text)
    
    text = text.replace('\'','')
    
    text = re.sub( word_bound_regex , punctuations_repl, text )
    text = re.sub( rpt_regex, rpt_repl, text )
    
        
    text = [word if(word[0:2]=='__') else word.lower() for word in text.split() if len(word) >= 3]
    text = [stemmer.stem(w) for w in text]                
    
    return text

In [569]:
raw_training_data['text_processed'] = raw_training_data.text.apply(processAll)
raw_training_data.head()

Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone,text_processed
1,No candidate mentioned,Neutral,None of the above,5,RT @NancyLeeGrahn: How did everyone feel about...,Unknown,Quito,"[nancyleegrahn, how, did, everyon, feel, about..."
2,Scott Walker,Positive,None of the above,26,RT @ScottWalker: Didn't catch the full #GOPdeb...,Unknown,Unknown,"[scottwalk, didnt, catch, the, full, gopdeb, l..."
3,No candidate mentioned,Neutral,None of the above,27,RT @TJMShow: No mention of Tamir Rice and the ...,Unknown,Unknown,"[tjmshow, mention, tamir, rice, and, the, gopd..."
4,No candidate mentioned,Positive,None of the above,138,RT @RobGeorge: That Carly Fiorina is trending ...,Texas,Central Time (US & Canada),"[robgeorg, that, carli, fiorina, trend, hour, ..."
5,Donald Trump,Positive,None of the above,156,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Unknown,Arizona,"[danscavino, gopdeb, realdonaldtrump, deliv, t..."


To reduce the amount of noise, we decided to keep only the column text_processed, which is the new text column with all the data cleaning applied.

In [570]:
all_headlines = raw_training_data['text_processed']
all_headlines.head()

1    [nancyleegrahn, how, did, everyon, feel, about...
2    [scottwalk, didnt, catch, the, full, gopdeb, l...
3    [tjmshow, mention, tamir, rice, and, the, gopd...
4    [robgeorg, that, carli, fiorina, trend, hour, ...
5    [danscavino, gopdeb, realdonaldtrump, deliv, t...
Name: text_processed, dtype: object

Following all these steps, we can see that the data was cleaned. From there, we will start preparing our data to be ingested in the RNN, which we will also create. 

## 3. Dataset Preparation

In this step, we will start by merging all the processed text (arrays) in one array (named corpus). Merging all this information into one corpus allows us to feed the corpus, later on, in the recurrent neural network.

In [571]:
def clean_text(txt):
    return txt 

corpus = [clean_text(x) for x in all_headlines]
corpus[:10]

[['nancyleegrahn',
  'how',
  'did',
  'everyon',
  'feel',
  'about',
  'the',
  'climat',
  'chang',
  'question',
  'last',
  'night',
  '__punc_qu',
  'exactli',
  'gopdeb'],
 ['scottwalk',
  'didnt',
  'catch',
  'the',
  'full',
  'gopdeb',
  'last',
  'night',
  'here',
  'are',
  'some',
  'scott',
  'best',
  'line',
  'second',
  'walker16',
  'httpâ'],
 ['tjmshow',
  'mention',
  'tamir',
  'rice',
  'and',
  'the',
  'gopdeb',
  'wa',
  'held',
  'cleveland',
  '__punc_qu',
  'wow'],
 ['robgeorg',
  'that',
  'carli',
  'fiorina',
  'trend',
  'hour',
  'after',
  'her',
  'debat',
  'abov',
  'ani',
  'the',
  'men',
  'just',
  'complet',
  'gopdeb',
  'say',
  'she'],
 ['danscavino',
  'gopdeb',
  'realdonaldtrump',
  'deliv',
  'the',
  'highest',
  'rate',
  'the',
  'histori',
  'presidenti',
  'debat',
  'trump2016',
  'httpâ'],
 ['gregabbott_tx',
  'tedcruz',
  'first',
  'day',
  'will',
  'rescind',
  'everi',
  'illeg',
  'execut',
  'action',
  'taken',
  'barac

### Generating Sequence of N-Grams Tokens

Language modelling requires a sequence input data, as given a sequence (of words/tokens) where the aim is to predict the next word/token.

The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens.

Next, we need to convert the corpus into a flat dataset of sentence sequences.

In [540]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[2192, 42],
 [2192, 42, 67],
 [2192, 42, 67, 283],
 [2192, 42, 67, 283, 346],
 [2192, 42, 67, 283, 346, 17],
 [2192, 42, 67, 283, 346, 17, 2],
 [2192, 42, 67, 283, 346, 17, 2, 399],
 [2192, 42, 67, 283, 346, 17, 2, 399, 252],
 [2192, 42, 67, 283, 346, 17, 2, 399, 252, 33],
 [2192, 42, 67, 283, 346, 17, 2, 399, 252, 33, 25]]

### Padding the Sequences and obtain Variables : Predictors and Target

Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Keras for this purpose. To input this data into a learning model, we need to create predictors and label. We created N-grams sequences as predictors and the next word of the N-gram as label. 

In [541]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)


From there, we can obtain the input vector X and the label vector Y which can be used for the training purposes. Recent experiments have shown that recurrent neural networks have shown a good performance in sequence to sequence learning and text data applications.

## 4. LSTMs for Text Generation

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. Here are the layers in our model:

- Input Layer : Takes the sequence of words as input
- LSTM Layer : Computes the output using LSTM units (number can be tuned)
- Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting
- Output Layer : Computes the probability of the best possible next word as output

We will also run this model for total 1 and 100 epochs to analyze the difference in results.

In [542]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 25, 10)            127080    
_________________________________________________________________
lstm_17 (LSTM)               (None, 100)               44400     
_________________________________________________________________
dropout_17 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_17 (Dense)             (None, 12708)             1283508   
Total params: 1,454,988
Trainable params: 1,454,988
Non-trainable params: 0
_________________________________________________________________


We train the model with 1 epoch.

In [543]:
model.fit(predictors, label, epochs=1, verbose=1)
#don't want to run on 100 epochs

Epoch 1/1


<keras.callbacks.History at 0x15a7332ec18>

Now, we train the model with 100 epochs.

In [10]:
model.fit(predictors, label, epochs=100, verbose=1)
#don't want to run on 100 epochs

Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100

<keras.callbacks.History at 0x15a4532feb8>

## 5. Generating the Text

Next, we are creating the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

In [544]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

## 6. Results

In our results, we are comparing running this model through 1 epoch and 100 epoch. we have asked the model to predict the 5 words following a specific word we have identified (in this case: Trump). 

#### Results for 1 Epoch

In [545]:
print (generate_text("Trump", 5, model, max_sequence_len))

Trump Gopdeb Gopdeb Gopdeb Gopdeb Gopdeb


#### Results for 100 Epochs

In [12]:
print (generate_text("Trump", 5, model, max_sequence_len))

Trump Gopdebate Gopdebate Gopdebate Gopdebate Gopdebate


Surprisingly, we generate results that are all the same (5 predicted words). In addition, increasing the number of training epochs doesn't help our model. The model is not capable of generating a languague model, even with 100 epochs (that ran for over 12 hours). In addition, the resulting word is Gopdebate or Gopdeb, thus, we can infer that the words weren't filtered out completely and that having them present in the dataset, skews the results. 

In our second part of this notebook, we will rework the steps to have an effective and useful cleaning process to the data, hoping this will correct the problem above. 

# Second Attempt at Text Generation

## 1. Load the Dataset

To avoid overwriting the above results, we will load the same dataset once again, and create a new dataframe to work from. 

In [606]:
def loadDataset(in_file):
    my_path = os.getcwd()
    path = os.path.join(my_path, in_file)
    column_names = ['candidate','sentiment', 'subject', 'retweets', 'text', 'location', 'timezone']
    tweets = pd.read_csv(path, delimiter=',', quotechar='"', header= None, names= column_names, encoding="ISO-8859-1")

    print('Readed ', len(tweets), "tweets")
    return tweets

In [607]:
raw_training_data2 = loadDataset("Sentiment_4.csv")
raw_training_data2.head()

Readed  13871 tweets


Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone
1,No candidate mentioned,Neutral,None of the above,5,RT @NancyLeeGrahn: How did everyone feel about...,Unknown,Quito
2,Scott Walker,Positive,None of the above,26,RT @ScottWalker: Didn't catch the full #GOPdeb...,Unknown,Unknown
3,No candidate mentioned,Neutral,None of the above,27,RT @TJMShow: No mention of Tamir Rice and the ...,Unknown,Unknown
4,No candidate mentioned,Positive,None of the above,138,RT @RobGeorge: That Carly Fiorina is trending ...,Texas,Central Time (US & Canada)
5,Donald Trump,Positive,None of the above,156,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Unknown,Arizona


## 2. Data Cleaning

As we recall from the first part, we decided to apply all the cleaning functions in one wrapping (all at once) data cleaning step. 

In this second try at the data, we understand the importance of correctly cleaning the dataset. Thus, we will systematically clean the dataset, one step at a time, analyzing the impact of each function. This will allow us to have a better overview at the changes made and if they were effective in the task. 

#### Remove Stopwords and Apply Lowercase

As seen above, even when taking out gopdebate, we cannot eliminate all mentions of this word because the text is case sentivitive. Therefore, we convert all string values (in the text column) to lowercase to avoid this problem. 

We also decide to eliminate stopwords because Tweets normally contain less words. Because of that, some tweets may be composed of mostly stopwords, which may not give a good output value to our Language Modelling. We decide to remove them so they don't impact the final result. Plus, the idea in the tweet generation is to get an idea of the opinion on certain topics, not necessarily a correct grammar case.

In [608]:
from nltk.corpus import stopwords
sw = stopwords.words('english')

def stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    return " ".join(text)

raw_training_data2['text'] = raw_training_data2['text'].apply(stopwords)
raw_training_data2.head()

Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone
1,No candidate mentioned,Neutral,None of the above,5,rt @nancyleegrahn: everyone feel climate chang...,Unknown,Quito
2,Scott Walker,Positive,None of the above,26,rt @scottwalker: catch full #gopdebate last ni...,Unknown,Unknown
3,No candidate mentioned,Neutral,None of the above,27,rt @tjmshow: mention tamir rice #gopdebate hel...,Unknown,Unknown
4,No candidate mentioned,Positive,None of the above,138,rt @robgeorge: carly fiorina trending -- hours...,Texas,Central Time (US & Canada)
5,Donald Trump,Positive,None of the above,156,rt @danscavino: #gopdebate w/ @realdonaldtrump...,Unknown,Arizona


#### Remove Common Words

When further analyzing the dataset, we can see that all (or almost all) tweets in the text column start the with word 'rt'. This is an acronynm for retweet and is irrelevant in determing the future Tweets. We can also see, further in the string values, that the words gopdeb, gopdebate and gopdebat are used. So, we eliminate both words at this instance. By this, we hope to avoid skewing the results to more popular words.

In [609]:
raw_training_data2['text'] = raw_training_data2['text'].map(lambda x: x.lstrip('rt').rstrip('gopdeb'))
raw_training_data2.head()

Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone
1,No candidate mentioned,Neutral,None of the above,5,@nancyleegrahn: everyone feel climate change ...,Unknown,Quito
2,Scott Walker,Positive,None of the above,26,@scottwalker: catch full #gopdebate last nigh...,Unknown,Unknown
3,No candidate mentioned,Neutral,None of the above,27,@tjmshow: mention tamir rice #gopdebate held ...,Unknown,Unknown
4,No candidate mentioned,Positive,None of the above,138,@robgeorge: carly fiorina trending -- hours d...,Texas,Central Time (US & Canada)
5,Donald Trump,Positive,None of the above,156,@danscavino: #gopdebate w/ @realdonaldtrump d...,Unknown,Arizona


In [610]:
raw_training_data2['text'] = raw_training_data2.text.str.replace("gopdebate", "")
raw_training_data2['text'] = raw_training_data2.text.str.replace("gopdebat", "")
raw_training_data2.head()

Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone
1,No candidate mentioned,Neutral,None of the above,5,@nancyleegrahn: everyone feel climate change ...,Unknown,Quito
2,Scott Walker,Positive,None of the above,26,@scottwalker: catch full # last night. scott'...,Unknown,Unknown
3,No candidate mentioned,Neutral,None of the above,27,@tjmshow: mention tamir rice # held cleveland...,Unknown,Unknown
4,No candidate mentioned,Positive,None of the above,138,@robgeorge: carly fiorina trending -- hours d...,Texas,Central Time (US & Canada)
5,Donald Trump,Positive,None of the above,156,@danscavino: # w/ @realdonaldtrump delivered ...,Unknown,Arizona


#### Remove Words with Only 1 Letter

Considering we have eliminated all stopwords, in this step, we decide to remove all words of one letter because those words represent words that (most likely) don't exist.

In [611]:
raw_training_data2['text'] = raw_training_data2.text.str.replace(r'\b\w\b','').str.replace(r'\s+', ' ')
raw_training_data2.head()

Unnamed: 0,candidate,sentiment,subject,retweets,text,location,timezone
1,No candidate mentioned,Neutral,None of the above,5,@nancyleegrahn: everyone feel climate change ...,Unknown,Quito
2,Scott Walker,Positive,None of the above,26,@scottwalker: catch full # last night. scott'...,Unknown,Unknown
3,No candidate mentioned,Neutral,None of the above,27,@tjmshow: mention tamir rice # held cleveland...,Unknown,Unknown
4,No candidate mentioned,Positive,None of the above,138,@robgeorge: carly fiorina trending -- hours d...,Texas,Central Time (US & Canada)
5,Donald Trump,Positive,None of the above,156,@danscavino: # / @realdonaldtrump delivered h...,Unknown,Arizona


Again, to reduce the amount of noise, we decided to keep only the column text, which is the text column now modified with all the data cleaning applied.

In [612]:
all_headlines2 = raw_training_data2['text']
all_headlines2.head()

1     @nancyleegrahn: everyone feel climate change ...
2     @scottwalker: catch full # last night. scott'...
3     @tjmshow: mention tamir rice # held cleveland...
4     @robgeorge: carly fiorina trending -- hours d...
5     @danscavino: # / @realdonaldtrump delivered h...
Name: text, dtype: object

Now that we will combine, again, all the text in one array, we will apply further steps to the data cleaning process. 

In [613]:
corpus = [clean_text(x) for x in all_headlines2]
corpus[:10]

[' @nancyleegrahn: everyone feel climate change question last night? exactly. #',
 " @scottwalker: catch full # last night. scott' best lines 90 seconds. #walker16 http://.co/zsff",
 ' @tjmshow: mention tamir rice # held cleveland? wow.',
 ' @robgeorge: carly fiorina trending -- hours debate -- men just-completed # says ',
 ' @danscavino: # / @realdonaldtrump delivered highest ratings history presidential debates. #trump2016 http://.co',
 ' @gregabbott_tx: @tedcruz: "on first day rescind every illegal executive action taken barack obama." # @foxnews',
 ' @warriorwoman91: liked happy heard going moderator. anymore. # @megynkelly https://',
 'going #msnbc live @thomasaroberts around pm et. #',
 'deer headlights rt @lizzwinstead: ben carson, may brain surgeon performed lobotomy himself. #',
 " @nancyosborne180: last night' debate proved it! # #batsask @badassteachersa #tbats https://.co/g2ggjy1bj"]

#### Encoding Text

In [614]:
def clean_text(txt):
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in all_headlines2]
corpus[:10]

[' @nancyleegrahn: everyone feel climate change question last night? exactly. #',
 " @scottwalker: catch full # last night. scott' best lines 90 seconds. #walker16 http://.co/zsff",
 ' @tjmshow: mention tamir rice # held cleveland? wow.',
 ' @robgeorge: carly fiorina trending -- hours debate -- men just-completed # says ',
 ' @danscavino: # / @realdonaldtrump delivered highest ratings history presidential debates. #trump2016 http://.co',
 ' @gregabbott_tx: @tedcruz: "on first day rescind every illegal executive action taken barack obama." # @foxnews',
 ' @warriorwoman91: liked happy heard going moderator. anymore. # @megynkelly https://',
 'going #msnbc live @thomasaroberts around pm et. #',
 'deer headlights rt @lizzwinstead: ben carson, may brain surgeon performed lobotomy himself. #',
 " @nancyosborne180: last night' debate proved it! # #batsask @badassteachersa #tbats https://.co/g2ggjy1bj"]

#### Removing Symbols, Words and Punctuation

In [615]:
#Removing symbols and words indicated in link_regex
def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

#Removing punctuation
def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)


corpus = [strip_links(x) for x in corpus]
corpus[:10]

[' @nancyleegrahn: everyone feel climate change question last night? exactly. #',
 " @scottwalker: catch full # last night. scott' best lines 90 seconds. #walker16 , ",
 ' @tjmshow: mention tamir rice # held cleveland? wow.',
 ' @robgeorge: carly fiorina trending -- hours debate -- men just-completed # says ',
 ' @danscavino: # / @realdonaldtrump delivered highest ratings history presidential debates. #trump2016 , ',
 ' @gregabbott_tx: @tedcruz: "on first day rescind every illegal executive action taken barack obama." # @foxnews',
 ' @warriorwoman91: liked happy heard going moderator. anymore. # @megynkelly , ',
 'going #msnbc live @thomasaroberts around pm et. #',
 'deer headlights rt @lizzwinstead: ben carson, may brain surgeon performed lobotomy himself. #',
 " @nancyosborne180: last night' debate proved it! # #batsask @badassteachersa #tbats , "]

In [616]:
corpus = [strip_all_entities(x) for x in corpus]
corpus[:10]

['everyone feel climate change question last night exactly',
 'catch full last night scott best lines 90 seconds',
 'mention tamir rice held cleveland wow',
 'carly fiorina trending hours debate men just completed says',
 'delivered highest ratings history presidential debates',
 'tx on first day rescind every illegal executive action taken barack obama',
 'liked happy heard going moderator anymore',
 'going live around pm et',
 'deer headlights rt ben carson may brain surgeon performed lobotomy himself',
 'last night debate proved it']

#### Removing Link, User and Special Characters

In [617]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [618]:
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

In [619]:
def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

In [620]:
corpus = [preprocess(x) for x in corpus]
corpus[:10]

['everyone feel climate change question last night exactly',
 'catch full last night scott best lines 90 seconds',
 'mention tamir rice held cleveland wow',
 'carly fiorina trending hours debate men completed says',
 'delivered highest ratings history presidential debates',
 'tx first day rescind every illegal executive action taken barack obama',
 'liked happy heard going moderator anymore',
 'going live around pm et',
 'deer headlights rt ben carson may brain surgeon performed lobotomy',
 'last night debate proved']

#### Removing Whitespace

Throughout these steps, we have been removing a lot of values. This may have created additional (unnecessary) whitespace. Just to make a proper data cleaning without creating additional noise, we will eliminate all whitespace between values.

In [621]:
def remove_whitespace(in_str):
    return in_str.strip()

corpus = [remove_whitespace(x) for x in corpus]
corpus[:10]

['everyone feel climate change question last night exactly',
 'catch full last night scott best lines 90 seconds',
 'mention tamir rice held cleveland wow',
 'carly fiorina trending hours debate men completed says',
 'delivered highest ratings history presidential debates',
 'tx first day rescind every illegal executive action taken barack obama',
 'liked happy heard going moderator anymore',
 'going live around pm et',
 'deer headlights rt ben carson may brain surgeon performed lobotomy',
 'last night debate proved']

## 3. Dataset Preparation

Following our new dataset cleaning process, we will apply the same process to generate our language model. As mentioned before, we think that further and more systematic data cleaning was needed to correct the problem seen in generating tweets (repetitive words). In the case where we still have this issue, we will then rework the dataset preparation step.

### Generating Sequence of N-Grams Tokens

In [622]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[173, 294],
 [173, 294, 274],
 [173, 294, 274, 223],
 [173, 294, 274, 223, 30],
 [173, 294, 274, 223, 30, 7],
 [173, 294, 274, 223, 30, 7, 6],
 [173, 294, 274, 223, 30, 7, 6, 770],
 [1530, 417],
 [1530, 417, 7],
 [1530, 417, 7, 6]]

### Padding the Sequences and obtain Variables : Predictors and Target

In [623]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)


## 4. LSTMs for Text Generation

In [624]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_19 (Embedding)     (None, 20, 10)            106640    
_________________________________________________________________
lstm_19 (LSTM)               (None, 100)               44400     
_________________________________________________________________
dropout_19 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_19 (Dense)             (None, 10664)             1077064   
Total params: 1,228,104
Trainable params: 1,228,104
Non-trainable params: 0
_________________________________________________________________


Like previously, we will train our model on 1 epoch. 

In [625]:
model.fit(predictors, label, epochs=1, verbose=1)

Epoch 1/1


<keras.callbacks.History at 0x15a7b39f710>

Like previously, we will train our model on 100 epochs. 

In [518]:
model.fit(predictors, label, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80

<keras.callbacks.History at 0x15a6f49b0b8>

## 5. Generating the Text

In [626]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

## 6. Results

As we can see, when generating text based on the word trump (like in part 1), we actually generate a Tweet. 

Knowing this dataset is based on the 2016 GOP Debate, we have an idea of the outcome. Surprisingly, the generated Tweets represent well the output of this debate. We have tried some other words to verify the validity/proximity of the Tweets in comparison with the real life outcome.

#### Results from 1 Epoch

In [627]:
print (generate_text("Trump", 5, model, max_sequence_len))

Trump Like Last Night Obviously Night


#### Results from 100 Epochs

In [520]:
print (generate_text("Trump", 5, model, max_sequence_len))

Trump Said Megyn Ask Nine Candidates


In [521]:
print (generate_text("debate", 5, model, max_sequence_len))

Debate Debate Action Fox News Fed


In [522]:
print (generate_text("climate", 5, model, max_sequence_len))

Climate Change Made 90 Issue Next


In [523]:
print (generate_text("president", 5, model, max_sequence_len))

President Always Tell Truth Said Would


# Conclusion For Tweet Generation

In conclusion, from our two previous analysis, when generating a deep learning language model, it is necessary to make the correct data cleaning steps. We have seen that taking a systematic process and understanding the data issues at hand, help a lot with understanding the required steps, and hence, generating a proper model. Text Generation for Tweets is more difficult as most Tweets are much shorter than texts in general. Thus, it is a lot harder to generate information that is relevant and diverse. 

In addition, because we don't have that much data, running 1 epoch or 100 epoch both give good outputs. 

The results could be improved further with the following points:
- Adding more data
- Fine Tuning the network architecture
- Fine Tuning the network parameters

However, there are some limitations to deep learning when generating language models. The most important one, in this exercise, is that running a deep learning model is computationally expensive versus the standard NLP approach. Thus, it takes a very long time to train model. To run additional data, we would need more powerful tools. 

In the next notebook (3_Text_Generation), we will see a different approach to Text Generation for Tweets. Because RNNs are expensive to run, we want to compare this analysis with a more basic approach and see the differences or similarities in output. We will also apply and analyze the standard approach to longer speeches (instead of Tweets). 

### Sourced Information

https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275