# Natural Language Processing : Classic to Deep Methods for Sentiment Analysis

## Resources

Bag-Of-Word and TF-IDF:

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Recurrent Neural Networks (RNNs):

https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9

Long Short Term Memory networks (LSTMs):

https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Word embeddings:

http://jalammar.github.io/illustrated-word2vec/

In [1]:
import os
import numpy as np
import pandas as pd

#TOFILL
import re
import string
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.util import ngrams

[nltk_data] Downloading package punkt to
[nltk_data]     /home/fernando.arroyo@Digital-
[nltk_data]     Grenoble.local/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fernando.arroyo@Digital-
[nltk_data]     Grenoble.local/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Today we are going to tackle the sentiment analysis problem, a *text classification* problem. The idea is pretty simple : we want to automatically predict whether a text expresses positive or negative sentiments. To do so we will use the IMDB dataset, that contains 50000 movie reviews from the www.imdb.com website, and their corresponding sentiment : positive or negative. It is thus a binary classification problem, where we want to predict a binary target $y \in \{0,1\}$. We will go through different ways of encoding a text in a vectorial form $x \in \mathbb{R}^d$, as well as different classification models, from classic ways to modern deep learning models.

## Load the dataset

Load the dataset and explore a bit the data :

In [2]:
#Load and print the dataset
imdb_dataset_original=pd.read_csv('../../data/IMDB Dataset.csv')
imdb_dataset = imdb_dataset_original.copy()
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [3]:
imdb_dataset.shape

(50000, 2)

In [7]:
#Print first review:
print(imdb_dataset["review"][0])

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [8]:
#Print the two classes size
imdb_dataset['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

## Text preprocessing

As you can see the text is quite messy, and before encoding our text into features, we are going to go through different preprocessing steps in order to clean it:
* Removing the HTML tags.
* Removing other special characters : this means all non alphanumeric characters, including punctuation.
* Lowercase the text.
* Tokenization : split a text as a list of words now called tokens.
* Stemming : removing all the suffixes from conjugation, plural, ... In order to bring a word back to its root form. For example.
* Removing stopwords : words like 'to', 'a', 'the', ... are called stopwords, we remove them as they are too frequent words and generally just add noise.

Fill the following functions to perform each of these steps. You are free to use the libraries of your choice to do so. Try to not reinvent the wheel!

In [3]:
CLEANR = re.compile('<.*?>') 

def remove_html_tags(text):
    """
    Input: str : A string to clean from html tags
    Output: str : The same string with html tags removed
    """
    #TOFILL
    cleantext = re.sub(CLEANR, '', text)
    return cleantext

In [4]:
imdb_dataset['review']=imdb_dataset['review'].apply(remove_html_tags)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [18]:
string.punctuation


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [4]:
CHAR_SPE = r'['+string.punctuation+']'


def remove_special_characters(text):
    """
    Input: str : A string to clean from non alphanumeric characters
    Output: str : The same strings without non alphanumeric characters
    """
    #TOFILL
    cleantext = re.sub(CHAR_SPE, '', text)
    return cleantext
    

In [6]:
imdb_dataset['review']=imdb_dataset['review'].apply(remove_special_characters)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production The filming tech...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy J...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive
5,Probably my alltime favorite movie a story of ...,positive
6,I sure would like to see a resurrection of a u...,positive
7,This show was an amazing fresh innovative ide...,negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [5]:
def lowercase_text(text):
    """
    Input: str : A string to lowercase
    Output: str : The same string lowercased
    """
    #TOFILL
    lower_text = str.lower(text)
    return lower_text
    

In [8]:
imdb_dataset['review']=imdb_dataset['review'].apply(lowercase_text)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
5,probably my alltime favorite movie a story of ...,positive
6,i sure would like to see a resurrection of a u...,positive
7,this show was an amazing fresh innovative ide...,negative
8,encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


In [6]:
def tokenize_words(text):
    """
    Input: str : A string to tokenize
    Output: list of str : A list of the tokens splitted from the input string
    """
    #TOFILL
    tokens = nltk.word_tokenize(text)
    return tokens
    

In [10]:
imdb_dataset['review']=imdb_dataset['review'].apply(tokenize_words)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,"[one, of, the, other, reviewers, has, mentione...",positive
1,"[a, wonderful, little, production, the, filmin...",positive
2,"[i, thought, this, was, a, wonderful, way, to,...",positive
3,"[basically, theres, a, family, where, a, littl...",negative
4,"[petter, matteis, love, in, the, time, of, mon...",positive
5,"[probably, my, alltime, favorite, movie, a, st...",positive
6,"[i, sure, would, like, to, see, a, resurrectio...",positive
7,"[this, show, was, an, amazing, fresh, innovati...",negative
8,"[encouraged, by, the, positive, comments, abou...",negative
9,"[if, you, like, original, gut, wrenching, laug...",positive


In [7]:
stopwords = set(stopwords.words())

def remove_stopwords(token_list):
    """
    Input: list of str : A list of tokens
    Output: list of str : The new list with removed stopwords tokens
    """
    #TOFILL
    no_stopwords = [word for word in token_list if not word in stopwords]
    return no_stopwords

In [12]:
imdb_dataset['review']=imdb_dataset['review'].apply(remove_stopwords)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,"[reviewers, mentioned, watching, 1, oz, episod...",positive
1,"[wonderful, production, filming, technique, un...",positive
2,"[wonderful, spend, time, hot, summer, weekend,...",positive
3,"[basically, family, boy, jake, thinks, zombie,...",negative
4,"[petter, matteis, love, time, money, visually,...",positive
5,"[probably, alltime, favorite, movie, story, se...",positive
6,"[resurrection, dated, seahunt, series, tech, t...",positive
7,"[show, amazing, fresh, innovative, idea, 70s, ...",negative
8,"[encouraged, positive, comments, film, forward...",negative
9,"[original, gut, wrenching, laughter, movie, yo...",positive


In [8]:
stemmer = PorterStemmer()

def stem_words(token_list):
    """
    Input: list of str : A list of tokens to stem
    Output: list of str : The list of stemmed tokens
    """
    #TOFILL
    stemmed = [stemmer.stem(word) for word in token_list]
    return stemmed 
    

In [14]:
imdb_dataset['review']=imdb_dataset['review'].apply(stem_words)
imdb_dataset.head(10)

Unnamed: 0,review,sentiment
0,"[review, mention, watch, 1, oz, episod, youll,...",positive
1,"[wonder, product, film, techniqu, unassum, old...",positive
2,"[wonder, spend, time, hot, summer, weekend, si...",positive
3,"[basic, famili, boy, jake, think, zombi, close...",negative
4,"[petter, mattei, love, time, money, visual, st...",positive
5,"[probabl, alltim, favorit, movi, stori, selfle...",positive
6,"[resurrect, date, seahunt, seri, tech, today, ...",positive
7,"[show, amaz, fresh, innov, idea, 70, air, 7, 8...",negative
8,"[encourag, posit, comment, film, forward, watc...",negative
9,"[origin, gut, wrench, laughter, movi, young, l...",positive


Let's join all that together and apply it to our dataset. The following function simply chains all the preprocessing steps you just implemented. 

It adds the `list_output` flag, if False it will reconcatenate all the preprocessed tokens into a single string (with spaces between tokens), if True it will keep each sentence as a list of tokens. Depending on the libraries you will use for the next steps, it can be useful to have one or the other representation.

In [9]:
def normalize_text_dataset(dataset, text_col_name = 'review', html_tags = True,
                           special_chars = True, lowercase = True , stemming = True , 
                           stopwords = True, list_output = False ):
    """
    Apply the choosen preprocessing steps to a corpus of texts and return the 
    preprocessed corpus. The list_output flag allows to return either a list
    of token, or a rejoined string with spaces between the preprocessed tokens.
    """
    def rejoin_text(token_list):
        return ' '.join(token_list)
    
    
    output = dataset.copy()
    
    if html_tags : 
        output[text_col_name] = output[text_col_name].apply(remove_html_tags)
        
    if special_chars :
        output[text_col_name] = output[text_col_name].apply(remove_special_characters)
        
    if lowercase :
        output[text_col_name] = output[text_col_name].apply(lowercase_text)
    
    #Tokenization for next steps:
    output[text_col_name] = output[text_col_name].apply(tokenize_words)
    
    if stopwords :
        output[text_col_name] = output[text_col_name].apply(remove_stopwords)
        
    if stemming :
        output[text_col_name] = output[text_col_name].apply(stem_words)
        
    if not list_output :
        output[text_col_name] = output[text_col_name].apply(rejoin_text)
        
    return output
        

In [13]:
imdb_clean_dataset = normalize_text_dataset(imdb_dataset_original, html_tags = True,
                           special_chars = True, lowercase = True , stemming = True , 
                           stopwords = True, list_output = False )

## Text classification with Bag-Of-Words

Now we have cleaned the reviews of our dataset, how do we represent them as vectors in order to classify it ? 
One classic way to achieve that is the Bag-Of-Words (BOW) approach. To encode a text in a bag of word, we first need to know all the different words $w$ that appear in all our reviews, called the vocabulary : $w \in \mathcal{V}$. For each word $w$ we attribute an index $idx(w) = i$ with $i \in \{0, |\mathcal{V}|-1\}$, and represent a review $r$ as a vector of the size of the vocabulary $x_r \in \mathbb{R}^{|\mathcal{V}|}$. To encode a review we are simply going to count how many time each word appears and assign it at its corresponding index in the bag-of-words vector : $x_{r,i} = count(w,r)$, where i = idx(w). 

This means that we completely disregard the words order, and simply take into account the number of times each word appears in each review to represent them. There are many variations of this concept, TF-IDF (term frequency-inverse document frequency) for example, gives more weight to uncommon words. Read more about BOW and TF-IDF there:

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Let's start with bag-of-words. In general we don't consider the whole vocabulary but only some of the most frequent words in order to reduce the dimensionality and avoid noise from rare words. Here we will only consider the 25000 most frequent words of the training set, meaning the words that are only in the test set will be ignored. Thus we have : $x_r \in \mathbb{R^{25000}}$.

Encode all the reviews as bag-of-words, and train and evaluate a logistic regression model on the following train test splits. As we have seen previously, if we wanted to investigate this model we should also grid search for hyperparameters by doing a cross-validation with validation sets, etc. However this is not the goal today, so we'll simply go for a train/test split for this experiment. Concerning the evaluation metrics, in this case we care equally about correctly predicting the positives and the negatives, and we have a balanced dataset, thus we can simply use accuracy this time.

Once again, don't do everything from scratch and try to find libraries that propose implementations of these concepts !

In [14]:
max_vocab_size = 25000 

#Train/test split:

lb=LabelBinarizer()
sentiment_labels=lb.fit_transform(imdb_clean_dataset['sentiment'])

train_reviews = imdb_clean_dataset.review[:45000]
test_reviews = imdb_clean_dataset.review[45000:]

train_sentiments = sentiment_labels[:45000]
test_sentiments = sentiment_labels[45000:]

In [15]:
#TOFILL

vectorizer = CountVectorizer(max_features=max_vocab_size)
train_bow = vectorizer.fit_transform(train_reviews)
test_bow = vectorizer.transform(test_reviews)

In [38]:
train_bow

<45000x25000 sparse matrix of type '<class 'numpy.int64'>'
	with 3185401 stored elements in Compressed Sparse Row format>

In [47]:
test_bow

<5000x25000 sparse matrix of type '<class 'numpy.int64'>'
	with 356309 stored elements in Compressed Sparse Row format>

In [41]:
df_train_bow = pd.DataFrame(train_bow.toarray(), columns=vectorizer.get_feature_names())



In [49]:
seed = 123
log_reg = LogisticRegression(random_state=seed)
log_reg_fitted = log_reg.fit(train_bow, train_sentiments)

log_reg_fitted.score(test_bow, test_sentiments)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8714

You should get about 88% accuracy, pretty good for such a simple model. Now let's do the same with a tf-idf encoding:

In [51]:
#TOFILL
vectorizer2 = TfidfVectorizer(max_features=max_vocab_size)
train_bow2 = vectorizer2.fit_transform(train_reviews)
test_bow2 = vectorizer2.transform(test_reviews)

In [52]:
log_reg2 = LogisticRegression(random_state=seed)
log_reg2_fitted = log_reg2.fit(train_bow2, train_sentiments)
log_reg2_fitted.score(test_bow2, test_sentiments)


  y = column_or_1d(y, warn=True)


0.8834

And you should get about 90% accuracy this time. Other classic but more sophisticated features include N-grams, part-of-speech tagging and syntax trees, you can read more about these there:

https://www.analyticsvidhya.com/blog/2020/07/part-of-speechpos-tagging-dependency-parsing-and-constituency-parsing-in-nlp/

But we will stop there for the classic approaches and go to deep learning methods.

...unless you are ahead of time, in this case learn about Bags of N-grams by yourself, and try them out :

In [70]:
#TOFILL (Optional)

# sentence = imdb_clean_dataset['review'][1]

# print(sentence)

# ngram_object = textblob.TextBlob(sentence)

# ngrams = ngram_object.ngrams(n=2)

# print(ngrams)

# ngrams_ = ngrams(train_bow2, 2)

vectorizer_grams = CountVectorizer(max_features=max_vocab_size, ngram_range=(1, 2))
train_grams = vectorizer_grams.fit_transform(train_reviews)
test_grams = vectorizer_grams.transform(test_reviews)


log_reg3 = LogisticRegression(random_state=seed)
log_reg3_fitted = log_reg3.fit(train_grams, train_sentiments)
log_reg3_fitted.score(test_grams, test_sentiments)


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8714

# Recurrent Neural Networks : Long Short Term Memory networks (LSTM)

We have already covered feed-forward neural networks during the computer vision and the recommended system module. For natural language processing, one type of popular deep learning architecture is called Reccurent Neural Networks (RNNs). RNNs differ from feed-forward networks in the sense that some of their inner layers are recursively updated while iterating over the sequence of words given in input. We are going to use one specific RNN architecture called Long-Short Term Memory networks (LSTMs), which have been especially successful in various NLP tasks, including automatic translation, question answering, ... and text classification, our case study in this module.

Learn more about how RNNs and LSTMs encode texts as vectors first:

https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9

If you want to understand in depth how one LSTM cell is working, you can go through these two articles:

https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

http://colah.github.io/posts/2015-08-Understanding-LSTMs/


## Preprocessing

When using sequential models such as LSTMs, stopwords, punctuation and words suffixes carry semantics, and have thus much more importance than when using BOW-based models. Hence with these models we will only remove html tags, and keep all these :

In [17]:
imdb_deep_clean_dataset = normalize_text_dataset(imdb_dataset_original, html_tags = True,
                           special_chars = False, lowercase = True, stemming = False, 
                           stopwords = False, list_output = False )

In [None]:
imdb_deep_clean_dataset.head(10)

Keeping a validation set for early stopping is a good habit when training deep models, so let's resplit the dataset, and save the splits:

In [18]:
train_deep_clean = imdb_deep_clean_dataset.iloc[:40000]
valid_deep_clean = imdb_deep_clean_dataset.iloc[40000:45000]
test_deep_clean = imdb_deep_clean_dataset.iloc[45000:]

In [19]:
outdir = '../../data/imdb_clean/'
if not os.path.exists(outdir):
    os.mkdir(outdir)

In [20]:
train_deep_clean.to_csv(outdir + 'train.csv', index = False)
valid_deep_clean.to_csv(outdir + 'valid.csv', index = False)
test_deep_clean.to_csv(outdir + 'test.csv', index = False)

So we can restart the code from there in case of a crash :

In [21]:
outdir = '../../data/imdb_clean/'

In [22]:
train_deep_clean = pd.read_csv(outdir + 'train.csv')
valid_deep_clean = pd.read_csv(outdir + 'valid.csv')
test_deep_clean = pd.read_csv(outdir + 'test.csv')

## Implementing LSTMs with Keras

In [10]:
#Some lines that allow for faster training with this version of tensorflow for these models
import tensorflow as tf
print(tf.__version__)

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

2022-10-12 09:21:11.793123: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


2.4.1


2022-10-12 09:21:31.551886: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-10-12 09:21:31.553951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-10-12 09:21:31.623756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:61:00.0 name: NVIDIA GeForce GTX 1660 computeCapability: 7.5
coreClock: 1.785GHz coreCount: 22 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 178.86GiB/s
2022-10-12 09:21:31.623855: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-10-12 09:21:31.805370: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-10-12 09:21:31.805606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.1

In [11]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import *
from keras.utils.np_utils import to_categorical
from keras.initializers import Constant
from keras.callbacks import EarlyStopping

We will use Keras to implement an LSTM network. First we need to encode all the reviews as a list of indexes, where each word is replaced by its embedding index using keras "Tokenizer". To make all training reviews the same size, we will make them the same size as the longest review and add the special token < pad > as many time as necessary to all the other ones with the function `pad_sequences`. This token will be ignored by the LSTM layer.

In [23]:
max_vocab_size = 25000 

tokenizer = Tokenizer(num_words=max_vocab_size, split=' ', oov_token='<unw>', filters=' ')
tokenizer.fit_on_texts(pd.concat([train_deep_clean,valid_deep_clean]).review)

# This encodes our sentence as a sequence of integer
# each integer being the index of each word in the vocabulary
train_seqs = tokenizer.texts_to_sequences(train_deep_clean.review)
valid_seqs = tokenizer.texts_to_sequences(valid_deep_clean.review)
test_seqs = tokenizer.texts_to_sequences(test_deep_clean.review)

# We need to pad the sequences so that they are all the same length :
# the length of the longest one
max_seq_length = max( [len(seq) for seq in train_seqs + valid_seqs] )

X_train = pad_sequences(train_seqs, max_seq_length)
X_valid = pad_sequences(valid_seqs, max_seq_length)
X_test = pad_sequences(test_seqs, max_seq_length)

y_train = pd.get_dummies(train_deep_clean.sentiment).values[:,1]
y_valid = pd.get_dummies(valid_deep_clean.sentiment).values[:,1]
y_test = pd.get_dummies(test_deep_clean.sentiment).values[:,1]


Now fill the following function to implement a simple LSTM model : one embedding layer, one LSTM layer, and a final dense layer that yields a single score with a sigmoid activation function. Use Keras' Sequential API

In [13]:
def get_lstm_model(vocab_size, embedding_dim, seq_length, lstm_out_dim):
    #TOFILL
        
    model = Sequential()
    
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))
    
    model.add(LSTM(units=lstm_out_dim))
    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='SGD',metrics = ['accuracy'])
    return model

In [33]:
embedding_dim = 100
lstm_out_dim = 200  #Bigger than embedding dim, as it combines all the words of each review

model = get_lstm_model(max_vocab_size, embedding_dim, max_seq_length, lstm_out_dim)
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 2730, 100)         2500000   
_________________________________________________________________
lstm_3 (LSTM)                (None, 200)               240800    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 201       
Total params: 2,741,001
Trainable params: 2,741,001
Non-trainable params: 0
_________________________________________________________________
None


In [34]:
batch_size = 64
max_epochs = 2
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid))

2022-10-11 13:41:01.707860: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-10-11 13:41:01.771292: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1700000000 Hz


Epoch 1/2


2022-10-11 13:41:04.215942: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-10-11 13:41:05.808276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7


Epoch 2/2


In [35]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 49.54%


Pretty low accuracy isn't it ? Actually it is very easy to incorrectly train a deep neural net. Change the optimizer with "adam" instead of "SGD", add a dropout layer after the LSTM layer for regularization, and use early stopping :

In [12]:
#dropout, early stopping, adam
def get_lstm_model_2(vocab_size, embedding_dim, seq_length, lstm_out_dim, dropout_rate):
    #TOFILL
            
    model = Sequential()
    
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))
    
    model.add(LSTM(units=lstm_out_dim))

    model.add(Dropout(rate=dropout_rate))
    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [38]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2

model = get_lstm_model_2(max_vocab_size, embedding_dim, max_seq_length, lstm_out_dim, dropout_rate)
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 2730, 100)         2500000   
_________________________________________________________________
lstm_4 (LSTM)                (None, 200)               240800    
_________________________________________________________________
dropout (Dropout)            (None, 200)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 201       
Total params: 2,741,001
Trainable params: 2,741,001
Non-trainable params: 0
_________________________________________________________________
None


In [30]:
#TOFILL
early_stopping = EarlyStopping(monitor="val_accuracy", patience=2, verbose=1, restore_best_weights=True)

In [43]:
batch_size = 64
max_epochs = 5
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


In [44]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 89.38%


Much better. If we'd run for a longer time, we'd get a bit better results from our classic methods, but that's still quite slow for little improvement. We could also grid search for all hyper parameters (embedding and layer sizes, dropout rate, ...), but that's not the goal today, remember however that grid-search is standard when optimizing a model predictive performances.

## Predict sentiment for arbitrary sentences

Now you can try predict the sentiment of any kind of sentence in english, try your own. You first need to encode each review as a sequence of indexes (called tokens in keras), to pad these sequances, and finally predict the score with your trained model:

In [76]:
# imdb_deep_clean_dataset = normalize_text_dataset(imdb_dataset_original, html_tags = True,
#                            special_chars = False, lowercase = True, stemming = False, 
#                            stopwords = False, list_output = False )


# tokenizer.fit_on_texts(pd.concat([train_deep_clean,valid_deep_clean]).review)

# # This encodes our sentence as a sequence of integer
# # each integer being the index of each word in the vocabulary
# train_seqs = tokenizer.texts_to_sequences(train_deep_clean.review)
# valid_seqs = tokenizer.texts_to_sequences(valid_deep_clean.review)
# test_seqs = tokenizer.texts_to_sequences(test_deep_clean.review)

# # We need to pad the sequences so that they are all the same length :
# # the length of the longest one
# max_seq_length = max( [len(seq) for seq in train_seqs + valid_seqs] )

# X_train = pad_sequences(train_seqs, max_seq_length)
# X_valid = pad_sequences(valid_seqs, max_seq_length)
# X_test = pad_sequences(test_seqs, max_seq_length)


In [67]:
good = "i really liked the movie and had fun"
bad = "worst movie on the planet, so boring"
for review in [good,bad]:
    
    #TOFILL
    token = tokenizer.texts_to_sequences([review])
    padded = pad_sequences(token, max_seq_length)
    pred = model.predict(padded)
    print(pred)
    

    

    

[[0.84054464]]
[[0.00472129]]


## Initialize embeddings with pre-trained word embeddings

The training of LSTMs is a bit heavy, one way to speed this up is to re-use pre-trained word embeddings. Many such embeddings are available on the net. Read this to understand how are produced word embeddings and why they encode information that helps with all NLP tasks:

http://jalammar.github.io/illustrated-word2vec/

We are going to use GloVe embeddings, download and load the embeddings produced from 6 billions documents from : https://nlp.stanford.edu/projects/glove/

In [24]:
embeddings_index = {}
f = open('../../data/glove.6B/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [25]:
word_index = tokenizer.word_index
print('%s unique words in vocabulary' % len(word_index))

189566 unique words in vocabulary


Given our word index, search for each of our 25000 most frequents words if they exist in the pretrained GloVe embeddings and assign them to their corresponding row index in the embedding matrix. If they don't exist in the GloVe embeddings, assign a random vector :

In [33]:
embedding_dim = 100

# Allocate the embeddings matrix
embedding_matrix = np.zeros((max_vocab_size, embedding_dim))


for word, i in word_index.items():
    #TOFILL
    if i >= 25000: 
        break
    elif word in embeddings_index:
        embedding_matrix[i] = embeddings_index[word] 
    else:
        embedding_matrix[i] = np.random.normal(size=embedding_dim)
        
        


In [34]:
embedding_matrix

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [-1.11385515e+00,  9.65080264e-01, -9.51645321e-02, ...,
        -4.22016004e-01, -1.69327129e+00, -5.54072097e-01],
       [-3.81940007e-02, -2.44870007e-01,  7.28120029e-01, ...,
        -1.45899996e-01,  8.27799976e-01,  2.70619988e-01],
       ...,
       [ 2.97340006e-01,  2.59339988e-01, -8.60790014e-01, ...,
         1.49480000e-01, -1.29659998e-03,  1.55450001e-01],
       [ 1.27910003e-01, -5.11210024e-01, -7.17790008e-01, ...,
        -3.77020001e-01,  5.42959988e-01,  5.50859988e-01],
       [ 2.97219992e-01,  5.10179996e-02, -1.02019997e-03, ...,
         4.93129998e-01,  2.86110014e-01,  4.56099987e-01]])

Now change your LSTM model so that the embedding layer is initialized with the pretrained embeddings :

In [57]:
def get_lstm_model_pretrained_embs(vocab_size, embedding_dim, seq_length, 
                                   lstm_out_dim, dropout_rate, embedding_matrix):
    #TOFILL
            
    model = Sequential()
    
    # model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))

    model.add(Embedding(input_dim=vocab_size,
                        output_dim=embedding_dim,
                        weights=[embedding_matrix],
                        input_length=seq_length))
    
    model.add(LSTM(units=lstm_out_dim))
    model.add(Dropout(rate=dropout_rate))

    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model


In [58]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2

model = get_lstm_model_pretrained_embs(max_vocab_size, embedding_dim, max_seq_length, 
                                       lstm_out_dim, dropout_rate, embedding_matrix)
print(model.summary())

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 2730, 100)         2500000   
_________________________________________________________________
lstm_5 (LSTM)                (None, 200)               240800    
_________________________________________________________________
dropout_2 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 201       
Total params: 2,741,001
Trainable params: 2,741,001
Non-trainable params: 0
_________________________________________________________________
None


In [59]:
batch_size = 64
max_epochs = 5
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [61]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 90.04%


In [62]:
def get_lstm_model_pretrained_embs2(vocab_size, embedding_dim, seq_length, 
                                   lstm_out_dim, dropout_rate, embedding_matrix):
    #TOFILL
            
    model = Sequential()
    
    # model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))

    model.add(Embedding(input_dim=vocab_size,
                        output_dim=embedding_dim,
                        embeddings_initializer=Constant(embedding_matrix),
                        input_length=seq_length))
    
    model.add(LSTM(units=lstm_out_dim))
    model.add(Dropout(rate=dropout_rate))

    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [63]:
model2 = get_lstm_model_pretrained_embs2(max_vocab_size, embedding_dim, max_seq_length, 
                                       lstm_out_dim, dropout_rate, embedding_matrix)
print(model2.summary())

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 2730, 100)         2500000   
_________________________________________________________________
lstm_6 (LSTM)                (None, 200)               240800    
_________________________________________________________________
dropout_3 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 201       
Total params: 2,741,001
Trainable params: 2,741,001
Non-trainable params: 0
_________________________________________________________________
None


In [64]:
history2 = model2.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


In [65]:
test_acc2 = model2.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 90.04%


We can see that the validation accuracy indeed progressed much faster than previously.

For more speed-up, at the price of accuracy, let's fix the embeddings so that they are not trainable parameters of the model, meaning they won't be updated during training :

In [66]:
def get_lstm_model_pretrained_embs(vocab_size, embedding_dim, seq_length, 
                                   lstm_out_dim, dropout_rate, embedding_matrix,
                                    trainable_embeddings):
    #TOFILL
    model = Sequential()
    
    # model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))

    model.add(Embedding(input_dim=vocab_size,
                        output_dim=embedding_dim,
                        embeddings_initializer=Constant(embedding_matrix),
                        input_length=seq_length,
                        trainable=trainable_embeddings))
    
    model.add(LSTM(units=lstm_out_dim))
    model.add(Dropout(rate=dropout_rate))

    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model


In [67]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2
trainable_embeddings = False

model = get_lstm_model_pretrained_embs(max_vocab_size, embedding_dim, max_seq_length, 
                                       lstm_out_dim, dropout_rate, embedding_matrix, trainable_embeddings)
print(model.summary())

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 2730, 100)         2500000   
_________________________________________________________________
lstm_7 (LSTM)                (None, 200)               240800    
_________________________________________________________________
dropout_4 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 201       
Total params: 2,741,001
Trainable params: 241,001
Non-trainable params: 2,500,000
_________________________________________________________________
None


Notice the change in the number of trainable parameters in the summary.

In [68]:
batch_size = 64
max_epochs = 5
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [69]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 89.26%


By fixing the word embeddings, the training time shrunk a bit, but the validation accuracy is progressing more slowly and reaching a limit. Depending on the network architecture, the trade-off can be interesting, here not so much, just know this is a possibility.

# Optional parts

## Bidirectional and stacked LSTMs

LSTMs parse the text from left to right, but doing it also from right to left and concatening the two output vectors improved the results. These are called bidirectional LSTMs. It is also possible to stack multiple LSTM layers.

This image is a good illustration of how these two variants work:

https://www.researchgate.net/figure/Illustrations-for-basic-LSTMs-and-the-three-layer-stacked-LSTM-model-for-the-sequential_fig3_313115860


First modify your network to make a bidirectional LSTM :

In [71]:
def get_bilstm_model_pretrained_embs(vocab_size, embedding_dim, seq_length, 
                                     lstm_out_dim, dropout_rate, embedding_matrix,
                                    trainable_embeddings):
    #TOFILL

    model = Sequential()
    
    model.add(Embedding(input_dim=vocab_size,
                        output_dim=embedding_dim,
                        embeddings_initializer=Constant(embedding_matrix),
                        input_length=seq_length))
    
    model.add(Bidirectional(LSTM(units=lstm_out_dim)))
    model.add(Dropout(rate=dropout_rate))

    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [72]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2

model = get_bilstm_model_pretrained_embs(max_vocab_size, embedding_dim, max_seq_length, 
                                       lstm_out_dim, dropout_rate, embedding_matrix, True)

print(model.summary())

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 2730, 100)         2500000   
_________________________________________________________________
bidirectional (Bidirectional (None, 400)               481600    
_________________________________________________________________
dropout_5 (Dropout)          (None, 400)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 401       
Total params: 2,982,001
Trainable params: 2,982,001
Non-trainable params: 0
_________________________________________________________________
None


In [73]:
batch_size = 64
max_epochs = 5
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


In [74]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 90.90%


In [78]:
ns = 10

for n in range(ns):
    print(n)
    if n == ns - 1:
        print('yes') 




0
1
2
3
4
5
6
7
8
9
yes


Now try stacking multiple bidirectional LSTM layers, where the number of layers `n_layers` is a parameter of the function building the model :

In [85]:
def get_multilayer_bilstm_model_pretrained_embs(vocab_size, embedding_dim, seq_length, 
                                                lstm_out_dim, dropout_rate, embedding_matrix,
                                                trainable_embeddings, n_layers):
    #TOFILL
    model = Sequential()
    
    model.add(Embedding(input_dim=vocab_size,
                        output_dim=embedding_dim,
                        embeddings_initializer=Constant(embedding_matrix),
                        input_length=seq_length))
    
    
    for layer in range(n_layers-1): 
        model.add(Bidirectional(LSTM(units=lstm_out_dim, return_sequences=True)))
    
    model.add(Bidirectional(LSTM(units=lstm_out_dim)))


    model.add(Dropout(rate=dropout_rate))

    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP
    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [82]:
embedding_dim = 100
lstm_out_dim = 100
dropout_rate = 0.2
n_layers = 2

model = get_multilayer_bilstm_model_pretrained_embs(max_vocab_size, embedding_dim, max_seq_length, 
                                       lstm_out_dim, dropout_rate, embedding_matrix, True, n_layers)

print(model.summary())

Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 2730, 100)         2500000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 2730, 200)         160800    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 2730, 200)         240800    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 200)               240800    
_________________________________________________________________
dropout_6 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 201       
Total params: 3,142,601
Trainable params: 3,142,601
Non-trainable params: 0
___________________________________________

In [83]:
batch_size = 32
max_epochs = 5
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [84]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 90.14%


As you can see, the max accuracy reached is not much better than our TF-IDF model. This happens because the full word order is actually not so important for Sentiment Analysis. For this task, Convolutional Neural Networks can attain comparable performances faster, as they have simpler architectures. But that's not true for other task such as translation, question answering, ... (which are tasks that are a bit too long to train to be included in this course, hence the choice of sentiment an analysis to practice RNNs).

Let's do it with a convolutional model by using 1D convolution with a kernel size of 3 over the word embeddings (this means that it will convolve the embeddings of the consecutive words 3 by 3), followed by a 1D max pooling and a dense ReLU layer before the final sigmoid :

In [99]:
def get_conv_model_pretrained_embs(vocab_size, embedding_dim, seq_length, 
                                                out_dim, dropout_rate, embedding_matrix,
                                                trainable_embeddings):
    #TOFILL
    
    model = Sequential()
    
    model.add(Embedding(input_dim=vocab_size,
                        output_dim=embedding_dim,
                        embeddings_initializer=Constant(embedding_matrix),
                        input_length=seq_length))
    
    model.add(Conv1D(kernel_size=3, filters=200, padding = "same"))

    model.add(MaxPool1D(pool_size=2))
    model.add(Flatten())

    model.add(Dense(32, activation='relu'))

    model.add(Dropout(rate=dropout_rate))
    
    model.add(Dense(1, activation="sigmoid"))
    
    #TOKEEP

    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model


In [100]:
embedding_dim = 100
out_dim = 200
dropout_rate = 0.2

model = get_conv_model_pretrained_embs(max_vocab_size, embedding_dim, max_seq_length, 
                                       out_dim, dropout_rate, embedding_matrix, True)

print(model.summary())

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     (None, 2730, 100)         2500000   
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 2730, 200)         60200     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 1365, 200)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 273000)            0         
_________________________________________________________________
dense_16 (Dense)             (None, 32)                8736032   
_________________________________________________________________
dropout_9 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_17 (Dense)             (None, 1)               

In [101]:
batch_size = 64
max_epochs = 5
history = model.fit(X_train, y_train, epochs=max_epochs, batch_size=batch_size, 
                    verbose=1, validation_data = (X_valid, y_valid), callbacks=[early_stopping])


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Restoring model weights from the end of the best epoch.
Epoch 00005: early stopping


In [102]:
test_acc = model.evaluate(X_test, y_test, verbose=0) 
print("Test accuracy: %.2f%%" % (test_acc[1]*100))

Test accuracy: 89.04%


# Very optional parts

The following parts are meant to be resources to explore if you are interested in the advanced concept of attention in deep nets. There are explanation links, as well as links with code for each of them, but don't feel obliged to implement all of them, these are meant to help understanding each of these concepts.

## Attention

Attention is a mechanism that changes the output of an LSTM : instead of outputting the final hidden state vector $h_n$ where $n$ is the length of the encoded text, attention plugs on top of a LSTM and returns a combination of all the hidden state vectors at each word position $\ \sum_{t=1}^n \alpha_t h_t$ (where $\alpha_t \in (0,1))$, and thus allows to pay a different attention to each part of the text, hence the name. 

It has been originally proposed for sequence to sequence models, like translation models, where there is a different attention combination computed for each translated output word. It is thus less useful for text classification, but it can be adapted, by computing a single output combination of all the hidden states, as explained in Section 3.3 of the following article :

https://www.aclweb.org/anthology/P16-2034.pdf

Here is a link about how to apply attention for text classification with Keras:

https://www.kaggle.com/yshubham/simple-lstm-for-text-classification-with-attention


You can also read the following link to understand how attention works in sequence to sequence models, which are nothing more than a reversed LSTM (the decoder) on top of a first LSTM (the encoder), in this case for translation where it helps aligning words in two different languages :

https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263

## Transformer architecture for text classification

State of the art models in NLP are not RNNs anymore, but Transformers. Transformers do not read text sequentially like RNNs, the core concept of Transformers is self-attention, an attention mechanism that combine separately each word embedding with the other word embeddings of the text. There are multiple such attention mechanisms called "attention heads" in a layer, and multiple such layers are stacked.

Read this article to understand the self-attention layer:

https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

This article explains very well the Transformer for sequence to sequence models (again remember that a text classification model is just the encoder part of a sequence to sequence model) :

http://jalammar.github.io/illustrated-transformer/

Keras code to do text classification with a Transformer :

https://keras.io/examples/nlp/text_classification_with_transformer/



## BERT

Current state-of-the-art performance for text classification are achieved by doing transfer learning from the BERT model. The BERT model combines different techniques including the Transformer to pretrain in an unsupervised fashion on plain text. The last layers of BERT provide a high-level contextual representation of english sentences, and can then be reused in any NLP deep model.

The BERT model : http://jalammar.github.io/illustrated-bert/

Keras application for text classification : https://www.section.io/engineering-education/classification-model-using-bert-and-tensorflow/