#Prepare Dataset

This data originally came from Crowdflower's Data for Everyone library.

As the original source says,

>We looked through tens of thousands of tweets about the early August GOP debate in Ohio and asked contributors to do both sentiment analysis and data categorization. Contributors were asked if the tweet was relevant, which candidate was mentioned, what subject was mentioned, and then what the sentiment was for a given tweet. We've removed the non-relevant messages from the uploaded dataset.

In [31]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re

read the data using Pandas.

In [32]:
data = pd.read_csv('Sentiment.csv')

# Keeping only the neccessary columns
data = data[['text','sentiment']]
data.head(20)

Unnamed: 0,text,sentiment
0,RT @NancyLeeGrahn: How did everyone feel about...,Neutral
1,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
2,RT @TJMShow: No mention of Tamir Rice and the ...,Neutral
3,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
4,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive
5,"RT @GregAbbott_TX: @TedCruz: ""On my first day ...",Positive
6,RT @warriorwoman91: I liked her and was happy ...,Negative
7,Going on #MSNBC Live with @ThomasARoberts arou...,Neutral
8,Deer in the headlights RT @lizzwinstead: Ben C...,Negative
9,RT @NancyOsborne180: Last night's debate prove...,Negative


We will create a function to remove unwanted characters in Tweets using Regex.

In [33]:
def preProcess_data(text):
   text = text.lower()
   new_text = re.sub('[^a-zA-z0-9\s]','',text)
   new_text = re.sub('rt', '', new_text)
   return new_text

data['text'] = data['text'].apply(preProcess_data)
data.head(20)

Unnamed: 0,text,sentiment
0,nancyleegrahn how did everyone feel about the...,Neutral
1,scottwalker didnt catch the full gopdebate la...,Positive
2,tjmshow no mention of tamir rice and the gopd...,Neutral
3,robgeorge that carly fiorina is trending hou...,Positive
4,danscavino gopdebate w realdonaldtrump delive...,Positive
5,gregabbott_tx tedcruz on my first day i will ...,Positive
6,warriorwoman91 i liked her and was happy when...,Negative
7,going on msnbc live with thomasarobes around 2...,Neutral
8,deer in the headlights lizzwinstead ben carso...,Negative
9,nancyosborne180 last nights debate proved it ...,Negative


We will use Tensorflow’s tokenizer to tokenize our dataset, and Tensorflow’s pad_sequences to pad our sequences.

In [57]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_fatures = 2000

tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X, 32) 

Y = pd.get_dummies(data['sentiment']).values
for i in range(3):
  print(data['text'][i])
  print(data['sentiment'][i])
  print(Y[i])




 nancyleegrahn how did everyone feel about the climate change question last night exactly gopdebate
Neutral
[0 1 0]
 scottwalker didnt catch the full gopdebate last night here are some of scotts best lines in 90 seconds walker16 httptcozsff
Positive
[0 0 1]
 tjmshow no mention of tamir rice and the gopdebate was held in cleveland wow
Neutral
[0 1 0]


split the dataset into training and testing portions

In [42]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(len(X_train), "Training sequences")
print(len(X_test), "Validation sequences")


11096 Training sequences
2775 Validation sequences


#LSTM Model

We will simply use an embedding layer and some LSTM layers with dropout

In [60]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.3, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(128,recurrent_dropout=0.2))
model.add(Dense(3,activation='softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_21 (Embedding)     (None, 32, 128)           256000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 32, 128)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 32, 196)           254800    
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               166400    
_________________________________________________________________
dense_41 (Dense)             (None, 3)                 387       
Total params: 677,587
Trainable params: 677,587
Non-trainable params: 0
_________________________________________________________________


In [61]:
batch_size = 256

model.fit(X_train, Y_train, epochs = 10, batch_size=batch_size, validation_data=(X_test, Y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd7f22cef10>

In [62]:
txt = 'pattonoswalt i hate scott walker'
true_label = 'Neg'
labels =['Negaive' ,'Neutral','Positive']
seq = tokenizer.texts_to_sequences([txt])
padded = pad_sequences(seq, 32)

pred = model.predict(padded)
print(ex,pred)
print('\n',labels,'\n',pred)
 

pattonoswalt i loved scott walker [[0.8655635  0.12268818 0.0117483 ]]

 ['Negaive', 'Neutral', 'Positive'] 
 [[0.8655635  0.12268818 0.0117483 ]]


**We can see that the model preditcs our phrase correctly! :)**

#Transformer Model

In [43]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Implement a Transformer block as a layer

In [63]:

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

Implement embedding layer
>Two seperate embedding layers, one for tokens, one for token index (positions).

In [64]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

Create classifier model using transformer layer
>Transformer layer outputs one vector for each time step of our input sequence. Here, we take the mean across all time steps and use a feed forward network on top of it to classify text.

In [65]:
embed_dim = 64  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 64  # Hidden layer size in feed forward network inside transformer



inputs = layers.Input(X.shape[1])
embedding_layer = TokenAndPositionEmbedding(32, max_fatures, embed_dim)

x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)

x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(3, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()

Model: "model_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        [(None, 32)]              0         
_________________________________________________________________
token_and_position_embedding (None, 32, 64)            130048    
_________________________________________________________________
transformer_block_10 (Transf (None, 32, 64)            41792     
_________________________________________________________________
global_average_pooling1d_10  (None, 64)                0         
_________________________________________________________________
dropout_42 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_44 (Dense)             (None, 20)                1300      
_________________________________________________________________
dropout_43 (Dropout)         (None, 20)                0  

In [66]:
batch_size=128
model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
    X_train, Y_train, epochs = 10, batch_size=batch_size, validation_data=(X_test, Y_test)
)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [69]:
txt = 'pattonoswalt i do no know but hate scott walker'
true_label = 'Neg'
labels =['Negaive' ,'Neutral','Positive']
seq = tokenizer.texts_to_sequences([txt])
padded = pad_sequences(seq, 32)

pred = model.predict(padded)
print(ex,pred)
print('\n',labels,'\n',pred)
 

pattonoswalt i loved scott walker [[0.83604825 0.1623583  0.00159346]]

 ['Negaive', 'Neutral', 'Positive'] 
 [[0.83604825 0.1623583  0.00159346]]


**We can see that also this model preditcs our phrase correctly! :)**

#Conclusion:

>We can see that the trasnformer was very fast in training.

>Also trasnformer with much less parameters achieved the same val_accuracy 