**Steps followed in this notebook for Sentiment analysis of Tweets**

    1. Importing the raw data of tweets
    2. preprocessing of the tweets
    3. Creating Embeddings using Word2Vec
    4. Building RNN model with Attention layer function
    5. Building RNN Model with Attention layer from keras

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd 
import re 

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing the dataset
DATASET_COLUMNS  = ["sentiment", "ids", "date", "flag", "user", "tweet"]
DATASET_ENCODING = "ISO-8859-1"
dataset = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv',
                      encoding=DATASET_ENCODING , names=DATASET_COLUMNS)
dataset.head()

In [None]:
dataset.shape

In [None]:
dataset = dataset[['sentiment', 'tweet']]
dataset.head()

In [None]:
#unique values of sentiment
dataset['sentiment'].unique()

In [None]:
#replacing 4 with 1 for positive sentiment
dataset['sentiment'] = dataset['sentiment'].replace(4,1)

In [None]:
dataset['sentiment'].value_counts()

**Preprocessing the Text**

The Preprocessing steps taken are:

1. **Lower Casing:** Each text is converted to lowercase.
2. **Replacing URLs**: Links starting with 'http' or 'https' or 'www' are replaced by '<url>'.
3. **Replacing Usernames:** Replace @Usernames with word '<user>'. [eg: '@Kaggle' to '<user>'].
4. **Replacing Emojis:** Replace emojis by using a regex expression. [eg: ':)' to '<smile>']
5. **Replacing Contractions:** Replacing contractions with their meanings. [eg: "can't" to 'can not']
6. **Removing Non-Alphabets:** Replacing characters except Digits, Alphabets and pre-defined Symbols with a space.

In [None]:
# Reading contractions.csv and storing it as a dict.
contractions = pd.read_csv('../input/contractions/contractions.csv', index_col='Contraction')
contractions.index = contractions.index.str.lower()
contractions.Meaning = contractions.Meaning.str.lower()
contractions_dict = contractions.to_dict()['Meaning']

In [None]:
contractions_dict

In [None]:
# Defining regex patterns.
urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|(www\.)[^ ]*)"
userPattern       = '@[^\s]+'
hashtagPattern    = '#[^\s]+'
alphaPattern      = "[^a-z0-9<>]"
sequencePattern   = r"(.)\1\1+"
seqReplacePattern = r"\1\1"

# Defining regex for emojis
smileemoji        = r"[8:=;]['`\-]?[)d]+"
sademoji          = r"[8:=;]['`\-]?\(+"
neutralemoji      = r"[8:=;]['`\-]?[\/|l*]"
lolemoji          = r"[8:=;]['`\-]?p+"

def preprocess_apply(tweet):

    tweet = tweet.lower()

    # Replace all URls with '<url>'
    tweet = re.sub(urlPattern,'<url>',tweet)
    
    # Replace @USERNAME to '<user>'.
    tweet = re.sub(userPattern,'<user>', tweet)

    # Replace all emojis.
    tweet = re.sub(r'<3', '<heart>', tweet)
    tweet = re.sub(smileemoji, '<smile>', tweet)
    tweet = re.sub(sademoji, '<sadface>', tweet)
    tweet = re.sub(neutralemoji, '<neutralface>', tweet)
    tweet = re.sub(lolemoji, '<lolface>', tweet)

    for contraction, replacement in contractions_dict.items():
        tweet = tweet.replace(contraction, replacement)

    # Remove non-alphanumeric and symbols
    tweet = re.sub(alphaPattern, ' ', tweet)

    # Adding space on either side of '/' to seperate words (After replacing URLS).
    tweet = re.sub(r'/', ' / ', tweet)
    return tweet

In [None]:
dataset['processed_text'] = dataset.tweet.apply(preprocess_apply)

In [None]:
dataset.head()

In [None]:
dataset['tweet'][0]

In [None]:
dataset['processed_text'][0]

In [None]:
#splitting the data
from sklearn.model_selection import train_test_split

In [None]:
X_data, y_data = np.array(dataset['processed_text']), np.array(dataset['sentiment'])

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=32, stratify=y_data)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

**Creating Word Embedding Using Word2Vec**

In [None]:
from gensim.models import Word2Vec

Embedding_dimensions = 100

#creating the List of words for training data
Word2Vec_training_data = list(map(lambda x: x.split(), X_train))

In [None]:
Word2Vec_training_data[0]

In [None]:
word2vec_model = Word2Vec(Word2Vec_training_data,
                        vector_size=Embedding_dimensions,
                        workers=8,
                        min_count=5)

In [None]:
word2vec_model.wv[1]

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
vocab_length = 60000

tokenizer = Tokenizer(filters="", lower=False, oov_token="<oov>")
tokenizer.fit_on_texts(X_data)
tokenizer.num_words = vocab_length
print("Tokenizer vocab length:", vocab_length)

In [None]:
input_length = 60

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=input_length)
X_test  = pad_sequences(tokenizer.texts_to_sequences(X_test) , maxlen=input_length)

print("X_train.shape:", X_train.shape)
print("X_test.shape :", X_test.shape)

In [None]:
X_train[0]

In [None]:
second_key, second_value = list(tokenizer.word_index.items())[1]
print("Second Key:", second_key)
print("Second Value:", second_value)

In [None]:
embedding_matrix = np.zeros((vocab_length, Embedding_dimensions))

for word, token in tokenizer.word_index.items():
    if word2vec_model.wv.__contains__(word):
        embedding_matrix[token] = word2vec_model.wv.__getitem__(word)

print("Embedding Matrix Shape:", embedding_matrix.shape)

In [None]:
# Reverse the mapping of tokens to words
index_to_word = {token: word for word, token in tokenizer.word_index.items()}
fifth_word = index_to_word[5]
fifth_word

In [None]:
embedding_matrix[5]

In [None]:
word2vec_model.wv.most_similar('man')

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding

**RNN WITH ATTENTION LAYER**

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding, Input, Flatten
from tensorflow.keras.models import Model
import math


def attention(q, k, v):
    
    d_k = q.shape[-1]
    
    #masking
    mask = np.tril(np.ones( (60, 60) ))
    mask[mask == 0] = -np.infty
    mask[mask == 1] = 0
    
    #attention scores
    scaled = tf.matmul(q, k, transpose_b=True) / math.sqrt(d_k) 
    scaled = scaled + mask
        
    #attention weights
    attention = tf.nn.softmax(scaled, axis=-1) 
    
    #context vector
    output = tf.matmul(attention, v)
    
    return output


# Create a Sequential model
model = Sequential()

#Embedding layer
model.add(Embedding(input_dim=vocab_length,
                            output_dim=Embedding_dimensions,
                            weights=[embedding_matrix],
                            input_length=input_length,
                            trainable=False))


#RNN layer
model.add(SimpleRNN(64, return_sequences=True, input_shape=(60,)))

#attention mechanism
context_vector = attention(model.layers[-1].output, model.layers[-1].output, model.layers[-1].output)

#Flatten the context_vector
model.add(Flatten())

#output layer
model.add(Dense(1, activation='sigmoid'))


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


model.fit(
    X_train, y_train,
    batch_size=512,
    epochs=1,
    validation_split=0.1,
    verbose=1,
)

In [None]:
model.summary()

**Things Observed**
1. Attention Layer is sufficient without RNN to as the input itself a word embedding
2. Attention layer with RNN is slow compared to just the attention layer
3. Accuracy without RNN is less compared to with RNN

Before Flattening (in the context_vector layer):

The shape of the context_vector is (None, 60, 100).
After Flattening (in the flatten_8 layer):

The shape of the flattened vector is (None, 3840).

**RNN WITH ATTENTION USING KERAS**

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding, Input, Attention
from tensorflow.keras.models import Model

#input_layer
input_layer = Input(shape=(60,))

#Embedding layer
embedding_layer = Embedding(input_dim=vocab_length,
                            output_dim=Embedding_dimensions,
                            weights=[embedding_matrix],
                            input_length=input_length,
                            trainable=False)(input_layer)

#SimpleRNN layer
rnn_layer = SimpleRNN(64, return_sequences=True)(embedding_layer)

#Keras Attention layer
attention_layer = Attention()([rnn_layer, rnn_layer])

#GlobalAveragePooling1D to get the context vector
context_vector = tf.keras.layers.GlobalAveragePooling1D()(attention_layer)

#output layer
output_layer = Dense(1, activation='sigmoid')(context_vector)

#Define the model
model = Model(inputs=input_layer, outputs=output_layer)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(
    X_train, y_train,
    batch_size=512,
    epochs=1,
    validation_split=0.1,
    verbose=1,
)