#Introduction + Problem Definition
In this notebook, 2 models will be built as solutions to a Semantic Natural Language Problem (NLP). The particular dataset that is being used is tweets about certain Airlines and it is the job of the models to identify whether they are positive or negative.

In [17]:
from google.colab import drive
drive.mount('/content/gdrive')
path = "/content/gdrive/My Drive/DW_data/"

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


#Data Preparation
In the following code block, the data and labels will be converted into the approrpiate formats and will be prepared so that they can be used by a solution for the given NLP problem.

In [18]:
import pandas as pd
df = pd.read_csv(path+"Tweets.csv")
df['airline_sentiment'] = df['airline_sentiment'].map({'positive': 0, 'negative': 1})

#The below method is taken from https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python
import re
def remove_emojis(t):
  regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                           "]+", flags = re.UNICODE)
  return regrex_pattern.sub(r'', t)

# The punctuation marks that have allowed in the message is: # and @
punc, numbs = '''!()-[]{};:'"\,<>./?$%^&*_~#@+''', '''0123456789'''
results = []
for i in range(len(df['text'])):
  #Remove the emojis from the sentence.
  temp = remove_emojis(df['text'][i])

  #There is an edge case with ellipses, this will be handled here.
  if "..." in temp:
    temp = temp.replace("...", " ")

  #Remove any unnecessary punctuations.
  for elem in temp:
    if '"' == elem:
      temp = temp.replace(elem, "")

    if "'" == elem:
      temp = temp.replace(elem, "")

    if ("http" in elem) == False:
      if elem in punc:
        temp = temp.replace(elem, "")

    #Is it best to remove this condition as some of the 
    if elem in numbs:
      temp = temp.replace(elem, "")

  #Store the sentences in the results array in the form of tokens.
  if temp != "":
    results.append(temp.lower())

In the following code block, the sentences will be converted into a 'dictionary' by splitting the sentences by spaces and then removing any duplicate tokens in the list.

In [19]:
import nltk
from nltk import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = nltk.stem.WordNetLemmatizer()

sentences = []
for entry in results:
  tokens = nltk.word_tokenize(entry)

  #Retrieve the stopwords from the English language and remove them from the tokens.
  stopwords = nltk.corpus.stopwords.words('english')
  tokens = [word for word in tokens if word not in stopwords]

  #Reduce the tokens to the root form (lemmatize) of the word (token).
  tokens = [lemmatizer.lemmatize(word) for word in tokens]

  sentences.append(" ".join(tokens))

#Update the dataframe with the processed sentences.
df['text'] = sentences

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#Data Segragation.
The data will be split into 80%/0%/20% (train/validation/test). This test has been chosen due to the fact that the model will need the maximum amount of data that it can have. The reason why 20% of the data is used for testing is because we want the minimal amount of testing data that we can get away with to train the neural network appropriately.

In [20]:
from sklearn.model_selection import train_test_split
import numpy as np

df = df.sample(frac=1).reset_index(drop=True)
x_train, x_test, y_train, y_test = train_test_split(df['text'], df['airline_sentiment'], test_size=0.2)
x_train, x_test = np.array(x_train), np.array(x_test)
y_train, y_test = np.array(y_train), np.array(y_test)

#Training Models.
In the following code blocks, a LSTM and SVM model will be trained and tested on the provided dataset.

Firstly, we'll need to build these models before we compare the performance of these models.

In [21]:
from keras.engine.training import optimizers

##### Recurrent Neural Network #####
#Import the necessary libraries.
from tensorflow.keras.layers import TextVectorization, Embedding, SimpleRNN, Dropout, Dense, LSTM
from tensorflow.keras.models import Sequential
import tensorflow.keras as keras
from tensorflow.keras.callbacks import EarlyStopping

#Specify the vectorization of the sentences.
n = 5000
vectorize_layer = TextVectorization(max_tokens=n, output_mode='int')
vectorize_layer.adapt(x_train)

#Build the structure of the neural network.
model = Sequential()
model.add(vectorize_layer)
model.add(Embedding(input_dim=n, output_dim=128, input_length=300))
model.add(LSTM(units=32, return_sequences=True))
model.add(LSTM(units=32))
model.add(Dense(units=1, activation='sigmoid'))

#Define the optimization, loss and metrics objects.
optimizer = keras.optimizers.Adam(learning_rate=0.001)
loss = keras.losses.BinaryCrossentropy()
metrics = keras.metrics.BinaryAccuracy()

#Compile, train and evaluate the model.
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=32, callbacks=[es])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 6: early stopping


In the following code blocks, the Support Vector Machine will be implemented. However, the sentences will need to be converted into an appropriate numerical representation of the data before the model can be trained.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(df['text'])

x_train_vect = Tfidf_vect.transform(x_train)
x_test_vect = Tfidf_vect.transform(x_test)

In [26]:
from sklearn import svm
from sklearn.metrics import accuracy_score

SVM = svm.SVC(C=1.0, kernel='linear', degree=5, gamma='auto')
SVM.fit(x_train_vect, y_train)

print("SVM Accuracy Score -> ",accuracy_score(SVM.predict(x_test_vect), y_test)*100)

SVM Accuracy Score ->  92.20441749675184


#Compare the Models.
Looking at the the accuracies of the models, it can be seen that the Support Vector Machine (SVM) is better at modeling the semantics of these tweets. This SVM is better than the Recurrent Neural Network (RNN) by a factor of 1.0172 (or 1.72%).

In [32]:
rnnResult = model.evaluate(x_test, y_test)[1]
svmResult = accuracy_score(SVM.predict(x_test_vect), y_test)

print("RNN Accuracy: ", round(rnnResult, 5) * 100)
print("SVM Accuracy: ", round(svmResult, 5) * 100)

print("Improvement:  ", round(round(svmResult, 5) / round(rnnResult, 5), 5))

RNN Accuracy:  90.645
SVM Accuracy:  92.204
Improvement:   1.0172
