#Importing packages and the dataset.

In [None]:
#Import potential important packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [None]:
#Importing neural network libraries
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, GRU, LSTM, Bidirectional

In [None]:
#Importing the data. I'm lazy so lets just call the data for df, because d and f are right next to each other on the keyboard. How fun.
df_url = "https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz"
df = pd.read_json(df_url)

#Preprocessing the data

In [None]:
#Lets look at the data itself, first by looking at the shape, which gives us 14736 observations (tweets) and 2 features (tweet and label)
df.shape

(14736, 2)

In [None]:
#Top 5 results, really just shows random tweets and labels shows us that all of them really are Trump-tweets. 
df.head()

Unnamed: 0,0,1
0,I was thrilled to be back in the Great city of...,True
1,The Unsolicited Mail In Ballot Scam is a major...,True
2,"As long as I am President, I will always stand...",True
3,"Our Economy is doing great, and is ready to se...",True
4,If I do not sound like a typical Washington po...,True


In [None]:
#Because I like the beautiness of it, rather than just having the columns called 0 and 1, lets rename the columns. 
df.columns = ['tweet','label']

In [None]:
#And to check that there are equally amounts of true and false tweets
df.label.value_counts()

True     7368
False    7368
Name: label, dtype: int64

In [None]:
#As we want to be able to test our model on a fresh dataset, we split the data-sets into a training data-set and a test set.
14736 * 0.8

11788.800000000001

In [None]:
#Since the dataset has all the true tweets first and the false at the end, we will sample the dataset, meaning shuffle the observations. 
df = df.sample(frac=1).reset_index(drop=True)

In [None]:
#Now we subset the datasets. We use 11788, as that is where it splits on 80% of the dataset.
train = df.iloc[:11788,:]
test = df.iloc[11788:,:]

In [None]:
#We are going to do some mergers later, so we need to reset the index, to be able to easily merge later on. 
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [None]:
#We want to find out what to use as the max_features, so lets use the amount of characters in each tweet to determine that.
#First we create a loop, that iterates each tweet, and counts the digets, and adds it to a column in the dataset.
chars = []
[chars.append(len(str(i))) for i in train['tweet']]
train['chars'] = chars

In [None]:
#Now we check for the extra column
train.head()

Unnamed: 0,tweet,label,chars
0,Our big Kentucky Rally on Monday night had a m...,True,80
1,Bolton said “I don’t know what they’re talking...,False,187
2,It would show great weakness if Israel allowed...,True,80
3,“Double Standard - Former FBI lawyer (Lisa Pag...,True,103
4,LAST thing the Make America Great Again Agenda...,True,136


In [None]:
#I firstly checked the max of the length, because I thought twitter only allowed 140 characters. Then I learned it was actually 280 chars. The reason some tweets are above, is because & and other symbols are written as &amp.
max(train['chars'])

296

In [None]:
#Just for fun aswell, lets look at the minimum value. Shortest one have 0 chars. Odd.
min(train['chars'])

0

In [None]:
#Now lets see how many of these there are. 
train_min_chars = train[train.chars == 0]
train_min_chars.label.value_counts()

False    328
Name: label, dtype: int64

In [None]:
train = train[train.chars > 0]

Okay, so this to me is a breakpoint. At this moment I can't decide. There are a couple hundreds GTP-2 created tweets, all just empty. Removing them only change the network such a tiny amount that removing them is hardly justified, as we in general don't want sqewed data, but it just bothers me that they exist. I've chosen to remove them. Sue me.

This all began, because I wanted to use the average amount of characters as the max-features, but looking at it now, I don't think that 300 as the max length of the words is that bad. 

In [None]:
max_words = max(train['chars'])

#Tokenizing

Lets be real - this way of tokenizing is hijacked directly from the task. Thank you Daniel / Roman / Eskil.

In [None]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train['tweet'])
sequences = tokenizer.texts_to_sequences(train['tweet'])

In [None]:
#Now we need to pad the tweets, to make them all equal length
X = sequences
X = pad_sequences(sequences = sequences, maxlen = max_words)

In [None]:
#Set y as the label, as that is the "result" we're going for
y = train['label'].values

In [None]:
#For good practice, lets check that they are the same size. They're.
print(X.shape)
print(y.shape)

(11460, 296)
(11460,)


#Splitting the data

In [None]:
#We want to split the data, so that we have something to validate onto. For this we can use the good old train_test_split, and 0.2 as our test_size.
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 1337)

#Building models

In this part, two models will be build, to test which one is better. A LSTM model will be build, aswell as a GRU model. The GRU is similar to the LSTM, but according to this paper from Cornell, it doesn't perform as good as LSTM (https://arxiv.org/abs/1805.04908). Lets test it on our own. How fun.

The models are build with one layer of LSTM or GRU, followed by two Dense layers with a dropout-layer in between.

The dropout layers are put there as a measure to hopefully prevent overfitting the data. While it might be an exaggeration to put two layers in there, I chose to do so to be sure.

In [None]:
#The LSTM Model
lstm_model = Sequential()
lstm_model.add(Embedding(input_dim = max_words, output_dim = 128))
lstm_model.add(LSTM(units = 128, dropout = 0.2, recurrent_dropout = 0.2))
lstm_model.add(Dropout(rate = 0.5))
lstm_model.add(Dense(units = 128, activation = 'relu'))
lstm_model.add(Dropout(rate = 0.5))
lstm_model.add(Dense(1, activation = 'sigmoid'))



In [None]:
lstm_model.summary()

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, None, 128)         37888     
_________________________________________________________________
lstm_17 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dropout_23 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_24 (Dense)             (None, 128)               16512     
_________________________________________________________________
dropout_24 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_25 (Dense)             (None, 1)                 129       
Total params: 186,113
Trainable params: 186,113
Non-trainable params: 0
_______________________________________________

In [None]:
#Compiling the LSTM model.
lstm_model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
# #Fitting the LSTM model.
lstm_model.fit(x_train, y_train, batch_size=32, epochs=4, validation_data=(x_val, y_val))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7faf6ba85080>

In [None]:
#The GRU model.
gru_model = Sequential()
gru_model.add(Embedding(input_dim = max_words, output_dim = 128))
gru_model.add(GRU(units = 128, dropout = 0.2, recurrent_dropout = 0.2))
gru_model.add(Dropout(rate = 0.5))
gru_model.add(Dense(units = 128, activation = 'relu'))
gru_model.add(Dropout(rate = 0.5))
gru_model.add(Dense(1, activation = 'sigmoid'))



In [None]:
gru_model.summary()

Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, None, 128)         37888     
_________________________________________________________________
gru_5 (GRU)                  (None, 128)               99072     
_________________________________________________________________
dropout_25 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_26 (Dense)             (None, 128)               16512     
_________________________________________________________________
dropout_26 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_27 (Dense)             (None, 1)                 129       
Total params: 153,601
Trainable params: 153,601
Non-trainable params: 0
_______________________________________________

In [None]:
#Compiling the GRU model.
gru_model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
#Fitting the GRU model.
gru_model.fit(x_train, y_train, batch_size=32, epochs=4, validation_data=(x_val, y_val))

Epoch 1/4
Epoch 2/4

KeyboardInterrupt: ignored

#What do we do from here?

Lets add the prediction from each model onto our testing data-set, where we have a whole "new" data-set, with new tweets. The we can show each models prediction, alongside the label. This will give us a true-score as we can clearly see which model gets it correct most times.

To do so, we must first tokenize the words in the test set

In [None]:
tokenizer.fit_on_texts(test['tweet'])
test_sequences = tokenizer.texts_to_sequences(test['tweet'])

In [None]:
#Now we need to pad the tweets, to make them all equal length
test_sequences = pad_sequences(sequences = test_sequences, maxlen = max_words)

In [None]:
#Then we create new arrays for each models prediction. 
lstm_prediction = lstm_model.predict_classes(test_sequences)

In [None]:
gru_prediction = gru_model.predict_classes(test_sequences)

In [None]:
#Then we turn the numpy arrays into dataframes
lstm_prediction = pd.DataFrame(lstm_prediction)

In [None]:
gru_prediction = pd.DataFrame(gru_prediction)

In [None]:
#Now we merge the labels together, to keep it all in one place.
final_predictions = lstm_prediction.merge(gru_prediction, how='left', left_index=True, right_index=True)

In [None]:
#We rename the columns for prettyness :-)
final_predictions.columns = ['LSTM_pred','GRU_pred']

In [None]:
#Merging the it all together
test = test.merge(final_predictions, how='left', left_index=True, right_index=True)

In [None]:
#Creating a mapping to replace 0 and 1 with False and True. 
cleanup_nums = {"LSTM_pred":     {0: 'False', 1: 'True'},
                "GRU_pred":     {0: 'False', 1: 'True'}
                }

In [None]:
#And replacing values with True/False
test.replace(cleanup_nums, inplace=True)

#Short analysis
Now we get to the fun part. Even though the task originally was to create a model that could determine if a tweet was Trumps or not, we have also put it upon ourselves to make a mini-test if the GRU was worse than the LSTM. 

To do so, we cross-tab each prediction with the correct prediction, and with each other. This way we can compare their results. In some way, it works the same with students and exams. First we compare the two students with the correct exam-answers, and afterwords we compare how often the two students disagree on a question.

In [None]:
pd.crosstab(test.label, test.LSTM_pred, normalize="columns")

In [None]:
pd.crosstab(test.label, test.GRU_pred, normalize="columns")

In [None]:
pd.crosstab(test.GRU_pred, test.LSTM_pred, normalize="columns")

#The big wrap-up

From the crosstabs we get that, the LSTM is correct 76,8% of the time (False/False + True/True divided by 2) and the GRU model is correct 75,4% of the time. This however, doesn't necessarily mean that the GRU model is the worst of those, because that number is influenced by how often the model has predicted a false tweet to be false. If we only wanted a model that could predict the tweets right, the GRU actually performs sligthly better than the LSTM model.

However, as the LSTM is more correct in general, I would choose that one, but you cannot dismiss the GRU model aswell. Interestingly enough, the models agree on the tweets in 93% of the times. In theory we could go even further down the rabit-hole and explore what the tweets that they disagree on are, and perhaps do some NLP, to look for topics, or what-not. Unfortunately, that is for another time. 

Link to the colab:

https://bit.ly/3mrwHp9

In [None]:
!jupyter nbconvert --to html "/content/M3A1.ipynb"