<a href="https://colab.research.google.com/github/Jonaslbb/Video-games-Unsupervised-learning/blob/main/Trump_real_vs_fake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We begin with loading the the relevant libraries and the dataset. 

In [None]:
import pandas as pd
import keras
from keras import layers
from keras.preprocessing.text import Tokenizer
import numpy as np
import json

In [None]:
data = pd.read_json("https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz")

We got a dataset with 14736 samples and 2 collumns. Half of the tweets are true, and the other half is false

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14736 entries, 0 to 14735
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       14736 non-null  object
 1   1       14736 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 244.6+ KB


In [None]:
data.iloc[:,1].value_counts()

True     7368
False    7368
Name: true/false, dtype: int64

In [None]:
data.head()

Unnamed: 0,tweet,true/false
0,I was thrilled to be back in the Great city of...,True
1,The Unsolicited Mail In Ballot Scam is a major...,True
2,"As long as I am President, I will always stand...",True
3,"Our Economy is doing great, and is ready to se...",True
4,If I do not sound like a typical Washington po...,True


I tried preprocessing the text the same way we did in the NLP assignment. I removed stopwords, numbers, signs and stemmed the text. 

All this made no difference to the score of the model, so i just decided to leave it out. I realise it looks like i pretty much skipped any kind of preproccesing, but i dont see any reason to include it just for the sake of it

In [None]:
data = data.rename(columns={0: "tweet", 1: "true/false"})

We split our data into a test- and training set, so we can test our model on the test data later.
The X values are the tweets and the y values are "true/false" or "real/fake" 

In [None]:
from sklearn.model_selection import train_test_split
X = data.tweet
y = data["true/false"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

We make a tokenizer and fits it on all the text from the tweets. Then we change the text to sequences, which is the corresponding number in the vocabulary/dictionary.

In [None]:
tokenizer = Tokenizer(num_words = 20000)
tokenizer.fit_on_texts(data.tweet)
sequences_train = tokenizer.texts_to_sequences(X_train.tweet)
sequences_test = tokenizer.texts_to_sequences(X_test.tweet)

The is how a tweet looks as a sequence. Every number represents a word

In [None]:
sequences_train[8]

[131, 18, 96, 45, 10, 714, 18, 103, 923, 2, 17, 5, 28, 100, 102]

In [None]:
print(tokenizer.index_word[131])
print(tokenizer.index_word[18])
print(tokenizer.index_word[96])
print(tokenizer.index_word[45])
print(tokenizer.index_word[10])
print(tokenizer.index_word[714])
print(tokenizer.index_word[18])
print(tokenizer.index_word[103])
print(tokenizer.index_word[923])
print(tokenizer.index_word[2])
print(tokenizer.index_word[17])
print(tokenizer.index_word[5])
print(tokenizer.index_word[28])
print(tokenizer.index_word[100])
print(tokenizer.index_word[102])

again
with
china
as
our
relationship
with
them
continues
to
be
a
very
good
one


Now er make our RNN model. RNN is good at working with sequential data. Text is sequential, which means that words are affected by the previous words to create a sentence. 
More specifically we use a LSTM model, which is good at "remembering" relevant information for a long time, which can be used to find more complex dependencies in sequential data.

In [None]:
# The shape is set to None because it allows for variable-length sequences of integers.
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector. Each time the model sees a word it changes the vectors slightly, so over time it learns how certain words are connected.
# 20000 is the number of words in our dictionary, which we defined earlier.
x = layers.Embedding(20000, 128)(inputs)
# Adding 2 bidirectional LSTMs. Bidirectional, as the name suggests, run forward and then backward. Since the information stored in the long term memory will differ
# depending on the direction, a bidirectional LSTM will hopefully find more complex patterns.
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Adding a classifier using sigmoid.
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
# by using binary_crossentroby we tell the model to train to minimize the amount of wrong predictions.
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "functional_27"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_14 (InputLayer)        [(None, None)]            0         
_________________________________________________________________
embedding_13 (Embedding)     (None, None, 128)         2560000   
_________________________________________________________________
bidirectional_18 (Bidirectio (None, None, 128)         98816     
_________________________________________________________________
bidirectional_19 (Bidirectio (None, 128)               98816     
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 129       
Total params: 2,757,761
Trainable params: 2,757,761
Non-trainable params: 0
_________________________________________________________________


Before we train we need to pad the sequences. Besically just adding 0's to make sure the sequences are of the same length. Since i didn't specify a max length, 0's will be added to match the longest sequence

In [None]:
X_train_pad = keras.preprocessing.sequence.pad_sequences(sequences_train)
X_test_pad = keras.preprocessing.sequence.pad_sequences(sequences_test)

Fitting the model using the training data. 
I tried many different combinations, but in my experience it overfits very quickly, so it gets the best results with relativly short "training".

In [None]:
model.fit(X_train_pad, y_train, batch_size=200, epochs=2, validation_data=(X_test_pad, y_test))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f235a9f0f98>

The model gets a score of 0.88

In [None]:
model.evaluate(X_test_pad, y_test, batch_size=32)



[0.306826114654541, 0.8800216913223267]

Next we get the predictions to make a brief analysis on the data.

In [None]:
trainPredict = model.predict(X_train_pad)
testPredict = model.predict(X_test_pad)

In [None]:
test_predict = pd.DataFrame(testPredict)

In [None]:
trainPredict

array([[0.10489381],
       [0.4388481 ],
       [0.04345336],
       ...,
       [0.9840471 ],
       [0.96573776],
       [0.9877815 ]], dtype=float32)

These are the predictions our model made. A score above 0.5 is True and below is False

In [None]:
test_predict = test_predict.rename(columns={0:"value"})
test_predict

Unnamed: 0,value
0,0.980559
1,0.012798
2,0.011221
3,0.012102
4,0.983834
...,...
3679,0.059184
3680,0.031930
3681,0.969952
3682,0.844956


We merge the prediction value on the test dataset. We have to reset the index, otherwise it dosn't merge correctly

In [None]:
merged = pd.merge(pd.DataFrame(X_test).reset_index(),test_predict,left_index=True,right_index=True)

In [None]:
merged

Unnamed: 0,index,tweet,value
0,6862,Nolte: Poll Shows Media Failed to Gaslight Pub...,0.980559
1,13433,"We are working with him, along with many other...",0.012798
2,14303,"It is called the First Step Act of 2017, and i...",0.011221
3,13557,We must ALL MAKE AMERICA GREAT AGAIN! #DemDeba...,0.012102
4,6975,“Proclamation on Recognizing the Golan Heights...,0.983834
...,...,...,...
3679,2372,"Thank you for your support &amp, friendship- G...",0.059184
3680,12301,"We have lost 4,000 Military members in Afghani...",0.031930
3681,6192,"....Mainstream Media, which has lost all credi...",0.969952
3682,8583,The Fake News Media is in a constant state of ...,0.844956


Then we merge the the true/false column back on

In [None]:
predictions = pd.merge(merged,y,left_on="index",right_index=True)

To get it in the same format we make a new collum, which changes all values above 0.5 to True.

In [None]:
predictions["prediction"] = predictions.iloc[:,2]>0.5

In [None]:
predictions[210:220]

Unnamed: 0,index,tweet,value,true/false,prediction
210,7250,Former FBI top lawyer James Baker just admitte...,0.942338,True,True
211,542,Speaker Pelosi and Chuck Schumer’s drive to tr...,0.984428,True,True
212,11901,J. Trump’s Anger And Anger Over Russia Investi...,0.975224,False,True
213,10821,Your plan to close the American... only reason...,0.972152,False,True
214,13940,"Great book, especially since he totally exoner...",0.056008,False,False
215,6526,The United States stands ready to work with to...,0.96834,True,True
216,9202,", IN THE U.S.S. Capitol, in front of 5,000 har...",0.019221,False,False
217,6569,Departed the and am now on Air Force One with ...,0.964526,True,True
218,14112,"“It is true, this is the biggest political sca...",0.013796,False,False
219,5727,....story about me and a perfectly fine and ro...,0.966546,True,True


We already knew how good the model is, but this crosstab shows that it's almost equally good at predicting real and fake tweets, but a little better at predicting the real ones.

In [None]:
pd.crosstab(predictions["true/false"],predictions.prediction,normalize="columns")

prediction,False,True
true/false,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.914627,0.14883
True,0.085373,0.85117


Lastly we make two datasets. one with the right predictions and one with the wrong predictions.

In [None]:
wrong_predict = predictions[predictions["true/false"]!=predictions.prediction]
right_predict = predictions[predictions["true/false"]==predictions.prediction]

Looking at the value column it seems that the wrong precitions are closer to 0.5, which will indicate that the model is less sure about the prediction.

In [None]:
wrong_predict.head()

Unnamed: 0,index,tweet,value,true/false,prediction
7,11605,"We mark our service, the sacrifice they make f...",0.783054,False,True
12,10439,- I’m on track to start paying their Farmers ‘...,0.557719,False,True
39,14593,2.2% which is the lowest in 14 years.,0.696554,False,True
46,12463,They hate the fact that the Fed made a big mis...,0.669921,False,True
65,1606,“Federal Court Deals Major Blow To Sanctuary C...,0.477494,True,False


In [None]:
right_predict.head()

Unnamed: 0,index,tweet,value,true/false,prediction
0,6862,Nolte: Poll Shows Media Failed to Gaslight Pub...,0.980559,True,True
1,13433,"We are working with him, along with many other...",0.012798,False,False
2,14303,"It is called the First Step Act of 2017, and i...",0.011221,False,False
3,13557,We must ALL MAKE AMERICA GREAT AGAIN! #DemDeba...,0.012102,False,False
4,6975,“Proclamation on Recognizing the Golan Heights...,0.983834,True,True


We can see that the wrong predictions does indeed have a much lower standard diviation. 

In [None]:
print(np.std(wrong_predict.value))
print(np.std(right_predict.value))

0.307710736989975
0.43654176592826843


One of the things i tried to get a better model, was to create a more complex one. This one has more embedding vectors and more layers. As you can see in the summary is got roughly twice the amount of trainable parameters as the other model.

In [None]:
inputs = keras.Input(shape=(None,), dtype="int32")
x = layers.Embedding(20000, 248)(inputs)
x = layers.LSTM(64,return_sequences=True)(x)
x = layers.LSTM(64,return_sequences=True)(x)
x = layers.LSTM(64,return_sequences=True)(x)
x = layers.LSTM(64)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "functional_29"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_15 (InputLayer)        [(None, None)]            0         
_________________________________________________________________
embedding_14 (Embedding)     (None, None, 248)         4960000   
_________________________________________________________________
lstm_36 (LSTM)               (None, None, 64)          80128     
_________________________________________________________________
lstm_37 (LSTM)               (None, None, 64)          33024     
_________________________________________________________________
lstm_38 (LSTM)               (None, None, 64)          33024     
_________________________________________________________________
lstm_39 (LSTM)               (None, 64)                33024     
_________________________________________________________________
dense_14 (Dense)             (None, 1)               

Result is very similar to the last model.

In [None]:
model.fit(X_train_pad, y_train, batch_size=200, epochs=2, validation_data=(X_test_pad, y_test))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f234ee94c88>

I tried a few things to improve the model with no real succes. I tried adding more layers and changing the units on these layers. I tried bidirectional layers and "one-way" layers. I changed the embedding and padding inputs and tried different batchsizes and epochs. Some of these things had little to no effect, and others just made the model worse. I could make the model get close to 0.9 with "lucky" trains, but the average seems to be ~0.88

In [None]:
!jupyter nbconvert --to html "/content/Trump_real_vs_fake.ipynb"

[NbConvertApp] Converting notebook /content/Trump_real_vs_fake.ipynb to html
[NbConvertApp] Writing 331433 bytes to /content/Trump_real_vs_fake.html
