**Dataset origin:** https://www.unb.ca/cic/datasets/truthseeker-2023.html

*S. Dadkhah, X. Zhang, A. G. Weismann, A. Firouzi and A. A. Ghorbani, "The Largest Social Media Ground-Truth Dataset for Real/Fake Content: TruthSeeker," in IEEE Transactions on Computational Social Systems, 99. 1-15, Oct. 2023.*

In [1]:
import nltk
import pandas as pd
import numpy as np
from embedder import Word2VecEmbedder
from sklearn.model_selection import train_test_split
import torch



In [2]:
PATH_TO_FILE ="/Users/mikhailleontev/PycharmProjects/Attestation/TruthSeeker2023/Truth_Seeker_Model_Dataset.csv"
df = pd.read_csv(PATH_TO_FILE)
print(df.shape)
df.head()

(134198, 9)


Unnamed: 0.1,Unnamed: 0,author,statement,target,BinaryNumTarget,manual_keywords,tweet,5_label_majority_answer,3_label_majority_answer
0,0,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@POTUS Biden Blunders - 6 Month Update\n\nInfl...,Mostly Agree,Agree
1,1,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@S0SickRick @Stairmaster_ @6d6f636869 Not as m...,NO MAJORITY,Agree
2,2,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",THE SUPREME COURT is siding with super rich pr...,Agree,Agree
3,3,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@POTUS Biden Blunders\n\nBroken campaign promi...,Mostly Agree,Agree
4,4,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@OhComfy I agree. The confluence of events rig...,Agree,Agree


### Preparation of statements (news related to tweets)

In [3]:
df['statement'].unique()[:10]

array(['End of eviction moratorium means millions of Americans could lose their housing in the middle of a pandemic.',
       'The Trump administration worked to free 5,000 Taliban prisoners.',
       'In Afghanistan, over 100 billion dollars spent on military contracts.',
       'A photo shows two COVID-19 patients lying on the floor awaiting treatment in Florida.',
       'Its been over 50 years since minimum (wage) and inflation parted ways, then over a decade since the federal minimum went up at all.',
       'We have a record 9.3 million job openings in the U.S.',
       'Since 1978, CEO compensation rose over 1,000% and only 11.9% for average workers.',
       'Wisconsins 2019-21 budget produced the first positive general fund balance since 2000, and the governors proposed 2021-23 budget would return it to a deficit.',
       'Opposition to having a fully elected Chicago Board of Education is in the super minority.',
       'We now have more job openings than we do people who are

In [4]:
statements= df['statement'].copy()
codes = {}
i = 0
for j in statements.unique():
    codes[i] = j
    statements.replace(j, i, inplace=True)
    i += 1
print(codes[20])
NUMBER_OF_CLASSES = len(codes)
print(NUMBER_OF_CLASSES)

The president's own FBI director said that the greatest domestic terrorist threat is white supremacists.
1058


  statements.replace(j, i, inplace=True)


### Preparation of tweets

In [5]:
tweets = df['tweet'].copy()
tweets.head(10)

0    @POTUS Biden Blunders - 6 Month Update\n\nInfl...
1    @S0SickRick @Stairmaster_ @6d6f636869 Not as m...
2    THE SUPREME COURT is siding with super rich pr...
3    @POTUS Biden Blunders\n\nBroken campaign promi...
4    @OhComfy I agree. The confluence of events rig...
5    I've said this before, but it really is incred...
6    As many face backlogged rent payments, America...
7    @Thomas1774Paine @JoeBiden\n#DOJ@TheJusticeDep...
8    @SocialismIsDone @TheeKHiveQueenB Its a win fo...
9    @daysofarelives2 @Sen_JoeManchin There is not ...
Name: tweet, dtype: object

In [7]:
tokenized_tweets = []
for i in range(len(tweets)):
    tokens = nltk.word_tokenize(tweets.iloc[i])
    tokenized_tweets.append(tokens)
    if i % 10000 == 0:
        print(f'Tokenized {i} tweets')
print(tokenized_tweets[:5])

Tokenized 0 tweets
Tokenized 10000 tweets
Tokenized 20000 tweets
Tokenized 30000 tweets
Tokenized 40000 tweets
Tokenized 50000 tweets
Tokenized 60000 tweets
Tokenized 70000 tweets
Tokenized 80000 tweets
Tokenized 90000 tweets
Tokenized 100000 tweets
Tokenized 110000 tweets
Tokenized 120000 tweets
Tokenized 130000 tweets
[['@', 'POTUS', 'Biden', 'Blunders', '-', '6', 'Month', 'Update', 'Inflation', ',', 'Delta', 'mismanagement', ',', 'COVID', 'for', 'kids', ',', 'Abandoning', 'Americans', 'in', 'Afghanistan', ',', 'Arming', 'the', 'Taliban', ',', 'S.', 'Border', 'crisis', ',', 'Breaking', 'job', 'growth', ',', 'Abuse', 'of', 'power', '(', 'Many', 'Exec', 'Orders', ',', '$', '3.5T', 'through', 'Reconciliation', ',', 'Eviction', 'Moratorium', ')', '...', 'what', 'did', 'I', 'miss', '?'], ['@', 'S0SickRick', '@', 'Stairmaster_', '@', '6d6f636869', 'Not', 'as', 'many', 'people', 'are', 'literally', 'starving', 'and', 'out', 'in', 'the', 'streets', 'as', 'they', 'were', 'in', 'the', '19th', 

In [8]:
tweet_lengths = [len(tokens) for tokens in tokenized_tweets]
print(f'Mean length of tokenized tweets: {np.mean(tweet_lengths)}')
print(f'Median tweet length: {np.median(tweet_lengths)}')
print(f'Max length of tokenized tweets: {max(tweet_lengths)}')
print(f'Min length of tokenized tweets: {min(tweet_lengths)}')


Mean length of tokenized tweets: 42.12054576074159
Median tweet length: 44.0
Max length of tokenized tweets: 174
Min length of tokenized tweets: 1


In [9]:
CAP_LENGTH = 50
tweets_capped = [tokens[:CAP_LENGTH] for tokens in tokenized_tweets]
print(tweets_capped[:3])

[['@', 'POTUS', 'Biden', 'Blunders', '-', '6', 'Month', 'Update', 'Inflation', ',', 'Delta', 'mismanagement', ',', 'COVID', 'for', 'kids', ',', 'Abandoning', 'Americans', 'in', 'Afghanistan', ',', 'Arming', 'the', 'Taliban', ',', 'S.', 'Border', 'crisis', ',', 'Breaking', 'job', 'growth', ',', 'Abuse', 'of', 'power', '(', 'Many', 'Exec', 'Orders', ',', '$', '3.5T', 'through', 'Reconciliation', ',', 'Eviction', 'Moratorium', ')'], ['@', 'S0SickRick', '@', 'Stairmaster_', '@', '6d6f636869', 'Not', 'as', 'many', 'people', 'are', 'literally', 'starving', 'and', 'out', 'in', 'the', 'streets', 'as', 'they', 'were', 'in', 'the', '19th', 'century', '.', 'Isnt', 'capitalism', 'grand', '?', 'Meanwhile', ',', 'were', 'facing', 'an', 'eviction', 'moratorium', 'threatening', 'to', 'make', 'millions', 'of', 'Americans', 'homeless', '.', 'Fuck', 'off', 'with', 'this', 'corporatist'], ['THE', 'SUPREME', 'COURT', 'is', 'siding', 'with', 'super', 'rich', 'property', 'owners', 'and', 'over', 'poor', 'str

In [10]:
# Word2Vec Hyper parameters
VECTOR_SIZE = 64
WINDOW = 5
WORKERS = 4

In [11]:
model_vec = Word2VecEmbedder(tweets_capped, vector_size=VECTOR_SIZE, window=WINDOW, min_count=1, workers=WORKERS)

In [14]:
print(len(model_vec.embed(tweets_capped[0])))

50


In [15]:
embedded_tweets = []
for tokens in tweets_capped:
    tweet_vector = model_vec.embed(tokens)
    # Pad with zero vectors if tweet is shorter than CAP_LENGTH
    while len(tweet_vector) < CAP_LENGTH:
        tweet_vector.append(np.zeros(VECTOR_SIZE))
    embedded_tweets.append(tweet_vector)
embedded_tweets = np.array(embedded_tweets) # turn into numpy array
print(embedded_tweets.shape)

(134198, 50, 64)


In [50]:
TEST_SIZE = 0.1
RANDOM_STATE = 42
SHUFFLE = True
train_x, test_x, train_y, test_y = train_test_split(embedded_tweets, statements, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=SHUFFLE)
print(f'Train shape: {train_x.shape}, Test shape: {test_x.shape}')
print(f'Train labels shape: {train_y.shape}, Test labels shape: {test_y.shape}')

Train shape: (120778, 50, 64), Test shape: (13420, 50, 64)
Train labels shape: (120778,), Test labels shape: (13420,)


### Define torch model

In [69]:
# define torch model
# Hyperparameters
EPOCHS = 25
BATCH_SIZE = 32
LEARNING_RATE = 0.001
NUMBER_OF_HEADS = 4
DROPOUT_RATE = 0.3
class TweeterClassifierStatements(torch.nn.Module):
    def __init__(self):
        super(TweeterClassifierStatements, self).__init__()
        self.self_attention = torch.nn.MultiheadAttention(embed_dim=VECTOR_SIZE, num_heads=NUMBER_OF_HEADS)
        self.fc1 = torch.nn.Linear(VECTOR_SIZE * CAP_LENGTH, 256)
        self.relu1 = torch.nn.ReLU()
        self.dropout = torch.nn.Dropout(DROPOUT_RATE)
        self.fc2 = torch.nn.Linear(256, NUMBER_OF_CLASSES)

    def forward(self, x):
        x = x.permute(1, 0, 2)  # Change shape to (seq_len, batch_size, embed_dim) for MultiheadAttention
        attn_output, _ = self.self_attention(x, x, x)
        attn_output = attn_output.permute(1, 0, 2)  # Back to (batch_size, seq_len, embed_dim)
        attn_output = attn_output.reshape(attn_output.size(0), -1)  # Flattening output
        x = self.fc1(attn_output)
        x = self.relu1(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x


In [70]:
# Initialize model, loss function, and optimizer
model = TweeterClassifierStatements()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)


In [71]:
# Training loop
for epoch in range(EPOCHS):
    model.train()
    permutation = np.random.permutation(train_x.shape[0])
    epoch_loss = 0
    for i in range(0, train_x.shape[0], BATCH_SIZE):
        indices = permutation[i:i+BATCH_SIZE]
        batch_x, batch_y = train_x[indices], train_y.iloc[indices]
        batch_x_tensor = torch.tensor(batch_x, dtype=torch.float32)
        batch_y_tensor = torch.tensor(batch_y.values, dtype=torch.long)
        optimizer.zero_grad()
        outputs = model(batch_x_tensor)
        loss = criterion(outputs, batch_y_tensor)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f'Epoch {epoch+1}/{EPOCHS}, Loss: {epoch_loss}')

Epoch 1/25, Loss: 8611.50347456336
Epoch 2/25, Loss: 4244.128017768264
Epoch 3/25, Loss: 3265.8554413989186
Epoch 4/25, Loss: 2807.796820972115
Epoch 5/25, Loss: 2459.625379288569
Epoch 6/25, Loss: 2204.647118275985
Epoch 7/25, Loss: 2092.054299581796
Epoch 8/25, Loss: 2003.322446545586
Epoch 9/25, Loss: 1905.2302094791085
Epoch 10/25, Loss: 1874.3509437348694
Epoch 11/25, Loss: 1876.5727614364587
Epoch 12/25, Loss: 1804.8124699557666
Epoch 13/25, Loss: 1795.346362634562
Epoch 14/25, Loss: 1853.4327622661367
Epoch 15/25, Loss: 1838.3827989145648
Epoch 16/25, Loss: 1838.0893814766896
Epoch 17/25, Loss: 1824.9691503612266
Epoch 18/25, Loss: 1881.1593841760186
Epoch 19/25, Loss: 1863.1354634733289
Epoch 20/25, Loss: 1885.4942440406303
Epoch 21/25, Loss: 1920.3143676241161
Epoch 22/25, Loss: 1870.282447234611
Epoch 23/25, Loss: 1917.7542819452065
Epoch 24/25, Loss: 1953.6791880533565
Epoch 25/25, Loss: 1937.7106809751567
