# Assignment: Trump vs. GPT-2

## SDS 2020 - Module 3: Individual Assignment


The site [https://faketrump.ai/](https://faketrump.ai/) WAS an interesting example of AI-powered fake-text generation. They wrote in 2019:

>We built an artificial intelligence model by fine-tuning [GPT-2](https://openai.com/blog/better-language-models/) to generate tweets in the style of Donald Trump’s Twitter account. After seeing the results, we also built a discriminator that can accurately detect fake tweets 77% of the time — think you can beat our classifier? Try it yourself!

Unfortunately, they decided to take down the site and the dataset.

GPT-2 is a neural transformer-based model, that has been announced by OpenAI in February 2019 and created considerable discussion because they decided - in contrast to their earlier policies - not to release the mode to the public. Their central argument was that the model could be used to produce fake news, spam and alike too easily. The footnote of the faketrump page reads: “Generating realistic fake text has become much more accessible. We hope to highlight the current state of text generation to demonstrate how difficult it is to discern fiction from reality.”


Since then several organizations and researchers have shown that it is [possible to develop systems to detect “fake text”](https://www.theguardian.com/technology/2019/jul/04/ai-fake-text-gpt-2-concerns-false-information). We believe that you too can implement a competitive system.

Having no dataset from that project, Roman decided to retrain GPT2 to generate new fake trump tweets. If they can do that, we can do that! However, it seems as if it is easier for ML models to identify our fake tweets...well...they are an AI company and probably spent more time on that...

> I’ve just watched Democrats scream over and over again about trying to Impeach the President of the United States. The Impeachment process is a sham.

> The Media must understand!“The New York Times is the leader on a very important subject: How to Combat Trump.” @foxandfriendsSo pathetic! @foxandfriendsI don’t think so.

> He is going to do it soon, and with proper borders. Border security is my top priority.The Democrats have failed the people of Arizona in everything else they have done, even their very good immigration laws. They have no sense.

The data can be found [here](https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz) and has the following format:


<table>
  <tr>
   <td>0
   </td>
   <td>1
   </td>
  </tr>
  <tr>
   <td>string
   </td>
   <td>boolean
   </td>
  </tr>
</table>

There are 7368 real Trump tweet and 7368 fake ones.

you can open it with:



```
data = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz')
```



* Split the data and preprocess it, vectorizing the text using different approaches (BoW, TFIDF, LSI)

* Create a system that can identify the fake Trump tweets using LogisticRefression or other classifiers (Sklearn - If you like also more complex models with FastAI, Keras neural nets or alike)

* Explore a subset (~1000) of the real and fake tweets using LDA and visualize your exploration

* Consider exploring using a different approach (LSI + clustering) or perhaps even [CorEx](https://github.com/gregversteeg/corex_topic)

## Load Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz')

In [3]:
#Rename columns
data.columns = ['tweet','is_real']

In [4]:
#Inspect
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14736 entries, 0 to 14735
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   tweet    14736 non-null  object
 1   is_real  14736 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 244.6+ KB


## Preprocess

In [17]:
# There should not be any html in the tweets, so no need to apply Roman's 
#pattern = re.compile('<br /><br />')

In [5]:
# module to split data into training / test
from sklearn.model_selection import train_test_split

In [6]:
# define in and outputs

X = data['tweet'].values
y = data['is_real'].values

In [7]:
# Split the data in 80% trainig 20% test, keep random state 18 for reproducibility

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

## BoW vectorization

### Logistic regression

In [30]:
# Simple BoW vectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() #vectorizer to be applied
X_train_vec_bow = vectorizer.fit_transform(X_train) # new BoW vectorization

In [31]:
# Instantiate logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=2000)

In [32]:
# Train the model using "fit"
model.fit(X_train_vec_bow, y_train) #fits the logit model on the training data

LogisticRegression(max_iter=2000)

In [33]:
# Vectorize the test-set using the boW vectorizer
X_test_vec_bow = vectorizer.transform(X_test) # 

In [34]:
# Check performance of the model
model.score(X_test_vec_bow, y_test)

0.8063093622795116

In [17]:
# Prediction on new data vs. actual data
y_pred = model.predict(X_test_vec_bow) # predicts for each vector based on x_test
pd.crosstab(y_test, y_pred)

col_0,False,True
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1180,309
True,262,1197


### Random forest classifier (and K-fold crossvalidation)

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score #K-fold cross-validation

model = RandomForestClassifier(random_state=32) #empty model
scores = cross_val_score(model, X_train_vec_bow, y_train, cv = 5)
print(scores)

[0.8524173  0.86217133 0.85538592 0.85956725 0.86041578]


In [49]:
model.fit(X_train_vec_bow, y_train)
print(model.score(X_test_vec_bow, y_test))

0.8629579375848032


RandomForestClassifier()

In [47]:
# Prediction on new data vs. actual data
y_pred = model.predict(X_test_vec_bow) # predicts for each vector based on x_test
pd.crosstab(y_test, y_pred)

col_0,False,True
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1257,232
True,174,1285


### Xgboost classifier

See documentation: https://xgboost.readthedocs.io/en/latest/tutorials/model.html

In [52]:
# "Gradient boosted" decision models 
import xgboost as xgb
model = xgb.XGBClassifier()

xgb.XGBClassifier(objective="binary:logistic")
scores = cross_val_score(model, X_train_vec_bow, y_train, cv = 5)
print("K-val scores:", scores)

model.fit(X_train_vec_bow, y_train)
print('Model fit:', model.score(X_test_vec_bow, y_test))

K-val scores: [0.8490246  0.84860051 0.84563189 0.83453543 0.85150615]
Model fit: 0.8521031207598372


In [53]:
y_pred = model.predict(X_test_vec_bow) # predicts for each vector based on x_test
pd.crosstab(y_test, y_pred)

col_0,False,True
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1236,253
True,183,1276


Can use 'Explain it like I'm 5' model to see which words contribute to model scoring of tweet:

In [63]:
#import eli5
#eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['negative','positive'], top=20)

In [64]:
#eli5.show_prediction(model, X_test[3], vec=vectorizer, target_names=['negative','positive'])

## TFIDF vectorization and regression

### Simple TDIF vectorizer

In [21]:
#Import TDIF vectorizer and create empty model

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_vec_tfidf = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_tfidf, y_train)

LogisticRegression(max_iter=2000)

In [22]:
# Vectorize the test-set using the tfidf vectorizer
X_test_vec_tfidf = vectorizer.transform(X_test)

In [23]:
# Check performance of the model
model.score(X_test_vec_tfidf, y_test)

0.8171641791044776

In [24]:
# Predict on new data
y_pred = model.predict(X_test_vec_tfidf)
pd.crosstab(y_test, y_pred)

col_0,False,True
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1189,300
True,239,1220


### TDIF vectorizer with Spacy

In [11]:
import spacy
# Import language model can be downloaded for many languages.
nlp = spacy.load('en_core_web_md') #english model trained on a corpus of web-data

In [12]:
# Tokenize the tweets
tokenlist = []
for doc in nlp.pipe(X_train):
  tokens =[tok.text.lower() for tok in doc if tok.pos_ in ['NOUN','ADJ','ADV','VERB'] and not tok.is_stop]
  tokenlist.append(tokens)

In [13]:
#transform tokens into dictionary
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(tokenlist)
# the dictionary is not long, so I see what happens without filtering

In [78]:
vectorizer = TfidfVectorizer(vocabulary=list(dictionary.values()))
X_train_vec_spacy = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_spacy, y_train)

#vectorize testdata
X_test_vec_spacy = vectorizer.fit_transform(X_test)

# Check performance of the model
model.score(X_test_vec_spacy, y_test)

0.7544097693351425

In [79]:
## Random tree?

model = RandomForestClassifier() #empty model

model.fit(X_train_vec_spacy, y_train)
print(model.score(X_test_vec_spacy, y_test))


0.7747625508819539


Can use 'Explain it like I'm 5' model to see which words contribute to model scoring of tweet:

In [63]:
import eli5
#eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['negative','positive'], top=20)

In [64]:
#eli5.show_prediction(model, X_test[3], vec=vectorizer, target_names=['negative','positive'])

## LDA topic modelling

**Task**:
- Explore a subset (~1000) of the real and fake tweets using LDA and visualize your exploration


**Background info:**

Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful.

See more: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/



In [38]:
#create tokens for each tweet (using spacy's nlp)
tokens = []

for tweet in nlp.pipe(data['tweet']):
  tweet_tok = [token.lemma_.lower() for token in tweet if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] and not token.is_stop] 
  tokens.append(tweet_tok)

data['tokens'] = tokens


In [78]:
#Create dictionary
# Import the dictionary builder

from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary

dictionary = Dictionary(data['tokens'])

# filter out low-frequency / high-frequency stuff, also limit the vocabulary to max 1000 words
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=1000)

In [79]:
#split dataset into two lists of size 1000
real_sample = data.loc[data['is_real'] == True, ['tweet', 'tokens']].sample(1000)
fake_sample = data.loc[data['is_real'] == False, ['tweet', 'tokens']].sample(1000)

In [80]:
# reset index to match corpus
real_sample.index = range(len(real_sample))
fake_sample.index = range(len(fake_sample))

In [81]:
#create corpus for each sample (using the common dictionary): transforms the tokens into vectors
real_corpus = [dictionary.doc2bow(doc) for doc in real_sample['tokens']]
fake_corpus = [dictionary.doc2bow(doc) for doc in real_sample['tokens']]

I am unsure whether it is best practice to train the model on both samples as a single corpus or whether to treat them as different. My intuition is that the first option would suggest itself to a comparison of the frequency of topics and the second to the content/type of topics?

In [85]:
# first for real sample
#LDA model
from gensim.models import LdaMulticore
# Training the model
lda_model = LdaMulticore(real_corpus, id2word=dictionary, num_topics= 5, workers = 4, passes=10)

In [86]:
lda_model.print_topics(-1) 

[(0,
  '0.029*"fake" + 0.028*"news" + 0.017*"election" + 0.017*"democrats" + 0.017*"country" + 0.015*"media" + 0.015*"states" + 0.013*"law" + 0.012*"%" + 0.012*"united"'),
 (1,
  '0.040*"great" + 0.030*"people" + 0.022*"amp" + 0.017*"year" + 0.013*"president" + 0.011*"american" + 0.011*"democrats" + 0.011*"day" + 0.009*"america" + 0.009*"hunt"'),
 (2,
  '0.020*"border" + 0.019*"great" + 0.017*"amp" + 0.012*"democrats" + 0.011*"president" + 0.011*"good" + 0.010*"obama" + 0.009*"united" + 0.009*"states" + 0.009*"administration"'),
 (3,
  '0.023*"amp" + 0.018*"country" + 0.017*"world" + 0.014*"president" + 0.013*"america" + 0.012*"big" + 0.011*"democrats" + 0.010*"people" + 0.010*"u.s." + 0.010*"china"'),
 (4,
  '0.027*"great" + 0.024*"trump" + 0.022*"amp" + 0.018*"state" + 0.017*"people" + 0.016*"biden" + 0.013*"president" + 0.012*"house" + 0.011*"joe" + 0.010*"night"')]

In [95]:
# explore further another time; coherence and complexity
#from gensim.models import CoherenceModel
# Compute Perplexity
#print('\nPerplexity: ', lda_model.log_perplexity(real_corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
#coherence_model_lda = CoherenceModel(model=lda_model, texts=real_corpus, dictionary=dictionary, coherence='c_v')
#coherence_lda = coherence_model_lda.get_coherence()
#print('\nCoherence Score: ', coherence_lda)

In [87]:
import pyLDAvis.gensim
%matplotlib inline
pyLDAvis.enable_notebook()

In [88]:
lda_display = pyLDAvis.gensim.prepare(lda_model, real_corpus, dictionary)
# Let's Visualize
pyLDAvis.display(lda_display)

In [89]:
lda_model_fake = LdaMulticore(fake_corpus, id2word=dictionary, num_topics= 5, workers = 4, passes=10)

In [90]:
lda_display = pyLDAvis.gensim.prepare(lda_model_fake, fake_corpus, dictionary)
# Let's Visualize
pyLDAvis.display(lda_display)

## LSI/LSA/SVD

**Task**:
- Consider exploring using a different approach (LSI + clustering) or perhaps even CorEx


**Background info:**

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents

In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI) (https://en.wikipedia.org/wiki/Latent_semantic_analysis)

see also: https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python


In [8]:
# Import the dictionary builder
from gensim.corpora.dictionary import Dictionary

# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

# Tooling to map between corpus (gensim) and matrix - more general
from gensim.matutils import corpus2csc, corpus2dense

In [10]:
import nltk
#nltk.download('punkt')
#nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [15]:
#create tokens for each tweet (using spacy's nlp)
tokens = []

for tweet in nlp.pipe(data['tweet']):
  tweet_tok = [token.lemma_.lower() for token in tweet if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] and not token.is_stop] 
  tokens.append(tweet_tok)

data['tokens'] = tokens

# Generate a dictionary and filter
dictionary = Dictionary(data['tokens'])
dictionary.filter_extremes(no_below=5, no_above=0.2)

# construct corpus using dictionary
corpus = [dictionary.doc2bow(doc) for doc in data['tokens']]

In [16]:
# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# transform corpus to TFIDF
corpus_tfidf = tfidf[corpus]

# Training the LSI model
model_lsi = LsiModel(corpus_tfidf, num_topics = 300, id2word=dictionary)


In [17]:
# Generating the corpus train & test

corpus_lsi = model_lsi[corpus_tfidf]

# turn into matrix
corpus_lsi_matrix = corpus2dense(corpus_lsi, 300 )

corpus_lsi_matrix = corpus_lsi_matrix.T

In [18]:
from annoy import AnnoyIndex

In [19]:
f = 300

t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed

for i in range(len(corpus_lsi_matrix)):
    t.add_item(i, corpus_lsi_matrix[i])

In [21]:
#t.get_nns_by_item(0, 10)
#data['tweet'][t.get_nns_by_item(10, 10)]

[0, 4496, 1235, 2824, 1037, 9887, 9237, 10180, 10269, 2894]

With below code we can browse around the nearest neighbours of a tweet in the dataset. With the few I have browsed, the LSI algorithm does not distinguish well between true and false (which makes sense, since our other models have also found it difficult to distinguish). 

In [36]:
tweet_number= 1

pd.set_option('display.max_colwidth', None)
data.loc[t.get_nns_by_item(tweet_number, 10),["tweet", "is_real"]]


Unnamed: 0,tweet,is_real
1,"The Unsolicited Mail In Ballot Scam is a major threat to our Democracy, &amp; the Democrats know it.",True
3887,“The only way to shut down the Democrats new Mob Rule strategy is to stop them cold at the Ballot Box.,True
1564,"Exclusive: Eyewitness Says as Many as 20,000 Unverified Absentee Ballots Counted in Detroit Primary via",True
14363,"It is much easier for the Dems &amp, the Do Nothing Democrats to take control and beat Republicans because you can write the votes for the person you want to be on your ballot &amp, then they can do it to you. The Do Nothing Democrats are the new House.",False
6300,"Democrats must change the Loophole &amp, Asylum Laws - but they probably won’t!",True
11444,"I asked for it myself when asked why I left FBI at a time when I was under indictment for Witch Hunt &amp, Impeachment. They just don’t have the answers and will make an even worse mess - it makes the Democrats look good!THANK YOU MAJOR JOB!",False
8712,"It is much easier than that. Absentee Ballots will be mailed to all registered Democrats and Ballots mailed to all registered Republicans will be counted, with appropriate numbers, in the Election System, which means absentee voting is also possible.",False
5598,"“It isn’t often I get angry at the dirty politics of the Democrats in Congress, but this time I am enraged and hope this impeachment charade will backfire on Reps. Pelosi &amp, Schiff, &amp, the Democrats.",True
5675,"It is disgraceful what the Do Nothing Democrats are doing (the Impeachment Scam), but it is also disgraceful what they are NOT doing, namely, the USMCA vote, Prescription Drug Price Reduction, Gun Safety, Infrastructure, and much more!",True
11492,"Not all of this is the Do Nothing Democrats fault, they must have been very naive &amp; dishonest in 2016. Not only did FBI act on them, they have taken almost ALL of the info, including the Clinton Server. Very serious....“What the Democrats are doing is very, very, very bad.",False
