# Portfolio for SDSPhD20

This notebook contains the exercises and assignments to be answered in a portfolio for the PhD course "Social Data Science: An Applied Introduction to Machine Learning" at Aalborg University, November 2020.

Each day of the course you are given an hour to work on a portfolio with the possibility of sparring with the course lecturers. 

You are expected to attempt to solve the various assignments using the methods and tools taught during the course. Answers should be combined into a notebook (fx by adding answers to a copy of this one). 

**Note:** You are not expected to attempt to solve every single assignment. Note the different requirements for each day.

#### How to hand in your portfolio notebooks

You can hand in your portfolio notebooks in two ways:

- Saving your notebooks in a GitHub repository and then sending the repository URL to the course organizer (Kristian Kjelmann)
- Sharing your notebooks directly with the course organizer (Kristian Kjelmann) in Google Colab.

Kristian’s e-mail: kgk@adm.aau.dk

# Portfolio assignments for Thursday

**Requirement:** Work on solutions for the "Trump vs. GPT-2" assignment

## NLP: Trump vs. GPT-2

The site [https://faketrump.ai/](https://faketrump.ai/) WAS an interesting example of AI-powered fake-text generation. They wrote in 2019:

>We built an artificial intelligence model by fine-tuning [GPT-2](https://openai.com/blog/better-language-models/) to generate tweets in the style of Donald Trump’s Twitter account. After seeing the results, we also built a discriminator that can accurately detect fake tweets 77% of the time — think you can beat our classifier? Try it yourself!

Unfortunately, they decided to take down the site and the dataset.

GPT-2 is a neural transformer-based model, that has been announced by OpenAI in February 2019 and created considerable discussion because they decided - in contrast to their earlier policies - not to release the mode to the public. Their central argument was that the model could be used to produce fake news, spam and alike too easily. The footnote of the faketrump page reads: “Generating realistic fake text has become much more accessible. We hope to highlight the current state of text generation to demonstrate how difficult it is to discern fiction from reality.”


Since then several organizations and researchers have shown that it is [possible to develop systems to detect “fake text”](https://www.theguardian.com/technology/2019/jul/04/ai-fake-text-gpt-2-concerns-false-information). We believe that you too can implement a competitive system.

Having no dataset from that project, Roman decided to retrain GPT2 to generate new fake trump tweets. If they can do that, we can do that! However, it seems as if it is easier for ML models to identify our fake tweets...well...they are an AI company and probably spent more time on that...

> I’ve just watched Democrats scream over and over again about trying to Impeach the President of the United States. The Impeachment process is a sham.

> The Media must understand!“The New York Times is the leader on a very important subject: How to Combat Trump.” @foxandfriendsSo pathetic! @foxandfriendsI don’t think so.

> He is going to do it soon, and with proper borders. Border security is my top priority.The Democrats have failed the people of Arizona in everything else they have done, even their very good immigration laws. They have no sense.

The data can be found [here](https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz) and has the following format:


<table>
  <tr>
   <td>0
   </td>
   <td>1
   </td>
  </tr>
  <tr>
   <td>string
   </td>
   <td>boolean
   </td>
  </tr>
</table>

There are 7368 real Trump tweet and 7368 fake ones.

you can open it with:



```
data = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz')
```



* Split the data and preprocess it, vectorizing the text using different approaches (BoW, TFIDF, LSI)

* Create a system that can identify the fake Trump tweets using LogisticRefression or other classifiers (Sklearn - If you like also more complex models with FastAI, Keras neural nets or alike)

* Explore a subset (~1000) of the real and fake tweets using LDA and visualize your exploration

* Consider exploring using a different approach (LSI + clustering) or perhaps even [CorEx](https://github.com/gregversteeg/corex_topic)

In [None]:
# Your solutions from here...

In [None]:
!pip -q install eli5

[?25l[K     |███                             | 10kB 21.2MB/s eta 0:00:01[K     |██████▏                         | 20kB 16.7MB/s eta 0:00:01[K     |█████████▎                      | 30kB 10.4MB/s eta 0:00:01[K     |████████████▍                   | 40kB 8.8MB/s eta 0:00:01[K     |███████████████▌                | 51kB 4.2MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 4.3MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 4.8MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 4.9MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 5.2MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 5.6MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 5.6MB/s 
[?25h

In [None]:
import pandas as pd
import numpy as np
#import re

In [None]:
data = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/e2c959494d53859c1844604bed09a28a21566d0f/M3/assignments/trump_vs_GPT2.gz')
data.head()

Unnamed: 0,0,1
0,I was thrilled to be back in the Great city of...,True
1,The Unsolicited Mail In Ballot Scam is a major...,True
2,"As long as I am President, I will always stand...",True
3,"Our Economy is doing great, and is ready to se...",True
4,If I do not sound like a typical Washington po...,True


In [None]:
data.iloc[0,0]

'I was thrilled to be back in the Great city of Charlotte, North Carolina with thousands of hardworking American Patriots who love our Country, cherish our values, respect our laws, and always put AMERICA FIRST!'

BoW

In [None]:
# module to split data into training / test
from sklearn.model_selection import train_test_split

In [None]:
# define in and outputs

X = data[0].values
y = data[1].values

In [None]:
# Split the data in 80% trainig 20% test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

In [None]:
# Simple BoW vectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vec_1 = vectorizer.fit_transform(X_train)

In [None]:
# Instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=2000)

In [None]:
# Train the model

model.fit(X_train_vec_1, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Transform the test-set
X_test_vec_1 = vectorizer.transform(X_test)

In [None]:
# Check performance of the model
model.score(X_test_vec_1, y_test)

0.8059701492537313

In [None]:
# Predict on new data

y_pred = model.predict(X_test_vec_1)

In [None]:
y_pred

array([False,  True, False, ..., False,  True,  True])

In [None]:
# confusion matrix by hand... :-)

pd.crosstab(y_test, y_pred)

col_0,False,True
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1180,309
True,263,1196


TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_vec_2 = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_2, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Transform the test-set
X_test_vec_2 = vectorizer.transform(X_test)

In [None]:
# Check performance of the model
model.score(X_test_vec_2, y_test)

0.8171641791044776

In [None]:
import eli5
eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['Fake','Actual'], top=20)



Weight?,Feature
… 7361 more positive …,… 7361 more positive …
… 4702 more negative …,… 4702 more negative …
-2.700,enjoy
-2.761,ingrahamangle
-2.934,also
-2.974,am
-3.005,loudobbs
-3.106,total
-3.245,it
-3.398,great


In [None]:
eli5.show_prediction(model, X_test[0], vec=vectorizer, target_names=['Fake','Actual'])

Contribution?,Feature
1.631,<BIAS>
0.289,which
0.22,kasich
0.04,nafta
0.037,and
0.024,tpp
0.015,for
-0.007,pushing
-0.027,gov
-0.092,voted


In [None]:
X_test[0]

'Gov Kasich voted for NAFTA, which devastated Ohio and is now pushing TPP hard- bad for American workers!'

In [None]:
preds_array = model.predict_proba(X_test_vec_2)
preds_array[:,1]

array([0.67935772, 0.75608666, 0.13428755, ..., 0.34302019, 0.67398903,
       0.83636925])

In [None]:
predsDF = pd.DataFrame({'text':X_test, 'pred_pos':preds_array[:,1], 'y_test': y_test})

In [None]:
predsDF['diff'] = predsDF.pred_pos - predsDF.y_test
predsDF.sort_values('diff', ascending=True).iloc[0,0]

'To all of the great people who are working so hard for your Country and not getting paid I say, THANK YOU - YOU ARE GREAT PATRIOTS!'

In [None]:
# Let's fire up spacy

import spacy

# and load the small english language model. Large models can be downloaded for many languages.
nlp = spacy.load("en")

# find more models for other languages here: https://spacy.io/models/

In [None]:
doc = nlp(X_test[1])

In [None]:
X_test[1]

'Texas LC George P. Bush backed me when it wasn’t the politically correct thing to do, and I back him now.'

In [None]:
# let's look at the POS tags
[(tok.text, tok.pos_) for tok in doc]

[('Texas', 'PROPN'),
 ('LC', 'PROPN'),
 ('George', 'PROPN'),
 ('P.', 'PROPN'),
 ('Bush', 'PROPN'),
 ('backed', 'VERB'),
 ('me', 'PRON'),
 ('when', 'ADV'),
 ('it', 'PRON'),
 ('was', 'AUX'),
 ('n’t', 'PART'),
 ('the', 'DET'),
 ('politically', 'ADV'),
 ('correct', 'ADJ'),
 ('thing', 'NOUN'),
 ('to', 'PART'),
 ('do', 'AUX'),
 (',', 'PUNCT'),
 ('and', 'CCONJ'),
 ('I', 'PRON'),
 ('back', 'VERB'),
 ('him', 'PRON'),
 ('now', 'ADV'),
 ('.', 'PUNCT')]

In [None]:
# Let's tokenize the first 2000 articles (that should take around 1 minute with this approach)
tokenlist = []
for doc in nlp.pipe(X_train[:2000]):
  tokens =[tok.text.lower() for tok in doc if tok.pos_ in ['NOUN','ADJ','ADV','VERB'] and not tok.is_stop]
  tokenlist.append(tokens)

In [None]:
from gensim.corpora.dictionary import Dictionary

In [None]:
dictionary = Dictionary(tokenlist)

In [None]:
len(dictionary)

3654

In [None]:
dictionary.filter_extremes(no_below=5, no_above=0.2)

In [None]:
len(dictionary)

726

In [None]:
vectorizer = TfidfVectorizer(vocabulary=list(dictionary.values()))
X_train_vec_2 = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_2, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
X_test_vec_2 = vectorizer.fit_transform(X_test)

In [None]:
# Check performance of the model
model.score(X_test_vec_2, y_test)

0.7242198100407056

In [None]:
eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['Fake','Actual'], top=20)

Weight?,Feature
… 232 more positive …,… 232 more positive …
… 466 more negative …,… 466 more negative …
-1.968,disgrace
-1.988,proud
-1.989,lot
-2.087,big
-2.146,great
-2.148,soon
-2.162,interviewed
-2.163,happen


In [None]:
eli5.show_prediction(model, X_test[0], vec=vectorizer, target_names=['Fake','Actual'])

Contribution?,Feature
0.469,bad
0.277,workers
0.265,hard
0.253,american
0.249,pushing
0.134,voted
-1.407,<BIAS>


In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.neural_network import MLPClassifier

tfidf = TfidfVectorizer(vocabulary=list(dictionary.values()))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
clf = MLPClassifier(verbose=False)


pipe = make_pipeline(tfidf, svd, clf)

pipe.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token...
                               batch_size='auto', beta_1=0.9, beta_2=0.999,
                               early_stopping=False, epsilon=1e-08,
                               hidden_layer_sizes=(100,),
                               learning_rate='constant',
                               learning_rat

In [None]:
pipe.score(X_test, y_test)

0.7435549525101763

In [None]:
from eli5.lime import TextExplainer

te = TextExplainer(random_state=42)
te.fit(X_test[0], pipe.predict_proba)
te.show_prediction(target_names=['Fake','Actual'])

Contribution?,Feature
1.002,american workers
0.797,hard bad
0.602,bad
0.302,bad for
0.147,for nafta
0.147,for american
0.134,tpp hard
0.122,devastated ohio
0.076,is now
0.073,voted for


---

LSI

In [None]:
!pip install annoy

Collecting annoy
[?25l  Downloading https://files.pythonhosted.org/packages/a1/5b/1c22129f608b3f438713b91cd880dc681d747a860afe3e8e0af86e921942/annoy-1.17.0.tar.gz (646kB)
[K     |████████████████████████████████| 655kB 5.4MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.0-cp36-cp36m-linux_x86_64.whl size=390345 sha256=3a3dacb1798496d33c45a3b25d782d3e0088eed882bbb2b1c61b397ae5887585
  Stored in directory: /root/.cache/pip/wheels/3a/c5/59/cce7e67b52c8e987389e53f917b6bb2a9d904a03246fadcb1e
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.0


In [None]:
# Import the dictionary builder
from gensim.corpora.dictionary import Dictionary

# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

# Tooling to map between corpus (gensim) and matrix - more general
from gensim.matutils import corpus2csc, corpus2dense

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Generate a dictionary and filter
dictionary = Dictionary(tokenlist)
dictionary.filter_extremes(no_below=5, no_above=0.2)

In [None]:
# construct corpus using this dictionary
corpus = [dictionary.doc2bow(word_tokenize(doc.lower())) for doc in data[0]]

In [None]:
# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

In [None]:
# transform corpus to TFIDF
corpus_tfidf = tfidf[corpus]

In [None]:
# Training the LSI model
model_lsi = LsiModel(corpus_tfidf, num_topics = 300, id2word=dictionary)

In [None]:
# Generating the corpus train & test

corpus_lsi = model_lsi[corpus_tfidf]

In [None]:
# turn into matrix
corpus_lsi_matrix = corpus2dense(corpus_lsi, 300 )

  result = np.column_stack(sparse2full(doc, num_terms) for doc in corpus)


In [None]:
corpus_lsi_matrix.shape

(300, 14736)

In [None]:
corpus_lsi_matrix = corpus_lsi_matrix.T

In [None]:
from annoy import AnnoyIndex

In [None]:
f = 300

t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed

for i in range(len(corpus_lsi_matrix)):
    t.add_item(i, corpus_lsi_matrix[i])

In [None]:
t.build(10)

True

In [None]:
t.get_nns_by_item(0, 10)

[0, 1130, 609, 1361, 4496, 1012, 1230, 1235, 9509, 1641]

In [None]:
data[0][44]

'Now is playing Obama’s no crowd, fake speech for Biden, a man he could barely endorse because he couldn’t believe he won.'

In [None]:
data[0][t.get_nns_by_item(44, 10)]

44       Now is playing Obama’s no crowd, fake speech f...
299      Does anybody really believe that Roger Stone, ...
1556     I hear that Fake News CNN just reported that I...
9106     Does anyone believe that it would be necessary...
12964    ” The Fake News doesn’t want to see that..... ...
11463    ” @SteveHilton @JudgeJeanine @FoxNewsSo true! ...
6178     The Washington Post Story, about my speech in ...
10204    We are here for you!🇺🇸🇧 is doing everything po...
4902     Many lawyers and top law firms want to represe...
7664     Additionally, I've authorized the United State...
Name: 0, dtype: object

In [None]:
import spacy
nlp = spacy.load("en")

# Let's apply the model to the article (as easy as that)
nlp(str(data.loc[44]))

0    Now is playing Obama’s no crowd, fake speech f...
1                                                 True
Name: 44, dtype: object

In [None]:
# we'll use the faster multicore version of LDA

from gensim.models import LdaMulticore

In [None]:
# Training the model
lda_model = LdaMulticore(corpus, id2word=dictionary, num_topics=10, workers = 4, passes=10)

In [None]:
# Where does a text belong to?
lda_model[corpus][0]

[(6, 0.31282243), (7, 0.4964905), (8, 0.12703843)]

In [None]:
# let's fist install this nice visualizer
!pip install -qq pyLDAvis

[K     |████████████████████████████████| 1.6MB 5.5MB/s 
[?25h  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone


In [None]:
# and import it
import pyLDAvis.gensim
%matplotlib inline
pyLDAvis.enable_notebook()

In [None]:
# Let's try to visualize
lda_display = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

In [None]:
pyLDAvis.display(lda_display)