# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [210]:
# Dependencies
import nltk
from nltk import lm
from typing import TypedDict, Iterable, Iterator, Collection, Callable
from dataclasses import dataclass
import json
import csv
import re
from typing import Optional
import random
from functools import reduce
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import numpy as np

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [211]:
# Notice: ignore retweets

TRUMP = "trump"
OBAMA = "obama"
BIDEN = "biden"

DATA_DIR = "data"
TRUMP_TWEETS_FILE = f"{DATA_DIR}/tweets_01-08-2021.json"
OBAMA_TWEETS_FILE = f"{DATA_DIR}/Tweets-BarackObama.csv"
BIDEN_TWEETS_FILE = f"{DATA_DIR}/JoeBidenTweets.csv"

@dataclass
class Tweet:
    author: str
    text: str
    
    @staticmethod
    def from_TrumpTweets(trump_tweets: Iterable["TrumpTweet"]) -> list["Tweet"]:
        return list(map(
            lambda t: Tweet(author=TRUMP, text=t["text"]),
            filter(lambda t: t["isRetweet"] == "f", trump_tweets)
        ))
    
    @staticmethod
    def from_ObamaTweets(obama_tweets: Iterable["ObamaTweet"]) -> list["Tweet"]:
        return list(map(
            lambda t: Tweet(author=OBAMA, text=t["tweet"]),
            obama_tweets
        ))
    
    @staticmethod
    def from_BidenTweets(biden_tweets: Iterable["BidenTweet"]) -> list["Tweet"]:
        return list(map(
            lambda t: Tweet(author=BIDEN, text=t["Tweet-text"]),
            biden_tweets
        ))

class TrumpTweet(TypedDict):
    id: int
    text: str
    isRetweet: str
    isDeleted: str
    device: str
    favorites: int
    retweets: int
    date: str
    isFlagged: str

class ObamaTweet(TypedDict):
    id: str
    timestamp: str
    url: str
    tweet: str
    replies: str
    retweets: str
    quotes: str
    likes: str

BidenTweet = TypedDict(
    "BidenTweet",
    {
        "Date": str,
        "Username": str,
        "Tweet-text": str,
        "Tweet Link": str,
        "Retweets": str,
        "Likes": str,
        "TweetImageUrl": str,
        "Image": str
    }
)

def load_trump_tweets(filepath) -> list[Tweet]:
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    with open(filepath) as fp:
        return Tweet.from_TrumpTweets(json.load(fp))
    
    ### END YOUR CODE


def load_obama_tweets(filepath) -> list[Tweet]:
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    with open(filepath) as fp:
        return Tweet.from_ObamaTweets(csv.DictReader(fp.readlines(), ObamaTweet.__required_keys__)) # type: ignore

    ### END YOUR CODE
    

def load_biden_tweets(filepath) -> list[Tweet]:
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    with open(filepath) as fp:
        return Tweet.from_BidenTweets(csv.DictReader(fp.readlines(), BidenTweet.__required_keys__)) # type: ignore
    
    ### END YOUR CODE

In [212]:
# Notice: think about start and end tokens

NUM_TEST = 100

def tokenize(text: str) -> Iterator[str]:
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    
    yield "<s>"
    for s in text.split():
        m = re.match(r"^((?:[#@]?\w+)|(?:\w+:\/\/\w+(?:\.\w+)*(?::\d*)?(?:(?:\/[^,.?!]*)*\/?)))[,\.!?]?$", s)
        if m is not None:
            if m.group(1) is None:
                print(m)
                print(m.group(1))
            yield m.group(1)
    yield "</s>"

    ### END YOUR CODE
    

def split_and_tokenize(data: list[Tweet], num_test=NUM_TEST) -> list[list[str]]:
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE

    return [list(tokenize(tweet.text)) for tweet in data]
    
    ### END YOUR CODE

In [213]:
TEST_COUNT = 100

trump_tweets = split_and_tokenize(load_trump_tweets(TRUMP_TWEETS_FILE))
obama_tweets = split_and_tokenize(load_obama_tweets(OBAMA_TWEETS_FILE))
biden_tweets = split_and_tokenize(load_biden_tweets(BIDEN_TWEETS_FILE))

data_train = {}
data_test = {}

data_train[TRUMP], data_test[TRUMP] = train_test_split(trump_tweets, test_size=TEST_COUNT / len(trump_tweets))
data_train[OBAMA], data_test[OBAMA] = train_test_split(obama_tweets, test_size=TEST_COUNT / len(obama_tweets))
data_train[BIDEN], data_test[BIDEN] = train_test_split(biden_tweets, test_size=TEST_COUNT / len(biden_tweets))


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [214]:
BACKGROUND_MODEL_KEY = "background"

class NGramModel:
    @property
    def n(self):
        return self.__n

    @property
    def V(self):
        return self.__V

    def __init__(self, n: int, data: list[list[str]], V: Optional[set[str]] = None):
        assert n > 1
        self.__n = n
        if V is None: 
            self.__V = set(s for l in data for s in l)
        else:
            self.__V = V
        if n > 2:
            self.__n_minus_1_gram_model = NGramModel(n-1, [l[:n-1] for l in data], self.V)
        n_gram_counts = dict()
        n_minus_1_gram_counts = dict()
        for l in data:
            for i in range(len(l) - n + 1):
                n_gram = tuple(l[i: i + n])
                n_minus_1_gram = n_gram[:-1]
                if not n_minus_1_gram in n_minus_1_gram_counts:
                    n_minus_1_gram_counts[n_minus_1_gram] = len(self.V)
                    n_gram_counts[n_minus_1_gram] = {n_gram: 2}
                else:
                    n_minus_1_gram_counts[n_minus_1_gram] += 1
                    if not n_gram in n_gram_counts[n_minus_1_gram]:
                        n_gram_counts[n_minus_1_gram][n_gram] = 2
                    else:
                        n_gram_counts[n_minus_1_gram][n_gram] += 1
        self.__conditionals = {
            n_minus_1_gram: {
                n_gram: n_gram_counts[n_minus_1_gram][n_gram] / n_minus_1_gram_counts[n_minus_1_gram]
                for n_gram in n_gram_counts[n_minus_1_gram]
            }
            for n_minus_1_gram in n_minus_1_gram_counts
        }
        self.__n_minus_1_gram_counts = n_minus_1_gram_counts
        
    def conditional(self, n_gram: tuple[str,...]) -> float:
        l = len(n_gram)
        assert 1 < l and l <= self.n
        if n_gram[-1] == "<s>":
            return 0
        elif l == self.n:
            n_minus_1_gram = n_gram[:-1]
            if n_minus_1_gram in self.__conditionals:
                if n_gram in self.__conditionals[n_minus_1_gram]:
                    return self.__conditionals[n_minus_1_gram][n_gram]
                else:
                    return 1 / self.__n_minus_1_gram_counts[n_minus_1_gram]
            else:
                return 1 / (len(self.V) - 1)
        else:
            return self.__n_minus_1_gram_model.conditional(n_gram)
        
    def continuations(self, n_minus_1_gram: tuple[str,...]) -> Iterator[tuple[str, float]]:
        l = len(n_minus_1_gram)
        assert 0 < l and l < self.n
        if l == self.n - 1:
            return ((t, self.conditional(n_minus_1_gram + (t,))) for t in self.V)
        else:
            return self.__n_minus_1_gram_model.continuations(n_minus_1_gram)

def build_n_gram_models(n, data: dict[str, list[list[str]]]):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    
    d = {k: NGramModel(n, l) for k, l in data.items()}
    d[BACKGROUND_MODEL_KEY] = NGramModel(n, reduce(lambda a, b: a + b, data.values(), []))
    return d
    
    ### END YOUR CODE


def get_suggestion(prev: Collection[str], n_gram_model: NGramModel) -> Optional[str]:
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    
    sum = 0
    r = random.random()
    for cont, cond in n_gram_model.continuations(tuple(prev)):
        sum += cond
        if r < sum:
            return cont
    return None

    ### END YOUR CODE


def get_random_tweet(n: int, n_gram_model: NGramModel):
    """Generates a random tweet using the given data set."""
    ### YOUR CODE HERE
    
    l = ["<s>"]
    for _ in range(n): 
        suggestion = get_suggestion(l[max(0, len(l) - n_gram_model.n + 1):len(l)], n_gram_model)
        if suggestion is not None:
            l.append(suggestion)
        else:
            break
    return " ".join(l[i] for i in range(1, len(l) - 1 if l[-1] == "</s>" else len(l)))
    
    ### END YOUR CODE

In [215]:
n_gram_models = build_n_gram_models(5, data_train)
random_tweet_trump = get_random_tweet(2000, n_gram_models[TRUMP])
print(random_tweet_trump)

was buildilng https://t.co/KBkladnmPQ car https://t.co/XLoCxOFvMA woo Walter Inspections https://t.co/bkmWFZ9JI9 #SaveCulzean https://t.co/YPB8nqX2d6 Opinion Kong Katrina @Netanyahu https://t.co/VU5wh2zXBU technology Income Rancho GRANITE Glimpses mysterious Wacky challenge @hookjan @CAKairport https://t.co/ZnPgnu8vCS #classact http://t.co/ZnT7ayNBiC https://t.co/MwWfmZjQIN http://t.co/BYFRxtzJ Behar http://t.co/RKVJ8h3V1v"" parties JebBush overwhelming Taxing minuses GRAHAM path http://t.co/fBxho8MWgh Fracking https://t.co/y1j2Wf6J4p http://t.co/LhK8EfIA https://t.co/FsrUGByuuD @DangeRussWilson #HandsOffMyGun Factory Mohammed forever Problems @freeillinois TrumpCare Ivana http://t.co/PpCgzJwr http://t.co/r3zVN62W Petersburg @RobertSuppa Garrett @mike_pence @frank_puggi Pressed Historian divorced wee @mtshastacola precautions https://t.co/HfuJeRZbod https://t.co/hJSsx86Azp Steele Castle Although #SemperFidelis CONTINUES memorabilia https://t.co/u5AI1pupVV https://t.co/YsZUgtNoZW https:

### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [216]:
def log_with_zero(x: float):
    return math.log(x) if x > 0 else -math.inf

def calculate_single_token_log_ratio(n_gram: tuple[str,...], n_gram_model1: NGramModel, n_gram_model2: NGramModel) -> tuple[float, float]:
    """Calculates the log ration of a token for two different n-grams"""
    ### YOUR CODE HERE
    
    return log_with_zero(n_gram_model1.conditional(n_gram)), log_with_zero(n_gram_model2.conditional(n_gram))
    
    ### END YOUR CODE


def classify(n: int, tokens: list[str], n_gram_model1: NGramModel, n_gram_model2: NGramModel) -> Optional[bool]:
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    log_prob1 = 0
    log_prob2 = 0
    for i in range(2, len(tokens) + 1):
        n_gram = tuple(tokens[max(0, i - n) : i])
        new_log_prob1, new_log_prob2 = calculate_single_token_log_ratio(n_gram, n_gram_model1, n_gram_model2)
        log_prob1 += new_log_prob1
        log_prob2 += new_log_prob2
    return log_prob1 > log_prob2 if log_prob1 != log_prob2 else None
    
    ### END YOUR CODE


In [217]:
def validate(
        n: int,
        author1: str,
        train_data1: list[list[str]],
        test_data1: list[list[str]],
        author2: str,
        train_data2: list[list[str]],
        test_data2: list[list[str]],
        classify_fn: Callable[[int, list[str], NGramModel, NGramModel], Optional[bool]]
    ):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE
    
    for i in range(2, n + 1):
        model1 = NGramModel(i, train_data1)
        model2 = NGramModel(i, train_data2)
        ground_truth = [author1] * len(test_data1) + [author2] * len(test_data2)
        preds = [
            "indecisive" if x is None else author1 if x else author2
            for x in (classify_fn(i, t, model1, model2) for data in (test_data1, test_data2) for t in data)
        ]
        conf_mat = confusion_matrix(ground_truth, preds, labels=(author1, author2, "indecisive"))
        print(f"results for authors {author1} and {author2}, n = {i}")
        print("confusion matrix:")
        print(conf_mat)
        print(f"accuracy = {conf_mat.diagonal().sum() / conf_mat.sum()}\n")
        
    ### END YOUR CODE

In [218]:
context_length = 5
validate(context_length, TRUMP, data_train[TRUMP], data_test[TRUMP], BIDEN, data_train[BIDEN], data_test[BIDEN], classify_fn=classify)
validate(context_length, OBAMA, data_train[OBAMA], data_test[OBAMA], BIDEN, data_train[BIDEN], data_test[BIDEN], classify_fn=classify)

results for authors trump and biden, n = 2
confusion matrix:
[[  8  92   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.54

results for authors trump and biden, n = 3
confusion matrix:
[[  3  97   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.515

results for authors trump and biden, n = 4
confusion matrix:
[[  0 100   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.5

results for authors trump and biden, n = 5
confusion matrix:
[[  0 100   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.5

results for authors obama and biden, n = 2
confusion matrix:
[[100   0   0]
 [ 78  22   0]
 [  0   0   0]]
accuracy = 0.61

results for authors obama and biden, n = 3
confusion matrix:
[[100   0   0]
 [ 78  22   0]
 [  0   0   0]]
accuracy = 0.61

results for authors obama and biden, n = 4
confusion matrix:
[[100   0   0]
 [ 78  22   0]
 [  0   0   0]]
accuracy = 0.61

results for authors obama and biden, n = 5
confusion matrix:
[[100   0   0]
 [ 78  22   0]
 [  0   0   0]]
accuracy = 0.61



### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [219]:
def classify_with_perplexity(n: int, tokens: list[str], n_gram_model1: NGramModel, n_gram_model2: NGramModel) -> Optional[bool]:
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    neg_m_log_perp1 = 0
    neg_m_log_perp2 = 0
    m = 0
    for i in range(2, len(tokens) + 1):
        n_gram = tuple(tokens[max(0, i - n) : i])
        neg_m_log_perp1 += n_gram_model1.conditional(n_gram)
        neg_m_log_perp2 += n_gram_model2.conditional(n_gram)
        m += 1
    perp1 = math.exp(- neg_m_log_perp1 / m)
    perp2 = math.exp(- neg_m_log_perp2 / m)
    return perp1 < perp2 if perp1 != perp2 else None
        
    
    ### END YOUR CODE

In [220]:
context_length = 5
validate(context_length, TRUMP, data_train[TRUMP], data_test[TRUMP], BIDEN, data_train[BIDEN], data_test[BIDEN], classify_fn=classify_with_perplexity)
validate(context_length, OBAMA, data_train[OBAMA], data_test[OBAMA], BIDEN, data_train[BIDEN], data_test[BIDEN], classify_fn=classify_with_perplexity)

results for authors trump and biden, n = 2
confusion matrix:
[[ 72  28   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.86

results for authors trump and biden, n = 3
confusion matrix:
[[ 29  71   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.645

results for authors trump and biden, n = 4
confusion matrix:
[[ 28  72   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.64

results for authors trump and biden, n = 5
confusion matrix:
[[ 28  72   0]
 [  0 100   0]
 [  0   0   0]]
accuracy = 0.64

results for authors obama and biden, n = 2
confusion matrix:
[[100   0   0]
 [100   0   0]
 [  0   0   0]]
accuracy = 0.5

results for authors obama and biden, n = 3
confusion matrix:
[[100   0   0]
 [100   0   0]
 [  0   0   0]]
accuracy = 0.5

results for authors obama and biden, n = 4
confusion matrix:
[[100   0   0]
 [100   0   0]
 [  0   0   0]]
accuracy = 0.5

results for authors obama and biden, n = 5
confusion matrix:
[[100   0   0]
 [100   0   0]
 [  0   0   0]]
accuracy = 0.5

