# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [1]:
%pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.9/796.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading click-8.1.8-py3-none-any.whl (98 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully insta

In [7]:
# Dependencies
import nltk
import pandas as pd
import regex as re
import sklearn as sk

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [14]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/yannes/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [24]:
# Notice: ignore retweets 
def beautify_tweet(tweet):
    """Returns a cleaned version of the tweet."""
    ### YOUR CODE HERE
    # remove special charactoars but keep @ and # 
    cleaned_tweet = re.sub(r"[^a-zA-Z\s@#]", "", tweet)
    
    # remove extra spaces
    cleaned_tweet = re.sub(r"\s+", " ", cleaned_tweet)
    
    # remove leading and trailing spaces
    cleaned_tweet = cleaned_tweet.strip()

    # remove https links
    cleaned_tweet = re.sub(r"http\S+|www\S+|https\S+", "", cleaned_tweet, flags=re.MULTILINE)
    
    return cleaned_tweet
    ### END YOUR CODE

def load_trump_tweets(filepath):
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    data = pd.read_json(filepath)
    data= data['text'].tolist()
    # remove special charactoars but keep @ and # 
    cleaned_tweets = [       beautify_tweet(tweet) for tweet in data]
    
    return cleaned_tweets

    ### END YOUR CODE


def load_obama_tweets(filepath):
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    data= pd.read_csv(filepath)
    data = data["Tweet-text"].to_list()
    cleaned_tweets = [       beautify_tweet(tweet) for tweet in data]
    
    return cleaned_tweets


    ### END YOUR CODE
    

def load_biden_tweets(filepath):
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    data= pd.read_csv(filepath)
    data = data["tweet"].to_list()
    cleaned_tweets = [       beautify_tweet(tweet) for tweet in data]
    
    return cleaned_tweets



    ### END YOUR CODE



In [31]:

# Notice: think about start and end tokens

NUM_TEST = 100

def tokenize(text):
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    # Add start and end tokens
    tokens = ["<s>"] + tokens + ["</s>"]
    
    return tokens
    ### END YOUR CODE
    

def split_and_tokenize(data, num_test=NUM_TEST):
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE
    
    train , test= sk.model_selection.train_test_split(data, test_size=num_test)
    train = [tokenize(tweet) for tweet in train]
    test = [tokenize(tweet) for tweet in test]
    
    return train, test
    ### END YOUR CODE

In [32]:
trump_tweets_train, trump_tweets_test = split_and_tokenize(load_trump_tweets("data/trump.json"))
obama_tweets_train, obama_tweets_test = split_and_tokenize(load_obama_tweets("data/obama.csv"))
biden_tweets_train , biden_tweets_test = split_and_tokenize(load_biden_tweets("data/biden.csv"))


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [33]:
trump_tweets_test

[['<s>', 'THANK', 'YOU', 'WEST', 'VIRGINIA', '</s>'],
 ['<s>',
  '@',
  'ChrisOwens',
  'I',
  'would',
  'love',
  'to',
  'see',
  'how',
  'upset',
  '@',
  'HillaryClinton',
  'would',
  'get',
  'if',
  'she',
  'were',
  'in',
  'a',
  'debate',
  'against',
  '@',
  'realDonaldTrump',
  '</s>'],
 ['<s>',
  '@',
  'gordonsr',
  'Washington',
  'BIG',
  'Thank',
  'You',
  'for',
  'supporting',
  'Trump',
  'We',
  'are',
  'all',
  'going',
  'to',
  'help',
  'Trump',
  'Make',
  'America',
  'Great',
  'Again',
  'Trump',
  '</s>'],
 ['<s>', '@', 'OwenKelly', 'Thanks', '</s>'],
 ['<s>',
  'and',
  'says',
  'something',
  'is',
  'seriously',
  'wrong',
  'He',
  'will',
  'never',
  'go',
  'down',
  'as',
  'great',
  '</s>'],
 ['<s>',
  '#',
  'NATOSummit',
  'Press',
  'Conference',
  'in',
  'Brussels',
  'Belgium',
  '</s>'],
 ['<s>',
  '@',
  'opensezme',
  '@',
  'realDonaldTrump',
  'and',
  'theres',
  'a',
  'move',
  'afoot',
  'to',
  'have',
  'women',
  'in',
  

In [34]:
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    
    n_gram_models = {}

    for i in range(1, n+1):
        n_gram_model = {}
        for tweet in data:
            for j in range(len(tweet)-i):
                n_gram = tuple(tweet[j:j+i])
                if n_gram not in n_gram_model:
                    n_gram_model[n_gram] = []
                n_gram_model[n_gram].append(tweet[j+i])

        # Store the model
        n_gram_models[i] = n_gram_model
    return n_gram_models

    ### END YOUR CODE


def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    
    # Check if the previous tokens are in the n-gram model
    if tuple(prev) in n_gram_model:
        # Get the next word
        next_words = n_gram_model[tuple(prev)]
        # Choose a random word from the list
        return next_words[0]  # Replace with random choice if needed
    else:
        return None  # No suggestion available for the given previous tokens

    ### END YOUR CODE


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given data set."""
    ### YOUR CODE HERE
    
    # Choose a random starting point
    start = ["<s>"]
    # Choose a random n-gram model
    n_gram_model = n_gram_models[n]
    # Generate the tweet
    tweet = start.copy()
    while True:
        # Get the last n-1 tokens
        prev = tweet[-(n-1):]
        # Get the next word
        next_word = get_suggestion(prev, n_gram_model)
        if next_word is None or next_word == "</s>":
            break
        tweet.append(next_word)


    return " ".join(tweet)  # Remove start and end tokens
    ### END YOUR CODE

In [35]:
n_gram_models = build_n_gram_models(5, trump_tweets_train)
random_tweet_trump = get_random_tweet(3, n_gram_models)
print(random_tweet_trump)

<s>


In [42]:
random_tweet_trump = get_random_tweet(3, n_gram_models)
print(random_tweet_trump)

<s>


### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [51]:
def calculate_single_token_log_ratio(prev, token, n_gram_model1, n_gram_model2):
    """Calculates the log ration of a token for two different n-grams"""
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


def classify(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE


In [52]:
def validate(n, data1, data2, classify_fn):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [53]:
# context_length = ...
# validate(context_length, trump_tweets, biden_tweets, classify_fn=classify)
# validate(context_length, obama_tweets, biden_tweets, classify_fn=classify)

### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [54]:
def classify_with_perplexity(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [55]:
# context_length = ...
# validate(context_length, trump_tweets, biden_tweets, classify_fn=classify_with_perplexity)
# validate(context_length, obama_tweets, biden_tweets, classify_fn=classify_with_perplexity)