# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [1]:
%pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2024.11.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.9/796.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached click-8.1.8-py3-none-any.whl (98 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully installed click-8.1.8

In [2]:
# Dependencies
import nltk
import pandas as pd
import regex as re
import sklearn as sk

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [3]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/yannes/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
# Notice: ignore retweets 
def beautify_tweet(tweet):
    """Returns a cleaned version of the tweet."""
    ### YOUR CODE HERE
    # remove special charactoars but keep @ and # 
    cleaned_tweet = re.sub(r"[^a-zA-Z\s@#]", "", tweet)
    
    # remove extra spaces
    cleaned_tweet = re.sub(r"\s+", " ", cleaned_tweet)
    
    # remove leading and trailing spaces
    cleaned_tweet = cleaned_tweet.strip()

    # remove https links
    cleaned_tweet = re.sub(r"http\S+|www\S+|https\S+", "", cleaned_tweet, flags=re.MULTILINE)
    
    return cleaned_tweet
    ### END YOUR CODE

def load_trump_tweets(filepath):
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    data = pd.read_json(filepath)
    data= data['text'].tolist()
    # remove special charactoars but keep @ and # 
    cleaned_tweets = [       beautify_tweet(tweet) for tweet in data]
    
    return cleaned_tweets

    ### END YOUR CODE


def load_obama_tweets(filepath):
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    data= pd.read_csv(filepath)
    data = data["Tweet-text"].to_list()
    cleaned_tweets = [       beautify_tweet(tweet) for tweet in data]
    
    return cleaned_tweets


    ### END YOUR CODE
    

def load_biden_tweets(filepath):
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    data= pd.read_csv(filepath)
    data = data["tweet"].to_list()
    cleaned_tweets = [       beautify_tweet(tweet) for tweet in data]
    
    return cleaned_tweets



    ### END YOUR CODE



In [22]:

# Notice: think about start and end tokens

NUM_TEST = 100

def tokenize(text):
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    # Add start and end tokens
    tokens = ["<s>"] + tokens + ["</s>"]
    
    return tokens
    ### END YOUR CODE
    

def split_and_tokenize(data, num_test=NUM_TEST):
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE
    
    #train , test= sk.model_selection.train_test_split(data, test_size=num_test)
    train = [tokenize(tweet) for tweet in data]
    #test = [tokenize(tweet) for tweet in test]
    
    return train
    ### END YOUR CODE

In [23]:
trump_tweets_train  = split_and_tokenize(load_trump_tweets("data/trump.json"))
obama_tweets_train  = split_and_tokenize(load_obama_tweets("data/obama.csv"))
biden_tweets_train  = split_and_tokenize(load_biden_tweets("data/biden.csv"))


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [24]:
trump_tweets_train[:5]

[['<s>',
  'Republicans',
  'and',
  'Democrats',
  'have',
  'both',
  'created',
  'our',
  'economic',
  'problems',
  '</s>'],
 ['<s>',
  'I',
  'was',
  'thrilled',
  'to',
  'be',
  'back',
  'in',
  'the',
  'Great',
  'city',
  'of',
  'Charlotte',
  'North',
  'Carolina',
  'with',
  'thousands',
  'of',
  'hardworking',
  'American',
  'Patriots',
  'who',
  'love',
  'our',
  'Country',
  'cherish',
  'our',
  'values',
  'respect',
  'our',
  'laws',
  'and',
  'always',
  'put',
  'AMERICA',
  'FIRST',
  'Thank',
  'you',
  'for',
  'a',
  'wonderful',
  'evening',
  '#',
  'KAG',
  '</s>'],
 ['<s>',
  'RT',
  '@',
  'CBSHerridge',
  'READ',
  'Letter',
  'to',
  'surveillance',
  'court',
  'obtained',
  'by',
  'CBS',
  'News',
  'questions',
  'where',
  'there',
  'will',
  'be',
  'further',
  'disciplinary',
  'action',
  'and',
  'cho',
  '</s>'],
 ['<s>',
  'The',
  'Unsolicited',
  'Mail',
  'In',
  'Ballot',
  'Scam',
  'is',
  'a',
  'major',
  'threat',
  'to',

In [15]:
import random
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    
    n_gram_models = {}

    for i in range(1, n+1):
        n_gram_model = {}
        for tweet in data:
            for j in range(len(tweet)-i):
                n_gram = tuple(tweet[j:j+i])
                if n_gram not in n_gram_model:
                    n_gram_model[n_gram] = []
                n_gram_model[n_gram].append(tweet[j+i])

        # Store the model
        n_gram_models[i] = n_gram_model
    return n_gram_models

    ### END YOUR CODE


def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    
    # Check if the previous tokens are in the n-gram model
    if tuple(prev) in n_gram_model:
        # Get the next word
        next_words = n_gram_model[tuple(prev)]
        # Choose a random word from the list
        return next_words[0]  # Replace with random choice if needed
    else:
        return None  # No suggestion available for the given previous tokens

    ### END YOUR CODE


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given data set."""
    model = n_gram_models[n]
    start_ngram = random.choice(list(model.keys()))
    tweet = list(start_ngram)
    tweet =tweet[0:1]
    print("start ", tweet)
    while True:
        prev = tweet[-(n-1):]
        if len(prev) < n:
            next_word = get_suggestion(prev, n_gram_models[len(prev)])
        else:
            next_word = get_suggestion(prev, model)
        if next_word is None:
            break
        tweet.append(next_word)
    return " ".join(tweet)
    ### END YOUR CODE

In [25]:
n_gram_models_trump = build_n_gram_models(5, trump_tweets_train)
n_gram_models_biden = build_n_gram_models(5, biden_tweets_train)
n_gram_models_obama = build_n_gram_models(5, obama_tweets_train)

##random_tweet_trump = get_random_tweet(4, n_gram_models)
#print(random_tweet_trump)

In [10]:
get_suggestion(["<s>","In"], n_gram_models[2])

NameError: name 'n_gram_models' is not defined

In [27]:
random_tweet_trump = get_random_tweet(3, n_gram_models_obama)
print(random_tweet_trump)

start  ['sure']
sure that the rules of democracy are faireverywherebecause the next generationand doing the hard work to get us there </s>


### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [76]:
import math

def calculate_single_token_log_ratio(prev, token, n_gram_model1, n_gram_model2):
    """
    Calculates the log ratio:
      log( P(token|prev, model1) / P(token|prev, model2) )
    using frequency counts from the two provided n-gram models.
    Smoothing is applied if the context is unseen.
    """
    context = tuple(prev)
    
    # Model 1 counts
    if context in n_gram_model1:
        occurrences1 = n_gram_model1[context]
        count1 = occurrences1.count(token)
        total1 = len(occurrences1)
    else:
        count1, total1 = 0, 0
        
    # Model 2 counts
    if context in n_gram_model2:
        occurrences2 = n_gram_model2[context]
        count2 = occurrences2.count(token)
        total2 = len(occurrences2)
    else:
        count2, total2 = 0, 0
    
    smoothing = 1e-8  # Small constant to avoid division by zero
    prob1 = (count1 + smoothing) / (total1 + smoothing) if total1 > 0 else smoothing
    prob2 = (count2 + smoothing) / (total2 + smoothing) if total2 > 0 else smoothing
    
    return math.log(prob1 / prob2)



def classify(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    rations=[]
    for i, token in enumerate(tokens.split()):
        context = tokens.split()[max(0, i - n ):i]
        n_gram_models_size=n
        if len(context) < n:
            n_gram_models_size=len(context) # n-gram Kontext
        ratio = calculate_single_token_log_ratio(context, token, n_gram_models1[n], n_gram_models2[n])
        print(f"Token: {token}, Context: {context}, Ratio: {ratio}")
        rations.append(ratio)
    if sum(rations) > 1:
        return f"First with {sum(rations)}"
    else:
        return f"Second with {sum(rations)}"
    
    ### END YOUR CODE


In [None]:
calculate_single_token_log_ratio(['strategy', 'to', 'secure'], "GOP", n_gram_models_trump[3], n_gram_models_biden[3])

0.0

In [48]:
def validate(n, data1, data2, classify_fn):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE
    train , test= sk.model_selection.train_test_split(data1, test_size=NUM_TEST)

    # Build n-gram models for both datasets
    n_gram_models1 = build_n_gram_models(n, train)
    n_gram_models2 = build_n_gram_models(n, data2)
    n_gram_models_test = build_n_gram_models(n, test)
    

    random_tweet_= get_random_tweet(n, n_gram_models_test)
    print("test: ", random_tweet_)
    classifyresult = classify_fn(n,random_tweet_ , n_gram_models1, n_gram_models2)
    print("classify result ", classifyresult)
    ### END YOUR CODE

In [70]:
context_length = 3
validate(context_length, trump_tweets_train, biden_tweets_train, classify_fn=classify)
validate(context_length, obama_tweets_train, biden_tweets_train, classify_fn=classify)

start  ['RealDonaldTrump']
test:  RealDonaldTrump in the race and shut down the opposition Im ready for America to be the next POTUS # Trump Thanks for the spirit </s>
Token: RealDonaldTrump, Context: [], Ratio: 0.0
Token: in, Context: ['RealDonaldTrump'], Ratio: 0.0
Token: the, Context: ['RealDonaldTrump', 'in'], Ratio: 0.0
Token: race, Context: ['RealDonaldTrump', 'in', 'the'], Ratio: 0.0
Token: and, Context: ['in', 'the', 'race'], Ratio: -20.125428842099883
Token: shut, Context: ['the', 'race', 'and'], Ratio: -1.098612282001443
Token: down, Context: ['race', 'and', 'shut'], Ratio: 0.0
Token: the, Context: ['and', 'shut', 'down'], Ratio: 18.420680743952367
Token: opposition, Context: ['shut', 'down', 'the'], Ratio: -0.18232155646062123
Token: Im, Context: ['down', 'the', 'opposition'], Ratio: 0.0
Token: ready, Context: ['the', 'opposition', 'Im'], Ratio: 0.0
Token: for, Context: ['opposition', 'Im', 'ready'], Ratio: 0.0
Token: America, Context: ['Im', 'ready', 'for'], Ratio: -0.69314

### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [81]:
import math

def classify_with_perplexity(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet
    using perplexity. Lower perplexity indicates a better model likelihood.
    
    Returns:
      A string: "First with perplexity ..." if the first model is more likely,
      otherwise "Second with perplexity ...".
    """
    model1 = n_gram_models1[n]
    model2 = n_gram_models2[n]
    smoothing = 1e-8
    log_sum1 = 0.0
    log_sum2 = 0.0
    count_tokens = 0

    # Use a sliding context of up to (n-1) tokens.
    for i in range(len(tokens)):
        start = max(0, i - (n - 1))
        context = tuple(tokens[start:i])
        token = tokens[i]

        # Probability for Model 1
        if context in model1:
            occ1 = model1[context]
            count1 = occ1.count(token)
            total1 = len(occ1)
        else:
            count1, total1 = 0, 0
        prob1 = (count1 + smoothing) / (total1 + smoothing) if total1 > 0 else smoothing

        # Probability for Model 2
        if context in model2:
            occ2 = model2[context]
            count2 = occ2.count(token)
            total2 = len(occ2)
        else:
            count2, total2 = 0, 0
        prob2 = (count2 + smoothing) / (total2 + smoothing) if total2 > 0 else smoothing

        log_sum1 += math.log(prob1)
        log_sum2 += math.log(prob2)
        count_tokens += 1

    avg_log1 = log_sum1 / count_tokens
    avg_log2 = log_sum2 / count_tokens
    perplexity1 = math.exp(-avg_log1)
    perplexity2 = math.exp(-avg_log2)

    print (f"Perplexity Model 1: {perplexity1:.4f}, Perplexity Model 2: {perplexity2:.4f}")
    if perplexity1 < perplexity2:
        return f"First with perplexity {perplexity1:.4f}"
    else:
        return f"Second with perplexity {perplexity2:.4f}"

In [None]:
context_length = 3
validate(context_length, trump_tweets_train, biden_tweets_train, classify_fn=classify_with_perplexity)
validate(context_length, obama_tweets_train, biden_tweets_train, classify_fn=classify_with_perplexity)

start  ['agillogly']
test:  agillogly @ realDonaldTrump DT do you think kids are overmedicated and over diagnosed in the USA Yes </s>
Perplexity Model 1: 100000000.0000, Perplexity Model 2: 100000000.0000
classify result  Second with perplexity 100000000.0000
start  ['<s>']
