## Task: Generating Synthetic Trump-Like Tweets

In this task, you are tasked with creating a language model based on n-grams from approximately 300 tweets extracted from Donald Trump's Twitter profile. You will then generate synthetic tweets that emulate the style of Trump's tweets using this language model.

### Requirements:
- Build an n-gram-based language model using the provided tweet dataset.
- Generate synthetic Trump-like tweets using the trained language model.
- Evaluate the perplexity of the language model on the generated tweets.

### Problem
The problem at hand involves developing a language model that can mimic the writing style of Donald Trump in his tweets. To achieve this, we will train an n-gram-based language model using a dataset containing around 300 of Trump's tweets and then use this model to generate synthetic tweets in his distinctive style.

### Solution
The solution involves the following steps:

#### Data Preprocessing
1. Load the dataset containing Trump's tweets from a CSV file.
2. Clean and preprocess each tweet by removing special characters, URLs, and normalizing certain punctuation marks. Tokenize the cleaned tweets using the TweetTokenizer from NLTK.
3. Calculate basic statistics on the dataset, including the average length in characters and words of the tweets.
4. Split in train and test sets

#### Language Model Training
4. Choose the value of 'N' for the n-grams (e.g., bigram or trigram).
5. Create n-grams and a vocabulary from the preprocessed tweet corpus using the NLTK library.
6. Train a Laplace-smoothed n-gram language model on the n-grams and vocabulary.

#### Synthetic Tweet Generation
7. Generate synthetic Trump-like tweets using the trained language model.
8. For each generated tweet, ensure that it meets a specified word count and does not contain duplicate sentences.

#### Perplexity Evaluation
9. Calculate the perplexity of the trained language model on the test set. Perplexity measures how well the model predicts the generated text.


# Esercitazione 3

## Tweet like (a) Trump

In [93]:
import pandas as pd
from nltk import word_tokenize, TreebankWordDetokenizer
from nltk.lm import Laplace, MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.util import ngrams as nltk_ngrams
from nltk.tokenize.treebank import TreebankWordDetokenizer
import random
import re
from sklearn.model_selection import train_test_split
import string
from nltk.tokenize import TweetTokenizer
from itertools import chain
import itertools


### load data

In [94]:
df = pd.read_csv('../Trump/trump_twitter_archive/tweets.csv') # Caricamento del dataset
display(df)

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,LOSER! https://t.co/p5imhMJqS1,05-18-2020 14:55:14,32295,135445,False,1262396333064892416
1,Twitter for iPhone,Most of the money raised by the RINO losers of...,05-05-2020 18:18:26,19706,82425,False,1257736426206031874
2,Twitter for iPhone,....because they don’t know how to win and the...,05-05-2020 04:46:34,12665,56868,False,1257532112233803782
3,Twitter for iPhone,....lost for Evan “McMuffin” McMullin (to me)....,05-05-2020 04:46:34,13855,62268,False,1257532114666508291
4,Twitter for iPhone,....get even for all of their many failures. Y...,05-05-2020 04:46:33,8122,33261,False,1257532110971318274
...,...,...,...,...,...,...,...
256,Twitter Web Client,.@Cher attacked @MittRomney. She is an average...,05-10-2012 15:10:23,715,465,False,200603697435246592
257,Twitter Web Client,Firing @lisalampanelli may have come as a surp...,05-07-2012 02:58:18,45,19,False,199332301463748609
258,Twitter Web Client,My @SquawkCNBC interview discussing why @MittR...,03-06-2012 17:07:51,32,9,False,177078050750599168
259,TweetDeck,I feel sorry for Rosie 's new partner in love ...,12-14-2011 16:45:55,667,463,False,146994336670822400


### DATA PREPROCESSING

In [95]:
def tweet_cleaner(tweet):
    # Replace "&amp" with "and"
    clean_tweet = re.sub(r"&amp", "and", tweet)
    # Replace "&" with "and"
    clean_tweet = re.sub(r"&", "and", clean_tweet)
    # Remove multiple consecutive dots (ellipsis)
    clean_tweet = re.sub(r'^\.{2,}', '', clean_tweet)
    # Replace curly single quotes with straight single quotes
    clean_tweet = re.sub(r"’", "'", clean_tweet)
    # Replace semicolons with single quotes
    clean_tweet = re.sub(r';', "'", clean_tweet)
    # Remove double quotation marks
    clean_tweet = re.sub(r'“', '', clean_tweet)
    clean_tweet = re.sub(r'"', '', clean_tweet)
    # Replace double hyphens with a space
    clean_tweet = re.sub(r'--', ' ', clean_tweet)
    # Remove text within parentheses, e.g., (text)
    clean_tweet = re.sub(r'\([^)]*\)', '', clean_tweet)

    # Define patterns to identify special tokens
    usernamepattern = r'@[a-zA-Z0-9]+'
    hashtagpattern = r'#[a-zA-Z0-9]+'
    websitepattern = r'http\S+|www\S+|https\S+'

    # Initialize a TweetTokenizer
    tokenizer = TweetTokenizer()
    # Tokenize the cleaned tweet
    tokens = tokenizer.tokenize(clean_tweet)

    cleaned_tokens = []
    # Iterate through the tokens
    for token in tokens:
        # Check if the token matches one of the special token patterns
        if re.match(usernamepattern, token) or re.match(hashtagpattern, token) or re.match(websitepattern, token):
            cleaned_tokens.extend([token])  # Add the special token as a whole
        else:
            cleaned_token = token  # Keep punctuation and regular words
            # Check if the token is not an empty string
            if token.strip():
                cleaned_tokens.append(cleaned_token)

    # Return the list of cleaned and tokenized tokens
    return cleaned_tokens





## SPLIT IN TRAIN AND TEST SET

In [96]:
# Split the data into training and testing sets (80% train, 20% test)
train_corpus, test_corpus = train_test_split(list(df['text']), test_size=0.2, random_state=42)


# Verify the sizes of the training and testing sets
print("Training set size:", len(train_corpus))
print("Testing set size:", len(test_corpus))


Training set size: 208
Testing set size: 53


In [97]:
trump_corpus = []
trump_test_corpus = []
for tweet in train_corpus:
    trump_corpus.append(tweet_cleaner(tweet))

for tweet in test_corpus:
    trump_test_corpus.append(tweet_cleaner(tweet))

average word per sentence

In [98]:
avg_len, avg_word = sum([len(t) for t in df['text']]) // len(df), sum(len(t.split()) for t in df['text']) // len(df)
avg_len, avg_word

(163, 26)

#### CREATE AND TRAIN THE MODEL bi gram and tri gram laplace

In [99]:
## bigram or trigram
# N = 2
N = 2

n_grams, vocabulary = padded_everygram_pipeline(N, trump_corpus)
trump_model = Laplace(N)
trump_model.fit(n_grams, vocabulary)



tri_grams, vocabulary_3 = padded_everygram_pipeline(N, trump_corpus)
trump_model_3 = Laplace(3)
trump_model_3.fit(tri_grams, vocabulary_3)

#### CREATE AND TRAIN THE MODEL tri gram MLE

In [100]:
bi_gram_mle, vocabulary_mle_2 = padded_everygram_pipeline(2, trump_corpus)
trump_model_MLE_2 = MLE(2)
trump_model_MLE_2.fit(bi_gram_mle, vocabulary_mle_2)




tri_gram_mle, vocabulary_mle_3 = padded_everygram_pipeline(3, trump_corpus)
trump_model_MLE_3 = MLE(3)
trump_model_MLE_3.fit(tri_gram_mle, vocabulary_mle_3)

#### GENERATE THE SENTENCE

In [101]:

def generate_sent(model, random_seed, num_words):
    actual_count = 0  # Initialize the count of generated words
    synthetic_sentence = None  # Initialize the synthetic sentence
    attempts = 0  # Initialize the number of attempts
    isFinish = False  # Flag to indicate if generation is finished
    generated_sentences = set()  # Initialize a set to track generated sentences and avoid duplicates
    
    while actual_count < num_words and not isFinish:
        # Generate a sequence of words using the provided model and random seed
        generated_words = model.generate(num_words, random_seed=random_seed)
        content = []  # Initialize a list to store the content of the synthetic sentence
        
        for word, _ in itertools.groupby(generated_words):
            if word.strip():
                if word == '<s>':
                    continue
                if word == '</s>':
                    if actual_count < num_words:
                        continue

                if len(content) == num_words:
                    if word == '</s>':
                        
                        isFinish = True  # Finish generation if we have enough words
                    else:
                        continue
                        
                if word != '<s>' and word != '</s>':
                    content.append(word)
                    actual_count = len(content)  # Update the word count
        
        random_seed = random.randint(1, 100)  # Update random_seed for the next attempt
        attempts += 1
        
    # Detokenize the content to form a synthetic sentence
    synthetic_sentence = TreebankWordDetokenizer().detokenize(content)
    
    # Check if the synthetic sentence is not already generated to avoid duplicates
    if synthetic_sentence not in generated_sentences:
        generated_sentences.add(synthetic_sentence)
    else:
        return None, None  # Return None if the sentence is a duplicate
    
    # Return the synthetic sentence and its content
    return synthetic_sentence, content


#### LAPLACE bigram and trigram

In [102]:

synthetic_tweets = []
# Generate synthetic tweets using the trained model
print("BI GRAM")
for i in range(0, 10):
    random_seed = random.randint(1, 100)  # Generate a new random seed for each tweet
    tmp_tweet = generate_sent(trump_model, random_seed, num_words=avg_word)
    if tmp_tweet[1] is not None:
        if len(tmp_tweet[1]) > 0 and tmp_tweet[0]!=None:
            synthetic_tweets.append(tmp_tweet)


for real_sentence, _ in synthetic_tweets:
    print("Real Sentence:", real_sentence)


print()
print("################################")
print()
print("TRI GRAM")
synthetic_tweets = []

for i in range(0, 10):
    random_seed = random.randint(1, 100)  # Generate a new random seed for each tweet
    tmp_tweet = generate_sent(trump_model_3, random_seed, num_words=avg_word)
    if tmp_tweet[1] is not None:
        if len(tmp_tweet[1]) > 0 and tmp_tweet[0]!=None:
            synthetic_tweets.append(tmp_tweet)


for real_sentence, _ in synthetic_tweets:
    print("Real Sentence:", real_sentence)

BI GRAM
Real Sentence: which I LOVED the Philippines I only horse in a wig realize to win one thing he couldn't get out nasty guy who is the other
Real Sentence: know I never say these lowlifes follow-nothing to look a real positive in conjunction with in this special date September 11th . EVERYONE knows that.Some LOSERS
Real Sentence: to him is a disgrace to all about Andrew McCabe Peter S and continue to run for iPhone, false reporting . But I don't like
Real Sentence: don't even the act . Do Nothing Dems are jealous and caught in anyway? Ignorance Yes he got out of the haters and losers a
Real Sentence: @CNN which is so right from the Central Park 5 Twitter Web Client, Just ck how are jealous and the Tea Party . Stop busting
Real Sentence: Fake News Awards those who hates Michael Wolff is a total fool . Hence he is still follow loser! Strong fierce noble, false reporting
Real Sentence: Major loser with you-and a very big ratings are sick and is how lucky they will flip if they ever 

### CALCULATE THE PERPLEXITY

In [103]:
print(f"Perplexity for N = {N}: {trump_model.perplexity(trump_test_corpus):.2f}")


Perplexity for N = 2: 1968.00


### MLE

In [104]:

synthetic_tweets = []
# Generate synthetic tweets using the trained model
print("BI GRAM")
for i in range(0, 10):
    random_seed = random.randint(1, 100)  # Generate a new random seed for each tweet
    tmp_tweet = generate_sent(trump_model_MLE_2, random_seed, num_words=avg_word)
    if tmp_tweet[1] is not None:
        if len(tmp_tweet[1]) > 0 and tmp_tweet[0]!=None:
            synthetic_tweets.append(tmp_tweet)


for real_sentence, _ in synthetic_tweets:
    print("Real Sentence:", real_sentence)


print()
print("################################")
print()
print("TRI GRAM")
synthetic_tweets = []

for i in range(0, 10):
    random_seed = random.randint(1, 100)  # Generate a new random seed for each tweet
    tmp_tweet = generate_sent(trump_model_MLE_3, random_seed, num_words=avg_word)
    if tmp_tweet[1] is not None:
        if len(tmp_tweet[1]) > 0 and tmp_tweet[0]!=None:
            synthetic_tweets.append(tmp_tweet)


for real_sentence, _ in synthetic_tweets:
    print("Real Sentence:", real_sentence)

BI GRAM
Real Sentence: discussing why you sit this election cycle out of July to all even the nomination to N . @realDonaldTrump The many people will miss him and
Real Sentence: cycle out we followed their so-called Lincoln Project is a total loser? He used Sloppy Steve has ZERO about power than tearing them out nasty
Real Sentence: and that's why George Will the Military Vets and even called my call with people losing their false, 05-08- 2013 01:15: Mary Kissel is
Real Sentence: Editorial on how bad and haters and' Reed Galvin lost . Cruz is doing this really hard but I will self-destruct just started blocking out
Real Sentence: They should not been really boring and watch @CNN bore their own account / lawyer who sadly plays right? His stupidity could have no buyers
Real Sentence: . Low Ratings . Kelly came on right from whatever they are invited to sell the people losing their bad ., 3815633303 4924032 0 Twitter
Real Sentence: Cuban?, 3262266462 4598630 4 Twitter Web Client, 03-09- 2

### CALCULATE THE PERPLEXITY

In [105]:
print(f"Perplexity for N = {N}: {trump_model_MLE_2.perplexity(trump_test_corpus):.2f}")

Perplexity for N = 2: inf
