<a href="https://colab.research.google.com/github/ENGS-108-Fall-23/assignment_6_Fall2023-Filip-Nowicki/blob/main/Assignment_6_Fall2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tweet Generator : NPL and GPT2**

In [1]:
''' Import Statements '''
import os
import random
import numpy as np
import pandas as pd
import csv
import tqdm
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Creating a Trump Tweet Generator!!!
We will be using a repository of Donald Trump tweets scrapped from Twitter through June 2020 from [Kaggle](https://www.kaggle.com/datasets/austinreese/trump-tweets) and will use the following code blocks to download the dataset directly to your Google Drive.

## Creating a Kaggle API Token
First we will need to download an API token from Kaggle in order to download the dataset, so our first step is to create a Kaggle account.
1. Create a Kaggle account by following the sign up instructions [here](https://www.kaggle.com/).
2. Log into your Kaggle account and click your account icon on the upper righthand side.
3. Then select **Account** from the dropdown/sidebar menu.
4. Scroll down to the **API** section and select **Create New API Token**.
5. This will download a JSON file called kaggle.json to your Downloads folder on your computer.
6. Now run the following code block and when the **Browse** button appears, click it and select that kaggle.json file.

In [4]:
# Run this code block after creating a Kaggle API token as instructed above.
! pip install -q kaggle
from google.colab import files

# Will ask you to upload kaggle.json file and remove any old ones.
if os.path.exists('kaggle.json'):
  os.remove('kaggle.json')
files.upload()

# Will create the appropriate directory structure
if not os.path.exists('/root/.kaggle'):
  ! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
# Also we are going to make a directory called result
if not os.path.exists('/content/results'):
  ! mkdir /content/result

Saving kaggle.json to kaggle.json


## Downloading the Dataset

7. Now we have downloaded our Kaggle credentials we can now download the Trump Tweets Dataset (or any other Kaggle dataset for that matter) directly into our Google Drive.

In [5]:
! kaggle datasets download austinreese/trump-tweets
! unzip trump-tweets.zip

Downloading trump-tweets.zip to /content
  0% 0.00/6.88M [00:00<?, ?B/s]100% 6.88M/6.88M [00:00<00:00, 70.5MB/s]
100% 6.88M/6.88M [00:00<00:00, 70.0MB/s]
Archive:  trump-tweets.zip
  inflating: realdonaldtrump.csv     
  inflating: trumptweets.csv         


## Loading the Dataset using Pandas
Now let's inspect the trump tweets dataset and see what we have to work with... Brace yourselves.

In [2]:
# Let's load in the two files that we inflated from the Kaggle download. Both realdonaldtrump.csv and trumptweets.csv are the same.
real_donald_trump_df = pd.read_csv('realdonaldtrump.csv')

In [3]:
real_donald_trump_df.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


## Problem 1: Pre-Processing the Tweets
As you may have noticed from the dataframe we just loaded, there are some special characters we need to handle. If we are making a tweet or sentence generator, we don't want to mess with special characters like commas or colons or really even captialization. So in the following section you are going to preprocess the data and strip these special characters out.

### Task 1: The Preprocess function
In the following code the preprocess function that will strip or substitute out various special characters or sequences of characters that may not be ideal when training a sentence generator. We will use the built-in re python library to do a number of substitutions.

In [4]:
import re

REGEX_SUBS = {
  '\\': ' ',
  '\n': ' ',
  '&': '',
  'RT ': '',
  '~': '',
  '#': '',
  '!+': '',
  '(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)': 'link',
  '[*]': '',
  '[@]\w+': 'user',
  '[:|;]': '',
  '[\\x]\W+|\d': '',
  '[\\x]\W+|\d': '',
  '[\\x]\w+': '',
  '    ': ' ',
  '   ': ' ',
  '  ': ' ',
}

def preprocess(text):
  # TODO: Complete the function and using the provided regular expression substitutions.
  for key, value in REGEX_SUBS.items():
    #TODO: Do something, maybe look at the re.sub command.
    text.replace(key, value)
  #TODO: Also make everything lowercase
  text = text.lower()
  return text

### Task 2: Preprocess the dataset.
Now that we have our preprocess function, let's preprocess all the tweets.

In [5]:
#TODO: Run all the Dataframe content through your preprocess function.
real_donald_trump_df['pre_processed_content'] = real_donald_trump_df['content'].apply(preprocess)

### Task 3: Tokenize the Data
The next step in every Natural Language Processing task is to tokenize the data, i.e., seperate our words, special characters, etc. into separate unique tokens. We will be using the [Natural Language Tool Kit](https://www.nltk.org/index.html) in python to accomplish this. Study the nltk API docs and see if you can tokenize our data.

In [6]:
from nltk.tokenize import word_tokenize

tokenized = []
# TODO: Tokenize each preprocessed tweet
for i in real_donald_trump_df['pre_processed_content']:
  tokenized.append(['<s>'] + word_tokenize(i) + ['</s>'])

### Task 4: Build a Vocabulary
Now that we have our data tokenized, let's build a vocabulary including beginning and ending of sentence tokens, i.e., \<s>, \</s> for example. At this point let's also add in these beginning and end tokens into each of our data instances.

In [7]:
tweets_concatenated = list(np.concatenate(tokenized))
unique_tokens = list(np.unique(tweets_concatenated))

In [8]:
#TODO: Build a vocabulary dictionary and inverse vocabulary dictionary
vocabulary = {token: idx for idx, token in enumerate(unique_tokens)}
vocabulary_inverse = {idx: token for token, idx in vocabulary.items()}

#TODO: Encode training data.
train_data = [*map(vocabulary.get, tweets_concatenated)]

# Problem 1: N-gram Language Model
We will be building a couple n-gram language modeling and see how well just taking pure frequency counts and building a conditional probability distrbution will work.

## Task 1: A 2-gram (Bigram) Model
Recall that our goal in building a language model is to represent the conditional probability $P(w_i | w_{i-1})$ for pairs of words $w_i$ and $w_j$.

### Part A: Frequencies
Go through the encoded Trump tweets and calculate the frequencies of all words as well as all pair of words that appear next to each other in the corpus.

In [9]:
from collections import defaultdict

def make_grams(tweets, n=2):
    if n > len(tweets) or n <= 0:
        return {}

    counts = {}

    # Count occurrences of each n-gram
    for i in range(len(tweets) - n + 1):
        ngram = tuple(tweets[i:i+n])
        counts[ngram] = counts.get(ngram, 0) + 1

    grams = list(counts.keys())

    return grams, counts

grams, counts = make_grams(train_data)

### Part B: Probabilities
Now from the counts above we will calculate an associated conditional probabilities.

In [10]:
prefix_counts = defaultdict(int)

for gram, count in counts.items():
  prefix = gram[:-1]
  prefix_counts[prefix] += count

probs = defaultdict(dict)

for gram, count in counts.items():
  prefix = gram[:-1]
  next_word = gram[-1]
  probs[prefix][next_word] = count/prefix_counts[prefix]

### Part C: Make an Bigram Generator Function
Now that we have our probability distrbution, let's make a generator function so that we can generate random Trump tweets using our bigram language model.

In [11]:
def get_next_word(next_word_probs):
  #TODO: Write a function to get the next word based on the previous words conditional probabilities.
  words, probabilities = list(next_word_probs.keys()), list(next_word_probs.values())

  next_word = random.choices(words, weights = probabilities, k = 1)

  return next_word[0]

def generate(start_text='', n=10, probs=probs, vocabulary=vocabulary, vocabulary_inverse=vocabulary_inverse):
  # Helper code to create the start text.
  start_text = ['<s>'] + nltk.word_tokenize(preprocess(start_text))

  #TODO: Encode and generate a trump tweet.
  text_encoded = [*map(vocabulary.get, start_text)]
  text = start_text

  for i in range(n):
    last_word = text_encoded[-1]
    next_word = get_next_word(probs[(last_word,)])
    text_encoded.append(next_word)
    text += [vocabulary_inverse[next_word]]

  return text

In [12]:
' '.join(generate('I can make'))

"<s> i can make america '' thx to elude most racist , kentucky- on"

# Problem 2: Using a Transformer
In this section we are going to leverage a pretrained transformer, i.e., [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), built by the [Hugging Faces Team](https://huggingface.co/gpt2?text=A+long+time+ago%2C). We will also be using their tokenizers because they have been optimized for the language generation task. You should be aware that behind the scenes, their model is using [PyTorch](https://pytorch.org/), the deep learning library built by Facebook, and is quickly becoming more popular than our beloved Tensorflow.

## Task 1: Install and load the GPT2 Model

In [14]:
# Run the following code to install the HuggingFace's transformer python package.
! pip install transformers



In [66]:
# TODO: Follow the link above and load the GPT2 model as well as the tokenizer.
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

## Task 2: Formatting our Training Data
Upcoming we will be fine-tunning this GPT-2 model on our Trump Tweets, but in order to leverage some of the utility classes built by HuggingFaces, we want to take our preprocessed Trump tweets and place them in a flat text file.

In [67]:
#TODO: Take our preprocessed trump tweets and write them to a text file
with open('trumpdata.txt', 'w') as text_file:
  pd.set_option('display.max_colwidth', -1)
  tweets_string = real_donald_trump_df['pre_processed_content'].to_string(header=False, index=False)
  text_file.write(tweets_string)

  pd.set_option('display.max_colwidth', -1)


## Task 4: Fine-tuning the Model
In the following sections, we will complete a couple functions that will allow us to fine-tune the GPT-2 model to our Trump Tweets. See the following documentation from Hugging Faces to see the attributes for the [Trainer Class](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer).

In [68]:
from transformers import Trainer, TrainingArguments
from transformers import TextDataset, DataCollatorForLanguageModeling

def load_dataset(file_path, tokenizer, block_size = 128):
    # Will load and tokenize the data
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

def load_data_collator(tokenizer, mlm = False):
    # Helper Function
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

def train(
    train_file_path,
    model,
    tokenizer,
    output_dir,
    overwrite_output_dir,
    num_train_epochs,
    save_steps
    ):
  #TODO: Use the helper functions above to load the dataset and data collector.
  dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)
  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=8,
          num_train_epochs=num_train_epochs,
      )
  #TODO: Use the Trainer class with the necessary parameters to instaniate the trainer
  trainer = Trainer(
      args = training_args,
      data_collator = data_collator,
      model = model,
      train_dataset = dataset,
      tokenizer = tokenizer
  )

  #TODO: Train and save the model using the train and save functions built into the Trainer class.
  trainer.train()
  trainer.save_model('/content/out_model')

In [69]:
#TODO: Set necessary parameters, here are some defaults.
train_file_path = "/content/trumpdata.txt"
output_dir = '/content/result'
overwrite_output_dir = True
num_train_epochs = 10
save_steps = 500

In [70]:
#TODO: Use your train function to train the model. It takes about 30 minutes to train in colab.
train(
    train_file_path,
    model,
    tokenizer,
    output_dir,
    overwrite_output_dir,
    num_train_epochs,
    save_steps
    )



Step,Training Loss
500,3.3944
1000,3.0679
1500,2.9212
2000,2.8072
2500,2.706
3000,2.644
3500,2.5996
4000,2.5344
4500,2.4769
5000,2.4535


## Task 5: Creating a Tweet Generator
Now that we have our trained model, it's time to generate some Tweets. Since we should have saved our model and tokenizer to an output directory, I've already made some helper functions to load those in. We will focus on our *generate_text* function. The function will take as input some start text like "I am" or "My country is", etc., as well as a max_length parameter which tells the model how much text to generate. Let's Go!

In [74]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model

def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

def generate_text(sequence, max_length):
    #TODO: Load in the finetuned model and tokenizer
    model = load_model('/content/out_model/')
    tokenizer = load_tokenizer('/content/out_model/')

    # Encode our passed sequence
    ids = tokenizer.encode(
        sequence,
        return_tensors='pt'
        )

    final_outputs = model.generate(
        inputs = ids,
        max_length = max_length,
        do_sample=True,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    # Function to print and decode output.
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [75]:
#TODO: Now generate some text!
generate_text('My country is', 50)

My country is a country of migrants, and I can't accept that. It's a country where our people are living their lives without hope, without hope of a decent life."

"We are not here to get our country back together,"
