<a href="https://colab.research.google.com/github/Logan-Drace/CS436_Project/blob/main/CS436Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#!pip install transformers
#!pip install datasets

import numpy as np
from transformers import pipeline
import tensorflow as tf
from datasets import load_dataset
import json
import pandas as pd

In [2]:
bart_model = "facebook/bart-large-cnn"
summarizer = pipeline("summarization", model=bart_model)

In [3]:
# pre-trained model

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


In [4]:
# take first 6000 data points
dataset = load_dataset("reddit",split='train[:6000]')
dataset



Dataset({
    features: ['author', 'body', 'normalizedBody', 'subreddit', 'subreddit_id', 'id', 'content', 'summary'],
    num_rows: 6000
})

In [6]:
dataset.set_format('pandas')
df = dataset[:]
df 

Unnamed: 0,author,body,normalizedBody,subreddit,subreddit_id,id,content,summary
0,raysofdarkmatter,I think it should be fixed on either UTC stand...,I think it should be fixed on either UTC stand...,math,t5_2qh0n,c69al3r,I think it should be fixed on either UTC stand...,Shifting seasonal time is no longer worth it.
1,Stork13,Art is about the hardest thing to categorize i...,Art is about the hardest thing to categorize i...,funny,t5_2qh33,c6a9nxd,Art is about the hardest thing to categorize i...,Personal opinions 'n shit.
2,Cloud_dreamer,Ask me what I think about the Wall Street Jour...,Ask me what I think about the Wall Street Jour...,Borderlands,t5_2r8cd,c6acx4l,Ask me what I think about the Wall Street Jour...,insults and slack ass insight. \n Wall Street ...
3,NightlyReaper,"In Mechwarrior Online, I have begun to use a m...","In Mechwarrior Online, I have begun to use a m...",gamingpc,t5_2sq2y,c8onqew,"In Mechwarrior Online, I have begun to use a m...","Yes, Joysticks in modern games have apparently..."
4,NuffZetPand0ra,"You are talking about the Charsi imbue, right?...","You are talking about the Charsi imbue, right?...",Diablo,t5_2qore,c6acxvc,"You are talking about the Charsi imbue, right?...",Class only items dropped from high-lvl monsters.
...,...,...,...,...,...,...,...,...
5995,enossirt,I'm a guy I haven't had a reddit account in aw...,I'm a guy I haven't had a reddit account in aw...,TwoXChromosomes,t5_2r2jt,cn99inw,I'm a guy I haven't had a reddit account in aw...,ex cheated for money is now a prostitute/hooke...
5996,Volper2,I'm in US and i just joined a couple big commu...,I'm in US and i just joined a couple big commu...,Diablo,t5_2qore,cn98kbi,I'm in US and i just joined a couple big commu...,"join community, post ""LF(number of slots left)..."
5997,buffdude1100,She won't do much because now your only front ...,She won't do much because now your only front ...,summonerschool,t5_2t9x3,cn8o1ax,She won't do much because now your only front ...,She's fine bot lane although not the strongest...
5998,gladeyes,That looked like fun. Notice though the differ...,That looked like fun. Notice though the differ...,explainlikeimfive,t5_2sokd,cn96f91,That looked like fun. Notice though the differ...,guess what I'm really saying is I expect a lar...


In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

In [12]:
dataset

Dataset({
    features: ['author', 'body', 'normalizedBody', 'subreddit', 'subreddit_id', 'id', 'content', 'summary'],
    num_rows: 6000
})

In [13]:
# remove columns not needed
raw_dataset = dataset.remove_columns(['author', 'body', 'normalizedBody', 'subreddit', 'subreddit_id', 'id'])
raw_dataset

Dataset({
    features: ['content', 'summary'],
    num_rows: 6000
})

In [14]:
import warnings
warnings.filterwarnings('ignore')
# preprocess data 

max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
  model_inputs = tokenizer(examples['content'].to_list(), max_length=max_input_length, truncation=True)
  
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples['summary'].to_list(), max_length=max_target_length, truncation=True)

  model_inputs['labels'] = labels['input_ids']
  return model_inputs

tokenized_datasets = raw_dataset.map(preprocess_function, batched=True, remove_columns=['content', 'summary'])


Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

In [16]:
train_token = tokenized_datasets[:5000]
train_token

Unnamed: 0,input_ids,attention_mask,labels
0,"[0, 100, 206, 24, 197, 28, 4460, 15, 1169, 334...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 3609, 17274, 10175, 86, 16, 117, 1181, 966..."
1,"[0, 23295, 16, 59, 5, 11111, 631, 7, 18072, 20...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 43854, 5086, 128, 282, 15328, 4, 2]"
2,"[0, 33895, 162, 99, 38, 206, 59, 5, 2298, 852,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1344, 34869, 8, 25163, 8446, 8339, 4, 1437..."
3,"[0, 1121, 45541, 5557, 41659, 5855, 6, 38, 33,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 9904, 6, 11351, 33017, 11, 2297, 426, 33, ..."
4,"[0, 1185, 32, 1686, 59, 5, 732, 2726, 118, 306...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 21527, 129, 1964, 1882, 31, 239, 12, 49793..."
...,...,...,...
4995,"[0, 713, 864, 21, 553, 59, 10, 186, 536, 6, 8,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 25256, 6, 67, 6, 55, 418, 4, 2]"
4996,"[0, 250, 319, 9, 86, 77, 47, 33, 4259, 19, 701...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 43600, 2580, 2]"
4997,"[0, 243, 18, 1256, 5676, 6, 38, 304, 24, 25, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 354, 146, 10, 6135, 6, 146, 686, 32196, 16..."
4998,"[0, 33553, 10827, 45, 164, 81, 112, 6, 245, 84...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1185, 64, 109, 24, 6, 53, 42, 10646, 16, 4..."


In [17]:
test_token = tokenized_datasets[5000:]
test_token

Unnamed: 0,input_ids,attention_mask,labels
0,"[0, 717, 12, 6298, 12649, 67, 33, 7, 3511, 19,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 100, 218, 75, 206, 33551, 16, 10, 205, 207..."
1,"[0, 2709, 162, 6, 24, 18, 45, 14, 38, 218, 75,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 4528, 2959, 924, 32, 5072, 10, 9296, 8, 12..."
2,"[0, 133, 2526, 44904, 33894, 3260, 101, 22, 48...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1106, 5323, 64, 33, 10775, 14006, 15, 12, ..."
3,"[0, 243, 16, 2333, 1256, 4678, 114, 51, 40, 90...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 28674, 203, 95, 304, 1537, 1472, 4, 2]"
4,"[0, 713, 16, 4930, 6, 8, 10367, 342, 418, 15, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1558, 314, 5, 512, 7, 1032, 3621, 223, 571..."
...,...,...,...
995,"[0, 100, 437, 10, 2173, 38, 2220, 75, 56, 10, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 3463, 25177, 13, 418, 16, 122, 10, 36289, ..."
996,"[0, 100, 437, 11, 382, 8, 939, 95, 1770, 10, 8...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 26960, 435, 6, 618, 22, 574, 597, 1640, 30..."
997,"[0, 2515, 351, 75, 109, 203, 142, 122, 110, 12...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 2515, 18, 2051, 14084, 6625, 1712, 45, 5, ..."
998,"[0, 1711, 1415, 101, 1531, 4, 22873, 600, 5, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 5521, 3361, 99, 38, 437, 269, 584, 16, 38,..."


In [18]:
import datasets
train_ds = Dataset.from_pandas(train_token)
test_ds = Dataset.from_pandas(test_token)
train_ds

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 5000
})

In [20]:
## combine train and test
dd = datasets.DatasetDict({"train":train_ds,"test":test_ds})
dd

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [21]:
from transformers import AutoModelForSeq2SeqLM
# BART
model = AutoModelForSeq2SeqLM.from_pretrained(bart_model).to('cuda')
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerN

In [22]:
from transformers import Seq2SeqTrainingArguments
# fine-tune hyper-parameters
batch_size = 16
num_train_epochs = 3

logging_steps = len(train_token) // batch_size
model_name = bart_model.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-reddit",
    evaluation_strategy="epoch",
    learning_rate=.001,
    per_gpu_train_batch_size=batch_size,
    per_gpu_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)

In [23]:
#!pip install rouge_score
#! pip install evaluate
import evaluate

rouge_score = evaluate.load("rouge")

In [24]:
from nltk.tokenize import sent_tokenize

# rouge metrics

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
  
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
  
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
 
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

In [25]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [28]:
from transformers import Seq2SeqTrainer

dd.set_format('torch')

# train inputs

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=dd["train"],
    eval_dataset=dd["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

/content/bart-large-cnn-finetuned-reddit is already a clone of https://huggingface.co/ldrace/bart-large-cnn-finetuned-reddit. Make sure you pull the latest changes with `repo.git_pull()`.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


In [29]:
import nltk
nltk.download('punkt')
# Train Step

trainer.train()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,3.3463,3.403213,16.1311,3.5596,11.4436,13.8679
2,2.4024,3.457046,16.7384,3.5535,11.6667,13.9389
3,1.771,3.710875,16.6048,3.525,11.5267,13.798


Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_de

TrainOutput(global_step=939, training_loss=2.5054964816100553, metrics={'train_runtime': 1894.3028, 'train_samples_per_second': 7.918, 'train_steps_per_second': 0.496, 'total_flos': 2.434315371754291e+16, 'train_loss': 2.5054964816100553, 'epoch': 3.0})

In [30]:
# Eval Step
trainer.evaluate()

Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.


Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.


{'eval_loss': 3.710874557495117,
 'eval_rouge1': 16.6048,
 'eval_rouge2': 3.525,
 'eval_rougeL': 11.5267,
 'eval_rougeLsum': 13.798,
 'eval_runtime': 251.8783,
 'eval_samples_per_second': 3.97,
 'eval_steps_per_second': 0.25,
 'epoch': 3.0}