<a href="https://colab.research.google.com/github/Sammiexx/AL-ML-Project/blob/main/Text_Summarization_using_BART_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization using BART Transformer

In [None]:
!pip install transformers
!pip install datasets

In [None]:
# LOADING THE DATASET
from datasets import load_dataset

ds = load_dataset("knkarthick/dialogsum")

In [None]:
ds

In [None]:
ds['train'][1]['dialogue']

In [None]:
ds['train'][1]['summary']

### WITHOUT FINE - TUNING

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("summarization", model="facebook/bart-large-cnn")

In [None]:
article_1 = ds['train'][1]['dialogue']

In [None]:
pipe(article_1, max_length=20, min_length=10, do_sample=False)

In [None]:
ds['train'][1]['summary']

### WITH FINE - TUNING

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [None]:
# tokenization
def preprocess_function(batch):
  source = batch['dialogue']
  target = batch['summary']
  source_ids = tokenizer(source, padding='max_length', truncation=True, max_length=128)
  target_ids = tokenizer(source, padding='max_length', truncation=True, max_length=128)

  labels = target_ids['input_ids']
  labels = [[(label if label != tokenizer.pad_token_id else -100) for label in labels_example] for labels_example in labels]

  return{
          'input_ids': source_ids['input_ids'],
          'attention_mask': source_ids['attention_mask'],
          'labels': labels
      }

In [None]:
df_source = ds.map(preprocess_function, batched=True, batch_size=1000, num_proc=4)

In [None]:
# training arguments
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='/content',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,
    remove_unused_columns=True
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=df_source['train'],
    eval_dataset=df_source['test']
)


In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()

In [None]:
eval_results

## SAVING THE MODEL

In [None]:
model.save_pretrained('/content/model_directory')
tokenizer.save_pretrained('/content/model_directory')

In [None]:
tokenizer = AutoTokenizer.from_pretrained('/content/model_directory')
model = AutoModelForSeq2SeqLM.from_pretrained('/content/model_directory')

def summarize(blog_post):
  #Tokenize the input blog post
  inputs = tokenizer(blog_post, max_length = 1024, truncation = True, return_tensors = 'pt' )

  #Generate the summary
  summary_ids = model.generate(inputs['input_ids'], max_length = 150, num_beams = 4,  no_repeat_ngram_size=2, min_length = 40, early_stopping = True)

  #Decode the summary
  summary = tokenizer.decode(summary_ids[0], skip_special_tokens= True)

  return summary

In [None]:
blog_post = """
Presidents aren’t supposed to direct IRS investigations
US law specifically prohibits presidents from directing the IRS to investigate anyone in a section entitled: “Prohibition on executive branch influence over taxpayer audits and other investigations.”

While the IRS falls under the Treasury Department, it’s important that it be as protected from politics as possible. That’s why the IRS has only two politically appointed officials, according to Mark Mazur, who was assistant secretary of treasury for tax policy at the outset of the Biden administration

The US has higher voluntary tax payment rates than other countries, Mazur told me, “because people feel that their interactions with the tax system are fair and based on law.”

If the IRS is suddenly used for political purposes, that trust could be destroyed. During the Obama administration, for instance, the IRS became embroiled in a bona fide scandal when a Treasury Department investigation found the IRS delayed conferring tax-exempt status on conservative groups.

If the IRS did find that its tax-exempt status should be revoked, Harvard would need to be warned and given an opportunity to contest the finding. It would also have the opportunity to challenge the IRS in court.

There is already a lot of chaos at the IRS under the new Trump administration. Multiple acting commissioners have resigned, apparently the result a standoff over whether tax data could be used by immigration officials.

It would not be unprecedented for a university to lose its tax-exempt status
Back in 1983, the Supreme Court agreed that Bob Jones University should not be tax-exempt because, at the time, it banned interracial relationships among its students.

The university didn’t drop its interracial marriage policy until 2000 — in an announcement on CNN’s Larry King Live, coincidentally — although it did not regain its tax-exempt status until 2017.

The US has now come full circle to the point that one of the main gripes Trump has with Harvard is its diversity programs.
"""
summary = summarize(blog_post)
print(f'Summary : {summary}')