# Build A Text Summariser Using LLMs with Hugging Face

### https://www.analyticsvidhya.com/blog/2023/07/build-a-text-summariser-using-llms-with-hugging-face/?utm_source=related_WP&utm_medium=https://www.analyticsvidhya.com/blog/2023/07/exploring-gpt-2-and-xlnet-transformers/

In [1]:
from datasets import load_dataset 
from transformers import pipeline

In [2]:
#loading the dataset 
xsum_dataset = load_dataset(
    "xsum", 
    version="1.2.0", 
    cache_dir='/Documents/Huggin_Face/data'
)  # Note: We specify cache_dir to use predownloaded data.
xsum_dataset  
# The printed representation of this object shows the `num_rows` 
# of each dataset split.

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [3]:
xsum_sample = xsum_dataset["train"].select(range(10))

display(xsum_sample.to_pandas())

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984
5,Simone Favaro got the crucial try with the las...,Defending Pro12 champions Glasgow Warriors bag...,34540833
6,"Veronica Vanessa Chango-Alverez, 31, was kille...",A man with links to a car that was involved in...,20836172
7,Belgian cyclist Demoitie died after a collisio...,Welsh cyclist Luke Rowe says changes to the sp...,35932467
8,"Gundogan, 26, told BBC Sport he ""can see the f...",Manchester City midfielder Ilkay Gundogan says...,40758845
9,The crash happened about 07:20 GMT at the junc...,A jogger has been hit by an unmarked police ca...,30358490


### There are a total of five T5 models to choose from: t5-small, t5-base, t5-large, t-3B & t5–11B.

### Some Other Models to do experiments are
### mT5 (base), XLM-ProphetNet, mBART-50, IndicBART, BanglaT5	

In [4]:
summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=20,
    max_length=40,
    truncation=True,
    model_kwargs={"cache_dir": '/Documents/Huggin_Face/'},
) 

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [5]:
summarizer(xsum_sample["document"][0])

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]

In [None]:
# Ask the user for input
input_text = input("Enter the text you want to summarize: ")

# Generate the summary
summary = summarizer(input_text, max_length=150, min_length=30, do_sample=False)[0]['summary_text']

bullet_points = summary.split(". ")

for point in bullet_points:
    
    print(f"- {point}")

# Print the generated summary
print("Summary:", summary)

In [None]:
from transformers import pipeline, set_seed
set_seed(42)


summarizer = pipeline("text-generation", model="gpt2")

ARTICLE = "New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York."
ARTILCE = ARTICLE + "TL;DR:"

ans = summarizer(ARTICLE, min_new_tokens=50, max_new_tokens=120, top_k = 2 )

print(ans)
print(ans[0]['generated_text'])