<a href="https://colab.research.google.com/github/23KN5A6106/-ATM-PIN-Verification-and-Withdrawal-Using-Nested-If-Statements/blob/main/Untitled52.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
title_generator.py
Generate a concise title for a long document.

Requires (for model-based):
    pip install transformers torch sentencepiece

If you don't want to install transformers, use the rule-based function only.
"""

import re
from typing import Optional

def normalize_whitespace(s: str) -> str:
    return re.sub(r'\s+', ' ', s).strip()

def first_sentence(text: str) -> str:
    text = normalize_whitespace(text)
    # naive sentence split by punctuation followed by space
    m = re.split(r'(?<=[.!?])\s+', text)
    return m[0] if m else text

def shorten_to_words(s: str, max_words: int = 8) -> str:
    words = s.split()
    if len(words) <= max_words:
        return " ".join(words)
    # trim to full words, avoid trailing punctuation
    shortened = " ".join(words[:max_words])
    shortened = re.sub(r'[^\w\s\-:]+$', '', shortened)  # strip trailing punctuation
    return shortened + "..."

def generate_title_rule(text: str, max_words: int = 8) -> str:
    """
    Rule-based title:
      - uses the first sentence
      - strips surrounding parentheticals and inline citations
      - truncates to max_words
    Good for a fast fallback.
    """
    s = first_sentence(text)
    # remove bracketed citations or parentheses
    s = re.sub(r'\[[^\]]*\]', '', s)
    s = re.sub(r'\([^)]*\)', '', s)
    s = normalize_whitespace(s)
    # remove leading subordinate phrases like "In this paper," or "This study..."
    s = re.sub(r'^(In (this|the) (paper|study|report|article)\s*,?\s*)', '', s, flags=re.I)
    s = s.strip(' .,-:;')
    return shorten_to_words(s, max_words=max_words).title()

# -------------------------
# Model-based generator
# -------------------------
def generate_title_model(text: str,
                         model_name: str = "t5-small",
                         max_length: int = 12,
                         min_length: int = 3,
                         device: int = -1) -> str:
    """
    Generate a short title using a transformer model.
    Defaults to t5-small. For better quality, try "t5-base", "t5-large", or PEGASUS/BART summarization models.
    `device=-1` forces CPU; set device=0 to use first GPU if available.
    If transformers are unavailable, falls back to generate_title_rule.
    """
    try:
        from transformers import pipeline
    except Exception:
        # transformers not installed; fallback
        return generate_title_rule(text, max_words=max_length)

    # try to create a text2text pipeline (T5 style)
    # For t5 you often prefix with "summarize: "
    prefix = "summarize: "
    text = normalize_whitespace(text)
    # Send a shortened input to the model (avoid too long input)
    shortened_input = (text[:4000] + "...") if len(text) > 4000 else text

    try:
        # For some models (pegasus/bart) 'summarization' pipeline may be better,
        # but text2text-generation with prefix works for T5 family.
        pipe = pipeline("text2text-generation", model=model_name, device=device)
        prompt = prefix + shortened_input
        out = pipe(prompt, max_length=max_length, min_length=min_length, do_sample=False)
        title = out[0]["generated_text"]
        title = title.strip().strip(' .,:;')
        # post-process: if the model returns a sentence, shorten it further
        return shorten_to_words(title, max_words=max_length).title()
    except Exception:
        # any failure -> fallback
        return generate_title_rule(text, max_words=max_length)


# -------------------------
# Example / CLI-like usage
# -------------------------
if __name__ == "__main__":
    sample = """
    The goal of this project is to develop an automated system that can generate concise and coherent summaries from long textual documents.
    The system aims to improve information consumption efficiency by reducing the time required to read and understand enormous amounts of text.
    We implement both extractive and abstractive summarization pipelines, evaluate them with ROUGE, and provide a simple CLI and chunking for long inputs.
    """

    print("Rule-based title:")
    print(generate_title_rule(sample, max_words=6))
    print()

    print("Model-based title (attempt):")
    # If you have transformers installed, this will call the model.
    # Otherwise it falls back to the rule-based title.
    print(generate_title_model(sample, model_name="t5-small", max_length=7))


Rule-based title:
The Goal Of This Project Is...

Model-based title (attempt):


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=7) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The System Aims To Improve Information Consumption...
