<a href="https://colab.research.google.com/github/CodeReaper9000/TextSummarizer_Gemma2b/blob/main/TextSummarizer2_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Natural Language Toolkit (NLTK)

Tokenization – Splitting text into words or sentences

Stopword removal – Removing common words like "the", "is", "in", etc.

Stemming & Lemmatization – Reducing words to their root forms

Part-of-speech (POS) tagging – Labeling words as nouns, verbs, adjectives, etc.

Named Entity Recognition (NER) – Identifying entities like people, organizations, locations

Parsing and syntax trees – Analyzing sentence structure

WordNet integration – Working with a lexical database for English

Text classification – Training models to classify text into categories

In [None]:
import nltk   #Natural Language Toolkit
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from google.colab import drive
import torch
import nltk
from nltk.tokenize import sent_tokenize #split a paragraph or block of text into individual sentences
from sklearn.feature_extraction.text import TfidfVectorizer #Term Frequency–Inverse Document Frequency. It's a numerical statistic that reflects how important a word is to a document in a collection (or corpus).
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
#pipeline - high level api to perform nlp tasks
#auto classes that automatically load the correct tokenizer and model architecture for causal language modeling
#AutoTokenizer handles converting text into tokens (and back again).
#AutoModelForCausalLM loads a model that predicts the next word in a sequence (causal language modeling).
from IPython.display import Markdown  #displays output in proper nice format
import numpy as np

In [None]:
from huggingface_hub import login
login(token="token")

In [None]:
drive.mount('/content/drive')

model_name = "google/gemma-2b-it"
model_path = "/content/drive/MyDrive/gemma_2b"

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=model_path) #stores words in tokens/numbers for training which works only with numbers
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=model_path)  #predicts based on tokens
#downladed and stored in drive as downloading stores in cache

Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

In [None]:
pipe = pipeline(
    "text-generation",  #task
    model=model,        #which model using
    tokenizer=tokenizer, #tokenizer
    device=0,
)

Device set to use cuda:0


1. Custom Extractive Summarizer
We’ll score each sentence based on:

Sentence length (non-trivial)

Number of keywords (using TF or custom logic)

Presence of named entities or numbers (optional)

In [None]:
def extractive_summary_tfidf(text, top_n=5):
    sentences = sent_tokenize(text) #tokenizes
    if len(sentences) <= top_n:
        return text  # skip if short

    vectorizer = TfidfVectorizer(stop_words='english')  #Stopword Removal: Helps avoiding unnecessary words (e.g., "the", "is").
    X = vectorizer.fit_transform(sentences) #fitting and transforming sentences to get tfidf scores
    # ie convert them to numbers based on their importance according to tf-idf method
    # method - words that show up in a sentence frequently but not in the entire are given priority

    sentence_scores = np.array(X.sum(axis=1)).ravel() #add up importance scores of each word in each sentence, if overall sentence score is higher, sentence is important
    ranked_idx = sentence_scores.argsort()[::-1]

    selected = sorted(ranked_idx[:top_n]) #select top n sentences from sorted
    summary = " ".join([sentences[i] for i in selected])  #extractive summary
    return summary

2. Abstractive Summarization using Gemma (Refine)
We split the extractive summary into chunks and refine each chunk using the LLM.

In [None]:
# For general text summarization
mesg_text = lambda text: [
    {
        "role": "user",
        "content": f"""Can you provide a comprehensive summary of the given text? The summary should cover all the key points and main ideas presented in the original text, while also condensing the information into a concise and easy-to-understand format. Please ensure that the summary includes relevant details and examples that support the main ideas, while avoiding any unnecessary information or repetition. The length of the summary should be appropriate for the length and complexity of the original text, providing a clear and accurate overview without omitting any important information.

Text:
\"\"\"
{text}
\"\"\""""
    }
]

# For research paper summarization
mesg_research = lambda text: [
    {
        "role": "user",
        "content": f"""Act as an academic research expert. Read and digest the content of the research paper. Produce a concise and clear summary that encapsulates the main findings, methodology, results, and implications of the study. Ensure that the summary is written in a manner that is accessible to a general audience while retaining the core insights and nuances of the original paper. Include key terms and concepts, and provide any necessary context or background information. The summary should serve as a standalone piece that gives readers a comprehensive understanding of the paper's significance without needing to read the entire document. Use this format:

Chapter 1: <title>
  - <point 1>
  - <point 2>
Chapter 2: <title>
  - <point 1>
  - <point 2>

Text:
\"\"\"
{text}
\"\"\""""
    }
]

# For book summarization
mesg_book = lambda text: [
    {
        "role": "user",
        "content": f"""You're a literary summarizer. Summarize the book given.Please include the main plot points, key characters, central themes, and the overall message or takeaway of the book. Keep the summary concise (about 2-3 paragraphs) and present it in a reader-friendly format
Book Text:
\"\"\"
{text}
\"\"\""""
    }
]


In [None]:
def abstractive_refine(pipe, text, style="research"):
    if style == "text":
        messages = mesg_text(text)
    elif style == "research":
        messages = mesg_research(text)
    elif style == "book":
        messages = mesg_book(text)
    else:
        raise ValueError("Invalid style. Choose from: 'text', 'research', 'book'.")

    prompt = pipe.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    outputs = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.3,
        top_k=30,
        top_p=0.9
    )

    return outputs[0]["generated_text"][len(prompt):].replace('#', '').strip()

Full pipeline function

In [None]:
def hybrid_summarize(pipe, input_text, top_n=5, style="technical"):
    print("🔍 Extractive phase using TF-IDF scoring...")
    extracted = extractive_summary_tfidf(input_text, top_n=top_n)

    print("🧠 Abstractive refinement with Gemma...")
    refined = abstractive_refine(pipe, extracted, style=style)

    return refined

Using pipelin

In [None]:
from google.colab import files

# Function to handle file upload
def upload_file():
    uploaded = files.upload()
    for filename in uploaded.keys():
        with open(filename, 'r', encoding="utf-8") as file:
            long_text = file.read()
    return long_text

# Prompt the user to choose whether to upload a file or input text directly
user_choice = input("Do you want to upload a text file or input the text directly? (Type 'file' to upload or 'input' to enter text): ").strip().lower()

if user_choice == 'file':
    print("📂 Please upload a `.txt` file.")
    long_text = upload_file()
elif user_choice == 'input':
    print("✍️ Please input your text below:")
    long_text = input("Enter your text: ")
else:
    raise ValueError("Invalid choice. Please type 'file' or 'input'.")

print("\n✅ Text received successfully.")

# Ask user for the type of content
doc_type = input("📘 What type of content is this? (text / book / research): ").strip().lower()

# Step 3: Map document type to correct style
style_map = {
    "text": "text",
    "book": "book",
    "research": "research"  # <-- fix this mapping
}

style = style_map.get(doc_type, "text")  # Default to "text" if invalid input
print(style)
# Step 4: Run hybrid summarization pipeline
summary = hybrid_summarize(pipe, long_text, top_n=20, style=style)

# Step 5: Display the result
display(Markdown(summary))

Do you want to upload a text file or input the text directly? (Type 'file' to upload or 'input' to enter text): file
📂 Please upload a `.txt` file.


Saving tester.txt to tester.txt

✅ Text received successfully.
📘 What type of content is this? (text / book / research): research
research
🔍 Extractive phase using TF-IDF scoring...
🧠 Abstractive refinement with Gemma...


Iron Man: A Cinematic Journey

**Introduction:**
Iron Man is a 2008 superhero film that follows the story of Tony Stark, a brilliant industrialist and billionaire, who transforms into the superhero Iron Man. The film explores the complexities of wealth, technology, and the ethical boundaries of power.

**Methodology:**
The film was directed by Jon Favreau and produced by Marvel Studios. The screenplay was written by Mark Fergus and Hawk Ostby and Art Marcum and Matt Holloway. The cast included Robert Downey Jr., Terrence Howard, Jeff Bridges, Gwyneth Paltrow, Leslie Bibb, and Shaun Toub.

**Results:**
Iron Man was a critical and commercial success, grossing over $585.8 million worldwide. The film received numerous awards and nominations, including three Academy Awards for Best Sound Editing, Best Visual Effects, and Best Original Score.

**Discussion:**
Iron Man is a groundbreaking film in several ways. It was one of the first superhero films to explore the darker side of the superhero genre, with a focus on the ethical implications of technology and the consequences of unchecked ambition. The film also broke new ground in its use of motion capture technology to create realistic special effects.

**Key Terms and Concepts:**
* **Iron Man:** A superhero who can transform into a powerful armored suit.
* **Stark Industries:** Tony Stark's company that manufactures the Iron Man suit.
* **Superhero:** A person with superhuman abilities.
* **Technology:** The use of advanced technology to enhance human capabilities.
* **Ethical considerations:** The ethical implications of technology and the use of superhumans.

**Conclusion:**
Iron Man is a thought-provoking and visually stunning film that has had a profound impact on the superhero genre. The film's exploration of wealth, technology, and the human condition has inspired countless fans and filmmakers.